Kuromoji

Open source Java morphological analyzer for Japanese.

Features

Word segmentation

Kuromoji can separate a block of text into distinct words, also known as morphemes.

吾輩は猫である。→ 吾輩は猫である。

Part of speech tagging

For each word, Kuromoji assigns a part of speech like noun, verb, adjective, and so on.

相撲nounをparticle見るverbのparticleがparticle好きadjectival nounですauxiliary verb。symbol

Lemmatization

Get the base form for inflected verbs and adjectives.

Surface Form	Base Form
食べたい	食べる
楽しくない	楽しい
帰りました	帰る

Readings

Extract readings for kanji.

親譲おやゆずりの無鉄砲むてっぽうで小供こどもの時ときから損そんばかりしている

Search segmentation mode

Kuromoji comes with a Search Mode for search applications, that does additional splitting of words to make sure you get hits when searching for compounds nouns.

For example, we want a search for 空港 (airport) to match 関西国際空港 (Kansai International Airport), but most analyzers don’t allow this since 関西国際空港 tends to become one token.

Dictionary support

Kuromoji support a wide range of dictionary backends to support different use cases, including ipadic, jumandic, and unidic among others.

Open Source

Kuromoji is licensed under the Apache License, Version 2.0.

Search Integration

Kuromoji powers the Japanese language support in Apache Lucene and Apache Solr. It also used in Elasticsearch.

Demo

Usage

Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does:

Word segmentation. Segmenting text into words (or morphemes)
Part-of-speech tagging. Assign word-categories (nouns, verbs, particles, adjectives, etc.)
Lemmatization. Get dictionary forms for inflected verbs and adjectives
Readings. Extract readings for kanji

Several other features are supported. Please consult each dictionaries’ Token class for details.

Using Kuromoji

The example below shows how to use the Kuromoji morphological analyzer in its simplest form; to segment text into tokens and output features for each token.

1package com.atilika.kuromoji.example;
2
3import com.atilika.kuromoji.ipadic.Token;
4import com.atilika.kuromoji.ipadic.Tokenizer;
5import java.util.List;
6
7public class KuromojiExample {
8    public static void main(String[] args) {
9        Tokenizer tokenizer = new Tokenizer() ;
10        List<Token> tokens = tokenizer.tokenize("お寿司が食べたい。");
11        for (Token token : tokens) {
12            System.out.println(token.getSurface() + "\t" + token.getAllFeatures());
13        }
14    }
15}

Make sure you add the dependency below to your pom.xml before building your project.

1<dependency>
2    <groupId>com.atilika.kuromoji</groupId>
3    <artifactId>kuromoji-ipadic</artifactId>
4    <version>0.9.0</version>
5</dependency>