Kuromoji is an open source Japanese morphological analyzer written in Java.

Kuromoji has been contributed to the Apache Software Foundation and provides the Japanese language support in Apache Lucene and Apache Solr since 3.6 (JapaneseTokenizer), but it can also be used separately for NLP work.

Feature summary

Kuromoji is a Japanese morphological analyzer that does

  • Word segmentation. Segmenting text into words (or morphemes)
  • Part-of-speech tagging. Assign word-categories (nouns, verbs, particles, adjectives, etc.)
  • Lemmatization. Get dictionary forms for inflected verbs and adjectives
  • Readings. Extract readings for kanji

Kuromoji also has the following key characteristics

  • Practical packaging. Packaged as a self-contained jar file with everything included
  • Designed for search. Modes for splitting compounds into their parts for improved recall with search
  • Ease-of-use. Easy to use API and Maven integration for easy use and access
  • Practical license. Has an Apache License v2 that fits both open source and commercial software

Try right here, right now

Programming example Using Kuromoji in code

Kuromoji is packaged as a single jar file, is mavenized and doesn't have 3rd party dependencies to make it extra easy to work with.

Below is a simple Java example that demonstrates how a simple text can be segmented.

package org.atilika.kuromoji.example;

import org.atilika.kuromoji.Token;
import org.atilika.kuromoji.Tokenizer;

public class TokenizerExample {
  public static void main(String[] args) {
    Tokenizer tokenizer = Tokenizer.builder().build();
    for (Token token : tokenizer.tokenize("寿司が食べたい。")) {
      System.out.println(token.getSurfaceForm() + "\t" + token.getAllFeatures());
    }
  }
}

Compile the example program using

% javac -encoding UTF-8 -cp lib/kuromoji-0.7.7.jar \
  src/main/java/org/atilika/kuromoji/example/KuromojiExample.java

and then run it using

% java -Dfile.encoding=UTF-8 \
  -cp lib/kuromoji-0.7.7.jar:src/main/java \
  org.atilika.kuromoji.example.KuromojiExample
  寿司	名詞,一般,*,*,*,*,寿司,スシ,スシ
  が	助詞,格助詞,一般,*,*,*,が,ガ,ガ
  食べ	動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
  たい	助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
  。	記号,句点,*,*,*,*,。,。,。

Tip Kuromoji is thread-safe so you can tokenize text from multiple threads.

Maven artifact repository For ease of use with Maven or Ivy

To use Kuromoji with Maven, first add the repository to the <repositories> section of your pom.xml as indicated below.

<repository>
  <id>Atilika Open Source repository</id>
  <url>http://www.atilika.org/nexus/content/repositories/atilika</url>
</repository>

Then add the Kuromoji coordinates to the <dependencies> section as follows:

<dependency>
  <groupId>org.atilika.kuromoji</groupId>
  <artifactId>kuromoji</artifactId>
  <version>0.7.7</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>

You should now be able to use Kuromoji in your project.

Kuromoji with Lucene, Solr and elasticsearch

Kuromoji provides the default Japanese language support in Apache Lucene and Apache Solr thanks to a joint effort of committers and contributors. Kuromoji is also available as a plugin to elasticsearch.

Kuromoji in Lucene/Solr and elasticssearch has a ready-to-use default configuration that does

  • Light stopwords/stoptags removal. Removes particles and common words to prevent rank-skew
  • Character width-normalization. Full-width romaji to half-width and half-width kana to full-width
  • Lemmatization. Reduces inflected adjectives and verbs to their base form

which suits many deployments out-of-the-box.

Additional to the above, there are lots of useful token attributes with readings, romanized readings, part-of-speech, etc. The above is available in Lucene as JapaneseAnalyzer and a default field "text_ja" in Solr's example schema.xml. Various configuration options are available.

Tip To search Japanese using Solr, simply use field type "text_ja".

Tip To search Japanese using Lucene, all the above is available using JapaneseAnalyzer.

Search mode and synonym compounds

In search mode, we want to split compounds in order to make their parts searchable, which is good for recall.

In order to make sure we maintain precision for an exact term match, we also keep the compound in our index as a synonym to get a rank boost (typically from IDF).

Kuromoji makes recall and precision considerations for overall good ranking.

Tokens for 関西国際空港. We keep the compound as a synonym in position 1.
Position 1 Position 2 Position 3
関西 国際 空港
関西国際空港

Apache Solr 4.0 analysis screenshot

Below is an analysis screenshot of Solr 4.0 RC1.

Info Several token attributes available, including part-of-speech tags, readings, romanized readings, etc.

Frequently Asked Questions

I have a question about Kuromoji. How can I get help?

Please feel free to contact us at moc.akilita@ijomoruk with any questions and we will try to help out as best we can. 日本語でも大丈夫です。

Which license does Kuromoji use?

Kuromoji is licensed under the Apache License v2.0 and uses the MeCab-IPADIC dictionary/statistical model. See NOTICE.md for license details.

Is Kuromoji available in C or C++?

We have a work-in-progress C version of Kuromoji available. Please contact us if you are interested.

Which dictionaries does Kuromoji support?

Kuromoji supports the MeCab-IPADIC dictionary and has experimental support for UniDic.

Please contact us if you need additional dictionary support.

Is Kuromoji thread safe?

Yes.

What does Kuromoji mean?

Good question! Literally, kuromoji means black letters in Japanese, but it also has another meaning.

Kuromoji is a utensil used for Japanese tea ceremony to transfer sweets (wagashi) from a tray onto one's tissue (kaishi), and then cut it into suitable pieces before enjoying.

So – basically it is a tool for input processing. ;)