Kuromoji

Open source Java morphological analyzer for Japanese.

Features

Word segmentation

Kuromoji can separate a block of text into distinct words, also known as morphemes.

吾輩は猫である。   吾輩   は   猫   で   ある   。

Part of speech tagging

For each word, Kuromoji assigns a part of speech like, noun, verb, adjective, and so on.

相撲nounparticle見るverbparticleparticle好きadjectival nounですauxiliary verbsymbol

Lemmatization

Get the base form for inflected verbs and adjectives.

Surface FormBase Form
食べたい食べ
楽しくない楽しい
帰りました帰る

Readings

Extract readings for kanji.

親譲おやゆずりの無鉄砲むてっぽう小供こどもときからそんばかりしている

Search segmentation mode

Kuromoji comes with a Search Mode for search applications, that does additional splitting of words to make sure you get hits when searching for compounds nouns.

For example, we want a search for 空港 (airport) to match 関西国際空港 (Kansai International Airport), but most analyzers don’t allow this since 関西国際空港 tends to become one token.

Dictionary support

Kuromoji support a wide range of dictionary backends to support different use cases, including ipadic, jumandic, and unidic among others.

Open Source

Kuromoji is licensed under the Apache License, Version 2.0.

Search Integration

Kuromoji powers the Japanese language support in Apache Lucene and Apache Solr.
It also used in Elasticsearch.

Demo

Usage

Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does:

  • Word segmentation. Segmenting text into words (or morphemes)
  • Part-of-speech tagging. Assign word-categories (nouns, verbs, particles, adjectives, etc.)
  • Lemmatization. Get dictionary forms for inflected verbs and adjectives
  • Readings. Extract readings for kanji

Several other features are supported. Please consult each dictionaries’ Token class for details.


Using Kuromoji

The example below shows how to use the Kuromoji morphological analyzer in its simplest form; to segment text into tokens and output features for each token.

package com.atilika.kuromoji.example;

import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import java.util.List;

public class KuromojiExample {
    public static void main(String[] args) {
        Tokenizer tokenizer = new Tokenizer() ;
        List<Token> tokens = tokenizer.tokenize("お寿司が食べたい。");
        for (Token token : tokens) {
            System.out.println(token.getSurface() + "\t" + token.getAllFeatures());
        }
    }
}
123456789101112131415

Make sure you add the dependency below to your pom.xml before building your project.

<dependency>
    <groupId>com.atilika.kuromoji</groupId>
    <artifactId>kuromoji-ipadic</artifactId>
    <version>0.9.0</version>
</dependency>
12345