Open source Java morphological analyzer for Japanese.
Kuromoji can separate a block of text into distinct words, also known as morphemes.
For each word, Kuromoji assigns a part of speech like noun, verb, adjective, and so on.
Get the base form for inflected verbs and adjectives.
Surface Form | Base Form |
---|---|
食べたい | 食べる |
楽しくない | 楽しい |
帰りました | 帰る |
Extract readings for kanji.
Kuromoji comes with a Search Mode for search applications, that does additional splitting of words to make sure you get hits when searching for compounds nouns.
For example, we want a search for 空港 (airport) to match 関西国際空港 (Kansai International Airport), but most analyzers don’t allow this since 関西国際空港 tends to become one token.
Kuromoji support a wide range of dictionary backends to support different use cases, including ipadic, jumandic, and unidic among others.
Kuromoji is licensed under the Apache License, Version 2.0.
Kuromoji powers the Japanese language support in Apache Lucene and Apache Solr. It also used in Elasticsearch.
Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does:
Several other features are supported. Please consult each dictionaries’ Token class for details.
The example below shows how to use the Kuromoji morphological analyzer in its simplest form; to segment text into tokens and output features for each token.
1package com.atilika.kuromoji.example;
2
3import com.atilika.kuromoji.ipadic.Token;
4import com.atilika.kuromoji.ipadic.Tokenizer;
5import java.util.List;
6
7public class KuromojiExample {
8 public static void main(String[] args) {
9 Tokenizer tokenizer = new Tokenizer() ;
10 List<Token> tokens = tokenizer.tokenize("お寿司が食べたい。");
11 for (Token token : tokens) {
12 System.out.println(token.getSurface() + "\t" + token.getAllFeatures());
13 }
14 }
15}
Make sure you add the dependency below to your pom.xml before building your project.
1<dependency>
2 <groupId>com.atilika.kuromoji</groupId>
3 <artifactId>kuromoji-ipadic</artifactId>
4 <version>0.9.0</version>
5</dependency>