Abstract:
We present an algorithm for segmenting unrestricted Japanese text that is able to detect up to 98% of the words in a corpus. The segmentation technique, which is simple and extremely fast, does not depend on a lexicon or any formal notion of what a word is in Japanese, and the training procedure does not require annotated text of any kind. Relying almost exclusively on character type information and a table of hiragana bigram frequencies, the algorithm makes a decision as to whether to create word boundaries or not. This method divides strings of Japanese characters into units that are computationally tractable and that can be justified on lexical and syntactic grounds as well.