A Probabilistic Algorithm for Segmenting Non-Kanji Japanese Strings

Virginia Teller, Eleanor Olds Batchelder

We present an algorithm for segmenting unrestricted Japanese text that is able to detect up to 98% of the words in a corpus. The segmentation technique, which is simple and extremely fast, does not depend on a lexicon or any formal notion of what a word is in Japanese, and the training procedure does not require annotated text of any kind. Relying almost exclusively on character type information and a table of hiragana bigram frequencies, the algorithm makes a decision as to whether to create word boundaries or not. This method divides strings of Japanese characters into units that are computationally tractable and that can be justified on lexical and syntactic grounds as well.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.