Proceedings:
Natural Language Processing
Volume
Issue:
Proceedings of the AAAI Conference on Artificial Intelligence, 12
Track:
Corpus-Based Natural Language Processing
Downloads:
Abstract:
We present an algorithm for segmenting unrestricted Japanese text that is able to detect up to 98% of the words in a corpus. The segmentation technique, which is simple and extremely fast, does not depend on a lexicon or any formal notion of what a word is in Japanese, and the training procedure does not require annotated text of any kind. Relying almost exclusively on character type information and a table of hiragana bigram frequencies, the algorithm makes a decision as to whether to create word boundaries or not. This method divides strings of Japanese characters into units that are computationally tractable and that can be justified on lexical and syntactic grounds as well.
AAAI
Proceedings of the AAAI Conference on Artificial Intelligence, 12