Using Wikipedia Links to Construct Word Segmentation Corpora

David Gabay, Ziv Ben-Eliahu, Michael Elhadad

Tagged corpora are essential for evaluating and training natural language processing tools. The cost of constructing large enough manually tagged corpora is high, even when the annotation level is shallow. This article describes a simple method to automatically create a partially tagged corpus, using Wikipedia hyperlinks. The resulting corpus contains information about the correct segmentation of 523,599 non-consecutive words in 363,090 sentences. We used our method to construct a corpus of Modern Hebrew (which we have made available at The method can also be applied to other languages where word segmentation is difficult to determine, such as East and South-East Asian languages.

Subjects: 13. Natural Language Processing; 1.10 Information Retrieval

Submitted: May 5, 2008

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.