Eric Brill and Mitch Marcus
In this paper, we will discuss a method for assigning part of speech tags to words in an unannotated text corpus whose structure is completely unknown, with a little bit of help from an informant. Starting from scratch, automated and semiautomated methods are employed to build a part of speech tagger for the text. There are three steps to building the tagger: uncovering a set of part of speech tags, building a lexicon which indicates for each word its most likely tag, and learning rules to both correct mistakes in the learned lexicon and discover where contextual information can repair tagging mistakes. The long term goal of this work is to create a system which would enable somebody to take a large text in a language he does not know, and with only a few hours of help from a speaker of the language, accurately annotate the text with part of speech information.