AAAI Publications, Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence

Font Size: 
Learning from the Web: Extracting General World Knowledge from Noisy Text
Jonathan Gordon, Benjamin Van Durme, Lenhart K. Schubert

Last modified: 2010-07-07


The quality and nature of knowledge that can be found by an automated knowledge-extraction system depends on its inputs. For systems that learn by reading text, the Web offers a breadth of topics and currency, but it also presents the problems of dealing with casual, unedited writing, non-textual inputs, and the mingling of languages. The results of extraction using the KNEXT system on two Web corpora — Wikipedia and a collection of weblog entries — indicate that, with automatic filtering of the output, even ungrammatical writing on arbitrary topics can yield an extensive knowledge base, which human judges find to be of good quality, with propositions receiving an average score across both corpora of 2.34 (where the range is 1 to 5 and lower is better) versus 3.00 for unfiltered output from the same sources.


knowledge extraction; knowledge acquisition; information extraction; Wikipedia; weblogs; corpora; KNEXT; compositional semantic interpretation; knowledge base

Full Text: PDF