Beginning to Understand Unstructured, Ungrammatical Text: An Information Integration Approach

Matthew Michelson, Craig A. Knoblock

As information agents become pervasive, they will need to read and understand the vast amount of information on the World Wide Web. One such valuable source of information is unstructured and ungrammatical text that appears in data sources such as online auctions or internet classifieds. One way to begin to understand this text is to figure out the entities that the text references. This can be thought of as the semantic annotation problem, where the goal is to extract the attributes embedded within the text and then annotate the text with these extracted attributes. If enough attributes can be extracted, then the entity referenced in the text can be determined. For example, if we have a used car for sale in a classified ad, and we can identify the make, model and year within the post, we can identify the car for sale. However, information extraction is difficult because the text does not contain reliable structural or grammatical clues. In this paper we present an unsupervised approach to semantically annotating such unstructured and ungrammatical text with the intention that this will help in the problem of machine understanding on the Web. Furthermore, we define an architecture that allows for better understanding over time. We present experiments to show our annotation approach is competitive with the state-of-the-art which uses supervised machine learning, even though our technique is fully unsupervised.

Subjects: 12. Machine Learning and Discovery; 1.10 Information Retrieval

Submitted: Dec 15, 2006

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.