The Use of Web-based Statistics to Validate Information Extraction

Stephen Soderland, Oren Etzioni, Tal Shaked, and Daniel S. Weld

The World Wide Web is a powerful and readily available text corpus that can be used effectively to validate the output of an information extraction system. We present experiments that explore how pointwise mutual information (PMI) from search engine hit counts can be used in an Assessor module that assigns a probability that an extracted fact or relationship is correct, thus boosting precision. We find that thresholding on PMI scores is more effective in creating features for the Assessor than using probability density models. Bootstrapping can be effective in finding both positive and negative seeds to train the Assessor, performing better than hand-tagging a sample of actual extractions.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.