Learning to Query the Web

William W. Cohen and Yoram Singer

The World Wide Web (WWW) is filled with "resource directories"--i.e., documents that collect together links to all known documents on a specific topic. Keeping resource directories up-to-date is difficult because of the rapid growth in online documents. We propose using machine learning methods to address this problem. In particular, we propose to treat a resource directory as a list of positive examples of an unknown concept, and then use machine learning methods to construct from these examples a definition of the unknown concept. If the learned definition is in the appropriate form, it can be translated into a query, or series of queries, for a WWW search engine. This query can be used at a later date to detect any new instances of the concept. We present experimental results with two implemented systems, and two learning methods. One system is interactive, and is implemented as an augmented WWW browser; the other is a batch system, which can collect and label documents without any human intervention. The learning methods are the RIPPER rule learning system, and a rule-learning version of a new online weight allocation algorithm called the sleeping experts prediction algorithm. The experiments are performed on real data obtained from the WWW.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.