William W. Cohen, AT&T Labs - Research
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the top-ranked structure is "meaningful" (a structure that was used in a hand-coded "wrapper" for the page) nearly 70% of the time, improving on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, this measure of performance can improved to 85%.