Proceedings:
Proceedings of the AAAI Conference on Artificial Intelligence, 16
Volume
Issue:
Proceedings of the AAAI Conference on Artificial Intelligence, 16
Track:
Technical Papers
Downloads:
Abstract:
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the top-ranked structure is "meaningful" (a structure that was used in a hand-coded "wrapper" for the page) nearly 70% of the time, improving on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, this measure of performance can improved to 85%.
AAAI
Proceedings of the AAAI Conference on Artificial Intelligence, 16
ISBN 978-0-262-51106-3
July 18-22, 1999, Orlando, Florida. Published by The AAAI Press, Menlo Park, California.