Learning Page-Independent Heuristics for Extracting Data from Web Pages

William Cohen and Wei Fan

One bottleneck in implementing a system that intelligently queries the Web is developing "wrappers"--programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is ageneral procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.