Abstract:
Recent work on Internet information integration assumes a library of wrappers, specialized information extraction procedures. Maintaining wrappers is difficult, because the formatting regularities on which they rely often change. The wrapper verification problem is to determine whether a wrapper is correct. Standard regression testing approaches are inappropriate, because both the formatting regularities and a site’s underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent verification algorithm. RAPTURE uses well-motivated heuristics to compute the similarity between a wrapper’s expected and observed output. Experiments with 27 actual Internet sites show a substantial performance improvement over standard regression testing.

Registration: ISBN 978-0-262-51106-3
Copyright: July 18-22, 1999, Orlando, Florida. Published by The AAAI Press, Menlo Park, California.