Automated Analysis of Structured Online Documents

Vladimir Kulyukin, Kristian Hammond, and Robin Burk

We believe that the domain-specific knowledge of the structural organization of information is central to the human ability to deal with large quantities of data efficiently. A better understanding of the computational nature of this ability may lead to solutions to information retrieval problems of practical significance. The paper outlines a Theory of Document Presentation (DPT) which addresses the problems of how information can be structurally organized in the documents of a given domain and how the standards for such organization emerge. The paper describes FAQ Minder, a document processing system whose implementation was guided by DPT. FAQ Minder processes FAQs, files of "Frequently Asked Questions" associated with USENET newsgroups [1,2]. The system identifies and tags the logical components of FAQs: network headers, tables of contents, sections, glossaries, questions, answers, and bibliographies.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.