Information Extraction from Online XML-encoded Documents

Patricia Lutsky

Online reference documents tend to be semi-formatted in that they contain repeated sections with similar structure, and have free-text inside each section.  XML (extensible markup language) enables document designers to design rich tag sets where tags for section headings contain information about each section.  This contextual information, coupled with the fact that the free-text sections of the documents use a limited sublanguage, mean that simple natural-language-based techniques can be used to extract facts from semi-formatted online documents.  The SIFT document parser system has demonstrated results for this type of extraction in the area of software testing.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.