Track:
Information Extraction
Downloads:
Abstract:
Online reference documents tend to be semi-formatted in that they contain repeated sections with similar structure, and have free-text inside each section. XML (extensible markup language) enables document designers to design rich tag sets where tags for section headings contain information about each section. This contextual information, coupled with the fact that the free-text sections of the documents use a limited sublanguage, mean that simple natural-language-based techniques can be used to extract facts from semi-formatted online documents. The SIFT document parser system has demonstrated results for this type of extraction in the area of software testing.