Layout and Language: A Corpus of Documents Containing Tables

Matthew Hurst

Though the field of information extraction (IE) has been one of the most successful to come from the NLP/CL community, it has thus far concentrated on documents with simple logical structure. It seems appropriate, however, to consider more complex documents -- not solely due to the improvement in extraction technologies, but also due to the content and meta-information that complex structure can offer the IE task. One often used but never exploited component of a more complex document model is the table. Its utility is compact information presentation, a deftnite boon for IE processes, however it can also offer more .information for discourse and domain knowledge sub-processes. A table processing system (TabPro) is being developed and part of that research requires the construction of a corpus of documents containing tables. This corpus is used for training classification processes and evaluating the performance of the system as a whole. A complete description of the model of tables mentioned in this paper can be found in Hurst (1999).

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.