Track:
Contents
Downloads:
Abstract:
Though the field of information extraction (IE) has been one of the most successful to come from the NLP/CL community, it has thus far concentrated on documents with simple logical structure. It seems appropriate, however, to consider more complex documents -- not solely due to the improvement in extraction technologies, but also due to the content and meta-information that complex structure can offer the IE task. One often used but never exploited component of a more complex document model is the table. Its utility is compact information presentation, a deftnite boon for IE processes, however it can also offer more .information for discourse and domain knowledge sub-processes. A table processing system (TabPro) is being developed and part of that research requires the construction of a corpus of documents containing tables. This corpus is used for training classification processes and evaluating the performance of the system as a whole. A complete description of the model of tables mentioned in this paper can be found in Hurst (1999).