Thomas Y. Lee
In this paper, we describe a novel approach for learning to extract content from the text segments of regulatory filings for the purpose of competitive analysis and regulatory audit. Existing strategies that rely upon an explicit schema or a training set of representative documents are less suited for managing thousands of idiosyncratic submissions by independent filers. We introduce a technique that learns from regulatory instructions. Knowledge about document structure is drawn from the policy documents to initialize a set of extraction patterns. Patterns are relaxed to account for single insertion, deletion, and substitution errors within individual filings. Preliminary results are reported on various sets of filings submitted to the SEC in 2004 and 2005.
Subjects: 10. Knowledge Acquisition; 8. Enabling Technologies
Submitted: May 10, 2007