Discovery in Textual Documents
We report on extensive experiments using rule-based induction methods for document classification. The goal is to automatically discover patterns in document classifications, potentially surpassing humans who currently read and classify these documents. By using a decision rule model, we induce results in a form compatible with expensive human engineered systems that have recently demonstrated excellent performance. Using computer-intensive rule induction methods, we have conducted experiments over a vast set of document families, including UPI, Reuters, NTIS, and the Library of Congress Catalog. We report on several approaches to classic problems for such applications, including choosing the right representation for text, and handling high dimensionality.