Compression-Based Induction and Genome Data

Rattikorn Hewett, John Leuchner, Choh Man Teng, Sean D. Mooney, and Teri E. Klein

Our previous work developed SORCER, a learning system that induces a set of rules from a data set repre-sented as a second-order decision table. Second-order decision tables are database relations in which rows have sets of atomic values as components. Using sets of values, which are interpreted as disjunctions, provides compact representations that facilitate efficient management and enhance comprehensibility. SORCER generates classifiers with a near minimum number of rows. The induction algorithm can be viewed as a ta-ble compression technique in which a table of training data is transformed into a second-order table with fewer rows by merging rows in ways that preserve consistency with the training data. In this paper we propose three new mechanisms in SORCER: (1) compression by removal of table columns, (2) inclusion of simple rules based on statistics, and (3) a method for partitioning continuous data into discrete clusters. We apply our approach to classify clinical phenotypes of a genetic collagenous disorder, Osteogenesis imperfecta, using a data set of point mutations in COLIA1 gene. Preliminary results show that on the average, over ten 10-fold cross validations, SORCER obtained an error estimate of 16.7 %, compared to 35.1 % obtained from the decision tree learner, C4.5.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.