Learning Classifiers When The Training Data Is Not IID

Murat Dundar, Balaji Krishnapuram, Jinbo Bi, R. Bharat Rao

Most methods for classifier design assume that the training samples are drawn independently and identically from an unknown data generating distribution, although this assumption is violated in several real life problems. Relaxing this IID assumption, we consider algorithms from the statistics literature for the more realistic situation where batches or sub-groups of training samples may have internal correlations, although the samples from different batches may be considered to be uncorrelated. Next, we propose simpler (more efficient) variants that scale well to large datasets; theoretical results are provided to support their validity. Experimental results from real-life computer aided diagnosis (CAD) problems indicate that relaxing the IID assumption leads to statistically significant improvements in the accuracy of the learned classifier. Surprisingly, the simpler algorithm proposed here is experimentally found to be even more accurate than the original version.

Subjects: 12. Machine Learning and Discovery; 9.3 Mathematical Foundations

Submitted: Oct 11, 2006

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.