Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic

Luis Mena, Jesus A. Gonzalez

In this paper, we present a new rule induction algorithm for machine learning in medical diagnosis. Medical datasets, as many other real–world datasets, exhibit an imbalanced class distribution. However, this is not the only problem to solve for this kind of datasets, we must also consider other problems besides the poor classification accuracy caused by the classes distribution. Therefore, we propose a different strategy based on the maximization of the classification accuracy of the minority class as opposed to the usually used sampling and cost techniques. Our experimental results were conducted using an original dataset for cardiovascular diseases diagnostic and three public datasets. The experiments are performed using standard classifiers (Naïve Bayes, C4.5 and k–Nearest Neighbor), emergent classifiers (Neural Networks and Support Vector Machines) and other classifiers used for imbalanced datasets (Ripper and Random Forest). In all the tests, our algorithm showed competitive results in terms of accuracy and area under the ROC curve, but overcomes the other classifiers in terms of comprehensibility and validity.

Subjects: 12. Machine Learning and Discovery; 1.5 Diagnosis

Submitted: Feb 10, 2006

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.