Luis Mena, Jesus A. Gonzalez
In this paper, we present a new rule induction algorithm for machine learning in medical diagnosis. Medical datasets, as many other real–world datasets, exhibit an imbalanced class distribution. However, this is not the only problem to solve for this kind of datasets, we must also consider other problems besides the poor classification accuracy caused by the classes distribution. Therefore, we propose a different strategy based on the maximization of the classification accuracy of the minority class as opposed to the usually used sampling and cost techniques. Our experimental results were conducted using an original dataset for cardiovascular diseases diagnostic and three public datasets. The experiments are performed using standard classifiers (Naïve Bayes, C4.5 and k–Nearest Neighbor), emergent classifiers (Neural Networks and Support Vector Machines) and other classifiers used for imbalanced datasets (Ripper and Random Forest). In all the tests, our algorithm showed competitive results in terms of accuracy and area under the ROC curve, but overcomes the other classifiers in terms of comprehensibility and validity.
Subjects: 12. Machine Learning and Discovery; 1.5 Diagnosis
Submitted: Feb 10, 2006