Learning from Imbalanced Data: Rank Metrics and Extra Tasks

Rich Caruana

Imbalanced data creates two problems for machine learning. First, even if the training set is large, the sample size of smaller classes may be small. Learning accurate models from small samples is hard. Multitask learning is one way to learn more accurate models from small samples that is particularly well suited to imbalanced data. A second problem when learning from imbalanced data is that the usual error metrics (e.g., accuracy or squared error) cause learning to pay more attention to large classes than to small classes. This problem can be mitigated by careful selection of the error metric. We find rank based error metrics often perform better when an important class is under-represented.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.