DyS: A Framework for Mixture Models in Quantification

Authors

André Maletzke

Universidade de São Paulo

Denis dos Reis

Universidade de São Paulo

Everton Cherman

Universidade de São Paulo

Gustavo Batista

Universidade de São Paulo

Proceedings:

No. 1: AAAI-19, IAAI-19, EAAI-20

Volume

Issue:

Proceedings of the AAAI Conference on Artificial Intelligence, 33

Track:

AAAI Technical Track: Machine Learning

Downloads:

Download PDF

Abstract:

Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS: Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsøe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance’s popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsøe, while not requiring the sensible parameterization of the number of bins.

DOI:

10.1609/aaai.v33i01.33014552

AAAI

Proceedings of the AAAI Conference on Artificial Intelligence, 33

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.