Clustering Small Samples With Quality Guarantees: Adaptivity With One2all PPS

Authors

Edith Cohen

Google Research, Tel Aviv University

Shiri Chechik

Tel Aviv University

Haim Kaplan

Tel Aviv University; Google Research

Published:

2018-02-08

Proceedings:

Proceedings of the AAAI Conference on Artificial Intelligence, 32

Volume

Issue:

Thirty-Second AAAI Conference on Artificial Intelligence 2018

Track:

AAAI Technical Track: Machine Learning

Downloads:

Download PDF

Abstract:

Clustering of data points is a fundamental tool in data analysis. We consider points X in a relaxed metric space, where the triangle inequality holds within a constant factor. A clustering of X is a partition of X defined by a set of points Q(centroids), according to the closest centroid. The cost of clustering X by Q is V(Q)= ∑x ∈ X dxQ. This formulation generalizes classic k-means clustering, which uses squared distances. Two basic tasks, parametrized by k ≥ 1, are cost estimation, which returns (approximate) V(Q) for queries Q such that |Q| = k and clustering, which returns an (approximate) minimizer of V(Q) of size |Q|= k. When the data set X is very large, we seek efficient constructions of small samples that can act as surrogates for performing these tasks. Existing constructions that provide quality guarantees, however, are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. The core of our design are the novel one2all probabilities, computed for a set M of centroids and α ≥ 1: The clustering cost of each Q with cost V(Q) ≥ V(M)/α can be estimated well from a sample of size O(α |M| ε-2). For cost estimation, we apply one2all with a bicriteria approximate M, while adaptively balancing |M| and α to optimize sample size per quality. For clustering, we present a wrapper that adaptively applies a base clustering algorithm to a sample S, using the smallest sample that provides the desired statistical guarantees on quality. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.

DOI:

10.1609/aaai.v32i1.11772

AAAI

Thirty-Second AAAI Conference on Artificial Intelligence 2018

ISSN 2374-3468 (Online) ISSN 2159-5399 (Print)

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.