Text Bundling: Statistics Based Data-Reduction

Authors

Lawrence Shih

Jason Rennie

Yu-Han Chang

and David Karger

Proceedings:

Proceedings of the Twentieth International Conference on Machine Learning

Volume

Issue:

Proceedings of the Twentieth International Conference on Machine Learning

Track:

Contents

Downloads:

Download PDF

Abstract:

As text corpora become larger, tradeoffs between speed and accuracy become critical: slow but accurate methods may not complete in a practical amount of time. In order to make the training data a manageable size, a data reduction technique may be necessary. Subsampling, for example, speeds up a classifier by randomly removing training points. In this paper, we describe an alternate method for reducing the number of training points by combining training points such that important statistical information is retained. Our algorithm keeps the same statistics that fast, linear-time text algorithms like Rocchio and Naive Bayes use. We provide empirical results that show our data reduction technique compares favorably to three other data reduction techniques on four standard text corpora.

ICML

Proceedings of the Twentieth International Conference on Machine Learning

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.