TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

Authors

Fabian Flöck,Kenan Erdogan,Maribel Acosta

GESIS - Leibniz Institute for the Social Sciences,GESIS - Leibniz Institute for the Social Sciences,Karlsruhe Institute of Technology

Proceedings:

Vol. 11 No. 1 (2017): Eleventh International AAAI Conference on Web and Social Media

Volume

Issue:

Vol. 11 No. 1 (2017): Eleventh International AAAI Conference on Web and Social Media

Track:

Dataset Papers

Downloads:

Download PDF

Abstract:

We present a dataset that contains every instance of all tokens (words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,78 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. We show how this data enables, on token-level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics, in the process gaining several novel insights.

DOI:

10.1609/icwsm.v11i1.14860

ICWSM

Vol. 11 No. 1 (2017): Eleventh International AAAI Conference on Web and Social Media

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.