Distinguishing the Wood from the Trees: Contrasting Collection Methods to Understand Bias in a Longitudinal Brexit Twitter Dataset

Authors

Clare Llewellyn,Laura Cram

University of Edinburgh,University of Edinburgh

Proceedings:

Vol. 11 No. 1 (2017): Eleventh International AAAI Conference on Web and Social Media

Volume

Issue:

Vol. 11 No. 1 (2017): Eleventh International AAAI Conference on Web and Social Media

Track:

Poster Papers

Downloads:

Download PDF

Abstract:

Various methods can be used for searching or streaming Twitter data to gather a sample on a specific topic. All of these methods introduce a bias into the resulting datasets. Here we examine, and try to define, the bias that the different strategies introduce. Understanding the bias means that we can extrapolate wider meaning from the data in a more precise manner. We use datasets collected on topics from the UK-EU Brexit referendum conducted in 2016. Each dataset discussed draws data from Twitter over a twelve-month period, from 1st September 2015 until 31st August 2016. Three data collection strategies are considered: collecting on human defined topic specific hashtags; collecting using a semi-automated technique to identify topic terms which are then used to collect tweets; and collecting from predefined users known to be tweeting on the topic. To investigate bias in the data we look at, and find wide variation in: group level metadata attributes such as size of the dataset; number of users in each set; average numbers of friends and followers; likely re-tweet status; and levels of inclusion of various add-ons such as hashtags, URLs and media. We also find that relevance to the topic differs between the sets; being far higher in the known users set. We investigate how readability of tweets within each set varies, particularly between known users and topic term sets. We also find that there is a surprising lack of overlap in the data obtained using different collection methods.

DOI:

10.1609/icwsm.v11i1.14924

ICWSM

Vol. 11 No. 1 (2017): Eleventh International AAAI Conference on Web and Social Media

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.