Crowdsourcing platforms provide a convenient and scalable way to collect human-generated labels on-demand. This data can be used to train Artificial Intelligence (AI) systems or to evaluate the effectiveness of algorithms. The datasets generated by means of crowdsourcing are, however, dependent on many factors that affect their quality. These include, among others, the population sample bias introduced by aspects like task reward, requester reputation, and other filters introduced by the task design.In this paper, we analyse platform-related factors and study how they affect dataset characteristics by running a longitudinal study where we compare the reliability of results collected with repeated experiments over time and across crowdsourcing platforms. Results show that, under certain conditions: 1) experiments replicated across different platforms result in significantly different data quality levels while 2) the quality of data from repeated experiments over time is stable within the same platform. We identify some key task design variables that cause such variations and propose an experimentally validated set of actions to counteract these effects thus achieving reliable and repeatable crowdsourced data collection experiments.