Modeling Data Collection for Social Media Analytics (Master Project, Finished)


Simon Peterhans


With billions of users that produce enormous amounts of data, social media platforms are frequently targeted by data analytics. Despite most platforms offering an application programming interface (API) to provide data access, researchers often face several challenges in the process of accessing and collecting social media data. For instance, data from multiple platforms and over a longer period of time may be required. In such a scenario, the collecting system would not only have to account for the heterogeneity of the platform APIs and potentially changing access patterns, but also address situations where data collection may not be possible and has to be delayed. Furthermore, as data collection goes on, research questions and information need may change and additional data may have to be obtained. In this project, we introduce a conceptual model for social media data collection to address the aforementioned challenges. By formally defining individual components of the data collection process, the proposed model particularly addresses platform and data heterogeneity and supports long-running, recoverable queries. As a proof of concept, we present a prototype of a system implemented based on the collection model that is capable of running data collection queries for Twitter, Facebook, and Reddit. Finally, we identify numerous aspects of our work that could be built upon or improved by future work to further facilitate access and collection of social media data.

Start / End Dates

2021/10/01 - 2021/12/12


Research Topics