Multi-Platform Data Collection for Social Media Analytics (Master Thesis, Finished)


Simon Peterhans


With the rise of social media over the last decades, an ever-increasing amount of research is concerned with the analysis of social media content. In the process of collecting data from application programming interfaces (APIs) for subsequent analysis, researchers face numerous challenges: Not only are data models and access patterns volatile and heterogeneous across platforms, but also queries are often required to run over the course of multiple months and can yield large amounts of multimodal data. 

In this thesis, we make multiple contributions towards facilitating data collection for social media analytics. First, we propose a conceptual model for API-based social media data collection. The model is platform-agnostic, accounts for multimodal data, and addresses the challenges imposed by long-running queries. Then, based on the model definitions and properties, we present an architecture model of a generic social media data collection system. Finally, we implement this architecture model in a prototype that is capable of executing long-running queries to obtain multimodal data from Twitter, Facebook, and Instagram. By respecting the data collection model, our system is able to recover interrupted queries and make efficient use of API request limits. Furthermore, the content-based multimedia retrieval system vitrivr is integrated for additional multimedia processing. 

Our evaluation shows that the prototype can process and store hundreds of thousands of social media submissions per minute and is primarily restricted by the API limits of the platforms. To test the implementation in a realistic use case, our system was deployed in a bachelor's thesis and successfully collected multimodal social media data over several months from multiple platforms. 


Start / End Dates

2022/02/28 - 2022/08/26


Research Topics