Topic Modelling for Tweets (Bachelor Thesis, Finished)


Simon Peterhans


Tweet, like, share! During the last decade, social media have undeniably emerged as a mass- phenomenon and become the predominant way of how we use the internet. With billions of daily users, these new means of communication have become platforms to voice opinions, discuss and debate about virtually anything – for example political elections. In regard to such events it certainly would be interesting to know: What are people talking about in social media networks?

Topic modeling algorithms aim at computationally detecting topics in a body of text in attempt to answer the preceding question. One specific example is Correlation Explanation (CorEx), an algorithm trying to extract topics by using an information-theoretic approach. However, before any sort of analysis can occur, the data first has to be collected. In aim of assisting with this often tedious task, the research system verifir can obtain large amounts of tweets from Twitter for a given set of keywords. Unfortunately, no openly available software is currently capable of doing both data collection and semantic analysis in the form of topic models.

In this thesis, we simplify the process of generating and visualizing topic models for social media data by integrating the topic modeling algorithm CorEx into the research system verifir. Our contribution is twofold: First, we create a Python module that allows to generate topic models for data collected by verifir. Second, we visualize the generated topic models in the Angular-based user interface of verifir to simplify model evaluation.

To demonstrate the new functionality, we use a Twitter data set consisting of thousands of tweets from the 2019 Swiss federal elections and explore this data by generating several topic models. The visualization shows that CorEx finds meaningful topics and reveals several trends. Finally, we evaluate the scalability of CorEx to confirm that the execution time scales linearly when varying the parameters.

Start / End Dates

2019/08/19 - 2019/12/19


Research Topics