Multi-Modal Video Retrieval (PhD Thesis, finished)

All multimedia content, but especially video, has in recent years grown in both volume and importance. In order for this increasing amount of video material to be useful, it is important to be able to find the parts of it which are relevant to any given circumstance. The field of video retrieval works on addressing this challenge by offering means to retrieve video sequences from a larger pool which are similar to a query. Such retrieval processes commonly rely on textual annotations which often need to be added manually to a video in order to make it retrievable. In contrast, content-based video retrieval operates not on such external metadata but rather on the content of a video itself.

The aim of this thesis to make several contributions to the field of content-based video retrieval. It begins with an analysis of one of the largest and most diverse contemporary sources of video material - web video - as it is found in the wild. The analysis outlines several properties of such video material obtained from two large online video platforms and compares them with the properties of several video collections which are commonly used in research. The results of this comparison led to the creation of a new research video dataset which is scheduled to be used for multiple large video retrieval evaluation campaigns.

Next, the notion of similarity, especially in the visual domain, is explored as it is perceived by humans. A human-labelled ground-truth dataset of pair-wise image similarity is obtained through an online platform which made use of both crowdsourcing and gamification strategies for input acquisition. This dataset serves as the basis for a number of experiments which aim at exploring the interrelation between the multitude of options to compute the distance between two features describing visual content and the humanly perceived visual similarity. The insights gained from these experiments might help to support the decision on which distances to use when implementing content-based retrieval systems.

Finally, a content-based video retrieval engine is implemented which supports multiple modalities for query expression. This engine - which goes by the name Cineast - forms a vital component of the content-based retrieval stack vitrivr which has been made publicly available as open-source software. Cineast, and by extension vitrivr, has later been extended to concurrently support multiple media types besides video, such as images, audio and three dimensional models. This makes vitrivr a full-fledged content-based multimedia retrieval stack.



Prof. Dr. Klaus Schöffmann, Alpen-Adria Universität Klagenfurt

Start Date


Date of Defense


Funding Agencies


Research Topics