Scene Text Recognition in Images and Video with Cineast (Bachelor Thesis, Ongoing)


Renato Farruggio


Scene text is an important aspect of images and video segments and was shown to be very valuable in retrieving particular items of interest from large media collections. As these collections grow, however, manually transcribing text found in the scenes becomes less and less feasible.

Scene Text Recognition is the task of automatically detecting character sequences (text) in natural scenes. Since such text can come in various forms and on many different backgrounds, it is generally the more complex task than simple, optical character recognition (OCR) in documents. Conceptually, the task of scene text detection can be thought as a two-step process: First, text needs to be detected and localized (e.g., by assigning a bounding box). Once the position(s) of the text block(s) are known, the characters making up the text must be identified and transcribed.

Progress in machine learning and particular deep learning has given a huge boost to solving the problem of Scene Text Recognition and there is a large research corpus on the topic as well as many different open-source projects with pre-trained models that try to tackle the issue.

With this project, we would like to integrate state-of-the-art Scene Text Recognition into the multimedia retrieval stack vitrivr. vitrivr consists of three software components: A database backend called Cottontail DB, the feature generation and extraction engine Cineast and the user interface
vitrivr NG. Most of the work will take place in Cineast, where a Scene Text Recognition feature module should be integrated into the extraction and retrieval pipeline.

Start / End Dates

2021/04/06 - 2021/08/05


Research Topics