vitrivr reads (Bachelor Project, Ongoing)
Text in a video often conveys information which is not easily expressed otherwise. Additionally, retrieval based on scene text has proven invaluable in retrieval competitions such as VBS and LSC. This project deals with the integration of state of the art scene-text transcription into vitrivr using Tensorflow. Ideally, the implementation not only provides the text but also its location. While scene-text transcription is often a two-stage process where first the text is located and then transcribed, an implementation can also use end-to-end transcription where appropriate.
As an extension, the challenge of scrolling text or merging text across segments (e.g. subtitles) could be tackled. The following steps are part of the project:
- Survey existing implementations where there are pre-trained models available and their performance / usability
- Re-implement the code as a standalone component using the Tensorflow Library for Java
- Implement a new feature in cineast which extracts scene text from images and video and stores it using the existing pipeline
- Evaluate the performance in terms of runtime per shot / frame