A Scalable Agent for Multimedia Retrieval (Master Thesis, Finished)

Description

In this thesis, we set out to explore how open-domain video retrieval can move beyond static indexing toward systems that respond - through adaptive weighting, dynamic extraction, and  agentic reasoning. Existing systems already employ adaptive methods, but typically within closed or narrowly defined domains - such as surveillance footage, egocentric recordings, or analyses of individual long-form videos. These approaches privilege depth over scale. Our aim is to extend such adaptivity to the open domain by dynamically weighting, extracting, and reasoning about multimodal features across large and heterogeneous collections at query time. Built as an extension of the vitrivr framework, the resulting system operates in a distributed environment that orchestrates vision, audio, and language models across both Ahead-of-Time (AoT) and Just-in-Time (JiT) paradigms. Through agentic query variation, temporal expansion, and on-demand extraction, retrieval becomes a reactive process rather than a static lookup. Results show that AoT features - particularly Clip and VQA - form the structural backbone of performance, while JiT adaptation offers complementary gains for complex or underspecified queries. The findings point toward retrieval architectures where query, model, and raw data converge - collapsing the boundary between indexing and inference, and enabling contextual reasoning within an open world of video.

Start / End Dates

2025/04/28 - 2025/10/27

Supervisors

Research Topics