vitrivr: A Multimedia Search System supporting Multimodal Interactions
vitrivr is an open source, full-stack content-based multimedia retrieval system that primarily focuses on video content. At the interface, vitrivr offers multimodal interactions by providing a large variety of different query paradigms, such as keyword queries, search for semantic concepts, query-by-example, query-by-sketch, motion queries, and any combination thereof. Despite its focus on video content, its modular architecture makes it easy to process different types of media as well. Keyword search is based on manual annotations and on the results of OCR applied to the content of the collection. Semantic queries rely on features obtained from Deep Neural Networks in three areas: semantic class labels for entry-level concepts, hidden layer activation vectors for query-by-example and 2D semantic similarity results display. Furthermore, the system considers various low-level features used for query-by-example, query-by-sketch and motion queries. vitrivr's grid-based result navigation interface supports browsing. At the database back-end, the distributed polystore ADAMpro supports large-scale multimedia applications and ensures the scalability to steadily growing collections. Search is provided by Cineast, the retrieval engine of vitrivr. The user includes the possibility for keyword specification, sketching, the specification of sample objects, the selection of semantic class labels, and the specification of flow fields for motion. All these query modes can be applied either simultaneously (supporting multi-object, multi-feature queries) or sequentially, for query refinement. In addition, users can even naturally interact with the vitrivr system by using spoken commands. Multimodal commands enable the combination of spoken instructions with manual pointing and sketching. This also fosters the collaborative interaction with the vitrivr system. The IMOTION project, which is based on the vitrivr system, won the 2017 Video Browser Showdown (VBS 2017), an international video retrieval competition that consists of known-item search tasks based on visual descriptions, textual descriptions, and ad-hoc search tasks, all on the basis of the TRECVID 2016 Ad-hoc Video Search (AVS) dataset of approx. 600 hours (144 GB) of video content.