Scalable Near-duplicate Detection (Master Thesis, Finished)
Author
Description
With the ever-increasing volume of available multimedia comes an increase in unattributed sharing and derivates of multimedia, creating the need for a system with the ability to track the flow of images across the web in general and social media in particular. While a lot of work has been done in individual areas such as network analysis, near-duplicate detection of images, near-duplicate index structures, data collection and retrieval, we are unaware of an openly available system which integrates all components. In this thesis, we bridge the gap between those individual components by proposing an architecture for truly scalable near-duplicate detection.
There are various kinds of visual near-duplicates. We select chosen modifications and implement an image fingerprinting scheme to detect them efficiently. Four different building blocks are identified for our system. In the first building block, we introduce a generic data model and discuss its requirements in the context of two social networks, Twitter and Reddit. The second building block is a distributed data collection module since our vision requires data beyond available research datasets. In the third building block, we discuss data storage for both raw multimedia content and metadata. The fourth and final building block is concerned with data retrieval. We make extensions to the vitrivr stack to fit our system.
For all building blocks, we implement a prototype and use them together to create the PS-Battles dataset, a new dataset for image manipulation and derivation detection. We evaluate the scalability of data collection module by taking a closer look at its bottleneck and evaluating it for up to 400 Million URLs. The scalability of the data storage module is evaluated for up to 1 Billion elements for metadata storage and we examine system behavior for raw file storage. For the data retrieval module, we show that our implemented fingerprinting scheme is faster during extraction than other features from content-based image retrieval while remaining competitive in retrieval performance. The provided building blocks and prototypes are a good step towards our vision of a system which is able to find near-duplicates at scale.
Start / End Dates
2017/09/25 - 2018/03/24