Representative and Fair Collection Sampling for Multimedia Retrieval (Bachelor Thesis, Finished)


Abiyuga Thanabalasingam


With growing technological progression and digital content creation, multimedia collections have grown exponentially in the last few decades. Multimedia retrieval systems face difficulties when evaluating retrieval algorithms on large collections, due to the needed human interference in the evaluation. By selecting a representative subset from the large collection, the retrieval methods can be evaluated based on the subset. In this thesis, we investigate and use random sampling, systematic sampling and proportional stratified sampling on the SGV12 collection from the PIA1 project. To implement the latter sampling method, K- Means and DBSCAN for clustering as well as t-SNE for dimensionality reduction was used. The methods were then evaluated using the Chi-Squared-test, to measure representativeness, and using runtime. The results show that, overall, systematic sampling performed the best with an average score of 0.0058 closely followed by random sampling with a score of 0.00625. The proportional stratified sampling methods are significantly slower but have similar Chi-Squared-values. valuating retrieval methods on a smaller subset of a bigger collection saves time and significantly reduces human effort. With representativeness of the subset, important patterns can still be recognized but with reduced cost and storage.

Start / End Dates

2023/04/26 - 2023/08/25


Research Topics