A Novel Approach for Compound Document Matching
Bulletin of the IEEE Technical Committee on Digital Libraries (TCDL)
IEEE Technical Committee on Digital Libraries
Future digital libraries will not only contain pure text documents, but increasingly will hold massive amounts of compound documents that comprise many multimedia objects, e.g., texts, images, audio, and video. Already existing collections of documents, e.g., all electronic health records of one clinic can form a digital library with millions of multimedia objects and a total storage of several terabytes. It is therefore important to provide ways for effective and efficient retrieval for those collections. This paper proposes a novel approach for compound document matching using a filter-and-refinement algorithm for similarity-based retrieval within documents, which may consist of arbitrarily many objects of various media types. At the same time, this approach increases the effectiveness by establishing only semantically meaningful matches and providing greater expressiveness in queries by restricting the number of allowed matches to a single query object.