Are You Watching Closely? Content-based Retrieval of Hand Gestures
Gestures play an important role in our daily communications. However, recognizing and retrieving gestures in-the-wild is a challenging task which is not explored thoroughly in literature. In this paper, we explore the problem of identifying and retrieving gestures in a large-scale video dataset provided by the computer vision community and based on queries recorded in-the-wild. Our proposed pipeline, I3DEF, is based on the extraction of spatio-temporal features from intermediate layers of an I3D network, a state-of-the-art network for action recognition, and the fusion of the output of feature maps from RGB and optical flow input. The obtained embeddings are used to train a triplet network to capture the similarity between gestures. We further explore the effect of a person and body part masking step for improving both retrieval performance and recognition rate. Our experiments show the ability of I3DEF to recognize and retrieve gestures which are similar to the queries independently of the depth modality. This performance holds both for queries taken from the test data, and for queries using recordings from different people performing relevant gestures in a different setting.