Current progresses on fine-grained recognition are mainly focus on learning the discriminative feature representation via introducing the visual supervisions e.g. part labels. However, it is time-consuming and needs the professional knowledge to obtain the accuracy annotations. Different from these existing methods based on the visual supervisions, in this paper, we introduce a novel feature named audio visual attributes via discovering the correlations between the visual and audio representations. Specifically, our unified framework is training with video-level category label, which consists of two important modules, the encoder module and the attribute discovery module, to encode the image and audio into vectors and learn the correlations between audio and images, respectively. On the encoder module, we present two types of feed forward convolutional neural network for the image and audio modalities. While an attention driven framework based on recurrent neural network is developed to generate the audio visual attribute representation. Thus, our proposed architecture can be implemented end-to-end in the step of inference. We exploit our models for the problem of fine-grained bird recognition on the CUB200-211 benchmark. The experimental results demonstrate that with the help of audio visual attribute, we achieve the superior or comparable performance to that of strongly supervised approaches on the bird recognition.
Published Date: 2018-02-08
Registration: ISSN 2374-3468 (Online) ISSN 2159-5399 (Print)
Copyright: Published by AAAI Press, Palo Alto, California USA Copyright © 2018, Association for the Advancement of Artificial Intelligence All Rights Reserved.