Multi-Question Learning for Visual Question Answering

Chenyi Lei; Lei Wu; Dong Liu; Zhao Li; Guoxin Wang; Haihong Tang; Houqiang Li

doi:10.1609/aaai.v34i07.6794

Authors

Chenyi Lei Alibaba Group
Lei Wu Zhejiang University
Dong Liu University of Science and Technology of China
Zhao Li Alibaba Group
Guoxin Wang Alibaba Group
Haihong Tang Alibaba Group
Houqiang Li University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v34i07.6794

Abstract

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

Multi-Question Learning for Visual Question Answering

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription