Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Authors

  • Dongliang He Baidu, Inc.
  • Xiang Zhao Baidu, Inc.
  • Jizhou Huang Baidu, Inc.
  • Fu Li Baidu, Inc.
  • Xiao Liu Baidu, Inc.
  • Shilei Wen Baidu Research

DOI:

https://doi.org/10.1609/aaai.v33i01.33018393

Abstract

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

Downloads

Published

2019-07-17

How to Cite

He, D., Zhao, X., Huang, J., Li, F., Liu, X., & Wen, S. (2019). Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 8393-8400. https://doi.org/10.1609/aaai.v33i01.33018393

Issue

Section

AAAI Technical Track: Vision