Further Understanding Videos through Adverbs: A New Video Task

  • Bo Pang Shanghai Jiao Tong University
  • Kaiwen Zha Shanghai Jiao Tong University
  • Yifan Zhang Shanghai Jiao Tong University
  • Cewu Lu Shanghai Jiao Tong University


Video understanding is a research hotspot of computer vision and significant progress has been made on video action recognition recently. However, the semantics information contained in actions is not rich enough to build powerful video understanding models. This paper first introduces a new video semantics: the Behavior Adverb (BA), which is a more expressive and difficult one covering subtle and inherent characteristics of human action behavior. To exhaustively decode this semantics, we construct the Videos with Action and Adverb Dataset (VAAD), which is a large-scale dataset with a semantically complete set of BAs. The dataset will be released to the public with this paper. We benchmark several representative video understanding methods (originally for action recognition) on BA and action recognition. The results show that BA recognition task is more challenging than conventional action recognition. Accordingly, we propose the BA Understanding Network (BAUN) to solve this problem and the experiments reveal that our BAUN is more suitable for BA recognition (11% better than I3D). Furthermore, we find these two semantics (action and BA) can propel each other forward to better performance: promoting action recognition results by 3.4% averagely on three standard action recognition datasets (UCF-101, HMDB-51, Kinetics).

AAAI Technical Track: Vision