StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Authors

  • Dongliang He Baidu, Inc.
  • Zhichao Zhou Baidu, Inc.
  • Chuang Gan Massachusetts Institute of Technology
  • Fu Li Baidu, Inc.
  • Xiao Liu Baidu, Inc.
  • Yandong Li University of Central Florida
  • Limin Wang Nanjing University
  • Shilei Wen Baidu Research

DOI:

https://doi.org/10.1609/aaai.v33i01.33018401

Abstract

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

Downloads

Published

2019-07-17

How to Cite

He, D., Zhou, Z., Gan, C., Li, F., Liu, X., Li, Y., Wang, L., & Wen, S. (2019). StNet: Local and Global Spatial-Temporal Modeling for Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 8401-8408. https://doi.org/10.1609/aaai.v33i01.33018401

Issue

Section

AAAI Technical Track: Vision