Published:
2018-02-08
Proceedings:
Proceedings of the AAAI Conference on Artificial Intelligence, 32
Volume
Issue:
Thirty-Second AAAI Conference on Artificial Intelligence 2018
Track:
AAAI Technical Track: Vision
Downloads:
Abstract:
Video-based action recognition with deep neural networks has shown remarkable progress. However, most of the existing approaches are too computationally expensive due to the complex network architecture. To address these problems, we propose a new real-time action recognition architecture, called Temporal Convolutional 3D Network (T-C3D), which learns video action representations in a hierarchical multi-granularity manner. Specifically, we combine a residual 3D convolutional neural network which captures complementary information on the appearance of a single frame and the motion between consecutive frames with a new temporal encoding method to explore the temporal dynamics of the whole video. Thus heavy calculations are avoided when doing the inference, which enables the method to be capable of real-time processing. On two challenging benchmark datasets, UCF101 and HMDB51, our method is significantly better than state-of-the-art real-time methods by over 5.4% in terms of accuracy and 2 times faster in terms of inference speed (969 frames per second), demonstrating comparable recognition performance to the state-of-the-art methods. The source code for the complete system as well as the pre-trained models are publicly available at https://github.com/tc3d.
DOI:
10.1609/aaai.v32i1.12333
AAAI
Thirty-Second AAAI Conference on Artificial Intelligence 2018
ISSN 2374-3468 (Online) ISSN 2159-5399 (Print)
Published by AAAI Press, Palo Alto, California USA Copyright © 2018, Association for the Advancement of Artificial Intelligence All Rights Reserved.