Deep reinforcement learning (DRL) has achieved surpassing human performance on Atari games, using raw pixels and rewards to learn everything. However, first-person-shooter (FPS) games in 3D environments contain higher levels of human concepts (enemy, weapon, spatial structure, etc.) and a large action space. In this paper, we explore a novel method which can plan on temporally-extended action sequences, which we refer as Combo-Action to compress the action space. We further train a deep recurrent Q-learning network model as a high-level controller, called supervisory network, to manage the Combo-Actions. Our method can be boosted with auxiliary tasks (enemy detection and depth prediction), which enable the agent to extract high-level concepts in the FPS games. Extensive experiments show that our method is efficient in training process and outperforms previous stateof-the-art approaches by a large margin. Ablation study experiments also indicate that our method can boost the performance of the FPS agent in a reasonable way.