As an important research issue in affective computing community, multi-modal emotion recognition has become a hot topic in the last few years. However, almost all existing studies perform multiple binary classification for each emotion with focus on complete time series data. In this paper, we focus on multi-modal emotion recognition in a multi-label scenario. In this scenario, we consider not only the label-to-label dependency, but also the feature-to-label and modality-to-label dependencies. Particularly, we propose a heterogeneous hierarchical message passing network to effectively model above dependencies. Furthermore, we propose a new multi-modal multi-label emotion dataset based on partial time-series content to show predominant generalization of our model. Detailed evaluation demonstrates the effectiveness of our approach.