This paper focuses on the emerging Infrared-Visible cross-modal person re-identification task (IV-ReID), which takes infrared images as input and matches with visible color images. IV-ReID is important yet challenging, as there is a significant gap between the visible and infrared images. To reduce this ‘gap’, we introduce an auxiliary X modality as an assistant and reformulate infrared-visible dual-mode cross-modal learning as an X-Infrared-Visible three-mode learning problem. The X modality restates from RGB channels to a format with which cross-modal learning can be easily performed. With this idea, we propose an X-Infrared-Visible (XIV) ReID cross-modal learning framework. Firstly, the X modality is generated by a lightweight network, which is learnt in a self-supervised manner with the labels inherited from visible images. Secondly, under the XIV framework, cross-modal learning is guided by a carefully designed modality gap constraint, with information exchanged cross the visible, X, and infrared modalities. Extensive experiments are performed on two challenging datasets SYSU-MM01 and RegDB to evaluate the proposed XIV-ReID approach. Experimental results show that our method considerably achieves an absolute gain of over 7% in terms of rank 1 and mAP even compared with the latest state-of-the-art methods.