The problem of unsupervised domain adaptation in semantic segmentation is a major challenge for numerous computer vision tasks because acquiring pixel-level labels is time-consuming with expensive human labor. A large gap exists among data distributions in different domains, which will cause severe performance loss when a model trained with synthetic data is generalized to real data. Hence, we propose a novel domain adaptation approach, called Content Invariant Representation Network, to narrow the domain gap between the source (S) and target (T) domains. The previous works developed a network to directly transfer the knowledge from the S to T. On the contrary, the proposed method aims to progressively reduce the gap between S and T on the basis of a Content Invariant Representation (CIR). CIR is an intermediate domain (I) sharing invariant content with S and having similar data distribution to T. Then, an Ancillary Classifier Module (ACM) is designed to focus on pixel-level details and generate attention-aware results. ACM adaptively assigns different weights to pixels according to their domain offsets, thereby reducing local domain gaps. The global domain gap between CIR and T is also narrowed by enforcing local alignments. Last, we perform self-supervised training in the pseudo-labeled target domain to further fit the distribution of the real data. Comprehensive experiments on two domain adaptation tasks, that is, GTAV → Cityscapes and SYNTHIA → Cityscapes, clearly demonstrate the superiority of our method compared with state-of-the-art methods.