It has been crucial to leverage the rich information of multiple modalities in many tasks. Existing works have tried to design multi-modal networks with descent multi-modal fusion modules. Instead, we focus on improving generalization capability of multi-modal networks, especially the fusion module. Viewing the multi-modal data as different projections of information, we first observe that bad projection can cause poor generalization behaviors of multi-modal networks. Then, motivated by well-generalized network's low sensitivity to perturbation, we propose a novel multi-modal training method, multi-head modularization (MHM). We modularize a multi-modal network as a series of uni-modal embedding, multi-modal embedding, and task-specific head modules. Also, for training, we exploit multiple head modules learned with different datasets, swapping each other. From this, we can make the multi-modal embedding module robust to all the heads with different generalization behaviors. In testing phase, we select one of the head modules not to increase the computational cost. Owing to the perturbation of head modules, though including one selected head, the deployed network is more well-generalized compared to the simply end-to-end learned. We verify the effectiveness of MHM on various multi-modal tasks. We use the state-of-the-art methods as baselines, and show notable performance gain for all the baselines.