Recognizing ongoing events based on acoustic clues has been a critical yet challenging problem that has attracted significant research attention in recent years. Joint audio-visual analysis can improve the event detection accuracy but may not always be feasible as under many circumstances only audio recordings are available in real-world scenarios. To solve the challenges, we present a novel visual-assisted teacher-student mutual learning framework for robust sound event detection from audio recordings. Our model adopts a multi-modal teacher network based on both acoustic and visual clues, and a single-modal student network based on acoustic clues only. Conventional teacher-student learning performs unsatisfactorily for knowledge transfer from a multi-modality network to a single-modality network. We thus present a mutual learning framework by introducing a single-modal transfer loss and a cross-modal transfer loss to collaboratively learn the audio-visual correlations between the two networks. Our proposed solution takes the advantages of joint audio-visual analysis in training while maximizing the feasibility of the model in use cases. Our extensive experiments on the DCASE17 and the DCASE18 sound event detection datasets show that our proposed method outperforms the state-of-the-art audio tagging approaches.