Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement

Authors

Yichao Du

University of Science and Technology of China, Hefei, China

Zhirui Zhang

Machine Intelligence Technology Lab, Alibaba DAMO Academy

Weizhi Wang

Rutgers University, New Brunswick, USA

Boxing Chen

Machine Intelligence Technology Lab, Alibaba DAMO Academy

Jun Xie

Machine Intelligence Technology Lab, Alibaba DAMO Academy

Tong Xu

University of Science and Technology of China, Hefei, China

Proceedings:

No. 10: AAAI-22 Technical Tracks 10

Volume

Issue:

Proceedings of the AAAI Conference on Artificial Intelligence, 36

Track:

AAAI Technical Track on Speech and Natural Language Processing

Downloads:

Download PDF

Abstract:

End-to-end speech-to-text translation (E2E-ST) is becoming increasingly popular due to the potential of its less error propagation, lower latency, and fewer parameters. Given the triplet training corpus〈speech, transcription, translation〉, the conventional high-quality E2E-ST system leverages the〈speech, transcription〉pair to pre-train the model and then utilizes the〈speech, translation〉pair to optimize it further. However, this process only involves two-tuple data at each stage, and this loose coupling fails to fully exploit the association between triplet data. In this paper, we attempt to model the joint probability of transcription and translation based on the speech input to directly leverage such triplet data. Based on that, we propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data, which should be equal in theory. To achieve this goal, we introduce two Kullback-Leibler divergence regularization terms into the model training objective to reduce the mismatch between output probabilities of dual-path. Then the well-trained model can be naturally transformed as the E2E-ST models by a pre-defined early stop tag. Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs while achieving better performance in the automatic speech recognition task.

DOI:

10.1609/aaai.v36i10.21303

AAAI

Proceedings of the AAAI Conference on Artificial Intelligence, 36

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.