Multimodal Summarization with Guidance of Multimodal Reference

Junnan Zhu; Yu Zhou; Jiajun Zhang; Haoran Li; Chengqing Zong; Changliang Li

doi:10.1609/aaai.v34i05.6525

Authors

Junnan Zhu Chinese Academy of Sciences
Yu Zhou Chinese Academy of Sciences
Jiajun Zhang Chinese Academy of Sciences
Haoran Li JD AI Research
Chengqing Zong Chinese Academy of Sciences
Changliang Li Kingsoft AI Lab

DOI:

https://doi.org/10.1609/aaai.v34i05.6525

Abstract

Multimodal summarization with multimodal output (MSMO) is to generate a multimodal summary for a multimodal news report, which has been proven to effectively improve users' satisfaction. The existing MSMO methods are trained by the target of text modality, leading to the modality-bias problem that ignores the quality of model-selected image during training. To alleviate this problem, we propose a multimodal objective function with the guidance of multimodal reference to use the loss from the summary generation and the image selection. Due to the lack of multimodal reference data, we present two strategies, i.e., ROUGE-ranking and Order-ranking, to construct the multimodal reference by extending the text reference. Meanwhile, to better evaluate multimodal outputs, we propose a novel evaluation metric based on joint multimodal representation, projecting the model output and multimodal reference into a joint semantic space during evaluation. Experimental results have shown that our proposed model achieves the new state-of-the-art on both automatic and manual evaluation metrics. Besides, our proposed evaluation method can effectively improve the correlation with human judgments.

Multimodal Summarization with Guidance of Multimodal Reference

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription