Federated Learning for Vision-and-Language Grounding Problems

Fenglin Liu; Xian Wu; Shen Ge; Wei Fan; Yuexian Zou

doi:10.1609/aaai.v34i07.6824

Authors

Fenglin Liu Peking University
Xian Wu Tencent
Shen Ge Tencent
Wei Fan Tencent
Yuexian Zou Peking University

DOI:

https://doi.org/10.1609/aaai.v34i07.6824

Abstract

Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.

Federated Learning for Vision-and-Language Grounding Problems

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription