Impressive image captioning results (i.e., an objective description for an image) are achieved with plenty of training pairs. In this paper, we take one step further to investigate the creation of narrative paragraph for a photo stream. This task is even more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. The difficulty can even be exacerbated by the limited training data, so that existing approaches almost focus on search-based solutions. To deal with these challenges, we propose a sequence-to-sequence modeling approach with reinforcement learning and adversarial training. First, to model the ordered photo stream, we propose a hierarchical recurrent neural network as story generator, which is optimized by reinforcement learning with rewards. Second, to generate relevant and story-style paragraphs, we design the rewards with two critic networks, including a multi-modal and a language-style discriminator. Third, we further consider the story generator and reward critics as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critics aim at distinguishing them and further improving the generator by policy gradient. Experiments on three widely-used datasets show the effectiveness, against state-of-the-art methods with relative increase of 20.2% by METEOR. We also show the subjective preference for the proposed approach over the baselines through a user study with 30 human subjects.
Published Date: 2018-02-08
Registration: ISSN 2374-3468 (Online) ISSN 2159-5399 (Print)
Copyright: Published by AAAI Press, Palo Alto, California USA Copyright © 2018, Association for the Advancement of Artificial Intelligence All Rights Reserved.