Proceedings:
No. 1: AAAI-19, IAAI-19, EAAI-20
Volume
Issue:
Proceedings of the AAAI Conference on Artificial Intelligence, 33
Track:
AAAI Technical Track: Vision
Downloads:
Abstract:
Generating photo-realistic images conditioned on semantic text descriptions is a challenging task in computer vision field. Due to the nature of hierarchical representations learned in CNN, it is intuitive to utilize richer convolutional features to improve text-to-image synthesis. In this paper, we propose Perceptual Pyramid Adversarial Network (PPAN) to directly synthesize multi-scale images conditioned on texts in an adversarial way. Specifically, we design one pyramid generator and three independent discriminators to synthesize and regularize multi-scale photo-realistic images in one feed-forward process. At each pyramid level, our method takes coarse-resolution features as input, synthesizes highresolution images, and uses convolutions for up-sampling to finer level. Furthermore, the generator adopts the perceptual loss to enforce semantic similarity between the synthesized image and the ground truth, while a multi-purpose discriminator encourages semantic consistency, image fidelity and class invariance. Experimental results show that our PPAN sets new records for text-to-image synthesis on two benchmark datasets: CUB (i.e., 4.38 Inception Score and .290 Visual-semantic Similarity) and Oxford-102 (i.e., 3.52 Inception Score and .297 Visual-semantic Similarity).
DOI:
10.1609/aaai.v33i01.33018312
AAAI
Proceedings of the AAAI Conference on Artificial Intelligence, 33