We discuss the obstacles to inference of correspondences between objects within photographic images and their counterpart concepts in descriptive captions of those images. This is important for information retrieval of photographic data since its content analysis is mucharder than linguistic analysis of its captions. We argue that the key mapping is between certain caption concepts representing the "linguistic focus" and certain image regions representing the "visual focus". The mapping is one-to-many, however, and many image regions and captions concepts are not mapped at all. We discuss some domain-independent constraints that can restrict potential mappings. We also report on experiments testing our criteria for visual focus of images.