Several software tools index text of the World Wide Web, but little attention has been paid to the many valuable photographs. We present a relatively simple way to index them by localizing their likely explicit and implicit captions with a kind of expert system. We use multimodal clues from the general appearance of the image, layout of the Web page, and the words nearby the image that are likely to describe it. Our MARIE-3 system avoids full image processing and full natural-language processing, but demonstrates a surprising degree of success, and can thus serve as a preliminary filtering for such detailed content analysis. Experiments with a randomly chosen set of Web pages concerning the military showed 41% recall with 41% precision for individual caption identification, or 70% recall with 30% precision, although captions averaged only 1.4% of the page text.