Multimodal summarization aims to refine salient information from multiple modalities, among which texts and images are two mostly discussed ones. In recent years, many fantastic works have emerged in this field by modeling image-text interactions; however, they neglect the fact that most of multimodal documents have been elaborately organized by their writers. This means that a critical organized factor has long been short of enough attention, that is, image locations, which may carry illuminating information and imply the key contents of a document. To address this issue, we propose a location-aware approach for multimodal summarization (LAMS) based on Transformer. We investigate image locations for multimodal summarization via a stack of multimodal fusion block, which can formulate the high-order interactions among images and texts. An extensive experimental study on an extended multimodal dataset validates the superior summarization performance of the proposed model.