Current question answering systems succeed in many respects regarding questions about textual documents. However, information exists in other media, which provides both opportunities and challenges for question answering. We present results in extending question answering capabilities to video footage captured in a surveillance setting. Our prototype system, called Spot, can answer questions about moving objects that appear within the video. We situate this novel application of vision and language technology within a larger framework designed to integrate language and vision systems under a common representation. We believe that our framework will support the next generation of multimodal natural language information access systems.