Speech and natural language are natural and convenient ways to interact with artificial characters. Current use of language in games, however, is limited to menu systems and interplayer communication. To achieve smooth linguistic communication with synthetic agents, research should focus on how language connects to the situation in which it occurs. Taking account of the physical scene (where is the speaker located, what is around her, when does she speak?) as well as the functional aspects of the situation (why did he choose to speak? What are his likely plans?) can disambiguate the linguistic signal in form and content. We present a game environment to collect time synchronized speech and action streams, to visualize these data and to annotate them at different stages of processing. We further sketch a framework for situated speech understanding on such data, taking into account aspects of the physical situation as well as the plans players follow. Our results show that this combination of influences achieves remarkable improvements over the individual situation models despite the very noisy and spontaneous nature of the speech involved. This work provides a basis for developing characters that use situated natural spoken language to communicate meaningfully with human players.