Extracting the 3D geometry plays an important part in scene understanding. Recently, robust visual descriptors are proposed for extracting the indoor scene layout from a passive agent’s perspective, specifically from a single image. Their robustness is mainly due to modelling the physical interaction of the underlying room geometry with the objects and the humans present in the room. In this work we add the physical constraints coming from acoustic echoes, generated by an audio source, to this visual model. Our audio-visual 3D geometry descriptor improves over the state of the art in passive perception models as we show in our experiments.