A characteristic shared by most approaches to natural language understanding and generation is the use of symbolic representations of word and sentence meanings. Frames and semantic nets are two popular current approaches. Symbolic methods alone are inadequate for applications such as conversational robotics that require natural language semantics to be linked to perception and motor control. This paper presents an overview of our efforts towards robust natural spoken language understanding and generation systems with sensory-motor grounded semantics. Each system is trained by "show-and-tell" based on cross-modal language acquisition algorithms. The first system learns to generate natural spoken descriptions of objects in synthetically generated multi-object scenes. The second system performs the converse task: given spoken descriptions, it finds the best matching object and points at it. This system is embodied as a robotic device and binds the semantics of spoken phrases to objects identified by its real-time computer vision system. A laser is used to point to the selected object. The third system, in its early phases, is a trainable robotic manipulator. This robot serves as the basis for our experiments in learning the semantics of action verbs. These experimental implementations are part of our larger on-going effort to develop a comprehensive model of sensory-motor grounded semantics.