We are studying how to create social physical agents, i.e., humanoids that perform actions empowered by real-time audio-visual tracking of multiple talkers. Social skills require complex perceptual and motor capabilities as well as communicating ones. It is critical to identify primary features in designing building blocks for social skills, because performance of social interaction is usually evaluated as a whole system but not as each component. We investigate the minimum functionalities for social interaction, supposed that a humanoid is equipped with auditory and visual perception and simple motor control but without sound output. Real-time audio-visual multiple-talker tracking system is implemented on the humanoid, SIG, by using sound source localization, stereo vision, face recognition, and motor control. It extracts either auditory or visual streams and associates them by the proximity in localization. Socially-oriented attention control makes the best use of personality variations classified by the Interpersonal Theory of psychology. It also provides task-oriented functions with decaying factor of belief for each stream. We demonstrate that the resulting behavior of SIG invites the users’ participation in interaction and encourages the users to explore its behaviors. These demonstrations show that SIG behaves like a physical non-verbal Eliza.