Abstract:
Evidence from psychology suggests that humans process definite descriptions that refer to objects present in a visual scene incrementally upon hearing them, rather than constructing explicit parse trees after the whole sentence was said, which are then used to determine the referents. In this paper, we describe a real-time distributed robotic architectures for human reference resolution that demonstrates various interactions of auditory, visual, and semantic processing components hypothesized to underlie human processes.