We are of the view that one way to look upon mental representations is to assume that visual and linguistic information may be intertwined in the same memory structure, or if they get processed into distinct memory structures there are rich finks between them that facilitate going from one representation to another, as necessary. We have developed a computer simulation of such a model where inputs from separate visual and linguistic modalities are processed and then combined. Hierarchical representations of visual inputs and linguistic inputs are built and linked together; this makes possible reasoning with both representations and also grounds each representation in the other. For example, once the words "horse" and "striped" have been described to a person using visual inputs (pictures), simply being told that a "zebra is a striped horse" is often adequate without actually being shown one (example due to Harnad, 1990). To teach "striped" in the first place, we might say something like "striped means with long narrow ribbons more or less parallel to one another," but think of how much simpler and more informative it is to show a few pictures of striped objects. Similarly, we are able to visualize a "sphinx" or a "unicorn" quite well from verbal descriptions. Whenever we describe objects or narrate events, we are making use of the fact that words or linguistic symbols are grounded in their visual counterparts.