Unsupervised Context Sensitive Language Acquisition from Large, Untagged Corpora

Zach Solan, Eytan Ruppin, David Horn, and Shimon Edelman

A central tenet of generative linguistics is that extensive innate knowledge of grammar is essential to explain the acquisition of language from positive-only data (Chomsky, 1986). We explore an alternative hypothesis, according to which syntax is an abstraction that emerges from exposure to language (Hopper, 1998), coexisting with the corpus data within the same representational mechanism. Far from parsimonious, the representation we introduce allows partial overlap of linguistic patterns or constructions (Croft, 2001). The incremental process of acquisition of patterns is driven both by structural similarities and by statistical information inherent in the data, so that frequent strings of similar composition come to be represented by the same pattern. The degree of abstraction of a pattern varies: it may be high, as in the case of a frame with several slots, each occupied by a member of an equivalence class associated with it, or low, as in the extreme case of idioms or formulaic language snippets, where there is no abstraction at all (Langacker, 1987; Wray, 2002). The acquired patterns represent fully the original data, and, crucially, enable structure-sensitive generalization in the production and the assimilation of unseen examples.

