A Richly Annotated Corpus for Probabilistic Parsing

This paper describes the use of a small but syntactically rich parsed corpus of English in probabilistic parsing. Software has been developed to extract probabilistic systemic-f~nctional grammars (SFGs) from the Polytechnic of Wales Corpus in several formalisms, which could equally well be applied to other parsed corpora. To complement the large probabilistic grammar, we discuss progress in the provision of lexical resources, which range from corpus wordlists to a large lexical database supplemented with word frequencies and SFG categories. The lexicon and grammar resources may be used in a variety of probabilistic parsing programs, one of which is presented in some detail: The Realistic Annealing Parser. Compared to traditional rule-based methods, such parsers usn~ily implement complex algorithms, and are relatively slow, but are more robust in providing analyses to unrestricted and even semi-grammatical English.

