Applying Statistical Methods to Small Corpora: Benefiting from a Limited Domain

David Fisher and Ellen Riloff

The application of statistical approaches to problems in natural language processing generally requires large (1,000,000+ words) corpora to produce useful results. In this paper we show that a well-known statistical technique, the t test, can be applied to smaller corpora than was previously thought possible, by relying on semantic features rather than lexical items in a corpus of limited domain. We apply the t test to the problem of resolving relative pronoun antecedents, using collocation frequency data collected from the 500,000 word MUC-4 corpus. We conduct two experiments where t is calculated with lexical items and with semantic feature representations. We show that the test cases that are relevant to the MUC-4 domain produce more significant values of t than the ones that are irrelevant. We also show that the t test correctly resolves the relative pronoun in 91.07% of the relevant test cases where the value of t is significant.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.