3aSP2 Statistical grammar inference.

ASA 124th Meeting New Orleans 1992 October

3aSP2. Statistical grammar inference.

Yves Schabes

Dept. of Comput. Inform. Sci., Univ. of Pennsylvania, Philadelphia, PA 19104-6389

Language can be talked, written, printed, or encoded in numerous different ways. As any form of communication, each of these codings can be analyzed statistically for the purpose of comparative judgment and predicting power. Early proposals of language models such as Markov models, N-gram models [C. E. Shannon, Bell Syst. Tech. J. 27(3), 379--423 (1948)] although efficient in practice, have been quickly refuted in theory since they are unable to capture long distance dependencies or to describe hierarchically the syntax of natural languages. Stochastic context-free grammar [T. Booth, in 10th Annual IEEE Symp. on Switching and Automata Theory (1969)] is a hierarchical model that assigns a probability to each context-free rewriting rule. However, none of such proposals perform as well as the simpler Markov models because of the difficulty of capturing lexical information [Jelinek et al., Tech. Rep. RC 16374 (72684), IBM, Yorktown Heights, NY 10598 (1990)] [K. Lari and S. J. Young, Comput. Speech Lang. 4, 35--56 (1990)] This is the case even if supervised training is used [F. Pereira and Y. Schabes, in ACL '1992]. Stochastic lexicalized tree-adjoining grammar (SLTAG) has been recently suggested as the basis for training algorithm [Y. Schabes, in COLING '1992]. The parameters of a SLTAG correspond to the probability of combining two structures each one associated with a word. This system reconciles abstract structure and concrete data in natural language processing since it is is lexically sensitive and yet hierarchical. The need for a lexical and still hierarchical statistical language model is partially corroborated by preliminary experiments which show that SLTAG enable better statistical modeling than its hierarchical counterpart, stochastic context-free grammars. The experiments also show that SLTAG is capable of capturing lexical distributions as well as bigram models while maintaining the hierarchical nature of language.