[PS-3.13]Statistical learning of syntactic structure

Clark, A.

King's College London

Statistical Learning is potentially an important component of language acquisition; however, it is
not clear how this approach can infer the sorts of hierarchically structured grammars that are thought to
be necessary to describe natural language syntax, such as the context-free grammar (CFG). Here
we show
that statistical learning can be used to infer probabilistic CFGs from strings without any structural or
semantic information.
Using randomly generated probabilistic CFGs, we demonstrate that if the grammars satisfy certain natural restrictions,
then a simple statistical learner can infer the structure of the grammar.
The restrictions are that a) there are many more words than categories, b) that the grammars are in Chomsky normal form and c) that the
grammars are not overly ambiguous, which we measure using the performance of an optimal parser.
The learner uses only local n-gram statistics to infer the structure of the grammar, and is based on theoretically motivated algorithms with good convergence guarantees.
On medium sized grammars with 15 nonterminals (categories), the learner produces highly accurate grammars over 95% of the time.
These results imply that statistical learning based on purely local shallow distributional clues can have an important role in the acquisition of recursive syntactic structures.