Clark, A.

King's College London

Statistical Learning is potentially an important component of language acquisition; however, it is

not clear how this approach can infer the sorts of hierarchically structured grammars that are thought to

be necessary to describe natural language syntax, such as the context-free grammar (CFG). Here

we show

that statistical learning can be used to infer probabilistic CFGs from strings without any structural or

semantic information.

Using randomly generated probabilistic CFGs, we demonstrate that if the grammars satisfy certain natural restrictions,

then a simple statistical learner can infer the structure of the grammar.

The restrictions are that a) there are many more words than categories, b) that the grammars are in Chomsky normal form and c) that the

grammars are not overly ambiguous, which we measure using the performance of an optimal parser.

The learner uses only local n-gram statistics to infer the structure of the grammar, and is based on theoretically motivated algorithms with good convergence guarantees.

On medium sized grammars with 15 nonterminals (categories), the learner produces highly accurate grammars over 95% of the time.

These results imply that statistical learning based on purely local shallow distributional clues can have an important role in the acquisition of recursive syntactic structures.