Human languages order information efficiently

Gildea, D. 1 & Jaeger, T. F. 2

1 Computer Science, U. of Rochester
2 Brain and Cognitive Sciences, U. of Rochester

While most languages use the relative order between words to encode meaning relations, languages differ in what orders they use and how these orders are mapped onto different meanings. We test the hypothesis that--despite these differences--human languages might constitute different solutions to common pressures of language use. Specifically, we calculate the per-word dependency length and surprisal--both known to be correlated with increased processing difficulty--and compare it to what would be expected by chance.

Method: We use large-scale syntactic corpora from Arabic, American English, Czech, German, and Mandarin. We use Monte-Carlo simulations to create thousands of randomized versions of these languages with different (but internally consistent) word orders. This allows us to calculate the per-word dependency length and surprisal of each language *expected by chance*.

Result: All five languages have per-word dependency length and surprisal significantly lower than expected by chance (ps<.01). This held for both speech and written data (Futrell et al 2015; Gildea & Temperley, 2010 find the same for dependency length for many languages).

Additionally, we calculated optimal trade-offs between dependency length and surprisal (an NP-complete problem, but close approximations can be obtained through numerical methods). Four of the languages fall very close to the optimal trade-off boundary. This suggests that these processing pressures shape word order changes over historical time. Of relevance to psycholinguistic theory, surprisal was as strong or stronger a predictor than dependency length. While the two measures are positively correlated on average, they trade-off in the optimization: lowering surprisal eventually comes are the costs of dependency length.