Without a TRACE: A time-invariant model of spoken word recognition

Hannagan, T. ^1, ³ , Magnuson, J. ^2, ⁴ & Grainger, J. ^1, ³

1 CNRS
2 University of Connecticut
3 Université de Provence
4 Haskins Laboratories

How do we map the rapid input of spoken language onto phonological and lexical representations over time? Attempts at psychologically-tractable computational models of spoken word recognition tend either to ignore time or to transform the temporal input into a spatial representation. The latter is the approach taken in TRACE (McClelland & Elman, 1986), a connectionist model with broad and deep coverage of speech perception and spoken word recognition phenomena. TRACE reduplicates featural, phonemic, and lexical inputs at every time step in a large memory trace, with rich interconnections (excitatory forward and backward connections between levels and inhibitory links within levels). This leads to a proliferation of units and connections that grows dramatically with the lexicon and the memory trace. Our starting point is the observation that models of visual object recognition - including visual word recognition - have long grappled with the fundamental problem of how to model spatial invariance in human object recognition, but have not embraced the fully-reduplicative strategy of TRACE. We have developed a model that combines time-specific phoneme representations similar to those in TRACE with higher-level representations based on string kernels (Hannagan and Grainger, 2012): temporally independent diphone and lexical units. This reduces the numbers of units and connections required by several orders of magnitude relative to TRACE. I will introduce the model and compare its performance to that of the TRACE model on a set of key phenomena. Then, I will discuss phenomena that the model does not yet successfully simulate (and why), and novel predictions that follow from this architecture.