[PS-2.6] Modeling the actor-critic architecture by combining recent work in reservoir computing and temporal difference learning in complex environments

Rodny, J. & Noelle, D.

University of California, at Merced

Humans are often able to adapt their behavior, based on experience, so as to optimize the likelihood of obtaining future rewards. These reinforcement learning (RL) phenomena have been well captured by computational models employing temporal difference (TD) learning. TD learning typically involves an ?actor-critic architecture? (AC), in which an ?adaptive critic? module learns to predict future reward. These models are proven to converge to optimal solutions in simple RL environments; however the proof assumes a discrete environment. The ?value function? (VF) in the AC needs to be approximated in continuous and very large discrete environments. Traditional ?value function approximators? (VFAs) such as artificial neural networks with back-propagation, while having proven fruitful (Tesauro, 1992), are not reliable in complex or continuous environments, often times never converging to any solution (Boyan and Moore, 1995). Headway has been made in using complex neural network implementations for the VFA; for some problems, convergence to a bounded region of solutions has been proven when using an ?echo state network? (ESN), a type of ?reservoir computing? (Szita, et al., 2006). More general spiking neural networks have been used to implement an AC architecture (Potjans, et al., 2009). Other work has been done suggesting that ?spike-timing-dependent plasticity? (STDP) may be a possible underlying mechanism for computing temporal differences of reward (Roberts, et al., 2008). None of this work has addressed the problem that VFAs in AC architectures fail to learn in a general way in complex environments. We report on simulations involving spiking neural networks including liquid state machines applied to the problem with VFAs in such complex environments demonstrating the benefits and pitfalls using the temporal dynamics of spikes to encode the continuous state value information.