ePoster

Reward Bases: instant reward revaluation with temporal difference learning

Beren Millidge,Mark Walton,Rafal Bogacz
COSYNE 2022(2022)
Lisbon, Portugal

Conference

COSYNE 2022

Lisbon, Portugal

Resources

Authors & Affiliations

Beren Millidge,Mark Walton,Rafal Bogacz

Abstract

The dominant theory of dopamine function in the basal ganglia system is model-free reinforcement learning (RL) where the dopaminergic neurons in the Ventral Tegmental Area (VTA) encode reward prediction errors which are used to modulate plasticity at cortico-striatal synapses so as to learn a value function of cortical states (Schultz et al 1998). However, a key assumption of this model (and model-free RL generally) is that the reward function being optimized is fixed, while for biological creatures the `reward function' can fluctuate over time depending on physiological state -- i.e. food is rewarding when hungry but not when satiated. While experiments (Robinson et al 2013) have demonstrated that animals can instantly adapt their behaviour when their internal physiological state changes, the neurocomputational underpinnings of this capability are unknown and cannot be accounted for by standard model-free RL methods which must be retrained from scratch if the reward function changes. In this abstract, we propose a novel and simple extension to temporal difference learning that allows for zero-shot (instant) generalization to changing reward functions. Specifically, we demonstrate that if we interpret the reward function as a linear combination of reward basis vectors and then learn a separate value function for each reward basis using standard TD learning, then we can instantly compute the value function of any reward function in the span of the reward basis vectors. Moreover, this algorithm can be straightforwardly implemented in neural circuitry by simply parallelizing the circuits proposed in Schultz et al (1998). Here, we present the mathematical formalism underlying our algorithm, and demonstrate it can reproduce the behavioural effects of instant generalization (Robinson et al 2013) as well as dopamine responses in ventral striatum (VS) (Papageorgiou et al 2016)

Unique ID: cosyne-22/reward-bases-instant-reward-revaluation-1aaa2fb6