Reinforcement
reinforcement
Latest
Choice between methamphetamine and food is modulated by reinforcement interval and central drug metabolism
Understanding reward-guided learning using large-scale datasets
Understanding the neural mechanisms of reward-guided learning is a long-standing goal of computational neuroscience. Recent methodological innovations enable us to collect ever larger neural and behavioral datasets. This presents opportunities to achieve greater understanding of learning in the brain at scale, as well as methodological challenges. In the first part of the talk, I will discuss our recent insights into the mechanisms by which zebra finch songbirds learn to sing. Dopamine has been long thought to guide reward-based trial-and-error learning by encoding reward prediction errors. However, it is unknown whether the learning of natural behaviours, such as developmental vocal learning, occurs through dopamine-based reinforcement. Longitudinal recordings of dopamine and bird songs reveal that dopamine activity is indeed consistent with encoding a reward prediction error during naturalistic learning. In the second part of the talk, I will talk about recent work we are doing at DeepMind to develop tools for automatically discovering interpretable models of behavior directly from animal choice data. Our method, dubbed CogFunSearch, uses LLMs within an evolutionary search process in order to "discover" novel models in the form of Python programs that excel at accurately predicting animal behavior during reward-guided learning. The discovered programs reveal novel patterns of learning and choice behavior that update our understanding of how the brain solves reinforcement learning problems.
Understanding reward-guided learning using large-scale datasets
Understanding the neural mechanisms of reward-guided learning is a long-standing goal of computational neuroscience. Recent methodological innovations enable us to collect ever larger neural and behavioral datasets. This presents opportunities to achieve greater understanding of learning in the brain at scale, as well as methodological challenges. In the first part of the talk, I will discuss our recent insights into the mechanisms by which zebra finch songbirds learn to sing. Dopamine has been long thought to guide reward-based trial-and-error learning by encoding reward prediction errors. However, it is unknown whether the learning of natural behaviours, such as developmental vocal learning, occurs through dopamine-based reinforcement. Longitudinal recordings of dopamine and bird songs reveal that dopamine activity is indeed consistent with encoding a reward prediction error during naturalistic learning. In the second part of the talk, I will talk about recent work we are doing at DeepMind to develop tools for automatically discovering interpretable models of behavior directly from animal choice data. Our method, dubbed CogFunSearch, uses LLMs within an evolutionary search process in order to "discover" novel models in the form of Python programs that excel at accurately predicting animal behavior during reward-guided learning. The discovered programs reveal novel patterns of learning and choice behavior that update our understanding of how the brain solves reinforcement learning problems.
Screen Savers : Protecting adolescent mental health in a digital world
In our rapidly evolving digital world, there is increasing concern about the impact of digital technologies and social media on the mental health of young people. Policymakers and the public are nervous. Psychologists are facing mounting pressures to deliver evidence that can inform policies and practices to safeguard both young people and society at large. However, research progress is slow while technological change is accelerating.My talk will reflect on this, both as a question of psychological science and metascience. Digital companies have designed highly popular environments that differ in important ways from traditional offline spaces. By revisiting the foundations of psychology (e.g. development and cognition) and considering digital changes' impact on theories and findings, we gain deeper insights into questions such as the following. (1) How do digital environments exacerbate developmental vulnerabilities that predispose young people to mental health conditions? (2) How do digital designs interact with cognitive and learning processes, formalised through computational approaches such as reinforcement learning or Bayesian modelling?However, we also need to face deeper questions about what it means to do science about new technologies and the challenge of keeping pace with technological advancements. Therefore, I discuss the concept of ‘fast science’, where, during crises, scientists might lower their standards of evidence to come to conclusions quicker. Might psychologists want to take this approach in the face of technological change and looming concerns? The talk concludes with a discussion of such strategies for 21st-century psychology research in the era of digitalization.
Decision and Behavior
This webinar addressed computational perspectives on how animals and humans make decisions, spanning normative, descriptive, and mechanistic models. Sam Gershman (Harvard) presented a capacity-limited reinforcement learning framework in which policies are compressed under an information bottleneck constraint. This approach predicts pervasive perseveration, stimulus‐independent “default” actions, and trade-offs between complexity and reward. Such policy compression reconciles observed action stochasticity and response time patterns with an optimal balance between learning capacity and performance. Jonathan Pillow (Princeton) discussed flexible descriptive models for tracking time-varying policies in animals. He introduced dynamic Generalized Linear Models (Sidetrack) and hidden Markov models (GLM-HMMs) that capture day-to-day and trial-to-trial fluctuations in choice behavior, including abrupt switches between “engaged” and “disengaged” states. These models provide new insights into how animals’ strategies evolve under learning. Finally, Kenji Doya (OIST) highlighted the importance of unifying reinforcement learning with Bayesian inference, exploring how cortical-basal ganglia networks might implement model-based and model-free strategies. He also described Japan’s Brain/MINDS 2.0 and Digital Brain initiatives, aiming to integrate multimodal data and computational principles into cohesive “digital brains.”
Unmotivated bias
In this talk, I will explore how social affective biases arise even in the absence of motivational factors as an emergent outcome of the basic structure of social learning. In several studies, we found that initial negative interactions with some members of a group can cause subsequent avoidance of the entire group, and that this avoidance perpetuates stereotypes. Additional cognitive modeling discovered that approach and avoidance behavior based on biased beliefs not only influences the evaluative (positive or negative) impressions of group members, but also shapes the depth of the cognitive representations available to learn about individuals. In other words, people have richer cognitive representations of members of groups that are not avoided, akin to individualized vs group level categories. I will end presenting a series of multi-agent reinforcement learning simulations that demonstrate the emergence of these social-structural feedback loops in the development and maintenance of affective biases.
Contribution of computational models of reinforcement learning to neurosciences/ computational modeling, reward, learning, decision-making, conditioning, navigation, dopamine, basal ganglia, prefrontal cortex, hippocampus
Decomposing motivation into value and salience
Humans and other animals approach reward and avoid punishment and pay attention to cues predicting these events. Such motivated behavior thus appears to be guided by value, which directs behavior towards or away from positively or negatively valenced outcomes. Moreover, it is facilitated by (top-down) salience, which enhances attention to behaviorally relevant learned cues predicting the occurrence of valenced outcomes. Using human neuroimaging, we recently separated value (ventral striatum, posterior ventromedial prefrontal cortex) from salience (anterior ventromedial cortex, occipital cortex) in the domain of liquid reward and punishment. Moreover, we investigated potential drivers of learned salience: the probability and uncertainty with which valenced and non-valenced outcomes occur. We find that the brain dissociates valenced from non-valenced probability and uncertainty, which indicates that reinforcement matters for the brain, in addition to information provided by probability and uncertainty alone, regardless of valence. Finally, we assessed learning signals (unsigned prediction errors) that may underpin the acquisition of salience. Particularly the insula appears to be central for this function, encoding a subjective salience prediction error, similarly at the time of positively and negatively valenced outcomes. However, it appears to employ domain-specific time constants, leading to stronger salience signals in the aversive than the appetitive domain at the time of cues. These findings explain why previous research associated the insula with both valence-independent salience processing and with preferential encoding of the aversive domain. More generally, the distinction of value and salience appears to provide a useful framework for capturing the neural basis of motivated behavior.
Maintaining Plasticity in Neural Networks
Nonstationarity presents a variety of challenges for machine learning systems. One surprising pathology which can arise in nonstationary learning problems is plasticity loss, whereby making progress on new learning objectives becomes more difficult as training progresses. Networks which are unable to adapt in response to changes in their environment experience plateaus or even declines in performance in highly non-stationary domains such as reinforcement learning, where the learner must quickly adapt to new information even after hundreds of millions of optimization steps. The loss of plasticity manifests in a cluster of related empirical phenomena which have been identified by a number of recent works, including the primacy bias, implicit under-parameterization, rank collapse, and capacity loss. While this phenomenon is widely observed, it is still not fully understood. This talk will present exciting recent results which shed light on the mechanisms driving the loss of plasticity in a variety of learning problems and survey methods to maintain network plasticity in non-stationary tasks, with a particular focus on deep reinforcement learning.
A recurrent network model of planning predicts hippocampal replay and human behavior
When interacting with complex environments, humans can rapidly adapt their behavior to changes in task or context. To facilitate this adaptation, we often spend substantial periods of time contemplating possible futures before acting. For such planning to be rational, the benefits of planning to future behavior must at least compensate for the time spent thinking. Here we capture these features of human behavior by developing a neural network model where not only actions, but also planning, are controlled by prefrontal cortex. This model consists of a meta-reinforcement learning agent augmented with the ability to plan by sampling imagined action sequences drawn from its own policy, which we refer to as `rollouts'. Our results demonstrate that this agent learns to plan when planning is beneficial, explaining the empirical variability in human thinking times. Additionally, the patterns of policy rollouts employed by the artificial agent closely resemble patterns of rodent hippocampal replays recently recorded in a spatial navigation task, in terms of both their spatial statistics and their relationship to subsequent behavior. Our work provides a new theory of how the brain could implement planning through prefrontal-hippocampal interactions, where hippocampal replays are triggered by -- and in turn adaptively affect -- prefrontal dynamics.
Learning to Express Reward Prediction Error-like Dopaminergic Activity Requires Plastic Representations of Time
The dominant theoretical framework to account for reinforcement learning in the brain is temporal difference (TD) reinforcement learning. The TD framework predicts that some neuronal elements should represent the reward prediction error (RPE), which means they signal the difference between the expected future rewards and the actual rewards. The prominence of the TD theory arises from the observation that firing properties of dopaminergic neurons in the ventral tegmental area appear similar to those of RPE model-neurons in TD learning. Previous implementations of TD learning assume a fixed temporal basis for each stimulus that might eventually predict a reward. Here we show that such a fixed temporal basis is implausible and that certain predictions of TD learning are inconsistent with experiments. We propose instead an alternative theoretical framework, coined FLEX (Flexibly Learned Errors in Expected Reward). In FLEX, feature specific representations of time are learned, allowing for neural representations of stimuli to adjust their timing and relation to rewards in an online manner. In FLEX dopamine acts as an instructive signal which helps build temporal models of the environment. FLEX is a general theoretical framework that has many possible biophysical implementations. In order to show that FLEX is a feasible approach, we present a specific biophysically plausible model which implements the principles of FLEX. We show that this implementation can account for various reinforcement learning paradigms, and that its results and predictions are consistent with a preponderance of both existing and reanalyzed experimental data.
A recurrent network model of planning explains hippocampal replay and human behavior
When interacting with complex environments, humans can rapidly adapt their behavior to changes in task or context. To facilitate this adaptation, we often spend substantial periods of time contemplating possible futures before acting. For such planning to be rational, the benefits of planning to future behavior must at least compensate for the time spent thinking. Here we capture these features of human behavior by developing a neural network model where not only actions, but also planning, are controlled by prefrontal cortex. This model consists of a meta-reinforcement learning agent augmented with the ability to plan by sampling imagined action sequences drawn from its own policy, which we refer to as 'rollouts'. Our results demonstrate that this agent learns to plan when planning is beneficial, explaining the empirical variability in human thinking times. Additionally, the patterns of policy rollouts employed by the artificial agent closely resemble patterns of rodent hippocampal replays recently recorded in a spatial navigation task, in terms of both their spatial statistics and their relationship to subsequent behavior. Our work provides a new theory of how the brain could implement planning through prefrontal-hippocampal interactions, where hippocampal replays are triggered by - and in turn adaptively affect - prefrontal dynamics.
Richly structured reward predictions in dopaminergic learning circuits
Theories from reinforcement learning have been highly influential for interpreting neural activity in the biological circuits critical for animal and human learning. Central among these is the identification of phasic activity in dopamine neurons as a reward prediction error signal that drives learning in basal ganglia and prefrontal circuits. However, recent findings suggest that dopaminergic prediction error signals have access to complex, structured reward predictions and are sensitive to more properties of outcomes than learning theories with simple scalar value predictions might suggest. Here, I will present recent work in which we probed the identity-specific structure of reward prediction errors in an odor-guided choice task and found evidence for multiple predictive “threads” that segregate reward predictions, and reward prediction errors, according to the specific sensory features of anticipated outcomes. Our results point to an expanded class of neural reinforcement learning algorithms in which biological agents learn rich associative structure from their environment and leverage it to build reward predictions that include information about the specific, and perhaps idiosyncratic, features of available outcomes, using these to guide behavior in even quite simple reward learning tasks.
Off-policy learning in the basal ganglia
I will discuss work with Jack Lindsey modeling reinforcement learning for action selection in the basal ganglia. I will argue that the presence of multiple brain regions, in addition to the basal ganglia, that contribute to motor control motivates the need for an off-policy basal ganglia learning algorithm. I will then describe a biological implementation of such an algorithm that predicts tuning of dopamine neurons to a quantity we call "action surprise," in addition to reward prediction error. In the same model, an implementation of learning from a motor efference copy also predicts a novel solution to the problem of multiplexing feedforward and efference-related striatal activity. The solution exploits the difference between D1 and D2-expressing medium spiny neurons and leads to predictions about striatal dynamics.
Mapping learning and decision-making algorithms onto brain circuitry
In the first half of my talk, I will discuss our recent work on the midbrain dopamine system. The hypothesis that midbrain dopamine neurons broadcast an error signal for the prediction of reward is among the great successes of computational neuroscience. However, our recent results contradict a core aspect of this theory: that the neurons uniformly convey a scalar, global signal. I will review this work, as well as our new efforts to update models of the neural basis of reinforcement learning with our data. In the second half of my talk, I will discuss our recent findings of state-dependent decision-making mechanisms in the striatum.
Memory-enriched computation and learning in spiking neural networks through Hebbian plasticity
Memory is a key component of biological neural systems that enables the retention of information over a huge range of temporal scales, ranging from hundreds of milliseconds up to years. While Hebbian plasticity is believed to play a pivotal role in biological memory, it has so far been analyzed mostly in the context of pattern completion and unsupervised learning. Here, we propose that Hebbian plasticity is fundamental for computations in biological neural systems. We introduce a novel spiking neural network (SNN) architecture that is enriched by Hebbian synaptic plasticity. We experimentally show that our memory-equipped SNN model outperforms state-of-the-art deep learning mechanisms in a sequential pattern-memorization task, as well as demonstrate superior out-of-distribution generalization capabilities compared to these models. We further show that our model can be successfully applied to one-shot learning and classification of handwritten characters, improving over the state-of-the-art SNN model. We also demonstrate the capability of our model to learn associations for audio to image synthesis from spoken and handwritten digits. Our SNN model further presents a novel solution to a variety of cognitive question answering tasks from a standard benchmark, achieving comparable performance to both memory-augmented ANN and SNN-based state-of-the-art solutions to this problem. Finally we demonstrate that our model is able to learn from rewards on an episodic reinforcement learning task and attain near-optimal strategy on a memory-based card game. Hence, our results show that Hebbian enrichment renders spiking neural networks surprisingly versatile in terms of their computational as well as learning capabilities. Since local Hebbian plasticity can easily be implemented in neuromorphic hardware, this also suggests that powerful cognitive neuromorphic systems can be build based on this principle.
Learning Relational Rules from Rewards
Humans perceive the world in terms of objects and relations between them. In fact, for any given pair of objects, there is a myriad of relations that apply to them. How does the cognitive system learn which relations are useful to characterize the task at hand? And how can it use these representations to build a relational policy to interact effectively with the environment? In this paper we propose that this problem can be understood through the lens of a sub-field of symbolic machine learning called relational reinforcement learning (RRL). To demonstrate the potential of our approach, we build a simple model of relational policy learning based on a function approximator developed in RRL. We trained and tested our model in three Atari games that required to consider an increasingly number of potential relations: Breakout, Pong and Demon Attack. In each game, our model was able to select adequate relational representations and build a relational policy incrementally. We discuss the relationship between our model with models of relational and analogical reasoning, as well as its limitations and future directions of research.
Designing the BEARS (Both Ears) Virtual Reality Training Package to Improve Spatial Hearing in Young People with Bilateral Cochlear Implant
Results: the main areas which were modified based on participatory feedback were the variety of immersive scenarios to cover a range of ages and interests, the number of levels of complexity to ensure small improvements were measured, the feedback and reward schemes to ensure positive reinforcement, and specific provision for participants with balance issues, who had difficulties when using head-mounted displays. The effectiveness of the finalised BEARS suite will be evaluated in a large-scale clinical trial. We have added in additional login options for other members of the family and based on patient feedback we have improved the accompanying reward schemes. Conclusions: Through participatory design we have developed a training package (BEARS) for young people with bilateral cochlear implants. The training games are appropriate for use by the study population and ultimately should lead to patients taking control of their own management and reducing the reliance upon outpatient-based rehabilitation programmes. Virtual reality training provides a more relevant and engaging approach to rehabilitation for young people.
Learning in/about/from the basal ganglia
The basal ganglia are a collection of brain areas that are connected by a variety of synaptic pathways and are a site of significant reward-related dopamine release. These properties suggest a possible role for the basal ganglia in action selection, guided by reinforcement learning. In this talk, I will discuss a framework for how this function might be performed and computational results using an upward mapping to identify putative low-dimensional control ensembles that may be involved in tuning decision policy. I will also present some recent experimental results and theory – related to effects of extracellular ion dynamics -- that run counter to the classical view of basal ganglia pathways and suggest a new interpretation of certain aspects of this framework. For those not so interested in the basal ganglia, I hope that the upward mapping approach and impact of extracellular ion dynamics will nonetheless be of interest!
Dissecting the role of accumbal D1 and D2 medium spiny neurons in information encoding
Nearly all motivated behaviors require the ability to associate outcomes with specific actions and make adaptive decisions about future behavior. The nucleus accumbens (NAc) is integrally involved in these processes. The NAc is a heterogeneous population primarily composed of D1 and D2 medium spiny projection (MSN) neurons that are thought to have opposed roles in behavior, with D1 MSNs promoting reward and D2 MSNs promoting aversion. Here we examined what types of information are encoded by the D1 and D2 MSNs using optogenetics, fiber photometry, and cellular resolution calcium imaging. First, we showed that mice responded for optical self-stimulation of both cell types, suggesting D2-MSN activation is not inherently aversive. Next, we recorded population and single cell activity patterns of D1 and D2 MSNs during reinforcement as well as Pavlovian learning paradigms that allow dissociation of stimulus value, outcome, cue learning, and action. We demonstrated that D1 MSNs respond to the presence and intensity of unconditioned stimuli – regardless of value. Conversely, D2 MSNs responded to the prediction of these outcomes during specific cues. Overall, these results provide foundational evidence for the discrete aspects of information that are encoded within the NAc D1 and D2 MSN populations. These results will significantly enhance our understanding of the involvement of the NAc MSNs in learning and memory as well as how these neurons contribute to the development and maintenance of substance use disorders.
NaV Long-term Inactivation Regulates Adaptation in Place Cells and Depolarization Block in Dopamine Neurons
In behaving rodents, CA1 pyramidal neurons receive spatially-tuned depolarizing synaptic input while traversing a specific location within an environment called its place. Midbrain dopamine neurons participate in reinforcement learning, and bursts of action potentials riding a depolarizing wave of synaptic input signal rewards and reward expectation. Interestingly, slice electrophysiology in vitro shows that both types of cells exhibit a pronounced reduction in firing rate (adaptation) and even cessation of firing during sustained depolarization. We included a five state Markov model of NaV1.6 (for CA1) and NaV1.2 (for dopamine neurons) respectively, in computational models of these two types of neurons. Our simulations suggest that long-term inactivation of this channel is responsible for the adaptation in CA1 pyramidal neurons, in response to triangular depolarizing current ramps. We also show that the differential contribution of slow inactivation in two subpopulations of midbrain dopamine neurons can account for their different dynamic ranges, as assessed by their responses to similar depolarizing ramps. These results suggest long-term inactivation of the sodium channel is a general mechanism for adaptation.
Input and target-selective plasticity in sensory neocortex during learning
Behavioral experience shapes neural circuits, adding and subtracting connections between neurons that will ultimately control sensation and perception. We are using natural sensory experience to uncover basic principles of information processing in the cerebral cortex, with a focus on how sensory learning can selectively alter synaptic strength. I will discuss recent findings that differentiate reinforcement learning from sensory experience, showing rapid and selective plasticity of thalamic and inhibitory synapses within primary sensory cortex.
Why would we need Cognitive Science to develop better Collaborative Robots and AI Systems?
While classical industrial robots are mostly designed for repetitive tasks, assistive robots will be challenged by a variety of different tasks in close contact with humans. Hereby, learning through the direct interaction with humans provides a potentially powerful tool for an assistive robot to acquire new skills and to incorporate prior human knowledge during the exploration of novel tasks. Moreover, an intuitive interactive teaching process may allow non-programming experts to contribute to robotic skill learning and may help to increase acceptance of robotic systems in shared workspaces and everyday life. In this talk, I will discuss recent research I did on interactive robot skill learning and the remaining challenges on the route to human-centered teaching of assistive robots. In particular, I will also discuss potential connections and overlap with cognitive science. The presented work covers learning a library of probabilistic movement primitives from human demonstrations, intention aware adaptation of learned skills in shared workspaces, and multi-channel interactive reinforcement learning for sequential tasks.
Astrocytes and oxytocin interaction regulates amygdala neuronal network activity and related behaviors”
Oxytocin orchestrates social and emotional behaviors through modulation of neural circuits in brain structures such as the central amygdala (CeA). In this structure, the release of oxytocin modulates inhibitory circuits and subsequently suppresses fear responses and decreases anxiety levels. Using astrocyte-specific gain and loss of function approaches and pharmacology, we demonstrate that oxytocin signaling in the central amygdala relies on a subpopulation of astrocytes that represent a prerequisite for proper function of CeA circuits and adequate behavioral responses, both in rats and mice. Our work identifies astrocytes as crucial cellular intermediaries of oxytocinergic modulation in emotional behaviors related to anxiety or positive reinforcement. To our knowledge, this is the first demonstration of a direct role of astrocytes in oxytocin signaling and challenges the long-held dogma that oxytocin signaling occurs exclusively via direct action on neurons in the central nervous system.
Mice identify subgoals locations through an action-driven mapping process
Mammals instinctively explore and form mental maps of their spatial environments. Models of cognitive mapping in neuroscience mostly depict map-learning as a process of random or biased diffusion. In practice, however, animals explore spaces using structured, purposeful, sensory-guided actions. We have used threat-evoked escape behavior in mice to probe the relationship between ethological exploratory behavior and abstract spatial cognition. First, we show that in arenas with obstacles and a shelter, mice spontaneously learn efficient multi-step escape routes by memorizing allocentric subgoal locations. Using closed-loop neural manipulations to interrupt running movements during exploration, we next found that blocking runs targeting an obstacle edge abolished subgoal learning. We conclude that mice use an action-driven learning process to identify subgoals, and these subgoals are then integrated into an allocentric map-like representation. We suggest a conceptual framework for spatial learning that is compatible with the successor representation from reinforcement learning and sensorimotor enactivism from cognitive science.
NMC4 Short Talk: What can deep reinforcement learning tell us about human motor learning and vice-versa ?
In the deep reinforcement learning (RL) community, motor control problems are usually approached from a reward-based learning perspective. However, humans are often believed to learn motor control through directed error-based learning. Within this learning setting, the control system is assumed to have access to exact error signals and their gradients with respect to the control signal. This is unlike reward-based learning, in which errors are assumed to be unsigned, encoding relative successes and failures. Here, we try to understand the relation between these two approaches, reward- and error- based learning, and ballistic arm reaches. To do so, we test canonical (deep) RL algorithms on a well-known sensorimotor perturbation in neuroscience: mirror-reversal of visual feedback during arm reaching. This test leads us to propose a potentially novel RL algorithm, denoted as model-based deterministic policy gradient (MB-DPG). This RL algorithm draws inspiration from error-based learning to qualitatively reproduce human reaching performance under mirror-reversal. Next, we show MB-DPG outperforms the other canonical (deep) RL algorithms on a single- and a multi- target ballistic reaching task, based on a biomechanical model of the human arm. Finally, we propose MB-DPG may provide an efficient computational framework to help explain error-based learning in neuroscience.
Reinforcement Learning
Network dynamics in the basal ganglia and possible implications for Parkinson’s disease
The basal ganglia are a collection of brain areas that are connected by a variety of synaptic pathways and are a site of significant reward-related dopamine release. These properties suggest a possible role for the basal ganglia in action selection, guided by reinforcement learning. In this talk, I will discuss a framework for how this function might be performed. I will also present some recent experimental results and theory that call for a re-evaluation of certain aspects of this framework. Next, I will turn to the changes in basal ganglia activity observed to occur with the dopamine depletion associated with Parkinson’s disease. I will discuss some of the potential functional implications of some of these changes and, if time permits, will conclude with some new results that focus on delta oscillations under dopamine depletion.
Higher cognitive resources for efficient learning
A central issue in reinforcement learning (RL) is the ‘curse-of-dimensionality’, arising when the degrees-of-freedom are much larger than the number of training samples. In such circumstances, the learning process becomes too slow to be plausible. In the brain, higher cognitive functions (such as abstraction or metacognition) may be part of the solution by generating low dimensional representations on which RL can operate. In this talk I will discuss a series of studies in which we used functional magnetic resonance imaging (fMRI) and computational modeling to investigate the neuro-computational basis of efficient RL. We found that people can learn remarkably complex task structures non-consciously, but also that - intriguingly - metacognition appears tightly coupled to this learning ability. Furthermore, when people use an explicit (conscious) policy to select relevant information, learning is accelerated by abstractions. At the neural level, prefrontal cortex subregions are differentially involved in separate aspects of learning: dorsolateral prefrontal cortex pairs with metacognitive processes, while ventromedial prefrontal cortex with valuation and abstraction. I will discuss the implications of these findings, in particular new questions on the function of metacognition in adaptive behavior and the link with abstraction.
A reward-learning framework of knowledge acquisition
Recent years have seen a considerable surge of research on interest-based engagement, examining how and why people are engaged in activities without relying on extrinsic rewards. However, the field of inquiry has been somewhat segregated into three different research traditions which have been developed relatively independently --- research on curiosity, interest, and trait curiosity/interest. The current talk sets out an integrative perspective; the reward-learning framework of knowledge acquisition. This conceptual framework takes on the basic premise of existing reward-learning models of information seeking: that knowledge acquisition serves as an inherent reward, which reinforces people’s information-seeking behavior through a reward-learning process. However, the framework reveals how the knowledge-acquisition process is sustained and boosted over a long period of time in real-life settings, allowing us to integrate the different research traditions within reward-learning models. The framework also characterizes the knowledge-acquisition process with four distinct features that are not present in the reward-learning process with extrinsic rewards --- (1) cumulativeness, (2) selectivity, (3) vulnerability, and (4) under-appreciation. The talk describes some evidence from our lab supporting these claims.
Transforming task representations
Humans can adapt to a novel task on our first try. By contrast, artificial intelligence systems often require immense amounts of data to adapt. In this talk, I will discuss my recent work (https://www.pnas.org/content/117/52/32970) on creating deep learning systems that can adapt on their first try by exploiting relationships between tasks. Specifically, the approach is based on transforming a representation for a known task to produce a representation for the novel task, by inferring and then using a higher order function that captures a relationship between the tasks. This approach can be interpreted as a type of analogical reasoning. I will show that task transformation can allow systems to adapt to novel tasks on their first try in domains ranging from card games, to mathematical objects, to image classification and reinforcement learning. I will discuss the analogical interpretation of this approach, an analogy between levels of abstraction within the model architecture that I refer to as homoiconicity, and what this work might suggest about using deep-learning models to infer analogies more generally.
On cognitive maps and reinforcement learning in large-scale animal behaviour
Bats are extreme aviators and amazing navigators. Many bat species nightly commute dozens of kilometres in search of food, and some bat species annually migrate over thousands of kilometres. Studying bats in their natural environment has always been extremely challenging because of their small size (mostly <50 gr) and agile nature. We have recently developed novel miniature technology allowing us to GPS-tag small bats, thus opening a new window to document their behaviour in the wild. We have used this technology to track fruit-bats pups over 5 months from birth to adulthood. Following the bats’ full movement history allowed us to show that they use novel short-cuts which are typical for cognitive-map based navigation. In a second study, we examined how nectar-feeding bats make foraging decisions under competition. We show that by relying on a simple reinforcement learning strategy, the bats can divide the resource between them without aggression or communication. Together, these results demonstrate the power of the large scale natural approach for studying animal behavior.
From function to cognition: New spectroscopic tools for studying brain neurochemistry in-vivo
In this seminar, I will present new methods in magnetic resonance spectroscopy (MRS) we’ve been working on in the lab. The talk will be divided into two parts. In the first, I will talk about neurochemical changes we observe in glutamate and GABA during various paradigms, including simple motors tasks and reinforcement learning. In the second part, I’ll present a new approach to MRS that focuses on measuring the relaxation times (T1, T2) of metabolites, which reflect changes to specific cellular microenvironments. I will explain why these can be exciting markers for studying several in-vivo pathologies, and also present some preliminary data from a cohort of mild cognitive impairment (MCI) patients, showing changes that correlate to cognitive decline.
The structure of behavior entrained to long intervals
Interpretation of interval timing data generated from animal models is complicated by ostensible motivational effects which arise from the delay-to-reward imposed by interval timing tasks, as well as overlap between timed and non-timed responses. These factors become increasingly prevalent at longer intervals. To address these concerns, two adjustments to long interval timing tasks are proposed. First, subjects should be afforded with reinforced non-timing behaviors concurrent with timing. Second, subjects should initiate the onset of timed stimuli. Under these conditions, interference by extraneous behavior would be detected in the rate of concurrent non- timing behaviors, and changes in motivation would be detected in the rate at which timed stimuli are initiated. In a task with these characteristics, rats initiated a concurrent fixed-interval (FI) random-ratio (RR) schedule of reinforcement. This design facilitated response-initiated timing behavior, even at increasingly long delays. Pre-feeding manipulations revealed an effect on the number of initiated trials, but not on the timing peak function.
Learning in pain: probabilistic inference and (mal)adaptive control
Pain is a major clinical problem affecting 1 in 5 people in the world. There are unresolved questions that urgently require answers to treat pain effectively, a crucial one being how the feeling of pain arises from brain activity. Computational models of pain consider how the brain processes noxious information and allow mapping neural circuits and networks to cognition and behaviour. To date, they have generally have assumed two largely independent processes: perceptual and/or predictive inference, typically modelled as an approximate Bayesian process, and action control, typically modelled as a reinforcement learning process. However, inference and control are intertwined in complex ways, challenging the clarity of this distinction. I will discuss how they may comprise a parallel hierarchical architecture that combines pain inference, information-seeking, and adaptive value-based control. Finally, I will discuss whether and how these learning processes might contribute to chronic pain.
Mental Simulation, Imagination, and Model-Based Deep RL
Mental simulation—the capacity to imagine what will or what could be—is a salient feature of human cognition, playing a key role in a wide range of cognitive abilities. In artificial intelligence, the last few years have seen the development of methods which are analogous to mental models and mental simulation. In this talk, I will discuss recent methods in deep learning for constructing such models from data and learning to use them via reinforcement learning, and compare such approaches to human mental simulation. While a number of challenges remain in matching the capacity of human mental simulation, I will highlight some recent progress on developing more compositional and efficient model-based algorithms through the use of graph neural networks and tree search.
Choice engineering and the modeling of operant learning
Organisms modify their behavior in response to its consequences, a phenomenon referred to as operant learning. Contemporary modeling of this learning behavior is based on reinforcement learning algorithms. I will discuss some of the challenges that these models face, and proposed a new approach to model-selection that is based on testing their ability to engineer behavior. Finally, I will present the results of The Choice Engineering Competition – an academic competition that compared the efficacies of qualitative and quantitative models of operant learning in shaping behavior.
Peril, Prudence and Planning as Risk, Avoidance and Worry
Risk occupies a central role in both the theory and practice of decision-making. Although it is deeply implicated in many conditions involving dysfunctional behavior and thought, modern theoretical approaches to understanding and mitigating risk in either one-shot or sequential settings, which are derived largely from finance and economics, have yet to permeate fully the fields of neural reinforcement learning and computational psychiatry. I will discuss the use of dynamic and static versions of one prominent approach, namely conditional value-at-risk, to examine both the nature of risk avoidant choices, encompassing such things as justified gambler's fallacies, and the optimal planning that can lead to consideration of such choices, with implications for offline, ruminative, thinking.
Navigation Turing Test: Toward Human-like RL
tbc
A machine learning way to analyse white matter tractography streamlines / Application of artificial intelligence in correcting motion artifacts and reducing scan time in MRI
1. Embedding is all you need: A machine learning way to analyse white matter tractography streamlines - Dr Shenjun Zhong, Monash Biomedical Imaging Embedding white matter streamlines with various lengths into fixed-length latent vectors enables users to analyse them with general data mining techniques. However, finding a good embedding schema is still a challenging task as the existing methods based on spatial coordinates rely on manually engineered features, and/or labelled dataset. In this webinar, Dr Shenjun Zhong will discuss his novel deep learning model that identifies latent space and solves the problem of streamline clustering without needing labelled data. Dr Zhong is a Research Fellow and Informatics Officer at Monash Biomedical Imaging. His research interests are sequence modelling, reinforcement learning and federated learning in the general medical imaging domain. 2. Application of artificial intelligence in correcting motion artifacts and reducing scan time in MRI - Dr Kamlesh Pawar, Monash Biomedical imaging Magnetic Resonance Imaging (MRI) is a widely used imaging modality in clinics and research. Although MRI is useful it comes with an overhead of longer scan time compared to other medical imaging modalities. The longer scan times also make patients uncomfortable and even subtle movements during the scan may result in severe motion artifact in the images. In this seminar, Dr Kamlesh Pawar will discuss how artificial intelligence techniques can reduce scan time and correct motion artifacts. Dr Pawar is a Research Fellow at Monash Biomedical Imaging. His research interest includes deep learning, MR physics, MR image reconstruction and computer vision.
Uncertainty in learning and decision making
Uncertainty plays a critical role in reinforcement learning and decision making. However, exactly how subjective uncertainty influences behaviour remains unclear. Multi-armed bandits are a useful framework to gain more insight into this. Paired with computational tools such as Kalman filters, they allow us to closely characterize the interplay between trial-by-trial value, uncertainty, learning, and choice. In this talk, I will present recent research where we also measured participants visual fixations on the options in a multi-armed bandit task. The estimated value of each option, and the uncertainty in these estimations, influenced what subjects looked at in the period before making a choice and their subsequent choice, as additionally did fixation itself. Uncertainty also determined how long participants looked at the obtained outcomes. Our findings clearly show the importance of uncertainty in learning and decision making.
An inference perspective on meta-learning
While meta-learning algorithms are often viewed as algorithms that learn to learn, an alternative viewpoint frames meta-learning as inferring a hidden task variable from experience consisting of observations and rewards. From this perspective, learning to learn is learning to infer. This viewpoint can be useful in solving problems in meta-RL, which I’ll demonstrate through two examples: (1) enabling off-policy meta-learning, and (2) performing efficient meta-RL from image observations. I’ll also discuss how this perspective leads to an algorithm for few-shot image segmentation.
On cognitive maps and reinforcement learning in large-scale animal behaviour
Bats are extreme aviators and amazing navigators. Many bat species nightly com-mute dozens of kilometres in search of food, and some bat species annually migrate over thousands of kilometres. Studying bats in their natural environment has al-ways been extremely challenging because of their small size (mostly <50 gr) and agile nature. We have recently developed novel miniature technology allowing us to GPS-tag small bats, thus opening a new window to document their behaviour in the wild. We have used this technology to track fruit-bats pups over 5 months from birth to adulthood. Following the bats’ full movement history allowed us to show that they use novel short-cuts which are typical for cognitive-map based naviga-tion. In a second study, we examined how nectar-feeding bats make foraging deci-sions under competition. We show that by relying on a simple reinforcement learn-ing strategy, the bats can divide the resource between them without aggression or communication. Together, these results demonstrate the power of the large scale natural approach for studying animal behavior.
On climate change, multi-agent systems and the behaviour of networked control
Multi-agent reinforcement learning (MARL) has recently shown great promise as an approach to networked system control. Arguably, one of the most difficult and important tasks for which large scale networked system control is applicable is common-pool resource (CPR) management. Crucial CPRs include arable land, fresh water, wetlands, wildlife, fish stock, forests and the atmosphere, of which proper management is related to some of society’s greatest challenges such as food security, inequality and climate change. This talk will consist of three parts. In the first, we will briefly look at climate change and how it poses a significant threat to life on our planet. In the second, we will consider the potential of multi-agent systems for climate change mitigation and adaptation. And finally, in the third, we will discuss recent research from InstaDeep into better understanding the behaviour of networked MARL systems used for CPR management. More specifically, we will see how the tools from empirical game-theoretic analysis may be harnessed to analyse the differences in networked MARL systems. The results give new insights into the consequences associated with certain design choices and provide an additional dimension of comparison between systems beyond efficiency, robustness, scalability and mean control performance.
A journey through connectomics: from manual tracing to the first fully automated basal ganglia connectomes
The "mind of the worm", the first electron microscopy-based connectome of C. elegans, was an early sign of where connectomics is headed, followed by a long time of little progress in a field held back by the immense manual effort required for data acquisition and analysis. This changed over the last few years with several technological breakthroughs, which allowed increases in data set sizes by several orders of magnitude. Brain tissue can now be imaged in 3D up to a millimeter in size at nanometer resolution, revealing tissue features from synapses to the mitochondria of all contained cells. These breakthroughs in acquisition technology were paralleled by a revolution in deep-learning segmentation techniques, that equally reduced manual analysis times by several orders of magnitude, to the point where fully automated reconstructions are becoming useful. Taken together, this gives neuroscientists now access to the first wiring diagrams of thousands of automatically reconstructed neurons connected by millions of synapses, just one line of program code away. In this talk, I will cover these developments by describing the past few years' technological breakthroughs and discuss remaining challenges. Finally, I will show the potential of automated connectomics for neuroscience by demonstrating how hypotheses in reinforcement learning can now be tackled through virtual experiments in synaptic wiring diagrams of the songbird basal ganglia.
The geometry of abstraction in hippocampus and pre-frontal cortex
The curse of dimensionality plagues models of reinforcement learning and decision-making. The process of abstraction solves this by constructing abstract variables describing features shared by different specific instances, reducing dimensionality and enabling generalization in novel situations. Here we characterized neural representations in monkeys performing a task where a hidden variable described the temporal statistics of stimulus-response-outcome mappings. Abstraction was defined operationally using the generalization performance of neural decoders across task conditions not used for training. This type of generalization requires a particular geometric format of neural representations. Neural ensembles in dorsolateral pre-frontal cortex, anterior cingulate cortex and hippocampus, and in simulated neural networks, simultaneously represented multiple hidden and explicit variables in a format reflecting abstraction. Task events engaging cognitive operations modulated this format. These findings elucidate how the brain and artificial systems represent abstract variables, variables critical for generalization that in turn confers cognitive flexibility.
Mechanisms of Perceptual Learning
Perceptual learning (PL) is defined as long-term performance improvement on a perceptual task as a result of perceptual experience (Sasaki, Nanez& Watanabe, 2011, Nat Rev Neurosci, 2011). We first found that PL occurs for task-irrelevant and subthreshold features and that pairing task-irrelevant features with rewards is the key to form task-irrelevant PL (TIPL) (Watanabe, Nanez & Sasaki, Nature, 2001; Watanabe et al, 2002, Nature Neuroscience; Seitz & Watanabe, Nature, 2003; Seitz, Kim & Watanabe, 2009, Neuron; Shibata et al, 2011, Science). These results suggest that PL occurs as a result of interactions between reinforcement and bottom-up stimulus signals (Seitz & Watanabe, 2005, TICS). On the other hand, fMRI study results indicate that lateral prefrontal cortex fails to detect and thus to suppress subthreshold task-irrelevant signals. This leads to the paradoxical effect that a signal that is below, but close to, one’s discrimination threshold ends up being stronger than suprathreshold signals (Tsushima, Sasaki & Watanabe, 2006, Science). We confirmed this mechanism with the following results: Task-irrelevant learning occurs only when a presented feature is under and close to the threshold with younger individuals (Tsushima et al, 2009, Current Biol), whereas with older individuals who tend to have less inhibitory control task-irrelevant learning occurs with a feature whose signal is much greater than the threshold (Chang et al, 2014, Current Biol). From all of these results, we conclude that attention and reward play important but different roles in PL. I will further discuss different stages and phases in mechanisms of PL (Seitz et al, 2005, PNAS; Yotsumoto, Watanabe & Sasaki, Neuron, 2008; Yotsumoto et al, Curr Biol, 2009; Watanabe & Sasaki, 2015, Ann Rev Psychol; Shibata et al, 2017, Nat Neurosci; Tamaki et al, 2020, Nat Neurosci).
E-prop: A biologically inspired paradigm for learning in recurrent networks of spiking neurons
Transformative advances in deep learning, such as deep reinforcement learning, usually rely on gradient-based learning methods such as backpropagation through time (BPTT) as a core learning algorithm. However, BPTT is not argued to be biologically plausible, since it requires to a propagate gradients backwards in time and across neurons. Here, we propose e-prop, a novel gradient-based learning method with local and online weight update rules for recurrent neural networks, and in particular recurrent spiking neural networks (RSNNs). As a result, e-prop has the potential to provide a substantial fraction of the power of deep learning to RSNNs. In this presentation, we will motivate e-prop from the perspective of recent insights in neuroscience and show how these have to be combined to form an algorithm for online gradient descent. The mathematical results will be supported by empirical evidence in supervised and reinforcement learning tasks. We will also discuss how limitations that are inherited from gradient-based learning methods, such as sample-efficiency, can be addressed by considering an evolution-like optimization that enhances learning on particular task families. The emerging learning architecture can be used to learn tasks by a single demonstration, hence enabling one-shot learning.
Working memory transforms goals into rewards
Humans continuously need to learn to make good choices – be it using a new video-conferencing set up, figuring out what questions to ask to successfully secure a reliable babysitter, or just selecting which location in a house is least likely to be interrupted by toddlers during work calls. However, the goals we seek to attain – such as using zoom successfully – are often vaguely defined and previously unexperienced, and in that sense cannot be known by us as being rewarding. We hypothesized that learning to make good choices in such situations nevertheless leverages reinforcement learning processes, and that executive functions in general, and working memory in particular, play a crucial role in defining the reward function for arbitrary outcomes in such a way that they become reinforcing. I will show results from a novel behavioral protocol, as well as preliminary computational and imaging evidence supporting our hypothesis.
Reward foraging task, and model-based analysis reveal how fruit flies learn the value of available options
Understanding what drives foraging decisions in animals requires careful manipulation of the value of available options while monitoring animal choices. Value-based decision-making tasks, in combination with formal learning models, have provided both an experimental and theoretical framework to study foraging decisions in lab settings. While these approaches were successfully used in the past to understand what drives choices in mammals, very little work has been done on fruit flies. This is even though fruit flies have served as a model organism for many complex behavioural paradigms. To fill this gap we developed a single-animal, trial-based decision-making task, where freely walking flies experienced optogenetic sugar-receptor neuron stimulation. We controlled the value of available options by manipulating the probabilities of optogenetic stimulation. We show that flies integrate a reward history of chosen options and forget value of unchosen options. We further discover that flies assign higher values to rewards experienced early in the behavioural session, consistent with formal reinforcement learning models. Finally, we show that the probabilistic rewards affect walking trajectories of flies, suggesting that accumulated value is controlling the navigation vector of flies in a graded fashion. These findings establish the fruit fly as a model organism to explore the genetic and circuit basis of value-based decisions.
reinforcement coverage
50 items