ePosterDOI Available

Soft-actor-critic for model-free reinforcement learning of eye saccade control

Henrique Granadoand 4 co-authors

COSYNE 2022 (2022)

Mar 17, 2022

Lisbon, Portugal

Presentation

Mar 17, 2022

View poster

Poster preview

Event Information

Poster

View poster

DOI

10.57736/ww-0756-1c48

Abstract

Saccades are fast eye movements that change gaze direction in a ballistic fashion. Human saccadic behaviour has been studied extensively in neuroscience and several methods have been proposed to model the saccadic system. For example, John et al. (2021) adopted a model-based approach, based on a physical model of the eye plant (eyeball, extraocular muscles, and surrounding tissues). They used feedforward optimal-control principles to replicate human saccadic behaviour. Here, we adopted a model-free approach to study the question: “How to learn saccadic behaviours without prior knowledge of the eye plant?”. We addressed saccadic control as a reinforcement learning (RL) problem (Sutton & Barto, 2018). When an agent drives the eyes to a desired gaze direction, it receives a reward signal that accounts for the accuracy in target acquisition (tracking errors are penalized), the saccade duration (the shorter the better), the total movement energy (low energy is better), and the existence of overshoot in the response (exceeding the target angle is penalized). Here, we trained an agent to perform saccades with the soft-actor-critic algorithm (Haarnoja et al., 2019). This algorithm maximizes the expected rewards over time while promoting exploration behaviour. The agent is composed of an actor network that learns the command to drive the eye from the initial to desired orientation, and a critic network that learns to predict the reward of a given command. Both networks interact in the learning process, as the actor network learns to maximize the output of the critic network and the entropy of the command. We validated this approach in a computational simulation of a robotic eye performing horizontal saccades using pulse-step inputs. Results show that the pulse-step parameters leading to saccadic behaviours are compatible with human performance and can be learned with high precision in a few tens of thousands of iterations.