Resources
Authors & Affiliations
Peter Buttaroni, Friedemann Zenke
Abstract
Cognitive maps are abstract internal representations of the external world that facilitate planning and efficient navigation. How the brain learns such maps from variable sensory experience remains elusive. Previous models either relied on discrete environments [1], linear embeddings of continuous states [2], or generative models trained by predicting sensory input [3]. However, the world is neither discrete nor linear, and the brain presumably does not learn a generative model of its sensory inputs. So, how can the brain learn a map and use it to plan strategically?
In this work, we develop a non-generative model based on a joint embedding predictive architecture (JEPA) that learns a map from variable input sequences by predicting future sensory representations in latent space based on specific actions (Fig. a). We test our model in a spatial environment where each grid location is associated with a quasi-continuous set of visual inputs corresponding to handwritten digits or characters (Fig. c). Every time the agent revisits the same area, the visual input changes to a different instance of the same digit or character (Fig. d). The agent learns invariant sensory representations of each location after observing many sequences. Specifically, the agent learns embeddings that form a metric map of the environment and the effect of the agent’s actions (Fig. e). Finally, we show that the agent inspired by [2] can exploit the map’s Euclidean geometry for long-term planning using only local information. Our work shows how an agent can build a robust internal model of a continuous external world based on latent-space prediction and exploit it to reach long-term goals through local planning.