Resources
Authors & Affiliations
Hafez Ghaemi, Shahab Bakhtiari, Eilif Muller
Abstract
Predicting sensory input responses to update an internal ``world model'' is a core principle in cognitive learning theories that try to explain mechanisms behind representation learning in the brain [1,2]. Recent machine learning methods have also adopted predictive world modeling for self-supervised representation learning [3,4,5,6]. In this work, we investigate the perceptual and visuospatial capabilities of a self-supervised predictive world model trained in a manner approximating the active nature of human visual perception. We sample a sequence of fixations from image saliency maps [7], extracted from a model pre-trained on human fixations [8]. Using these fixations, we create a sequence of low-resolution glances fed to a visual encoder. The encoded representations are concatenated with learnable egocentric ``action'' embeddings which specify the act of saccading, i.e. moving from one fixation to the next in the sequence. These concatenated representations are given to an autoregressive memory-like module that produces an aggregated representation of them. This aggregated representation is concatenated with the egocentric action embedding of the next glance and used to predict its visual representation. Regarding perceptual capabilities, a linear probe trained on aggregated representations achieves an object recognition accuracy comparable with traditional self-supervised learning methods that induce invariance on hand-crafted image augmentations [5]. In terms of visuospatial capabilities, we show that thanks to egocentric action conditioning, (1) output representations of the visual encoder encode egocentric position information, (2) allocentric location of a future fixation can be decoded from its visual representation and aggregated past glances, and (3) our autoregressive world model is capable of visual path integration given aggregated past action embeddings and the egocentric action of a future fixation, even with ablated visual representations. Interestingly, when path integration drifts away due to a larger number of saccades, visual representations can partially correct its course, resembling similar phenomena observed in mice[9].