scene understanding
Latest
Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation
Studies of the mouse visual system have revealed a variety of visual brain areas in a roughly hierarchical arrangement, together with a multitude of behavioral capacities, ranging from stimulus-reward associations, to goal-directed navigation, and object-centric discriminations. However, an overall understanding of the mouse’s visual cortex organization, and how this organization supports visual behaviors, remains unknown. Here, we take a computational approach to help address these questions, providing a high-fidelity quantitative model of mouse visual cortex. By analyzing factors contributing to model fidelity, we identified key principles underlying the organization of mouse visual cortex. Structurally, we find that comparatively low-resolution and shallow structure were both important for model correctness. Functionally, we find that models trained with task-agnostic, unsupervised objective functions, based on the concept of contrastive embeddings were substantially better than models trained with supervised objectives. Finally, the unsupervised objective builds a general-purpose visual representation that enables the system to achieve better transfer on out-of-distribution visual, scene understanding and reward-based navigation tasks. Our results suggest that mouse visual cortex is a low-resolution, shallow network that makes best use of the mouse’s limited resources to create a light-weight, general-purpose visual system – in contrast to the deep, high-resolution, and more task-specific visual system of primates.
NMC4 Short Talk: Image embeddings informed by natural language improve predictions and understanding of human higher-level visual cortex
To better understand human scene understanding, we extracted features from images using CLIP, a neural network model of visual concept trained with supervision from natural language. We then constructed voxelwise encoding models to explain whole brain responses arising from viewing natural images from the Natural Scenes Dataset (NSD) - a large-scale fMRI dataset collected at 7T. Our results reveal that CLIP, as compared to convolution based image classification models such as ResNet or AlexNet, as well as language models such as BERT, gives rise to representations that enable better prediction performance - up to a 0.86 correlation with test data and an r-square of 0.75 - in higher-level visual cortex in humans. Moreover, CLIP representations explain distinctly unique variance in these higher-level visual areas as compared to models trained with only images or text. Control experiments show that the improvement in prediction observed with CLIP is not due to architectural differences (transformer vs. convolution) or to the encoding of image captions per se (vs. single object labels). Together our results indicate that CLIP and, more generally, multimodal models trained jointly on images and text, may serve as better candidate models of representation in human higher-level visual cortex. The bridge between language and vision provided by jointly trained models such as CLIP also opens up new and more semantically-rich ways of interpreting the visual brain.
Computational psychophysics at the intersection of theory, data and models
Behavioural measurements are often overlooked by computational neuroscientists, who prefer to focus on electrophysiological recordings or neuroimaging data. This attitude is largely due to perceived lack of depth/richness in relation to behavioural datasets. I will show how contemporary psychophysics can deliver extremely rich and highly constraining datasets that naturally interface with computational modelling. More specifically, I will demonstrate how psychophysics can be used to guide/constrain/refine computational models, and how models can be exploited to design/motivate/interpret psychophysical experiments. Examples will span a wide range of topics (from feature detection to natural scene understanding) and methodologies (from cascade models to deep learning architectures).
Uncovering the temporal dynamics of scene understanding using Event-Related Potentials
Probing Motion-Form Interactions in the Macaque Inferior Temporal Cortex and Artificial Neural Networks for Complex Scene Understanding
COSYNE 2025
scene understanding coverage
5 items