Natural Scenes
natural scenes
Neural circuits for vision in the natural world
Context-dependent selectivity to natural scenes in the retina
Synthetic and natural images unlock the power of recurrency in primary visual cortex
During perception the visual system integrates current sensory evidence with previously acquired knowledge of the visual world. Presumably this computation relies on internal recurrent interactions. We record populations of neurons from the primary visual cortex of cats and macaque monkeys and find evidence for adaptive internal responses to structured stimulation that change on both slow and fast timescales. In the first experiment, we present abstract images, only briefly, a protocol known to produce strong and persistent recurrent responses in the primary visual cortex. We show that repetitive presentations of a large randomized set of images leads to enhanced stimulus encoding on a timescale of minutes to hours. The enhanced encoding preserves the representational details required for image reconstruction and can be detected in post-exposure spontaneous activity. In a second experiment, we show that the encoding of natural scenes across populations of V1 neurons is improved, over a timescale of hundreds of milliseconds, with the allocation of spatial attention. Given the hierarchical organization of the visual cortex, contextual information from the higher levels of the processing hierarchy, reflecting high-level image regularities, can inform the activity in V1 through feedback. We hypothesize that these fast attentional boosts in stimulus encoding rely on recurrent computations that capitalize on the presence of high-level visual features in natural scenes. We design control images dominated by low-level features and show that, in agreement with our hypothesis, the attentional benefits in stimulus encoding vanish. We conclude that, in the visual system, powerful recurrent processes optimize neuronal responses, already at the earliest stages of cortical processing.
Retinal responses to natural inputs
The research in my lab focuses on sensory signal processing, particularly in cases where sensory systems perform at or near the limits imposed by physics. Photon counting in the visual system is a beautiful example. At its peak sensitivity, the performance of the visual system is limited largely by the division of light into discrete photons. This observation has several implications for phototransduction and signal processing in the retina: rod photoreceptors must transduce single photon absorptions with high fidelity, single photon signals in photoreceptors, which are only 0.03 – 0.1 mV, must be reliably transmitted to second-order cells in the retina, and absorption of a single photon by a single rod must produce a noticeable change in the pattern of action potentials sent from the eye to the brain. My approach is to combine quantitative physiological experiments and theory to understand photon counting in terms of basic biophysical mechanisms. Fortunately there is more to visual perception than counting photons. The visual system is very adept at operating over a wide range of light intensities (about 12 orders of magnitude). Over most of this range, vision is mediated by cone photoreceptors. Thus adaptation is paramount to cone vision. Again one would like to understand quantitatively how the biophysical mechanisms involved in phototransduction, synaptic transmission, and neural coding contribute to adaptation.
A Panoramic View on Vision
Statistics of natural scenes are not uniform - their structure varies dramatically from ground to sky. It remains unknown whether these non-uniformities are reflected in the large-scale organization of the early visual system and what benefits such adaptations would confer. By deploying an efficient coding argument, we predict that changes in the structure of receptive fields across visual space increase the efficiency of sensory coding. To test this experimentally, developed a simple, novel imaging system that is indispensable for studies at this scale. In agreement with our predictions, we could show that receptive fields of retinal ganglion cells change their shape along the dorsoventral axis, with a marked surround asymmetry at the visual horizon. Our work demonstrates that, according to principles of efficient coding, the panoramic structure of natural scenes is exploited by the retina across space and cell-types.
NMC4 Short Talk: Hypothesis-neutral response-optimized models of higher-order visual cortex reveal strong semantic selectivity
Modeling neural responses to naturalistic stimuli has been instrumental in advancing our understanding of the visual system. Dominant computational modeling efforts in this direction have been deeply rooted in preconceived hypotheses. In contrast, hypothesis-neutral computational methodologies with minimal apriorism which bring neuroscience data directly to bear on the model development process are likely to be much more flexible and effective in modeling and understanding tuning properties throughout the visual system. In this study, we develop a hypothesis-neutral approach and characterize response selectivity in the human visual cortex exhaustively and systematically via response-optimized deep neural network models. First, we leverage the unprecedented scale and quality of the recently released Natural Scenes Dataset to constrain parametrized neural models of higher-order visual systems and achieve novel predictive precision, in some cases, significantly outperforming the predictive success of state-of-the-art task-optimized models. Next, we ask what kinds of functional properties emerge spontaneously in these response-optimized models? We examine trained networks through structural ( feature visualizations) as well as functional analysis (feature verbalizations) by running `virtual' fMRI experiments on large-scale probe datasets. Strikingly, despite no category-level supervision, since the models are solely optimized for brain response prediction from scratch, the units in the networks after optimization act as detectors for semantic concepts like `faces' or `words', thereby providing one of the strongest evidences for categorical selectivity in these visual areas. The observed selectivity in model neurons raises another question: are the category-selective units simply functioning as detectors for their preferred category or are they a by-product of a non-category-specific visual processing mechanism? To investigate this, we create selective deprivations in the visual diet of these response-optimized networks and study semantic selectivity in the resulting `deprived' networks, thereby also shedding light on the role of specific visual experiences in shaping neuronal tuning. Together with this new class of data-driven models and novel model interpretability techniques, our study illustrates that DNN models of visual cortex need not be conceived as obscure models with limited explanatory power, rather as powerful, unifying tools for probing the nature of representations and computations in the brain.
NMC4 Short Talk: Image embeddings informed by natural language improve predictions and understanding of human higher-level visual cortex
To better understand human scene understanding, we extracted features from images using CLIP, a neural network model of visual concept trained with supervision from natural language. We then constructed voxelwise encoding models to explain whole brain responses arising from viewing natural images from the Natural Scenes Dataset (NSD) - a large-scale fMRI dataset collected at 7T. Our results reveal that CLIP, as compared to convolution based image classification models such as ResNet or AlexNet, as well as language models such as BERT, gives rise to representations that enable better prediction performance - up to a 0.86 correlation with test data and an r-square of 0.75 - in higher-level visual cortex in humans. Moreover, CLIP representations explain distinctly unique variance in these higher-level visual areas as compared to models trained with only images or text. Control experiments show that the improvement in prediction observed with CLIP is not due to architectural differences (transformer vs. convolution) or to the encoding of image captions per se (vs. single object labels). Together our results indicate that CLIP and, more generally, multimodal models trained jointly on images and text, may serve as better candidate models of representation in human higher-level visual cortex. The bridge between language and vision provided by jointly trained models such as CLIP also opens up new and more semantically-rich ways of interpreting the visual brain.
Target detection in the natural world
Animal sensory systems are optimally adapted to those features typically encountered in natural surrounds, thus allowing neurons that have a limited bandwidth to encode almost impossibly large input ranges. Importantly, natural scenes are not random, and peripheral visual systems have therefore evolved to reduce the predictable redundancy. The vertebrate visual cortex is also optimally tuned to the spatial statistics of natural scenes, but much less is known about how the insect brain responds to these. We are redressing this deficiency using several techniques. Olga Dyakova uses exquisite image manipulation to give natural images unnatural image statistics, or vice versa. Marissa Holden then uses these images as stimuli in electrophysiological recordings of neurons in the fly optic lobes, to see how the brain codes for the statistics typically encountered in natural scenes, and Olga Dyakova measures the behavioral optomotor response on our trackball set-up.
Understanding neural dynamics in high dimensions across multiple timescales: from perception to motor control and learning
Remarkable advances in experimental neuroscience now enable us to simultaneously observe the activity of many neurons, thereby providing an opportunity to understand how the moment by moment collective dynamics of the brain instantiates learning and cognition. However, efficiently extracting such a conceptual understanding from large, high dimensional neural datasets requires concomitant advances in theoretically driven experimental design, data analysis, and neural circuit modeling. We will discuss how the modern frameworks of high dimensional statistics and deep learning can aid us in this process. In particular we will discuss: (1) how unsupervised tensor component analysis and time warping can extract unbiased and interpretable descriptions of how rapid single trial circuit dynamics change slowly over many trials to mediate learning; (2) how to tradeoff very different experimental resources, like numbers of recorded neurons and trials to accurately discover the structure of collective dynamics and information in the brain, even without spike sorting; (3) deep learning models that accurately capture the retina’s response to natural scenes as well as its internal structure and function; (4) algorithmic approaches for simplifying deep network models of perception; (5) optimality approaches to explain cell-type diversity in the first steps of vision in the retina.
Do deep learning latent spaces resemble human brain representations?
In recent years, artificial neural networks have demonstrated human-like or super-human performance in many tasks including image or speech recognition, natural language processing (NLP), playing Go, chess, poker and video-games. One remarkable feature of the resulting models is that they can develop very intuitive latent representations of their inputs. In these latent spaces, simple linear operations tend to give meaningful results, as in the well-known analogy QUEEN-WOMAN+MAN=KING. We postulate that human brain representations share essential properties with these deep learning latent spaces. To verify this, we test whether artificial latent spaces can serve as a good model for decoding brain activity. We report improvements over state-of-the-art performance for reconstructing seen and imagined face images from fMRI brain activation patterns, using the latent space of a GAN (Generative Adversarial Network) model coupled with a Variational AutoEncoder (VAE). With another GAN model (BigBiGAN), we can decode and reconstruct natural scenes of any category from the corresponding brain activity. Our results suggest that deep learning can produce high-level representations approaching those found in the human brain. Finally, I will discuss whether these deep learning latent spaces could be relevant to the study of consciousness.
Theoretical and computational approaches to neuroscience with complex models in high dimensions across multiple timescales: from perception to motor control and learning
Remarkable advances in experimental neuroscience now enable us to simultaneously observe the activity of many neurons, thereby providing an opportunity to understand how the moment by moment collective dynamics of the brain instantiates learning and cognition. However, efficiently extracting such a conceptual understanding from large, high dimensional neural datasets requires concomitant advances in theoretically driven experimental design, data analysis, and neural circuit modeling. We will discuss how the modern frameworks of high dimensional statistics and deep learning can aid us in this process. In particular we will discuss: how unsupervised tensor component analysis and time warping can extract unbiased and interpretable descriptions of how rapid single trial circuit dynamics change slowly over many trials to mediate learning; how to tradeoff very different experimental resources, like numbers of recorded neurons and trials to accurately discover the structure of collective dynamics and information in the brain, even without spike sorting; deep learning models that accurately capture the retina’s response to natural scenes as well as its internal structure and function; algorithmic approaches for simplifying deep network models of perception; optimality approaches to explain cell-type diversity in the first steps of vision in the retina.
Natural visual stimuli for mice
During the course of evolution, a species’ environment shapes its sensory abilities, as individuals with more optimized sensory abilities are more likely survive and procreate. Adaptations to the statistics of the natural environment can be observed along the early visual pathway and across species. Therefore, characterising the properties of natural environments and studying the representation of natural scenes along the visual pathway is crucial for advancing our understanding of the structure and function of the visual system. In the past 20 years, mice have become an important model in vision research, but the fact that they live in a different environment than primates and have different visual needs is rarely considered. One particular challenge for characterising the mouse’s visual environment is that they are dichromats with photoreceptors that detect UV light, which the typical camera does not record. This also has consequences for experimental visual stimulation, as the blue channel of computer screens fails to excite mouse UV cone photoreceptors. In my talk, I will describe our approach to recording “colour” footage of the habitat of mice – from the mouse’s perspective – and to studying retinal circuits in the ex vivo retina with natural movies.
What the eye tells the brain: Visual feature extraction in the mouse retina
Visual processing begins in the retina: within only two synaptic layers, multiple parallel feature channels emerge, which relay highly processed visual information to different parts of the brain. To functionally characterize these feature channels we perform calcium and glutamate population activity recordings at different levels of the mouse retina. This allows following the complete visual signal across consecutive processing stages in a systematic way. In my talk, I will summarize our recent findings on the functional diversity of retinal output channels and how they arise within the retinal network. Specifically, I will talk about the role of inhibition and cell-type specific dendritic processing in generating diverse visual channels. Then, I will focus on how color – a single visual feature – emerges across all retinal processing layers and link our results to behavioral output and the statistics of mouse natural scenes. With our approach, we hope to identify general computational principles of retinal signaling, thereby increasing our understanding of what the eye tells the brain.
Toward a High-fidelity Artificial Retina for Vision Restoration
Electronic interfaces to the retina represent an exciting development in science, engineering, and medicine – an opportunity to exploit our knowledge of neural circuitry and function to restore or even enhance vision. However, although existing devices demonstrate proof of principle in treating incurable blindness, they produce limited visual function. Some of the reasons for this can be understood based on the precise and specific neural circuitry that mediates visual signaling in the retina. Consideration of this circuitry suggests that future devices may need to operate at single-cell, single-spike resolution in order to mediate naturalistic visual function. I will show large-scale multi-electrode recording and stimulation data from the primate retina indicating that, in some cases, such resolution is possible. I will also discuss cases in which it fails, and propose that we can improve artificial vision in such conditions by incorporating our knowledge of the visual system in bi-directional devices that adapt to the host neural circuitry. Finally, I will introduce the Stanford Artificial Retina Project, aimed at developing a retinal implant that more faithfully reproduces the neural code of the retina, and briefly discuss the implications for scientific investigation and for other neural interfaces of the future.
Human reconstruction of local image structure from natural scenes
Retinal projections often poorly represent the structure of the physical world: well-defined boundaries within the eye may correspond to irrelevant features of the physical world, while critical features of the physical world may be nearly invisible at the retinal projection. Visual cortex is equipped with specialized mechanisms for sorting these two types of features according to their utility in interpreting the scene, however we know little or nothing about their perceptual computations. I will present novel paradigms for the characterization of these processes in human vision, alongside examples of how the associated empirical results can be combined with targeted models to shape our understanding of the underlying perceptual mechanisms. Although the emerging view is far from complete, it challenges compartmentalized notions of bottom-up/top-down object segmentation, and suggests instead that these two modes are best viewed as an integrated perceptual mechanism.
Short-term adaptation reshapes retinal ganglion cell selectivity to natural scenes
Bernstein Conference 2024
Coarse-to-fine processing drives the efficient coding of natural scenes in mouse visual cortex
COSYNE 2022
Large retinal populations are collectively organized to efficiently process natural scenes
COSYNE 2022
Large retinal populations are collectively organized to efficiently process natural scenes
COSYNE 2022
Normative models of spatio-spectral decorrelation in natural scenes predict experimentally observed ratio of PR types
COSYNE 2022
Normative models of spatio-spectral decorrelation in natural scenes predict experimentally observed ratio of PR types
COSYNE 2022
Human-like behavior and neural representations emerge in a neural network trained to overtly search for objects in natural scenes from pixels
COSYNE 2025
Predictive and Invariant Representations via Motion and Form Factorization in Natural Scenes
COSYNE 2025
Neural pathways and computations that achieve stable contrast processing tuned to natural scenes
FENS Forum 2024