Image
image classification
Comparing supervised learning dynamics: Deep neural networks match human data efficiency but show a generalisation lag
Recent research has seen many behavioral comparisons between humans and deep neural networks (DNNs) in the domain of image classification. Often, comparison studies focus on the end-result of the learning process by measuring and comparing the similarities in the representations of object categories once they have been formed. However, the process of how these representations emerge—that is, the behavioral changes and intermediate stages observed during the acquisition—is less often directly and empirically compared. In this talk, I'm going to report a detailed investigation of the learning dynamics in human observers and various classic and state-of-the-art DNNs. We develop a constrained supervised learning environment to align learning-relevant conditions such as starting point, input modality, available input data and the feedback provided. Across the whole learning process we evaluate and compare how well learned representations can be generalized to previously unseen test data. Comparisons across the entire learning process indicate that DNNs demonstrate a level of data efficiency comparable to human learners, challenging some prevailing assumptions in the field. However, our results also reveal representational differences: while DNNs' learning is characterized by a pronounced generalisation lag, humans appear to immediately acquire generalizable representations without a preliminary phase of learning training set-specific information that is only later transferred to novel data.
Error Consistency between Humans and Machines as a function of presentation duration
Within the last decade, Deep Artificial Neural Networks (DNNs) have emerged as powerful computer vision systems that match or exceed human performance on many benchmark tasks such as image classification. But whether current DNNs are suitable computational models of the human visual system remains an open question: While DNNs have proven to be capable of predicting neural activations in primate visual cortex, psychophysical experiments have shown behavioral differences between DNNs and human subjects, as quantified by error consistency. Error consistency is typically measured by briefly presenting natural or corrupted images to human subjects and asking them to perform an n-way classification task under time pressure. But for how long should stimuli ideally be presented to guarantee a fair comparison with DNNs? Here we investigate the influence of presentation time on error consistency, to test the hypothesis that higher-level processing drives behavioral differences. We systematically vary presentation times of backward-masked stimuli from 8.3ms to 266ms and measure human performance and reaction times on natural, lowpass-filtered and noisy images. Our experiment constitutes a fine-grained analysis of human image classification under both image corruptions and time pressure, showing that even drastically time-constrained humans who are exposed to the stimuli for only two frames, i.e. 16.6ms, can still solve our 8-way classification task with success rates way above chance. We also find that human-to-human error consistency is already stable at 16.6ms.
NMC4 Short Talk: Image embeddings informed by natural language improve predictions and understanding of human higher-level visual cortex
To better understand human scene understanding, we extracted features from images using CLIP, a neural network model of visual concept trained with supervision from natural language. We then constructed voxelwise encoding models to explain whole brain responses arising from viewing natural images from the Natural Scenes Dataset (NSD) - a large-scale fMRI dataset collected at 7T. Our results reveal that CLIP, as compared to convolution based image classification models such as ResNet or AlexNet, as well as language models such as BERT, gives rise to representations that enable better prediction performance - up to a 0.86 correlation with test data and an r-square of 0.75 - in higher-level visual cortex in humans. Moreover, CLIP representations explain distinctly unique variance in these higher-level visual areas as compared to models trained with only images or text. Control experiments show that the improvement in prediction observed with CLIP is not due to architectural differences (transformer vs. convolution) or to the encoding of image captions per se (vs. single object labels). Together our results indicate that CLIP and, more generally, multimodal models trained jointly on images and text, may serve as better candidate models of representation in human higher-level visual cortex. The bridge between language and vision provided by jointly trained models such as CLIP also opens up new and more semantically-rich ways of interpreting the visual brain.
Transforming task representations
Humans can adapt to a novel task on our first try. By contrast, artificial intelligence systems often require immense amounts of data to adapt. In this talk, I will discuss my recent work (https://www.pnas.org/content/117/52/32970) on creating deep learning systems that can adapt on their first try by exploiting relationships between tasks. Specifically, the approach is based on transforming a representation for a known task to produce a representation for the novel task, by inferring and then using a higher order function that captures a relationship between the tasks. This approach can be interpreted as a type of analogical reasoning. I will show that task transformation can allow systems to adapt to novel tasks on their first try in domains ranging from card games, to mathematical objects, to image classification and reinforcement learning. I will discuss the analogical interpretation of this approach, an analogy between levels of abstraction within the model architecture that I refer to as homoiconicity, and what this work might suggest about using deep-learning models to infer analogies more generally.
Recurrent network dynamics lead to interference in sequential learning
Learning in real life is often sequential: A learner first learns task A, then task B. If the tasks are related, the learner may adapt the previously learned representation instead of generating a new one from scratch. Adaptation may ease learning task B but may also decrease the performance on task A. Such interference has been observed in experimental and machine learning studies. In the latter case, it is mediated by correlations between weight updates for the different tasks. In typical applications, like image classification with feed-forward networks, these correlated weight updates can be traced back to input correlations. For many neuroscience tasks, however, networks need to not only transform the input, but also generate substantial internal dynamics. Here we illuminate the role of internal dynamics for interference in recurrent neural networks (RNNs). We analyze RNNs trained sequentially on neuroscience tasks with gradient descent and observe forgetting even for orthogonal tasks. We find that the degree of interference changes systematically with tasks properties, especially with emphasis on input-driven over autonomously generated dynamics. To better understand our numerical observations, we thoroughly analyze a simple model of working memory: For task A, a network is presented with an input pattern and trained to generate a fixed point aligned with this pattern. For task B, the network has to memorize a second, orthogonal pattern. Adapting an existing representation corresponds to the rotation of the fixed point in phase space, as opposed to the emergence of a new one. We show that the two modes of learning – rotation vs. new formation – are directly linked to recurrent vs. input-driven dynamics. We make this notion precise in a further simplified, analytically tractable model, where learning is restricted to a 2x2 matrix. In our analysis of trained RNNs, we also make the surprising observation that, across different tasks, larger random initial connectivity reduces interference. Analyzing the fixed point task reveals the underlying mechanism: The random connectivity strongly accelerates the learning mode of new formation, and has less effect on rotation. The prior thus wins the race to zero loss, and interference is reduced. Altogether, our work offers a new perspective on sequential learning in recurrent networks, and the emphasis on internally generated dynamics allows us to take the history of individual learners into account.
Back-propagation in spiking neural networks
Back-propagation is a powerful supervised learning algorithm in artificial neural networks, because it solves the credit assignment problem (essentially: what should the hidden layers do?). This algorithm has led to the deep learning revolution. But unfortunately, back-propagation cannot be used directly in spiking neural networks (SNN). Indeed, it requires differentiable activation functions, whereas spikes are all-or-none events which cause discontinuities. Here we present two strategies to overcome this problem. The first one is to use a so-called 'surrogate gradient', that is to approximate the derivative of the threshold function with the derivative of a sigmoid. We will present some applications of this method for time series processing (audio, internet traffic, EEG). The second one concerns a specific class of SNNs, which process static inputs using latency coding with at most one spike per neuron. Using approximations, we derived a latency-based back-propagation rule for this sort of networks, called S4NN, and applied it to image classification.
Deep reinforcement learning and its neuroscientific implications
The last few years have seen some dramatic developments in artificial intelligence research. What implications might these have for neuroscience? Investigations of this question have, to date, focused largely on deep neural networks trained using supervised learning, in tasks such as image classification. However, there is another area of recent AI work which has so far received less attention from neuroscientists, but which may have more profound neuroscientific implications: Deep reinforcement learning. Deep RL offers a rich framework for studying the interplay among learning, representation and decision-making, offering to the brain sciences a new set of research tools and a wide range of novel hypotheses. I’ll provide a high level introduction to deep RL, discuss some recent neuroscience-oriented investigations from my group at DeepMind, and survey some wider implications for research on brain and behavior.
Saccade Mechanisms for Image Classification, Object Detection and Tracking
Neuromatch 5