Human Observers
human observers
Comparing supervised learning dynamics: Deep neural networks match human data efficiency but show a generalisation lag
Recent research has seen many behavioral comparisons between humans and deep neural networks (DNNs) in the domain of image classification. Often, comparison studies focus on the end-result of the learning process by measuring and comparing the similarities in the representations of object categories once they have been formed. However, the process of how these representations emerge—that is, the behavioral changes and intermediate stages observed during the acquisition—is less often directly and empirically compared. In this talk, I'm going to report a detailed investigation of the learning dynamics in human observers and various classic and state-of-the-art DNNs. We develop a constrained supervised learning environment to align learning-relevant conditions such as starting point, input modality, available input data and the feedback provided. Across the whole learning process we evaluate and compare how well learned representations can be generalized to previously unseen test data. Comparisons across the entire learning process indicate that DNNs demonstrate a level of data efficiency comparable to human learners, challenging some prevailing assumptions in the field. However, our results also reveal representational differences: while DNNs' learning is characterized by a pronounced generalisation lag, humans appear to immediately acquire generalizable representations without a preliminary phase of learning training set-specific information that is only later transferred to novel data.
Learning to see Stuff
Materials with complex appearances, like textiles and foodstuffs, pose challenges for conventional theories of vision. How does the brain learn to see properties of the world—like the glossiness of a surface—that cannot be measured by any other senses? Recent advances in unsupervised deep learning may help shed light on material perception. I will show how an unsupervised deep neural network trained on an artificial environment of surfaces that have different shapes, materials and lighting, spontaneously comes to encode those factors in its internal representations. Most strikingly, the model makes patterns of errors in its perception of material that follow, on an image-by-image basis, the patterns of errors made by human observers. Unsupervised deep learning may provide a coherent framework for how many perceptual dimensions form, in material perception and beyond.
Exploring Memories of Scenes
State-of-the-art machine vision models can predict human recognition memory for complex scenes with astonishing accuracy. In this talk I present work that investigated how memorable scenes are actually remembered and experienced by human observers. We found that memorable scenes were recognized largely based on recollection of specific episodic details but also based on familiarity for an entire scene. I thus highlight current limitations in machine vision models emulating human recognition memory, with promising opportunities for future research. Moreover, we were interested in what observers specifically remember about complex scenes. We thus considered the functional role of eye-movements as a window into the content of memories, particularly when observers recollected specific information about a scene. We found that when observers formed a memory representation that they later recollected (compared to scenes that only felt familiar), the overall extent of exploration was broader, with a specific subset of fixations clustered around later to-be-recollected scene content, irrespective of the memorability of a scene. I discuss the critical role that our viewing behavior plays in visual memory formation and retrieval and point to potential implications for machine vision models predicting the content of human memories.
Neural and computational principles of the processing of dynamic faces and bodies
Body motion is a fundamental signal of social communication. This includes facial as well as full-body movements. Combining advanced methods from computer animation with motion capture in humans and monkeys, we synthesized highly-realistic monkey avatar models. Our face avatar is perceived by monkeys as almost equivalent to a real animal, and does not induce an ‘uncanny valley effect’, unlike all other previously used avatar models in studies with monkeys. Applying machine-learning methods for the control of motion style, we were able to investigate how species-specific shape and dynamic cues influence the perception of human and monkey facial expressions. Human observers showed very fast learning of monkey expressions, and a perceptual encoding of expression dynamics that was largely independent of facial shape. This result is in line with the fact that facial shape evolved faster than the neuromuscular control in primate phylogenesis. At the same time, it challenges popular neural network models of the recognition of dynamic faces that assume a joint encoding of facial shape and dynamics. We propose an alternative physiologically-inspired neural model that realizes such an orthogonal encoding of facial shape and expression from video sequences. As second example, we investigated the perception of social interactions from abstract stimuli, similar to the ones by Heider & Simmel (1944), and also from more realistic stimuli. We developed and validated a new generative model for the synthesis of such social interaction, which is based on a modification of human navigation model. We demonstrate that the recognition of such stimuli, including the perception of agency, can be accounted for by a relatively elementary physiologically-inspired hierarchical neural recognition model, that does not require the assumption of sophisticated inference mechanisms, as postulated by some cognitive theories of social recognition. Summarizing, this suggests that essential phenomena in social cognition might be accounted for by a small set of simple neural principles that can be easily implemented by cortical circuits. The developed technologies for stimulus control form the basis of electrophysiological studies that can verify specific neural circuits, as the ones proposed by our theoretical models.
Bayesian integration of audiovisual speech by DNN models is similar to human observers
COSYNE 2025