ePoster

Contrastive-Equivariant self-supervised learning improves alignment with primate visual area IT

Thomas Yerxa, Jenelle Feather, Eero Simoncelli, SueYeon Chung
COSYNE 2025(2025)
Montreal, Canada

Conference

COSYNE 2025

Montreal, Canada

Resources

Authors & Affiliations

Thomas Yerxa, Jenelle Feather, Eero Simoncelli, SueYeon Chung

Abstract

Task-trained deep neural networks have emerged as leading models of neural responses in the primate visual system, especially for later stages of the ventral stream such as inferotemporal (IT) cortex. One major criticism of this approach is that the tasks used to train such networks rely on large numbers of labeled examples, and are thus not ecologically plausible. Recent approaches in representation learning circumvent the need for labeled examples, and match or surpass supervised learning methods on a variety of tasks. These approaches generally rely on supervision signals extracted from the data, rather than from human annotations, and are thus called “self-supervised”. Many self-supervised learning methods aim to learn a representation that is (1) invariant to a set of transformations in the input space (often called augmentations) and (2) maximally discriminative across distinct input images. However, this training is not well aligned with known characteristics of visual perception: transformations for which the network is encouraged to be invariant are generally quite perceptible to humans. Moreover, recent work has shown that the factorization of variability due to image transformations is more closely related to neural predictivity than the lack of variability. In this work we present a novel self-supervsed learning method that trades off invariance (discarding information about the input transformation) and equivariance (maintaining information about the input transformation). Specifically, a representation is said to be equivariant with respect to some transformation of the inputs if the same transformation, applied to different inputs, results in the same change in the representation. We demonstrate that our equivariant learning approach produces representations that contain more “category orthogonal” information, better factorize the sources of variability in the datasets, and better predict neural activity in visual area IT.

Unique ID: cosyne-25/contrastive-equivariant-self-supervised-87aed023