Resources
Authors & Affiliations
Zeyu Yun, Christopher Kymn, Galen Chuang, Yubei Chen, Bruno Olshausen
Abstract
A fundamental goal of visual perception is to generate informative representations that support downstream tasks. Doing so requires disentangling the raw image stream into representations that reflect its underlying causes - i.e., form and motion. Therefore, we propose a generative model containing separate latent variables representing form and motion. The model is related to prior work on sparse coding and slow feature analysis, but goes further by positing multiplicative interactions between latent variables for invariant object representations and for equivariant transformations representing motion. These principles result in the ability to analyze and synthesize time-varying natural scenes, and the learned representations have potential implications for neural coding. When training the model on natural videos, we observe complex cells responses similar to those found in the primary visual cortex (V1). However, our goal is to understand not only how complex cells emerge but also why they are useful. We find that at the population level, complex cells form invariant representations of objects that persist in time. In addition, these invariant representations are complemented with units that encode transformations in videos and are not object specific. These units collectively represent motions in terms of equivariant transformations of the population. This enables compact representations of video and prediction. These results lead to a hierarchical generative model of visual cortex that is capable of the rich yet rapid inference required for visual scenes. Overall, these results provide a new perspective on the functional purpose of complex cell populations, and new implications for how both invariance and equivariance is computed in V1 and higher visual areas.