AlexNet
Latest
NMC4 Short Talk: Image embeddings informed by natural language improve predictions and understanding of human higher-level visual cortex
To better understand human scene understanding, we extracted features from images using CLIP, a neural network model of visual concept trained with supervision from natural language. We then constructed voxelwise encoding models to explain whole brain responses arising from viewing natural images from the Natural Scenes Dataset (NSD) - a large-scale fMRI dataset collected at 7T. Our results reveal that CLIP, as compared to convolution based image classification models such as ResNet or AlexNet, as well as language models such as BERT, gives rise to representations that enable better prediction performance - up to a 0.86 correlation with test data and an r-square of 0.75 - in higher-level visual cortex in humans. Moreover, CLIP representations explain distinctly unique variance in these higher-level visual areas as compared to models trained with only images or text. Control experiments show that the improvement in prediction observed with CLIP is not due to architectural differences (transformer vs. convolution) or to the encoding of image captions per se (vs. single object labels). Together our results indicate that CLIP and, more generally, multimodal models trained jointly on images and text, may serve as better candidate models of representation in human higher-level visual cortex. The bridge between language and vision provided by jointly trained models such as CLIP also opens up new and more semantically-rich ways of interpreting the visual brain.
NMC4 Short Talk: Untangling Contributions of Distinct Features of Images to Object Processing in Inferotemporal Cortex
How do humans perceive daily objects of various features and categorize these seemingly intuitive and effortless mental representations? Prior literature focusing on the role of the inferotemporal region (IT) has revealed object category clustering that is consistent with the semantic predefined structure (superordinate, ordinate, subordinate). It has however been debated whether the neural signals in the IT regions are a reflection of such categorical hierarchy [Wen et al.,2018; Bracci et al., 2017]. Visual attributes of images that correlated with semantic and category dimensions may have confounded these prior results. Our study aimed to address this debate by building and comparing models using the DNN AlexNet, to explain the variance in representational dissimilarity matrix (RDM) of neural signals in the IT region. We found that mid and high level perceptual attributes of the DNN model contribute the most to neural RDMs in the IT region. Semantic categories, as in predefined structure, were moderately correlated with mid to high DNN layers (r = [0.24 - 0.36]). Variance partitioning analysis also showed that the IT neural representations were mostly explained by DNN layers, while semantic categorical RDMs brought little additional information. In light of these results, we propose future works should focus more on the specific role IT plays in facilitating the extraction and coding of visual features that lead to the emergence of categorical conceptualizations.
AlexNet coverage
2 items