Platform

  • Search
  • Seminars
  • Conferences
  • Jobs

Resources

  • Submit Content
  • About Us

© 2025 World Wide

Open knowledge for all • Started with World Wide Neuro • A 501(c)(3) Non-Profit Organization

Analytics consent required

World Wide relies on analytics signals to operate securely and keep research services available. Accept to continue, or leave the site.

Review the Privacy Policy for details about analytics processing.

World Wide
SeminarsConferencesWorkshopsCoursesJobsMapsFeedLibrary
Back to SeminarsBack
Seminar✓ Recording AvailableNeuroscience

NMC4 Short Talk: Image embeddings informed by natural language improve predictions and understanding of human higher-level visual cortex

Aria Wang

Graduate Student

Carnegie Mellon University

Schedule
Wednesday, December 1, 2021

Showing your local timezone

Schedule

Wednesday, December 1, 2021

6:00 AM America/New_York

Watch recording
Host: Neuromatch 4

Watch the seminar

Your browser does not support the video tag.

Recording provided by the organiser.

Event Information

Domain

Neuroscience

Original Event

View source

Host

Neuromatch 4

Duration

15 minutes

Abstract

To better understand human scene understanding, we extracted features from images using CLIP, a neural network model of visual concept trained with supervision from natural language. We then constructed voxelwise encoding models to explain whole brain responses arising from viewing natural images from the Natural Scenes Dataset (NSD) - a large-scale fMRI dataset collected at 7T. Our results reveal that CLIP, as compared to convolution based image classification models such as ResNet or AlexNet, as well as language models such as BERT, gives rise to representations that enable better prediction performance - up to a 0.86 correlation with test data and an r-square of 0.75 - in higher-level visual cortex in humans. Moreover, CLIP representations explain distinctly unique variance in these higher-level visual areas as compared to models trained with only images or text. Control experiments show that the improvement in prediction observed with CLIP is not due to architectural differences (transformer vs. convolution) or to the encoding of image captions per se (vs. single object labels). Together our results indicate that CLIP and, more generally, multimodal models trained jointly on images and text, may serve as better candidate models of representation in human higher-level visual cortex. The bridge between language and vision provided by jointly trained models such as CLIP also opens up new and more semantically-rich ways of interpreting the visual brain.

Topics

BERTCLIPNatural Scenes DatasetfMRIhigher-level visual corteximage embeddingsmultimodal modelsnatural languageprediction performancepredictionsunderstandingvisual brainvoxelwise encoding models

About the Speaker

Aria Wang

Graduate Student

Carnegie Mellon University

Contact & Resources

Personal Website

www.cnbc.cmu.edu/cnbc-directory/name/yuan-wang/

@ariairaw

Follow on Twitter/X

twitter.com/ariairaw

Related Seminars

Seminar60%

Knight ADRC Seminar

neuro

Jan 20, 2025
Washington University in St. Louis, Neurology
Seminar60%

TBD

neuro

Jan 20, 2025
King's College London
Seminar60%

Guiding Visual Attention in Dynamic Scenes

neuro

Jan 20, 2025
Haifa U
January 2026
Full calendar →