ePoster

Bayesian integration of audiovisual speech by DNN models is similar to human observers

Haotian Ma, Xiang Zhang, Zhengjia Wang, John F. Magnotti, Michael S. Beauchamp
COSYNE 2025(2025)
Montreal, Canada

Conference

COSYNE 2025

Montreal, Canada

Resources

Authors & Affiliations

Haotian Ma, Xiang Zhang, Zhengjia Wang, John F. Magnotti, Michael S. Beauchamp

Abstract

We investigated whether transformer-based machine learning (ML) models could have similar behaviors to humans and predict electrophysiological data in audiovisual speech recognition tasks. Human audiovisual speech recognition follows principles of Bayes-optimal integration, with each modality weighted by its reliability (Author et al., 2009). To determine whether ML models of audiovisual speech recognition follow similar Bayesian principles, we applied the technique used to estimate modality weights: assessing recognition of stimuli containing a mismatch between modalities (Author \& Author, 2002). For instance, the incongruent pairing of auditory ba with visual fa (AbaVfa) is usually perceived as fa by human observers. The Bayesian explanation for this finding is that the visual speech feature corresponding to fa is a highly reliable indicator of auditory fa, resulting in an upweighting of the visual modality (Author et al., 2024). We presented two types of incongruent stimuli (AbaVfa and AbaVga) to a ML model AV-HuBERT (audiovisual hidden unit bidirectional encoder representations from transformers). The model’s responses surprisingly matched human observers’ perception: it reported fa for AbaVfa and da for AbaVga. We then tested the Bayesian integration hypothesis by attenuating one modality. Adding pink noise to the auditory input increased the da responses and this result matched the data from human observers as well. To further investigate whether AV-HuBERT could predict human electrophysiological data, we compared the decoder’s attention magnitude over the time course of stimuli with intracranial electroencephalography (iEEG) recordings from participants while they were doing the same task. The average attention scores showed a pattern close to iEEG recordings and both focused on the auditory and visual onset. These similarities between human and ML models suggests that ML models may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception (Author et al., 2014).

Unique ID: cosyne-25/bayesian-integration-audiovisual-2d9de028