Resources
Authors & Affiliations
Jiaqi Shang, Shailee Jain, Haim Sompolinsky, Edward Chang
Abstract
Humans possess the remarkable ability to extract discrete words and construct meaningful sentences from continuous streams of acoustic signals. While brain regions such as the superior temporal gyrus (STG) are known to be involved in speech recognition, how populations of neurons coordinate to enable this process remains unclear. Similar to the brain, state-of-the-art models like Whisper [1] excel at speech processing by transforming acoustic inputs into high-dimensional representations. Such models can thus be valuable tools for studying the neural mechanisms underlying speech perception. In this study, we first investigated how the Whisper model recognizes words by examining the separability of word representations across layers. We observed a non-monotonic pattern of word separation shaped by three key geometric factors: signal, dimensionality (D), and overlap. We identified two key geometric mechanisms driving word separability: dimensionality expansion in early layers and signal amplification in later layers. Next, we explored which features drive word separation, revealing a progression from low-level acoustic features, like word duration, in early layers to phonemic and semantic features in later layers. Finally, we hypothesized that similar geometric mechanisms may underpin word recognition in the brain. To test this, we analyzed spiking data from over 500 single neurons in the human STG, recorded using Neuropixels probes during a speech perception task. Preliminary results show that word separation increases along the posterior-anterior axis of the STG, with dimensionality expanding monotonically while signal peaks in middle regions. These findings suggest that the geometric signatures driving word separability in Whisper may reflect analogous mechanisms in the brain, offering testable hypotheses on how distributed neural activity may support hierarchical word recognition.