Speech
speech
Prof David Brang
We are seeking a full-time post-doctoral research fellow to study computational and neuroscientific models of perception and cognition. The research fellow will be jointly supervised by Dr. David Brang (https://sites.lsa.umich.edu/brang-lab/) and Zhongming Liu (https://libi.engin.umich.edu). The goal of this collaboration is to build computational models of cognitive and perceptual processes using data combined from electrocorticography (ECoG) and fMRI. The successful applicant will also have freedom to conduct additional research based on their interests, using a variety of methods -- ECoG, fMRI, DTI, lesion mapping, and EEG. The ideal start date is from spring to fall 2021 and the position is expected to last for at least two years, with the possibility of extension for subsequent years. We are also recruiting a Post-Doc for research on multisensory interactions (particularly how vision modulates speech perception) using Cognitive Neuroscience techniques or to help with our large-scale brain tumor collaboration with Shawn Hervey-Jumper at UCSF (https://herveyjumperlab.ucsf.edu). In this latter collaboration we collect iEEG (from ~50 patients/year) and lesion mapping data (from ~150 patients/year) in patients with a brain tumor to study sensory and cognitive functions in patients. The goals of this project are to better understand the physiology of tumors, study causal mechanisms of brain functions, and generalize iEEG/ECoG findings from epilepsy patients to a second patient population.
Tejas Savalia
The Department of Psychological and Brain Sciences at the University of Massachusetts, Amherst is inviting applications for a tenure track, academic year, faculty position at the Assistant Professor level in its Cognition and Cognitive Neuroscience Psychology program, starting in Fall 2024. We are seeking outstanding applicants with expertise in any area of cognitive psychology or cognitive neuroscience, including interdisciplinary fields connected to cognitive psychology, whose work complements and broadens existing strengths in our program. The program has current strengths in attention, decision-making, psycholinguistics, and mathematical modeling, with connections to our Behavioral Neuroscience, Clinical Psychology, Developmental Science, and Social Psychology programs. Across the university, our faculty have strong connections to Linguistics, Information and Computer Sciences, and Speech, Language, and Hearing Sciences, as well as the Initiative in Cognitive Science, the Computational and Social Science Institute, the Institute for Diversity Sciences, and the Institute for Applied Life Sciences.
Jörn Anemüller
We have are looking to fill a fully funded 3-year Ph.D. student position in the field of deep learning-based signal processing algorithms for speech enhancement and computational audition. The position is funded by the German research council (DFG) within the Collaborative Research Centre SFB 1330 “Hearing Acoustics” at the Department of Medical Physics and Acoustics, University of Oldenburg. Within project B3 of the research centre, the Computational Audition Group develops machine learning algorithms for signal processing of speech and audio data.
Steve Schneider
The School of Computer Science and Electronic Engineering is seeking to recruit a full-time Lecturer in Natural Language Processing to grow our AI research. The School is home to two established research centres with expertise in AI and Machine Learning: the Computer Science Research Centre and the Centre for Vision, Speech and Signal Processing (CVSSP). This post is aligned to the Nature Inspired Computer and Engineering group within Computer Science. This role encourages applicants from the areas of natural language processing including language modelling, language generation (machine translation/summarisation), explainability and reasoning in NLP, and/or aligned multimodal challenges for NLP (vision-language, audio-language, and so on) and we are particularly interested in candidates who enhance our current strengths and bring complementary areas of AI expertise. Surrey has an established international reputation in AI research, 1st in the UK for computer vision and top 10 for AI, computer vision, machine learning and natural language processing (CSRankings.org) and were 7th in the UK for REF2021 outputs in Computer Science research. Computer Science and CVSSP are at the core of the Surrey Institute for People-Centred AI (PAI), established in 2021 as a pan-University initiative which brings together leading AI research with cross-discipline expertise across health, social, behavioural, and engineering sciences, and business, law, and the creative arts to shape future AI to benefit people and society. PAI leads a portfolio of £100m in grant awards including major research activities in creative industries and healthcare, and two doctoral training programmes with funding for over 100 PhD researchers: the UKRI AI Centre for Doctoral Training in AI for Digital Media Inclusion, and the Leverhulme Trust Doctoral Training Network in AI-Enabled Digital Accessibility.
Simulating Thought Disorder: Fine-Tuning Llama-2 for Synthetic Speech in Schizophrenia
Relating circuit dynamics to computation: robustness and dimension-specific computation in cortical dynamics
Neural dynamics represent the hard-to-interpret substrate of circuit computations. Advances in large-scale recordings have highlighted the sheer spatiotemporal complexity of circuit dynamics within and across circuits, portraying in detail the difficulty of interpreting such dynamics and relating it to computation. Indeed, even in extremely simplified experimental conditions, one observes high-dimensional temporal dynamics in the relevant circuits. This complexity can be potentially addressed by the notion that not all changes in population activity have equal meaning, i.e., a small change in the evolution of activity along a particular dimension may have a bigger effect on a given computation than a large change in another. We term such conditions dimension-specific computation. Considering motor preparatory activity in a delayed response task we utilized neural recordings performed simultaneously with optogenetic perturbations to probe circuit dynamics. First, we revealed a remarkable robustness in the detailed evolution of certain dimensions of the population activity, beyond what was thought to be the case experimentally and theoretically. Second, the robust dimension in activity space carries nearly all of the decodable behavioral information whereas other non-robust dimensions contained nearly no decodable information, as if the circuit was setup to make informative dimensions stiff, i.e., resistive to perturbations, leaving uninformative dimensions sloppy, i.e., sensitive to perturbations. Third, we show that this robustness can be achieved by a modular organization of circuitry, whereby modules whose dynamics normally evolve independently can correct each other’s dynamics when an individual module is perturbed, a common design feature in robust systems engineering. Finally, we will recent work extending this framework to understanding the neural dynamics underlying preparation of speech.
The representation of speech conversations in the human auditory cortex
LLMs and Human Language Processing
This webinar convened researchers at the intersection of Artificial Intelligence and Neuroscience to investigate how large language models (LLMs) can serve as valuable “model organisms” for understanding human language processing. Presenters showcased evidence that brain recordings (fMRI, MEG, ECoG) acquired while participants read or listened to unconstrained speech can be predicted by representations extracted from state-of-the-art text- and speech-based LLMs. In particular, text-based LLMs tend to align better with higher-level language regions, capturing more semantic aspects, while speech-based LLMs excel at explaining early auditory cortical responses. However, purely low-level features can drive part of these alignments, complicating interpretations. New methods, including perturbation analyses, highlight which linguistic variables matter for each cortical area and time scale. Further, “brain tuning” of LLMs—fine-tuning on measured neural signals—can improve semantic representations and downstream language tasks. Despite open questions about interpretability and exact neural mechanisms, these results demonstrate that LLMs provide a promising framework for probing the computations underlying human language comprehension and production at multiple spatiotemporal scales.
Sophie Scott - The Science of Laughter from Evolution to Neuroscience
Keynote Address to British Association of Cognitive Neuroscience, London, 10th September 2024
Llama 3.1 Paper: The Llama Family of Models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Exploring the cerebral mechanisms of acoustically-challenging speech comprehension - successes, failures and hope
Comprehending speech under acoustically challenging conditions is an everyday task that we can often execute with ease. However, accomplishing this requires the engagement of cognitive resources, such as auditory attention and working memory. The mechanisms that contribute to the robustness of speech comprehension are of substantial interest in the context of hearing mild to moderate hearing impairment, in which affected individuals typically report specific difficulties in understanding speech in background noise. Although hearing aids can help to mitigate this, they do not represent a universal solution, thus, finding alternative interventions is necessary. Given that age-related hearing loss (“presbycusis”) is inevitable, developing new approaches is all the more important in the context of aging populations. Moreover, untreated hearing loss in middle age has been identified as the most significant potentially modifiable predictor of dementia in later life. I will present research that has used a multi-methodological approach (fMRI, EEG, MEG and non-invasive brain stimulation) to try to elucidate the mechanisms that comprise the cognitive “last mile” in speech acousticallychallenging speech comprehension and to find ways to enhance them.
Dyslexia, Rhythm, Language and the Developing Brain
Recent insights from auditory neuroscience provide a new perspective on how the brain encodes speech. Using these recent insights, I will provide an overview of key factors underpinning individual differences in children’s development of language and phonology, providing a context for exploring atypical reading development (dyslexia). Children with dyslexia are relatively insensitive to acoustic cues related to speech rhythm patterns. This lack of rhythmic sensitivity is related to the atypical neural encoding of rhythm patterns in speech by the brain. I will describe our recent data from infants as well as children, demonstrating developmental continuity in the key neural variables.
Prosody in the voice, face, and hands changes which words you hear
Speech may be characterized as conveying both segmental information (i.e., about vowels and consonants) as well as suprasegmental information - cued through pitch, intensity, and duration - also known as the prosody of speech. In this contribution, I will argue that prosody shapes low-level speech perception, changing which speech sounds we hear. Perhaps the most notable example of how prosody guides word recognition is the phenomenon of lexical stress, whereby suprasegmental F0, intensity, and duration cues can distinguish otherwise segmentally identical words, such as "PLAto" vs. "plaTEAU" in Dutch. Work from our group showcases the vast variability in how different talkers produce stressed vs. unstressed syllables, while also unveiling the remarkable flexibility with which listeners can learn to handle this between-talker variability. It also emphasizes that lexical stress is a multimodal linguistic phenomenon, with the voice, lips, and even hands conveying stress in concert. In turn, human listeners actively weigh these multisensory cues to stress depending on the listening conditions at hand. Finally, lexical stress is presented as having a robust and lasting impact on low-level speech perception, even down to changing vowel perception. Thus, prosody - in all its multisensory forms - is a potent factor in speech perception, determining what speech sounds we hear.
Silences, Spikes and Bursts: Three-Part Knot of the Neural Code
When a neuron breaks silence, it can emit action potentials in a number of patterns. Some responses are so sudden and intense that electrophysiologists felt the need to single them out, labeling action potentials emitted at a particularly high frequency with a metonym – bursts. Is there more to bursts than a figure of speech? After all, sudden bouts of high-frequency firing are expected to occur whenever inputs surge. In this talk, I will discuss the implications of seeing the neural code as having three syllables: silences, spikes and bursts. In particular, I will describe recent theoretical and experimental results that implicate bursting in the implementation of top-down attention and the coordination of learning.
The speaker identification ability of blind and sighted listeners
Previous studies have shown that blind individuals outperform sighted controls in a variety of auditory tasks; however, only few studies have investigated blind listeners’ speaker identification abilities. In addition, existing studies in the area show conflicting results. The presented empirical investigation with 153 blind (74 of them congenitally blind) and 153 sighted listeners is the first of its kind and scale in which long-term memory effects of blind listeners’ speaker identification abilities are examined. For the empirical investigation, all listeners were evenly assigned to one of nine subgroups (3 x 3 design) in order to investigate the influence of two parameters with three levels, respectively, on blind and sighted listeners’ speaker identification performance. The parameters were a) time interval; i.e. a time interval of 1, 3 or 6 weeks between the first exposure to the voice to be recognised (familiarisation) and the speaker identification task (voice lineup); and b) signal quality; i.e. voice recordings were presented in either studio-quality, mobile phone-quality or as recordings of whispered speech. Half of the presented voice lineups were target-present lineups in which the previously heard target voice was included. The other half consisted of target-absent lineups which contained solely distractor voices. Blind individuals outperformed sighted listeners only under studio quality conditions. Furthermore, for blind and sighted listeners no significant performance differences were found with regard to the three investigated time intervals of 1, 3 and 6 weeks. Blind as well as sighted listeners were significantly better at picking the target voice from target-present lineups than at indicating that the target voice was absent in target-absent lineups. Within the blind group, no significant correlations were found between identification performance and onset or duration of blindness. Implications for the field of forensic phonetics are discussed.
Motor contribution to auditory temporal predictions
Temporal predictions are fundamental instruments for facilitating sensory selection, allowing humans to exploit regularities in the world. Recent evidence indicates that the motor system instantiates predictive timing mechanisms, helping to synchronize temporal fluctuations of attention with the timing of events in a task-relevant stream, thus facilitating sensory selection. Accordingly, in the auditory domain auditory-motor interactions are observed during perception of speech and music, two temporally structured sensory streams. I will present a behavioral and neurophysiological account for this theory and will detail the parameters governing the emergence of this auditory-motor coupling, through a set of behavioral and magnetoencephalography (MEG) experiments.
Pitch and Time Interact in Auditory Perception
Research into pitch perception and time perception has typically treated the two as independent processes. However, previous studies of music and speech perception have suggested that pitch and timing information may be processed in an integrated manner, such that the pitch of an auditory stimulus can influence a person’s perception, expectation, and memory of its duration and tempo. Typically, higher-pitched sounds are perceived as faster and longer in duration than lower-pitched sounds with identical timing. We conducted a series of experiments to better understand the limits of this pitch-time integrality. Across several experiments, we tested whether the higher-equals-faster illusion generalizes across the broader frequency range of human hearing by asking participants to compare the tempo of a repeating tone played in one of six octaves to a metronomic standard. When participants heard tones from all six octaves, we consistently found an inverted U-shaped effect of the tone’s pitch height, such that perceived tempo peaked between A4 (440 Hz) and A5 (880 Hz) and decreased at lower and higher octaves. However, we found that the decrease in perceived tempo at extremely high octaves could be abolished by exposing participants to high-pitched tones only, suggesting that pitch-induced timing biases are context sensitive. We additionally tested how the timing of an auditory stimulus influences the perception of its pitch, using a pitch discrimination task in which probe tones occurred early, late, or on the beat within a rhythmic context. Probe timing strongly biased participants to rate later tones as lower in pitch than earlier tones. Together, these results suggest that pitch and time exert a bidirectional influence on one another, providing evidence for integrated processing of pitch and timing information in auditory perception. Identifying the mechanisms behind this pitch-time interaction will be critical for integrating current models of pitch and tempo processing.
Hierarchical transformation of visual event timing representations in the human brain: response dynamics in early visual cortex and timing-tuned responses in association cortices
Quantifying the timing (duration and frequency) of brief visual events is vital to human perception, multisensory integration and action planning. For example, this allows us to follow and interact with the precise timing of speech and sports. Here we investigate how visual event timing is represented and transformed across the brain’s hierarchy: from sensory processing areas, through multisensory integration areas, to frontal action planning areas. We hypothesized that the dynamics of neural responses to sensory events in sensory processing areas allows derivation of event timing representations. This would allow higher-level processes such as multisensory integration and action planning to use sensory timing information, without the need for specialized central pacemakers or processes. Using 7T fMRI and neural model-based analyses, we found responses that monotonically increase in amplitude with visual event duration and frequency, becoming increasingly clear from primary visual cortex to lateral occipital visual field maps. Beginning in area MT/V5, we found a gradual transition from monotonic to tuned responses, with response amplitudes peaking at different event timings in different recording sites. While monotonic response components were limited to the retinotopic location of the visual stimulus, timing-tuned response components were independent of the recording sites' preferred visual field positions. These tuned responses formed a network of topographically organized timing maps in superior parietal, postcentral and frontal areas. From anterior to posterior timing maps, multiple events were increasingly integrated, response selectivity narrowed, and responses focused increasingly on the middle of the presented timing range. These results suggest that responses to event timing are transformed from the human brain’s sensory areas to the association cortices, with the event’s temporal properties being increasingly abstracted from the response dynamics and locations of early sensory processing. The resulting abstracted representation of event timing is then propagated through areas implicated in multisensory integration and action planning.
A Framework for a Conscious AI: Viewing Consciousness through a Theoretical Computer Science Lens
We examine consciousness from the perspective of theoretical computer science (TCS), a branch of mathematics concerned with understanding the underlying principles of computation and complexity, including the implications and surprising consequences of resource limitations. We propose a formal TCS model, the Conscious Turing Machine (CTM). The CTM is influenced by Alan Turing's simple yet powerful model of computation, the Turing machine (TM), and by the global workspace theory (GWT) of consciousness originated by cognitive neuroscientist Bernard Baars and further developed by him, Stanislas Dehaene, Jean-Pierre Changeux, George Mashour, and others. However, the CTM is not a standard Turing Machine. It’s not the input-output map that gives the CTM its feeling of consciousness, but what’s under the hood. Nor is the CTM a standard GW model. In addition to its architecture, what gives the CTM its feeling of consciousness is its predictive dynamics (cycles of prediction, feedback and learning), its internal multi-modal language Brainish, and certain special Long Term Memory (LTM) processors, including its Inner Speech and Model of the World processors. Phenomena generally associated with consciousness, such as blindsight, inattentional blindness, change blindness, dream creation, and free will, are considered. Explanations derived from the model draw confirmation from consistencies at a high level, well above the level of neurons, with the cognitive neuroscience literature. Reference. L. Blum and M. Blum, "A theory of consciousness from a theoretical computer science perspective: Insights from the Conscious Turing Machine," PNAS, vol. 119, no. 21, 24 May 2022. https://www.pnas.org/doi/epdf/10.1073/pnas.2115934119
Language Representations in the Human Brain: A naturalistic approach
Natural language is strongly context-dependent and can be perceived through different sensory modalities. For example, humans can easily comprehend the meaning of complex narratives presented through auditory speech, written text, or visual images. To understand how complex language-related information is represented in the human brain there is a necessity to map the different linguistic and non-linguistic information perceived under different modalities across the cerebral cortex. To map this information to the brain, I suggest following a naturalistic approach and observing the human brain performing tasks in its naturalistic setting, designing quantitative models that transform real-world stimuli into specific hypothesis-related features, and building predictive models that can relate these features to brain responses. In my talk, I will present models of brain responses collected using functional magnetic resonance imaging while human participants listened to or read natural narrative stories. Using natural text and vector representations derived from natural language processing tools I will present how we can study language processing in the human brain across modalities, in different levels of temporal granularity, and across different languages.
Artificial Intelligence and Racism – What are the implications for scientific research?
As questions of race and justice have risen to the fore across the sciences, the ALBA Network has invited Dr Shakir Mohamed (Senior Research Scientist at DeepMind, UK) to provide a keynote speech on Artificial Intelligence and racism, and the implications for scientific research, that will be followed by a discussion chaired by Dr Konrad Kording (Department of Neuroscience at University of Pennsylvania, US - neuromatch co-founder)
Electrophysiological investigations of natural speech and language processing
Representation of speech temporal structure in human cortex
Towards an inclusive neurobiology of language
Understanding how our brains process language is one of the fundamental issues in cognitive science. In order to reach such understanding, it is critical to cover the full spectrum of manners in which humans acquire and experience language. However, due to a myriad of socioeconomic factors, research has disproportionately focused on monolingual English speakers. In this talk, I present a series of studies that systematically target fundamental questions about bilingual language use across a range of conversational contexts, both in production and comprehension. The results lay the groundwork to propose a more inclusive theory of the neurobiology of language, with an architecture that assumes a common selection principle at each linguistic level and can account for attested features of both bilingual and monolingual speech in, but crucially also out of, experimental settings.
Hearing in an acoustically varied world
In order for animals to thrive in their complex environments, their sensory systems must form representations of objects that are invariant to changes in some dimensions of their physical cues. For example, we can recognize a friend’s speech in a forest, a small office, and a cathedral, even though the sound reaching our ears will be very different in these three environments. I will discuss our recent experiments into how neurons in auditory cortex can form stable representations of sounds in this acoustically varied world. We began by using a normative computational model of hearing to examine how the brain may recognize a sound source across rooms with different levels of reverberation. The model predicted that reverberations can be removed from the original sound by delaying the inhibitory component of spectrotemporal receptive fields in the presence of stronger reverberation. Our electrophysiological recordings then confirmed that neurons in ferret auditory cortex apply this algorithm to adapt to different room sizes. Our results demonstrate that this neural process is dynamic and adaptive. These studies provide new insights into how we can recognize auditory objects even in highly reverberant environments, and direct further research questions about how reverb adaptation is implemented in the cortical circuit.
Conflict in Multisensory Perception
Multisensory perception is often studied through the effects of inter-sensory conflict, such as in the McGurk effect, the Ventriloquist illusion, and the Rubber Hand Illusion. Moreover, Bayesian approaches to cue fusion and causal inference overwhelmingly draw on cross-modal conflict to measure and to model multisensory perception. Given the prevalence of conflict, it is remarkable that accounts of multisensory perception have so far neglected the theory of conflict monitoring and cognitive control, established about twenty years ago. I hope to make a case for the role of conflict monitoring and resolution during multisensory perception. To this end, I will present EEG and fMRI data showing that cross-modal conflict in speech, resulting in either integration or segregation, triggers neural mechanisms of conflict detection and resolution. I will also present data supporting a role of these mechanisms during perceptual conflict in general, using Binocular Rivalry, surrealistic imagery, and cinema. Based on this preliminary evidence, I will argue that it is worth considering the potential role of conflict in multisensory perception and its incorporation in a causal inference framework. Finally, I will raise some potential problems associated with this proposal.
Development of multisensory perception and attention and their role in audiovisual speech processing
Speak your mind: cortical predictions of speech sensory feedback
Encoding and perceiving the texture of sounds: auditory midbrain codes for recognizing and categorizing auditory texture and for listening in noise
Natural soundscapes such as from a forest, a busy restaurant, or a busy intersection are generally composed of a cacophony of sounds that the brain needs to interpret either independently or collectively. In certain instances sounds - such as from moving cars, sirens, and people talking - are perceived in unison and are recognized collectively as single sound (e.g., city noise). In other instances, such as for the cocktail party problem, multiple sounds compete for attention so that the surrounding background noise (e.g., speech babble) interferes with the perception of a single sound source (e.g., a single talker). I will describe results from my lab on the perception and neural representation of auditory textures. Textures, such as a from a babbling brook, restaurant noise, or speech babble are stationary sounds consisting of multiple independent sound sources that can be quantitatively defined by summary statistics of an auditory model (McDermott & Simoncelli 2011). How and where in the auditory system are summary statistics represented and the neural codes that potentially contribute towards their perception, however, are largely unknown. Using high-density multi-channel recordings from the auditory midbrain of unanesthetized rabbits and complementary perceptual studies on human listeners, I will first describe neural and perceptual strategies for encoding and perceiving auditory textures. I will demonstrate how distinct statistics of sounds, including the sound spectrum and high-order statistics related to the temporal and spectral correlation structure of sounds, contribute to texture perception and are reflected in neural activity. Using decoding methods I will then demonstrate how various low and high-order neural response statistics can differentially contribute towards a variety of auditory tasks including texture recognition, discrimination, and categorization. Finally, I will show examples from our recent studies on how high-order sound statistics and accompanying neural activity underlie difficulties for recognizing speech in background noise.
Multisensory speech perception
Exploring the neurogenetic basis of speech, language, and vocal communication
Speech as a biomarker in ataxia: What can it tell us and how should we use it?
The Jena Voice Learning and Memory Test (JVLMT)
The ability to recognize someone’s voice spans a broad spectrum with phonagnosia on the low end and super recognition at the high end. Yet there is no standardized test to measure the individual ability to learn and recognize newly-learnt voices with samples of speech-like phonetic variability. We have developed the Jena Voice Learning and Memory Test (JVLMT), a 20 min-test based on item response theory and applicable across different languages. The JVLMT consists of three phases in which participants are familiarized with eight speakers in two stages and then perform a three-alternative forced choice recognition task, using pseudo sentences devoid of semantic content. Acoustic (dis)similarity analyses were used to create items with different levels of difficulty. Test scores are based on 22 Rasch-conform items. Items were selected and validated in online studies based on 232 and 454 participants, respectively. Mean accuracy is 0.51 with an SD of .18. The JVLMT showed high and moderate correlations with convergent validation tests (Bangor Voice Matching Test; Glasgow Voice Memory Test) and a weak correlation with a discriminant validation test (Digit Span). Empirical (marginal) reliability is 0.66. Four participants with super recognition (at least 2 SDs above the mean) and 7 participants with phonagnosia (at least 2 SDs below the mean) were identified. The JVLMT is a promising screen too for voice recognition abilities in a scientific and neuropsychological context.
Direction selectivity in hearing: monaural phase sensitivity in octopus neurons
The processing of temporal sound features is fundamental to hearing, and the auditory system displays a plethora of specializations, at many levels, to enable such processing. Octopus neurons are the most extreme temporally-specialized cells in the auditory (and perhaps entire) brain, which make them intriguing but also difficult to study. Notwithstanding the scant physiological data, these neurons have been a favorite cell type of modeling studies which have proposed that octopus cells have critical roles in pitch and speech perception. We used a range of in vivo recording and labeling methods to examine the hypothesis that tonotopic ordering of cochlear afferents combines with dendritic delays to compensate for cochlear delay - which would explain the highly entrained responses of octopus cells to sound transients. Unexpectedly, the experiments revealed that these neurons have marked selectivity to the direction of fast frequency glides, which is tied in a surprising way to intrinsic membrane properties and subthreshold events. The data suggest that octopus cells have a role in temporal comparisons across frequency and may play a role in auditory scene analysis.
Learning Speech Perception and Action through Sensorimotor Interactions
Decoding the neural processing of speech
Understanding speech in noisy backgrounds requires selective attention to a particular speaker. Humans excel at this challenging task, while current speech recognition technology still struggles when background noise is loud. The neural mechanisms by which we process speech remain, however, poorly understood, not least due to the complexity of natural speech. Here we describe recent progress obtained through applying machine-learning to neuroimaging data of humans listening to speech in different types of background noise. In particular, we develop statistical models to relate characteristic features of speech such as pitch, amplitude fluctuations and linguistic surprisal to neural measurements. We find neural correlates of speech processing both at the subcortical level, related to the pitch, as well as at the cortical level, related to amplitude fluctuations and linguistic structures. We also show that some of these measures allow to diagnose disorders of consciousness. Our findings may be applied in smart hearing aids that automatically adjust speech processing to assist a user, as well as in the diagnosis of brain disorders.
Do deep learning latent spaces resemble human brain representations?
In recent years, artificial neural networks have demonstrated human-like or super-human performance in many tasks including image or speech recognition, natural language processing (NLP), playing Go, chess, poker and video-games. One remarkable feature of the resulting models is that they can develop very intuitive latent representations of their inputs. In these latent spaces, simple linear operations tend to give meaningful results, as in the well-known analogy QUEEN-WOMAN+MAN=KING. We postulate that human brain representations share essential properties with these deep learning latent spaces. To verify this, we test whether artificial latent spaces can serve as a good model for decoding brain activity. We report improvements over state-of-the-art performance for reconstructing seen and imagined face images from fMRI brain activation patterns, using the latent space of a GAN (Generative Adversarial Network) model coupled with a Variational AutoEncoder (VAE). With another GAN model (BigBiGAN), we can decode and reconstruct natural scenes of any category from the corresponding brain activity. Our results suggest that deep learning can produce high-level representations approaching those found in the human brain. Finally, I will discuss whether these deep learning latent spaces could be relevant to the study of consciousness.
Kamala Harris and the Construction of Complex Ethnolinguistic Political Identity
Over the past 50 years, sociolinguistic studies on black Americans have expanded in both theoretical and technical scope, and newer research has moved beyond seeing speakers, especially black speakers, as a monolithic sociolinguistic community (Wolfram 2007, Blake 2014). Yet there remains a dearth of critical work on complex identities existing within black American communities as well as how these identities are reflected and perceived in linguistic practice. At the same time, linguists have begun to take greater interest in the ways in which public figures, such as politicians, may illuminate the wider social meaning of specific linguistic variables. In this talk, I will present results from analyses of multiple aspects of ethnolinguistic variation in the speech of Vice President Kamala Harris during the 2019-2020 Democratic Party Primary debates. Together, these results show how VP Harris expertly employs both enregistered and subtle linguistic variables, including aspects of African American Language morphosyntax, vowels, and intonational phonology in the construction and performance of a highly specific sociolinguistic identity that reflects her unique positions politically, socially, and racially. The results of this study expand our knowledge about how the complexities of speaker identity are reflected in sociolinguistic variation, as well as press on the boundaries of what we know about how speakers in the public sphere use variation to reflect both who they are and who we want them to be.
Space for Thinking - Spatial Reference Frames and Abstract Concepts
People from cultures around the world tend to borrow from the domain of space to represent abstract concepts. For example, in the domain on time, we use spatial metaphors (e.g., describing the future as being in front and the past behind), accompany our speech with spatial gestures (e.g., gesturing to the left to refer to a past event), and use external tools that project time onto a spatial reference frame (e.g., calendars). Importantly, these associations are also present in the way we think and reason about time, suggesting that space and time are also linked in the mind. In this talk, I will explore the developmental origins and functional implications of these types of cross-dimensional associations. To start, I will discuss the roles that language and culture play in shaping how children in the US and India represent time. Next, I will use word learning and memory as test cases for exploring why cross-dimensional associations may be cognitively advantageous. Finally, I will talk about future directions and the practical implications for this line of work, with a focus on how encouraging spatial representations of abstract concepts could improve learning outcomes.
Low dimensional models and electrophysiological experiments to study neural dynamics in songbirds
Birdsong emerges when a set of highly interconnected brain areas manage to generate a complex output. The similarities between birdsong production and human speech have positioned songbirds as unique animal models for studying learning and production of this complex motor skill. In this work, we developed a low dimensional model for a neural network in which the variables were the average activities of different neural populations within the nuclei of the song system. This neural network is active during production, perception and learning of birdsong. We performed electrophysiological experiments to record neural activity from one of these nuclei and found that the low dimensional model could reproduce the neural dynamics observed during the experiments. Also, this model could reproduce the respiratory motor patterns used to generate song. We showed that sparse activity in one of the neural nuclei could drive a more complex activity downstream in the neural network. This interdisciplinary work shows how low dimensional neural models can be a valuable tool for studying the emergence of complex motor tasks
Monkey Talk – what studies about nonhuman primate vocal communication reveal about the evolution of speech
The evolution of speech is considered to be one of the hardest problems in science. Studies of the communicative abilities of our closest living relatives, the nonhuman primates, aim to contribute to a better understanding of the emergence of this uniquely human capability. Following a brief introduction over the key building blocks that make up the human speech faculty, I will focus on the question of meaning in nonhuman primate vocalizations. While nonhuman primate calls may be highly context specific, thus giving rise to the notion of ‘referentiality’, comparisons across closely related species suggest that this specificity is evolved rather than learned. Yet, as in humans, the structure of calls varies with arousal and affective state, and there is some evidence for effects of sensory-motor integration in vocal production. Thus, the vocal production of nonhuman primates bears little resemblance to the symbolic and combinatorial features of human speech, while basic production mechanisms are shared. Listeners, in contrast, are able learning the meaning of new sounds. A recent study using artificial predator shows that this learning may be extremely rapid. Furthermore, listeners are able to integrate information from multiple sources to make adaptive decisions, which renders the vocal communication system as a whole relatively flexible and powerful. In conclusion, constraints at the side of vocal production, including limits in social cognition and motivation to share experiences, rather than constraints at the side of the recipient explain the differences in communicative abilities between humans and other animals.
Towards a speech neuroprosthesis
I will review advances in understanding the cortical encoding of speech-related oral movements. These discoveries are being translated to develop algorithms to decode speech from population neural activity.
Unsupervised deep learning identifies semantic disentanglement in single inferotemporal neurons
Irina is a research scientist at DeepMind, where she works in the Froniers team. Her work aims to bring together insights from the fields of neuroscience and physics to advance general artificial intelligence through improved representation learning. Before joining DeepMind, Irina was a British Psychological Society Undergraduate Award winner for her achievements as an undergraduate student in Experimental Psychology at Westminster University, followed by a DPhil at the Oxford Centre for Computational Neuroscience and Artificial Intelligence, where she focused on understanding the computational principles underlying speech processing in the auditory brain. During her DPhil, Irina also worked on developing poker AI, applying machine learning in the finance sector, and working on speech recognition at Google Research."" https://arxiv.org/pdf/2006.14304.pdf
Neural control of vocal interactions in songbirds
During conversations we rapidly switch between listening and speaking which often requires withholding or delaying our speech in order to hear others and avoid overlapping. This capacity for vocal turn-taking is exhibited by non-linguistic species as well, however the neural circuit mechanisms that enable us to regulate the precise timing of our vocalizations during interactions are unknown. We aim to identify the neural mechanisms underlying the coordination of vocal interactions. Therefore, we paired zebra finches with a vocal robot (1Hz call playback) and measured the bird’s call response times. We found that individual birds called with a stereotyped delay in respect to the robot call. Pharmacological inactivation of the premotor nucleus HVC revealed its necessity for the temporal coordination of calls. We further investigated the contributing neural activity within HVC by performing intracellular recordings from premotor neurons and inhibitory interneurons in calling zebra finches. We found that inhibition is preceding excitation before and during call onset. To test whether inhibition guides call timing we pharmacologically limited the impact of inhibition on premotor neurons. As a result zebra finches converged on a similar delay time i.e. birds called more rapidly after the vocal robot call suggesting that HVC inhibitory interneurons regulate the coordination of social contact calls. In addition, we aim to investigate the vocal turn-taking capabilities of the common nightingale. Male nightingales learn over 100 different song motifs which are being used in order to attract mates or defend territories. Previously, it has been shown that nightingales counter-sing with each other following a similar temporal structure to human vocal turn-taking. These animals are also able to spontaneously imitate a motif of another nightingale. The neural mechanisms underlying this behaviour are not yet understood. In my lab, we further probe the capabilities of these animals in order to access the dynamic range of their vocal turn taking flexibility.
Rhythm-structured predictive coding for contextualized speech processing
Bernstein Conference 2024
Brain-Rhythm-based Inference (BRyBI) for time-scale invariant speech processing
COSYNE 2023
Cross-trial alignment reveals a low-dimensional cortical manifold of naturalistic speech production
COSYNE 2023
Altered sensory prediction error signaling and dopamine function drive speech hallucinations in schizophrenia
COSYNE 2025
Bayesian integration of audiovisual speech by DNN models is similar to human observers
COSYNE 2025
Geometric Signatures of Speech Recognition: Insights from Deep Neural Networks to the Brain
COSYNE 2025
Human precentral gyrus neurons link speech sequences from listening to speaking
COSYNE 2025
Attentional modulation of the cortical contribution to the frequency-following response evoked by continuous speech
FENS Forum 2024
EEG beta de-synchronization signs the efficacy of a rehabilitation treatment for speech impairment in Parkinson’s disease population
FENS Forum 2024
Brain-rhythm-based inference (BRyBI) for time-scale invariant speech processing
FENS Forum 2024
The cortical frequency-following response to continuous speech in musicians and non-musicians
FENS Forum 2024
Decoding envelope and frequency-following responses to speech using deep neural networks
FENS Forum 2024
Decoding of selective attention to speech in CI patients using linear and non-linear methods
FENS Forum 2024
Decoding spatiotemporal processing of speech and melody in the brain
FENS Forum 2024
The effects and interactions of top-down influences on speech perception
FENS Forum 2024
EEG-based source analysis of the neural response at the fundamental frequency of speech
FENS Forum 2024
Examining speech disfluency through the analysis of grey matter densities in 5-year-olds using voxel-based morphometry
FENS Forum 2024
The neural processing of natural audiovisual speech in noise in autism: A TRF approach
FENS Forum 2024
Web-based speech transcription tool for efficient quantification of memory performance
FENS Forum 2024