Resources
Authors & Affiliations
Badr AlKhamissi,Muhammad ElNokrashy,Zeb Kurth-Nelson,Sam Ritter
Abstract
In the human brain, some individual neurons respond selectively to abstract variables, invariant to sensory grounding (Mansouri, Freedman, & Buckley, 2020). Similarly tuned units also appear in artificial networks trained on cognitive tasks (Goh et al., 2021). It is often implicitly assumed that the emergence of such interpretable neurons plays a key role in the behavior of the trained network. Here we show that this is not necessarily the case. We train a biologically inspired artificial agent comprising two key components—a recurrent network and an associative memory—on a canonical rule-based neuroscience task (J. Wallis, Anderson, & Miller, 2001), and observe the emergence of brain-like rule representations in the recurrent network. Crucially, however, we find that these representations are not used to guide behavior at test time—ablating these units has minimal impact on performance. However, we find that ablating other units in the recurrent network can severely degrade performance. These results call into question the assumption that observing representations in a brain region along with performance degradation when that region is lesioned is sufficient to infer that those representations cause the animal’s behavior in the task. These results point the way toward further modeling and animal experiments that may improve our understanding of epiphenomenality in the brain.