ePoster

STUDYING THE LINK BETWEEN HUMAN PREFERENCE ALIGNMENT AND NEURAL ALIGNMENT IN LLMS

Mireia Masiasand 3 co-authors

Telefonica Innovación Digital

FENS Forum 2026 (2026)

Barcelona, Spain

Board PS02-07PM-571

Presentation

Date TBA

View poster

Board: PS02-07PM-571

Poster preview

Event Information

Poster Board

PS02-07PM-571

Poster

View poster

Abstract

Large language models (LLMs) are increasingly post-trained to better follow human preferences, yet it remains unclear whether improved preference alignment is accompanied by stronger alignment to human neural representations. We systematically compare pretrained models, instruction-following variants across LLMs. Using a shared set of language stimuli with matched neural recordings, we extract layer-wise representations and estimate neural alignment via representational similarity and encoding-based correlations, while accounting for measurement noise. We then relate neural alignment to performance on human-alignment benchmarks. Finally, we fine-tune a subset of models with a neural-alignment objective and test whether this brain tuning transfers to human-alignment and task performance measures. Preliminary results indicate that post-training shifts peak neural alignment toward middle-to-late layers and increases correspondence with higher-level semantic signals, while reward models show the strongest late-layer alignment. Across models, higher benchmark scores are positively associated with neural alignment in these layers. Brain-tuned models show small but consistent gains on preference benchmarks without degrading task performance, suggesting that neural alignment may provide a complementary training signal for building more human-aligned systems.

STUDYING THE LINK BETWEEN HUMAN PREFERENCE ALIGNMENT AND NEURAL ALIGNMENT IN LLMS

Poster preview

Event Information

Abstract

Recommended posters