Gestural priors for speech representation learning (Bio-inspired ML/AI)

Despite recent advances in AI, representation learning methods for continuous signals, such as speech or video, still fall short of human abilities in many respects. For example, current AI systems underperform in terms of data and compute efficiency^1,2, robustness to out-of-distribution examples^3,4, and robustness to adversarial attacks^5,6. This has important implications for the scalability, environmental footprint, reliability and safety of AI systems.

Current AI systems rely on generic inductive biases^7,8 and massive scaling of data and compute^9,10. In contrast, humans implement sophisticated domain-specific inductive biases and are much more frugal learners^11–13. In the speech domain, for example, a body of empirical evidence suggests that human infants may use gestural priors as an inductive bias during speech perception development^14,15. ‘Gestural’ here refers to the nature of speech, which results from a sequence of co-articulated gestures from a small number of articulators that move relatively slowly and according to the constraints of biological motion^16,17. The idea of a gestural prior is to leverage properties of this latent speech structure to guide self-supervised representation learning. From a signal processing point of view, this can be understood as learning to decouple latent time from signal time in a principled way to accommodate speech’s highly irregular timing.

In this PhD, you will design and implement gestural priors within deep learning models and assess their benefits for speech representation learning. Core challenges will include (i) designing gestural biases that are specific enough to effectively guide learning, while being expressive enough to accommodate the complexity of real-world speech; (ii) implementing these biases within deep learning architectures at scale; and (iii) characterizing conditions under which they provide benefits—for instance in compute and/or data-limited settings, under domain shifts and/or under adversarial attacks.

Depending on your interests, there are many possible extensions and openings, including, for example, toward video signal processing (e.g., audiovisual speech, sign language or goal-directed motions of people or animals in visual scenes) or toward the modeling of cognitive processing or brain activation during speech listening in humans.

Requirements

The ideal candidate has a strong quantitative background and prior experience with representation learning algorithms. He or she is interested in cognitive (neuro)science and enjoys hard mathematical or computational challenges.

References

Cuervo, S. & Marxer, R. Scaling properties of speech language models. in Proceedings of EMNLP (2024).

Long, B. L. et al. The BabyView dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences. in Proceedings of CCNeuro (2025).

Atwany, H., Waheed, A., Singh, R., Choudhury, M. & Raj, B. Lost in transcription, found in distribution shift: Demystifying hallucination in speech foundation models. in Proceedings of ACL (2025).

Han, S., Wang, Z. & Zhang, M. Flow snapshot neurons in action: Deep neural networks generalize to biological motion perception. in Proceedings of NeurIPS (2024).

Olivier, R. & Raj, B. There is more than one kind of robustness: Fooling whisper with adversarial examples. in Proceedings of interspeech (2023).

Wichmann, F. A. & Geirhos, R. Are deep neural networks adequate behavioral models of human visual perception? Annual review of vision science 9, 501–524 (2023).

Chatterjee, S. & Sudijono, T. Neural networks generalize on low complexity data. The Annals of Statistics 54, 350–382 (2026).

Valle-Perez, G., Camargo, C. Q. & Louis, A. A. Deep learning generalizes because the parameter-function map is biased towards simple functions. in Proceedings of ICLR (2019).

Sutton, R. The bitter lesson.

10.

Zhai, X., Kolesnikov, A., Houlsby, N. & Beyer, L. Scaling vision transformers. in Proceedings of CVPR (2022).

11.

Spelke, E. What Babies Know: Core Knowledge and Composition. vol. 1 (Oxford University Press, 2022).

12.

Yildirim, I., Belledonne, M., Freiwald, W. & Tenenbaum, J. Efficient inverse graphics in biological face processing. Science advances 6, (2020).

13.

Warstadt, A. et al. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. in Proceedings CoNLL (2023).

14.

Choi, D., Yeung, H. H. & Werker, J. F. Sensorimotor foundations of speech perception in infancy. Trends in Cognitive Sciences 27, 773–784 (2023).

15.

Kuhl, P. K., Ramı́rez, R. R., Bosseler, A., Lin, J.-F. L. & Imada, T. Infants’ brain responses to speech suggest analysis by synthesis. Proceedings of the National Academy of Sciences 111, 11238–11245 (2014).

16.

Turk, A. & Shattuck-Hufnagel, S. Speech Timing: Implications for Theories of Phonology, Phonetics, and Speech Motor Control. (Oxford University Press, 2020).

17.

Todorov, E. Optimality principles in sensorimotor control. Nature neuroscience 7, 907–915 (2004).