Despite recent advances in AI, representation learning methods for continuous signals, such as speech or video, still fall short of human abilities in many respects. For example, current AI systems underperform in terms of data and compute efficiency1,2, robustness to out-of-distribution examples3,4, and robustness to adversarial attacks5,6. This has important implications for the scalability, environmental footprint, reliability and safety of AI systems.
Current AI systems rely on generic inductive biases7,8 and massive scaling of data and compute9,10. In contrast, humans implement sophisticated domain-specific inductive biases and are much more frugal learners11–13. In the speech domain, for example, a body of empirical evidence suggests that human infants may use gestural priors as an inductive bias during speech perception development14,15. ‘Gestural’ here refers to the nature of speech, which results from a sequence of co-articulated gestures from a small number of articulators that move relatively slowly and according to the constraints of biological motion16,17. The idea of a gestural prior is to leverage properties of this latent speech structure to guide self-supervised representation learning. From a signal processing point of view, this can be understood as learning to decouple latent time from signal time in a principled way to accommodate speech’s highly irregular timing.
In this PhD, you will design and implement gestural priors within deep learning models and assess their benefits for speech representation learning. Core challenges will include (i) designing gestural biases that are specific enough to effectively guide learning, while being expressive enough to accommodate the complexity of real-world speech; (ii) implementing these biases within deep learning architectures at scale; and (iii) characterizing conditions under which they provide benefits—for instance in compute and/or data-limited settings, under domain shifts and/or under adversarial attacks.
Depending on your interests, there are many possible extensions and openings, including, for example, toward video signal processing (e.g., audiovisual speech, sign language or goal-directed motions of people or animals in visual scenes) or toward the modeling of cognitive processing or brain activation during speech listening in humans.
Requirements
The ideal candidate has a strong quantitative background and prior experience with representation learning algorithms. He or she is interested in cognitive (neuro)science and enjoys hard mathematical or computational challenges.