WARA Media and Language Webinar: Jens Edlund on Analysis through Synthesis

Please join the next WARA Media and Language webinar on May 13, 13:00-14:00 (CET). Jens Edlund (KTH Royal Institute of Technology Speech, Music and Hearing and Språkbanken Tal) will present an overview of a methodology that combines the interests of machine learning with those of the speech scientist: analysis-by-synthesis.

You can sign up for the event via Lyyti. Registration closes on 12:00, May 13. A zoom link will be distributed shortly before the webinar.

Bio

Jens Edlund is an interdisciplinary researcher with a background in phonetics, linguistics and computational linguistics. He holds a PhD in Speech Communication and a Docent in Speech Technology. Starting at Telia Research in the mid-90s, Jens has worked with most aspects of speech-centric research. He is an Associate Professor at KTH Speech, Music and Hearing and the Director of the national research infrastructure Språkbanken Tal.

Abstract

From a machine learning perspective, the algorithm and the ML methodology make up the centrepiece, surrounded by tasks – be it speech recognition, speech synthesis, facial recognition, or something entirely different – whose sole purpose is to provide interesting challenges. The tasks in themselves are largely peripheral; they are applications of ML. In speech-centric sciences, the focus is reversed. The fundamental research questions circle around speech and spoken language – how it works and what it means, how people communicate, how it affects lives and society, and how it evolves. Here, the view of ML is that it is one out of many tools in the speech scientist’s toolbox. I propose that one of the potential differences between speech-centric AI and pure machine learning applied to speech tasks is that AI must involve, minimally, both these views. If it focuses on the algorithms alone, it is machine learning – “intelligence” is meaningful only in relation to the outside world.

In this seminar, I will present an overview of a methodology that combines the interests of machine learning with those of the speech scientist: analysis-by-synthesis. The fundamental idea is old, with Gunnar Fant’s first speech synthesis from the 1950s as an early example: Fant created the synthesis not primarily to build spoken dialogue systems, but to verify a theory of how speech is produced and perceived. The words “What I cannot create, I do not understand” written on Richard Feynman’s blackboard at the time of his death expressed the same idea: if we properly understand something, we should be able to create it and verify that it works as expected. In analysis-by-synthesis, this idea is taken further: we build generative models that recreate some phenomenon of interest, manipulate the models in some systematic way, and inspect the effects on the generated phenomenon.

I will present the method in some detail, illustrated by examples. Obvious stumbling blocks include How do we build the model in the first place?, How do we verify its fidelity to the phenomenon of interest? and How do we inspect the effects of our manipulations? I will propose tentative and partial answers to these questions, and I will propose that we cannot and should not strive to remove humans (human behaviours, human perception and cognition, human reactions) from the equation entirely. Instead we should find meaningful and efficient ways of including them in the loop.