PhD defense of Adrien Gresse – 6 February 2019 – Laboratoire Informatique d’Avignon

Thursday 6 February 2020 14:30 at CERI (Amphitheater Ada).

Title: “The Art of Voice: Characterizing Vocal Information in Artistic Choices”

Jury members:

Mr. Emmanuel Vincent, Research Director at Inria-Nancy, LORIA, Reviewer
Mr. Jean-Julien Aucouturier, Research Scientist at CNRS, IRCAM, Reviewer
Ms. Julie Mauclair, Assistant Professor at the University of Toulouse, IRIT, Examiner
Ms. Lori Lamel, Research Director at CNRS, LIMSI, Examiner
Mr. Yannick Estève, Professor at the University of Avignon, LIA, Examiner
Mr. Jean-François Bonastre, Professor at the University of Avignon, LIA, Thesis Supervisor
Mr. Richard Dufour, Assistant Professor at the University of Avignon, LIA, Co-supervisor
Mr. Vincent Labatut, Assistant Professor at the University of Avignon, LIA, Co-supervisor

Abstract: To reach an international audience, audiovisual productions (films, series, video games) need to be translated into other languages. Often, the original language voices in the work are replaced by new voices in the target language. The vocal casting process aiming to choose a voice (an actor) in accordance with the original voice and the character played is manually performed by an artistic director (AD). Today, ADs have a tendency for new “talents” (less expensive and more available than experienced dubbers), but they cannot conduct large-scale auditions. Providing audiovisual industry professionals with automatic tools capable of measuring the suitability between a voice in a source language with a given context and a voice in a target language/culture is of strong interest. Furthermore, beyond vocal casting, this issue of voice selection echoes the major scientific challenges in understanding voice perception mechanisms.

In this thesis, we utilize voices of professional actors selected by an AD in different languages for already dubbed works. Initially, we build a protocol based on a state-of-the-art speaker recognition method to highlight the existence of characteristic character information in our data. We also identify the influence of linguistic bias on the system’s performance. Subsequently, we establish a methodological framework to evaluate an automatic system’s ability to discriminate pairs of voices dubbing the same character. The system we created relies on siamese neural networks. In this evaluation framework, we rigorously control biases (linguistic content, gender, etc.) and learn a similarity measure enabling us to predict the AD’s choices significantly beyond chance. Finally, we train a representation space highlighting the characteristic character information, called p-vector. Through our methodological framework, we demonstrate that this representation better discriminates new character voices compared to a representation focused on speaker information. Moreover, we show the possibility of benefiting from generalized knowledge learned from a close dataset using knowledge distillation techniques in neural networks. This thesis provides an initial response toward constructing a vocal casting aid tool capable of preselecting relevant voices from a large set available in a language. While we have demonstrated in this thesis the extraction of characteristic information from a large volume of data, often difficult to formalize artistic choices, we still need to highlight the explanatory factors for this decision. Additionally, understanding the system’s decision-making process would assist in defining the “vocal palette”. Following this work, we aim to explore the influence of the targeted language and culture by extending our research to more languages. In the longer term, this work could help understand how voice perception has evolved since the early days of dubbing.