PhD defense of Imen Ben-Amor – 25/04/2024 – Laboratoire Informatique d’Avignon

Lieu: Centre d’Enseignement et de Recherche en Informatique (CERI), Amphi ADA – 339 Chemin des Meinajaries, CERI, 84000 Avignon.

You can also attend the defense via video conference, using this link .

You can fin the slides here.

The jury members are the following:

Pr. Tomi KINNUNEN, University of Eastern Finland – Rapporteur
Pr. Alessandro VINCIARELLI, University of Glasgow – Rapporteur
Pr. Tanja SCHULTZ, University Bremen- Examinatrice
Pr. Didier MEUWLY, Netherlands Forensic Institute, University of Twente- Examinateur
Pr. Corinne FREDOUILLE, LIA, Université d’Avignon- Examinatrice
Pr. JEAN-FRANCOIS BONASTRE, Inria, LIA, Université d’Avignon – Directeur de thèse

TITLE: Deep modeling based on voice attributes for explainable speaker recognition. Application in the forensic domain.

Abstract:
Automatic speaker recognition (ASpR) has been integrated into critical applications, ranging from customised assistant services to security systems and forensic investigations. It aims to automatically determine whether two voice samples originate from the same speaker. These systems primarily rely on complex deep neural networks (DNN) and present their results by a single value. Despite the high performance demonstrated by DNN-based ASpR systems, they struggle to provide transparent insights into the nature of speech representations, its encoding, and its use in decision-making process. This lack of transparency presents significant challenges in addressing ethical and legal concerns, particularly in high-stakes applications such as forensics. This thesis introduces a three-step methodology based on deep learning, designed to provide interpretable and explainable ASpR results.
In the first step, we represent a speech extract by the presence or absence of a set of speech attributes, shared among groups of speakers and selected to be speaker discriminant. This information is encoded by a binary vector where a coefficient equal to 1 represent the presence of the corresponding attribute in the speech extract and 0 its absence. This binary and attribute-based modelling facilitates interpretability and allows for a better handle of the speech information. The results show that the obtained representations are more interpretable and offer a level of performance close to that of State-Of-The-Art (SOTA) ASpR.
In the second step, the goal is to ensure transparent computation of the likelihood ratio (LR), thereby facilitating a more informed assessment of the value of speech evidence in a courtroom setting. We therefore propose the Binary-Attribute-based LR (BA-LR) framework, that breaks down the scoring process into independent sub-processes, each dedicated to an attribute. An attribute-LR is a LR estimated using only the presence or absence of the attribute and its description, defined by three explicit behavioral parameters. The final LR is calculated as the product of the attribute-LRs, assuming independence between them. This framework enables transparent LR computation and a clearer understanding of the value of evidence. It also provides detailed explanations of the contribution of each attribute’s information to the final LR value, aiding juries and judges in decision-making.
In the third step, we conduct a discovery of the nature of attributes. This investigation employs statistical techniques, surrogate models as well as backpropagation and alignment strategies to provide a description of attributes in terms of acoustic, phonetic and phonemic information. The obtained explanations serve as a valuable tool for phoneticians to interpret the contributing attributes to a given LR.
Additionally, our three-step approach is validated through the application of BA-LR on a forensically realistic dataset. In such context, we apply a Logistic Regression model to handle the mismatch between the training conditions and a real-world scenarios. Results demonstrate the robustness and the generalisation ability of BA-LR in a forensic context.
Overall, this thesis opens a new perspective on explainable ASpR, by proposing a solution for a transparent decision making, with a level of performance comparable to SOTA systems. Our approach shows promise in offering forensic practitioners and the court insights into the value of evidence while also serving as a discovery tool for phoneticians helping them better understand and interpret speech information. As always in the field of forensics, these encouraging results require further evaluation through additional studies before being applied in real-world situations.