« Robustness of speaker recognition systems ».
The defense will take place on 15th May at 14:30 at Centre d’Enseignement et de Recherche en Informatique (Ada Lovelace Auditorium).
PR. PAVEL Matejka, (Rapporteur), Brno University
PR. LARCHER Anthony, (Rapporteur), Informatique Le Mans Université
PR. BONASTRE Jean-Francois, (Examinateur), Avignon Université
PR. ILLINA Irina, (Examinatrice), Université de Lorraine-INRIA
PR. BARRAS Claude, (Examinateur), Informatique Vocapia Research
PR. LEFEVRE Fabrice, (Examinateur), Avignon Université
PR. ROUVIER Mickael , (Examinateur), Avignon Université
PR. MATROUF Driss, (Directeur de thèse), Avignon Université
Title: Robustness of DNN-based speaker recognition systems against environmental variabilities
Speaker recognition systems authenticate the identity of speakers from their speech utterances. In order to authenticate the identity of a claimed user, it is required to obtain a fixed-length compact speaker-discriminant representation for variable-length speech utterances known as speaker embeddings. The current speaker recognition systems are using DNNs to extract speaker embeddings. Despite the relative robustness of DNN-based speaker recognition systems, their performance degrades in the presence of acoustical variabilities such as additive noise and reverberation. There are three main groups of variabilities that reduce the performance of speaker recognition systems: internal (e.g. age, emotion, and stress), external (e.g. noise, reverberation, and distance), and content (e.g. language, and accent). The main theme of this thesis is robust DNN-based text-independent speaker recognition systems against additive noise and reverberation variabilities. The impact of variabilities can be addressed at the signal level, feature level, speaker embedding extractor, speaker embedding, and scoring adaptation techniques. The scope of our work is speaker embedding extractor and speaker embedding in two well-known and successful DNN-based speaker recognition systems: TDNN, and ResNet.
The first part of our work (Chapter 5) is on proposing several noise compensation DAEs (Stacked DAE, Gaussian DAE) that perform a transformation between pairs of distorted/clean speaker embeddings extracted from the TDNN system. The Stacked DAE is composed of several DAEs where each DAE receives as input the output of its predecessor DAE concatenated with the difference between noisy speaker embeddings and its predecessors’ output. The noise compensation modules are tested in the case of additive noise (unseen noises, specific noise), early reverberation, and late reverberation distortions. We show a significant improvement of equal error rate in all cases ranging from 20% to 76% relative gain of equal error rate. In this part, we proposed two configurations in the case of having several acoustical distortions.
In the second part of our work (Chapter 6), the behavior of the ResNet speaker recognition system against noise and reverberation was explored and compared with the TDNN system. Also, we investigate the noise compensation on ResNet speaker embeddings in two cases: 1) compensation of artificial noise with artificial data, and 2) compensation of real noise with artificial data. The second case is the most desired scenario because it makes noise compensation affordable without having real data to train denoising techniques. The experimental results show that in the first scenario noise compensation gives significant improvement with TDNN while this improvement in ResNet is not significant. In the second scenario, we achieved a 15% improvement of EER over the VoiCes Eval challenge in both TDNN and ResNet systems. In most cases, the performance of ResNet without compensation is superior to TDNN with noise compensation.
In the next part (Chapter 7), we move towards learning noise-robust speaker embedding extractors. We propose two ResNet-based speaker recognition systems that make the speaker embedding more robust against additive noise and reverberation. The goal of the proposed systems is to extract speaker embeddings in noisy environments that are close to their corresponding speaker embedding in a clean environment. The first proposed system learns the same distribution for both noisy and clean environments. The second proposed system shifts the noisy speaker embeddings towards the distribution of the best-obtained system in a clean environment. In different situations with real and artificial noises and reverberation conditions, the modified systems outperform the baseline ResNet system. The proposed systems are tested with four evaluation protocols. In the presence of artificial noise and reverberation, we achieved a 19% improvement in EER. The main advantage of the proposed systems is their efficiency against real noise and reverberation. In the presence of real noise and reverberation, we achieved a 15% improvement in EER.
In the last part of our work (Chapter 8), we proposed a noise-robust self-supervised ResNet speaker recognition system based on the Barlow Twins loss function. The Barlow Twins objective function tries to optimize two criteria: Firstly, it increases the similarity between two versions of the same signal (i.e. the clean and its augmented noisy version) to make the speaker embedding invariant to the acoustic noise. Secondly, it reduces the redundancy between the dimensions of the speaker embeddings which improves the overall quality of speaker embeddings. The experimental results on the Fabiole corpus show a 22% relative gain in terms of EER in clean environments and an 18% improvement in the presence of noise with low SNR and reverberation.