ANR ESSL Project

Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies

Self-Supervised Learning (SSL) has recently emerged as an incredibly promising artificial intelligence (AI) method. Through this method, massive amounts of unlabeled data that are accessible can be utilized by AI systems to surpass known performances. Particularly, the field of Automatic Speech Processing (ASP) is swiftly being transformed by the arrival of SSL, thanks in part to massive industrial investments and the explosion of data, both provided by a handful of companies. The performance gains are impressive, but the complexity of SSL models requires researchers and industry professionals in the field to have extraordinary computational capacity, drastically limiting access to fundamental research in this area and its deployment in everyday products. For instance, a significant portion of work using an SSL model for ASP relies on a system maintained and provided by a single company (wav2vec 2.0). The entire lifecycle of the technology, from its theoretical foundations to its practical deployment and societal analysis, therefore depends solely on institutions with the physical and financial means to support the intensity of this technique’s development. The E-SSL project aims to restore to the scientific community and ASP industry the necessary control over self-supervised learning to ensure its evolution and equitable deployment by facilitating both academic research and its transfer to the industry. In practice, E-SSL holistically integrates three key issues of self-supervised learning for ASP, including its effective computational efficiency, societal impact, and feasibility of extension to future products.

Indeed, automatic speaker recognition systems are vulnerable not only to artificially produced speech via voice synthesis but also to other forms of attacks such as voice identity conversion and replay. The artifacts created during the creation or manipulation of these fraudulent attacks constitute the marks left in the signal by voice synthesis algorithms, enabling the distinction between the original real voice and a usurped voice. Self-Supervised Learning (SSL) has recently emerged as an incredibly promising artificial intelligence (AI) method. This method allows colossal amounts of unannotated data to be used by AI systems to surpass previously known performance levels. Particularly, the field of automatic speech processing (TAP) is rapidly transformed by the arrival of SSL, thanks in part to massive industrial investments and the explosion of data, both provided by a handful of companies. The performance gains are impressive, but the complexity of SSL models requires researchers and industry professionals in the sector to have an extraordinary computing capacity, drastically reducing access to fundamental research on this topic as well as its deployment in everyday products. For instance, a significant portion of work using an SSL model for TAP relies on a system maintained and provided by a single company (wav2vec 2.0). The entire lifecycle of the technology, from its theoretical foundations to its practical deployment, including the analysis of societal aspects, depends solely on institutions with the physical and financial means to support the intensity of this technique’s development. The E-SSL project aims to restore control over self-supervised learning to the scientific community and industry professionals in TAP to ensure its evolution and equal deployment by facilitating both academic research and its transfer to industry. In practice, E-SSL holistically integrates three key problems of self-supervised learning for TAP, including its effective computational efficiency, societal impact, and the feasibility of its extension to future products, allowing the distinction between the original real voice and a usurped voice.

In these conditions, detecting identity theft requires evaluating identity theft countermeasures concurrently with speaker recognition systems. The BRUEL project aims to propose the first methodology for evaluating/certifying voice identification systems against adversarial attacks based on a Common Criteria approach.

Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies

List of Partners:

Project Coordinator: LIA

Scientific Manager for LIA: Yannick ESTEVE

Start Date: 01/01/2023 End Date: 30/06/2026