Title : Analysis and understanding of the evaluation of automatic speech
recognition systems: towards metrics integrating human perception.
Date: Friday January 17 at 2:00 pm
Place : Amphithéâtre du bâtiment 34, LS2N, Campus Lombarderie, 2 chemin de la
Houssinière 44000 Nantes.
The defense will be presented in French.
Abstract :
Today, word error rate remains the most widely used metric for
evaluating automatic speech recognition (ASR) systems. However, this
metric has limitations in terms of correlation with human perception and
focuses only on spelling preservation. In this thesis, we propose
alternative metrics that can evaluate spelling, but also grammar,
semantics or phonetics.
To analyze the ability of these metrics to reflect transcript quality
from the user’s point of view, we built up a dataset named HATS,
annotated by 143 French-speaking subjects. Each annotator examined 50
triplets, made up of a manual reference transcription and two hypotheses
from different ASR systems, to determine which hypothesis was, in their
opinion, the most faithful.
By calculating the number of times a metric agrees with the annotators’
choices, we obtain a measure of its correlation with human perception.
This corpus can thus be used to rank different metrics according to the
judgment of a human reader. Our results show that SemDist, a metric
based on BERT’s semantic representations for comparing two sentences, is
the most relevant for evaluating transcriptions from a perceptual point
of view. Conversely, word error rate is one of the worst performers,
raising questions about its systematic use as the sole metric, while
other promising alternatives are largely neglected.
We have also investigated the impact of several hyperparameters of ASR
systems, such as hypothesis rescoring with language models, tokenization
and the use of SSL modules. In addition to the qualitative analysis of
these parameters, our research reveals that each metric evaluates
different aspects of the systems, and that the metrics do not always
converge in their ranking of the systems. This disparity, combined with
the limitations of the word error rate, justifies the use of several
metrics for a more refined evaluation.
Finally, we propose an innovative approach to making semantic metrics
more interpretable. These metrics currently simply provide raw scores
based on cosine similarities between semantic representations, making
error interpretation difficult. In order to make these metrics more
accessible, we have developed a method called minED, which aims to
improve the comprehensibility and transparency of ASR system evaluation,
taking into account semantic aspects as well as human perception. In
addition, a variant of this method evaluates the severity of each error
for the overall comprehension of a sentence, thus offering valuable
information not only on system errors, but also on the operation of the
metrics themselves.