PhD defense of Lucas Druart – 24/10/2024 – Laboratoire Informatique d’Avignon

Date: Jeudi 24 octobre à 15h

Lieu: salle des thèses sur le campus Hannah Arendt.

Vous pouvez également y assister à distance si vous le souhaitez grâce au lien suivant : https://v-au.univ-avignon.fr/live/bbb-soutenance-these-l-druart-24-octobre-2024/.

Title : Towards Contextual and Structured Spoken Task-Oriented Dialogue Understanding

Abstract : Accurately understanding users’ requests is key to provide smooth interactions with spoken Task-Oriented Dialogue (TOD) systems. Traditionally such systems adopt cascade approaches which combine an Automatic Speech Recognition (ASR) component with a Natural Language Understanding (NLU) one. Yet, those systems still have trouble to accurately map complex user’s request with their internal representation. Recent work highlights potential directions to improve those systems. On the one hand, end-to-end approaches have successfully enhanced Spoken Language Understanding (SLU) system’s performance. Indeed, they provide more robust and accurate predictions by leveraging joint optimization and paralinguistic information. On the other hand, textual datasets propose fine-grained semantic representations. Such representations seem more adequate to represent user’s complex requests.

This thesis explores both directions towards contextual and structured spoken task-oriented dialogue understanding. We first conduct a preliminary study dedicated to getting the grips of SLU in the context of TOD. We designed a cascade approach to perform spoken Dialogue State Tracking (DST) on MultiWOZ. Our approach ranked first in the Speech Aware Dialogue System Technology Challenge thanks to transcription correction and data augmentation.

Then, we proposed a novel method to perform completely neural spoken DST on both MultiWOZ and SpokenWOZ. Our approach fuses the high dimensional representation of a textual context with the representation of the current spoken dialogue turns to condition a dialogue state decoder. While it benefits from joint-optimization, especially in audio native settings, it struggles to accurately propagate the dialogue’s context.
Finally, in response to the semantic representation gap between textual and spoken TOD datasets, we introduced the ReMEDIATES dataset and benchmark. This dataset was built with a semi-automatic annotation pipeline to enhance the French MEDIA dataset with semantic trees. The benchmark enables to evaluate spoken dialogue parsing models on structured and contextual representations which opens perspectives for future challenges.