Séminaire SLG – Yanis Labrak – 27/03/2025

La prochaine réunion de l’équipe SLG aura lieu le jeudi 27 Mars prochain, en salle S4 de 12h00 à 13h00.

Title: Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels
Abstract: Text-Speech Language Models (TSLMs), language models trained to jointly process and generate text and speech, aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels between speech and text across the model’s layers. Representation analyses and improved multimodal performance suggest that our method enhances cross-modal transfer, even surpassing or rivaling state-of-the-art TSLMs trained using orders of magnitude more compute.