Multilingual Spoken Language Understanding: Efficient Speech Dataset Collection, Architectural Exploration, and Zero-Shot SLU

Lee, Beomseok

Spoken Language Understanding (SLU) is a key technology for interpreting a speaker’s intention through intent classification (IC) and extracting relevant information via slot filling (SF). For example, when a user asks, “What is the weather like in London tomorrow?”, the intent is ‘Get Weather’ and the slots are ‘location: London’ and ‘date: tomorrow’. This PhD thesis presents a three-year investigation aimed at advancing SLU technology. A central challenge in SLU is data scarcity. Unlike Automatic Speech Recognition (ASR) or Speech Translation (ST), publicly available SLU datasets are limited to 760 hours across multiple languages, with only 300 hours of real human speech. To address this, I developed a multilingual SLU dataset covering 12 languages with human recordings. During crowd-sourced SLU data collection, I observed that validation costs could reach half of the recording expenses. To improve efficiency while maintaining quality, I employed Speech Foundation Models (SFMs) for validation, reducing costs by 40%. Using this dataset, I explored SFMs for SLU, demonstrating effective knowledge transfer between related languages and bridging end-to-end (E2E) and cascaded SLU models through multi-task fine-tuning. I then focused on Speech-enabled Large Language Models (Speech-LLMs) due to their strong zero-shot and instruction-following capabilities. As part of the IWSLT Instruction-following Speech-LLM shared task, I developed models for ASR, ST, and Spoken Question Answering, gaining insights into optimizing instruction following and improving performance, ultimately achieving top results. Building on this work, I evaluated Speech-LLMs for multilingual zero-shot SLU. Conventional SLU models are typically constrained to seen intents and slots, revealing a gap between current research and practical deployment, which motivated the study of multilingual zero-shot SLU. Without task-specific training, they demonstrated promising performance in identifying speaker intentions, which further improved with self-enriched context. Overall, this thesis advances SLU through three key contributions: creation of a large multilingual human-voice dataset and an efficient data collection protocol; bridging cascaded and E2E SLU models using SFMs; and pioneering the application of Speech-LLMs for multilingual zero-shot SLU and broader speech tasks. These contributions collectively enhance the state of the art and bring SLU closer to real-world deployment.

Multilingual Spoken Language Understanding: Efficient Speech Dataset Collection, Architectural Exploration, and Zero-Shot SLU / Lee, Beomseok. - (2026 Apr 23), pp. 1-178.

Multilingual Spoken Language Understanding: Efficient Speech Dataset Collection, Architectural Exploration, and Zero-Shot SLU

Lee, Beomseok

2026-04-23

Abstract

Spoken Language Understanding (SLU) is a key technology for interpreting a speaker’s intention through intent classification (IC) and extracting relevant information via slot filling (SF). For example, when a user asks, “What is the weather like in London tomorrow?”, the intent is ‘Get Weather’ and the slots are ‘location: London’ and ‘date: tomorrow’. This PhD thesis presents a three-year investigation aimed at advancing SLU technology. A central challenge in SLU is data scarcity. Unlike Automatic Speech Recognition (ASR) or Speech Translation (ST), publicly available SLU datasets are limited to 760 hours across multiple languages, with only 300 hours of real human speech. To address this, I developed a multilingual SLU dataset covering 12 languages with human recordings. During crowd-sourced SLU data collection, I observed that validation costs could reach half of the recording expenses. To improve efficiency while maintaining quality, I employed Speech Foundation Models (SFMs) for validation, reducing costs by 40%. Using this dataset, I explored SFMs for SLU, demonstrating effective knowledge transfer between related languages and bridging end-to-end (E2E) and cascaded SLU models through multi-task fine-tuning. I then focused on Speech-enabled Large Language Models (Speech-LLMs) due to their strong zero-shot and instruction-following capabilities. As part of the IWSLT Instruction-following Speech-LLM shared task, I developed models for ASR, ST, and Spoken Question Answering, gaining insights into optimizing instruction following and improving performance, ultimately achieving top results. Building on this work, I evaluated Speech-LLMs for multilingual zero-shot SLU. Conventional SLU models are typically constrained to seen intents and slots, revealing a gap between current research and practical deployment, which motivated the study of multilingual zero-shot SLU. Without task-specific training, they demonstrated promising performance in identifying speaker intentions, which further improved with self-enriched context. Overall, this thesis advances SLU through three key contributions: creation of a large multilingual human-voice dataset and an efficient data collection protocol; bridging cascaded and E2E SLU models using SFMs; and pioneering the application of Speech-LLMs for multilingual zero-shot SLU and broader speech tasks. These contributions collectively enhance the state of the art and bring SLU closer to real-world deployment.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				23-apr-2026
			
	Ciclo
	
				XXXVIII
			
	Anno Accademico
	
				2024-2025
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Industrial Innovation
			
	Supervisore/Relatore di tesi esterno (External supervisor)
	
				Negri, Matteo
			
	Supervisore aggiunto/Correlatore esterno (External Co-supervisor)
	
				Besacier, Laurent
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Lingua (Language)
	
				Inglese
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
phd_unitn_Lee_Beomseok.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Creative commons Dimensione 6.58 MB Formato Adobe PDF Visualizza/Apri	6.58 MB	Adobe PDF	Visualizza/Apri