Spoken Language Understanding (SLU) is a key technology for interpreting a speaker’s intention through intent classification (IC) and extracting relevant information via slot filling (SF). For example, when a user asks, “What is the weather like in London tomorrow?”, the intent is ‘Get Weather’ and the slots are ‘location: London’ and ‘date: tomorrow’. This PhD thesis presents a three-year investigation aimed at advancing SLU technology. A central challenge in SLU is data scarcity. Unlike Automatic Speech Recognition (ASR) or Speech Translation (ST), publicly available SLU datasets are limited to 760 hours across multiple languages, with only 300 hours of real human speech. To address this, I developed a multilingual SLU dataset covering 12 languages with human recordings. During crowd-sourced SLU data collection, I observed that validation costs could reach half of the recording expenses. To improve efficiency while maintaining quality, I employed Speech Foundation Models (SFMs) for validation, reducing costs by 40%. Using this dataset, I explored SFMs for SLU, demonstrating effective knowledge transfer between related languages and bridging end-to-end (E2E) and cascaded SLU models through multi-task fine-tuning. I then focused on Speech-enabled Large Language Models (Speech-LLMs) due to their strong zero-shot and instruction-following capabilities. As part of the IWSLT Instruction-following Speech-LLM shared task, I developed models for ASR, ST, and Spoken Question Answering, gaining insights into optimizing instruction following and improving performance, ultimately achieving top results. Building on this work, I evaluated Speech-LLMs for multilingual zero-shot SLU. Conventional SLU models are typically constrained to seen intents and slots, revealing a gap between current research and practical deployment, which motivated the study of multilingual zero-shot SLU. Without task-specific training, they demonstrated promising performance in identifying speaker intentions, which further improved with self-enriched context. Overall, this thesis advances SLU through three key contributions: creation of a large multilingual human-voice dataset and an efficient data collection protocol; bridging cascaded and E2E SLU models using SFMs; and pioneering the application of Speech-LLMs for multilingual zero-shot SLU and broader speech tasks. These contributions collectively enhance the state of the art and bring SLU closer to real-world deployment.

Multilingual Spoken Language Understanding: Efficient Speech Dataset Collection, Architectural Exploration, and Zero-Shot SLU / Lee, Beomseok. - (2026 Apr 23), pp. 1-178.

Multilingual Spoken Language Understanding: Efficient Speech Dataset Collection, Architectural Exploration, and Zero-Shot SLU

Lee, Beomseok
2026-04-23

Abstract

Spoken Language Understanding (SLU) is a key technology for interpreting a speaker’s intention through intent classification (IC) and extracting relevant information via slot filling (SF). For example, when a user asks, “What is the weather like in London tomorrow?”, the intent is ‘Get Weather’ and the slots are ‘location: London’ and ‘date: tomorrow’. This PhD thesis presents a three-year investigation aimed at advancing SLU technology. A central challenge in SLU is data scarcity. Unlike Automatic Speech Recognition (ASR) or Speech Translation (ST), publicly available SLU datasets are limited to 760 hours across multiple languages, with only 300 hours of real human speech. To address this, I developed a multilingual SLU dataset covering 12 languages with human recordings. During crowd-sourced SLU data collection, I observed that validation costs could reach half of the recording expenses. To improve efficiency while maintaining quality, I employed Speech Foundation Models (SFMs) for validation, reducing costs by 40%. Using this dataset, I explored SFMs for SLU, demonstrating effective knowledge transfer between related languages and bridging end-to-end (E2E) and cascaded SLU models through multi-task fine-tuning. I then focused on Speech-enabled Large Language Models (Speech-LLMs) due to their strong zero-shot and instruction-following capabilities. As part of the IWSLT Instruction-following Speech-LLM shared task, I developed models for ASR, ST, and Spoken Question Answering, gaining insights into optimizing instruction following and improving performance, ultimately achieving top results. Building on this work, I evaluated Speech-LLMs for multilingual zero-shot SLU. Conventional SLU models are typically constrained to seen intents and slots, revealing a gap between current research and practical deployment, which motivated the study of multilingual zero-shot SLU. Without task-specific training, they demonstrated promising performance in identifying speaker intentions, which further improved with self-enriched context. Overall, this thesis advances SLU through three key contributions: creation of a large multilingual human-voice dataset and an efficient data collection protocol; bridging cascaded and E2E SLU models using SFMs; and pioneering the application of Speech-LLMs for multilingual zero-shot SLU and broader speech tasks. These contributions collectively enhance the state of the art and bring SLU closer to real-world deployment.
23-apr-2026
XXXVIII
2024-2025
Ingegneria e scienza dell'Informaz (29/10/12-)
Industrial Innovation
Negri, Matteo
Besacier, Laurent
no
Inglese
File in questo prodotto:
File Dimensione Formato  
phd_unitn_Lee_Beomseok.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Creative commons
Dimensione 6.58 MB
Formato Adobe PDF
6.58 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/483811
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact