In the past two decades there have been several projects on Spoken Language Understanding (SLU). In the early nineties DARPA ATIS project aimed at providing a natural language interface to a travel information database. Following the ATIS project, DARPA Communicator project aimed at building a spoken dialog system automatically providing information on flights and travel reservation. These two projects defined a first generation of conversational systems. In late nineties ``How may I help you'' project from AT\&T, with Large Vocabulary Continuous Speech Recognition (LVCSR) and mixed initiatives spoken interfaces, started the second generation of conversational systems, which later have been improved integrating approaches based on machine learning techniques. The European funded project LUNA aims at starting the third generation of spoken language interfaces. In the context of this project we have acquired the first Italian corpus of spontaneous speech from real users engaged in a problem solving task, as opposed to previous projects. The corpus contains transcriptions and annotations based on a new multilevel protocol studied specifically for the goal of the LUNA project. The task of Spoken Language Understanding is the extraction of the meaning structure from spoken utterances in conversational systems. For this purpose, two main statistical learning paradigms have been proposed in the last decades: generative and discriminative models. The former are robust to over-fitting and they are less affected by noise but they cannot easily integrate complex structures (e.g. trees). In contrast, the latter can easily integrate very complex features that can capture arbitrarily long distance dependencies. On the other hand they tend to over-fit training data and so they are less robust to annotation errors in the data needed to learn the model. This work presents an exhaustive study of Spoken Language Understanding models, putting particular focus on structural features used in a Joint Generative and Discriminative learning framework. This combines the strengths of both approaches while training segmentation and labeling models for SLU. Its main characteristic is the use of Kernel Methods to encode structured features in Support Vector Machines, which in turn re-rank the hypotheses produced by an first step SLU module based either on Stochastic Finite State Transducers or Conditional Random Fields. Joint models based on transducers are also amenable to decode word lattices generated by large vocabulary speech recognizers. We show the benefit of our approach with comparative experiments among generative, discriminative and joint models on some of the most representative corpora of SLU, for a total of four corpora in four different languages: the ATIS corpus (English), the MEDIA corpus (French) and the LUNA Italian and Polish corpora (Italian and Polish respectively). These also represent three different kinds of domain applications, i.e. informational, transactional and problem-solving domains. The results, although depending on the task and in some range on the first model baseline, show that joint models improve in most cases the state-of-the-art, especially when a small training set is available.

Spoken Language Understanding: from Spoken Utterances to Semantic Structures / Dinarelli, Marco. - (2010), pp. 1-129.

Spoken Language Understanding: from Spoken Utterances to Semantic Structures

Dinarelli, Marco
2010-01-01

Abstract

In the past two decades there have been several projects on Spoken Language Understanding (SLU). In the early nineties DARPA ATIS project aimed at providing a natural language interface to a travel information database. Following the ATIS project, DARPA Communicator project aimed at building a spoken dialog system automatically providing information on flights and travel reservation. These two projects defined a first generation of conversational systems. In late nineties ``How may I help you'' project from AT\&T, with Large Vocabulary Continuous Speech Recognition (LVCSR) and mixed initiatives spoken interfaces, started the second generation of conversational systems, which later have been improved integrating approaches based on machine learning techniques. The European funded project LUNA aims at starting the third generation of spoken language interfaces. In the context of this project we have acquired the first Italian corpus of spontaneous speech from real users engaged in a problem solving task, as opposed to previous projects. The corpus contains transcriptions and annotations based on a new multilevel protocol studied specifically for the goal of the LUNA project. The task of Spoken Language Understanding is the extraction of the meaning structure from spoken utterances in conversational systems. For this purpose, two main statistical learning paradigms have been proposed in the last decades: generative and discriminative models. The former are robust to over-fitting and they are less affected by noise but they cannot easily integrate complex structures (e.g. trees). In contrast, the latter can easily integrate very complex features that can capture arbitrarily long distance dependencies. On the other hand they tend to over-fit training data and so they are less robust to annotation errors in the data needed to learn the model. This work presents an exhaustive study of Spoken Language Understanding models, putting particular focus on structural features used in a Joint Generative and Discriminative learning framework. This combines the strengths of both approaches while training segmentation and labeling models for SLU. Its main characteristic is the use of Kernel Methods to encode structured features in Support Vector Machines, which in turn re-rank the hypotheses produced by an first step SLU module based either on Stochastic Finite State Transducers or Conditional Random Fields. Joint models based on transducers are also amenable to decode word lattices generated by large vocabulary speech recognizers. We show the benefit of our approach with comparative experiments among generative, discriminative and joint models on some of the most representative corpora of SLU, for a total of four corpora in four different languages: the ATIS corpus (English), the MEDIA corpus (French) and the LUNA Italian and Polish corpora (Italian and Polish respectively). These also represent three different kinds of domain applications, i.e. informational, transactional and problem-solving domains. The results, although depending on the task and in some range on the first model baseline, show that joint models improve in most cases the state-of-the-art, especially when a small training set is available.
2010
XXII
2009-2010
Ingegneria e Scienza dell'Informaz (cess.4/11/12)
Information and Communication Technology
Riccardi, Giuseppe
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
PhD-Thesis-Dinarelli.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.41 MB
Formato Adobe PDF
1.41 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/367830
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact