When this PhD started, the translation of speech into text in a different language was mainly tackled with a cascade of automatic speech recognition (ASR) and machine translation (MT) models, as the emerging direct speech translation (ST) models were not yet competitive. To close this gap, part of the PhD has been devoted to improving the quality of direct models, both in the simplified condition of test sets where the audio is split into well-formed sentences, and in the realistic condition in which the audio is automatically segmented. First, we investigated how to transfer knowledge from MT models trained on large corpora. Then, we defined encoder architectures that give different weights to the vectors in the input sequence, reflecting the variability of the amount of information over time in speech. Finally, we reduced the adverse effects caused by the suboptimal automatic audio segmentation in two ways: on one side, we created models robust to this condition; on the other, we enhanced the audio segmentation itself. The good results achieved in terms of overall translation quality allowed us to investigate specific behaviors of direct ST systems, which are crucial to satisfy real users’ needs. On one side, driven by the ethical goal of inclusive systems, we disclosed that established technical choices geared toward high general performance (statistical word segmentation of the target text, knowledge distillation from MT) cause an exacerbation of the gender representational disparities in the training data. Along this line of work, we proposed mitigation techniques that reduce the gender bias of ST models, and showed how gender-specific systems can be used to control the translation of gendered words related to the speakers, regardless of their vocal traits. On the other side, motivated by the practical needs of interpreters and translators, we evaluated the potential of direct ST systems in the “augmented translation” scenario, focusing on the translation and recognition of named entities (NEs). Along this line of work, we proposed solutions to cope with the major weakness of ST models (handling person names), and introduced direct models that jointly perform ST and NE recognition showing their superiority over a pipeline of dedicated tools for the two tasks. Overall, we believe that this thesis moves a step forward toward adopting direct ST systems in real applications, increasing the awareness of their strengths and weaknesses compared to the traditional cascade paradigm.

Direct Speech Translation Toward High-Quality, Inclusive, and Augmented Systems / Gaido, Marco. - (2023 Apr 28), pp. 1-307. [10.15168/11572_374507]

Direct Speech Translation Toward High-Quality, Inclusive, and Augmented Systems

Gaido, Marco
2023-04-28

Abstract

When this PhD started, the translation of speech into text in a different language was mainly tackled with a cascade of automatic speech recognition (ASR) and machine translation (MT) models, as the emerging direct speech translation (ST) models were not yet competitive. To close this gap, part of the PhD has been devoted to improving the quality of direct models, both in the simplified condition of test sets where the audio is split into well-formed sentences, and in the realistic condition in which the audio is automatically segmented. First, we investigated how to transfer knowledge from MT models trained on large corpora. Then, we defined encoder architectures that give different weights to the vectors in the input sequence, reflecting the variability of the amount of information over time in speech. Finally, we reduced the adverse effects caused by the suboptimal automatic audio segmentation in two ways: on one side, we created models robust to this condition; on the other, we enhanced the audio segmentation itself. The good results achieved in terms of overall translation quality allowed us to investigate specific behaviors of direct ST systems, which are crucial to satisfy real users’ needs. On one side, driven by the ethical goal of inclusive systems, we disclosed that established technical choices geared toward high general performance (statistical word segmentation of the target text, knowledge distillation from MT) cause an exacerbation of the gender representational disparities in the training data. Along this line of work, we proposed mitigation techniques that reduce the gender bias of ST models, and showed how gender-specific systems can be used to control the translation of gendered words related to the speakers, regardless of their vocal traits. On the other side, motivated by the practical needs of interpreters and translators, we evaluated the potential of direct ST systems in the “augmented translation” scenario, focusing on the translation and recognition of named entities (NEs). Along this line of work, we proposed solutions to cope with the major weakness of ST models (handling person names), and introduced direct models that jointly perform ST and NE recognition showing their superiority over a pipeline of dedicated tools for the two tasks. Overall, we believe that this thesis moves a step forward toward adopting direct ST systems in real applications, increasing the awareness of their strengths and weaknesses compared to the traditional cascade paradigm.
28-apr-2023
XXXV
2021-2022
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Turchi, Marco
Negri, Matteo
no
Inglese
File in questo prodotto:
File Dimensione Formato  
phd_unitn_Gaido_Marco.pdf

accesso aperto

Descrizione: PhD Thesis
Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Creative commons
Dimensione 3.55 MB
Formato Adobe PDF
3.55 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/374507
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact