Contextualized translation of automatically segmented speech

Gaido, M.; Di Gangi, M. A.; Negri, M.; Cettolo, M.; Turchi, M.

doi:10.21437/Interspeech.2020-2860

Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntax-informed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models' output. So far, researchers have focused on improving audio segmentation towards producing sentence-like splits. In this paper, instead, we address the issue in the model, making it more robust to a different, potentially sub-optimal segmentation. To this aim, we train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context. We show that our context-aware solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.

Contextualized translation of automatically segmented speech / Gaido, M.; Di Gangi, M. A.; Negri, M.; Cettolo, M.; Turchi, M.. - 2020-:(2020), pp. 1471-1475. (Intervento presentato al convegno 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 tenutosi a Shanghai, China nel 25-29 October, 2020) [10.21437/Interspeech.2020-2860].

Contextualized translation of automatically segmented speech

Gaido M.;Di Gangi M. A.;Negri M.;Cettolo M.;Turchi M.

2020-01-01

Abstract

Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntax-informed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models' output. So far, researchers have focused on improving audio segmentation towards producing sentence-like splits. In this paper, instead, we address the issue in the model, making it more robust to a different, potentially sub-optimal segmentation. To this aim, we train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context. We show that our context-aware solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2020
			
	Titolo del volume (Proceedings title)
	
				Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
			
	Luogo di edizione (Place of publication)
	
				Online
			
	Casa editrice (Publisher)
	
				International Speech Communication Association
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85098172694
			
	Codice WOS (WOS identifier)
	
				WOS:000833594101125
			
	Tutti gli autori
	
						Gaido, M.; Di Gangi, M. A.; Negri, M.; Cettolo, M.; Turchi, M.
					
	Citazione
	
				Contextualized translation of automatically segmented speech / Gaido, M.; Di Gangi, M. A.; Negri, M.; Cettolo, M.; Turchi, M.. - 2020-:(2020), pp. 1471-1475. (Intervento presentato al  convegno 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 tenutosi a Shanghai, China nel 25-29 October, 2020) [10.21437/Interspeech.2020-2860].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
gaido20_interspeech.pdf accesso aperto Descrizione: Articolo principale Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 687.51 kB Formato Adobe PDF Visualizza/Apri	687.51 kB	Adobe PDF	Visualizza/Apri