Speechformer: Reducing Information Loss in Direct Speech Translation

IRIS

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to a reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.

Speechformer: Reducing Information Loss in Direct Speech Translation / Papi, S.; Gaido, M.; Negri, M.; Turchi, M.. - (2021), pp. 1698-1706. (Intervento presentato al convegno 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 tenutosi a Online and Punta Cana, Dominican Republic nel 7-11 November 2021).

Speechformer: Reducing Information Loss in Direct Speech Translation

Papi S.;Gaido M.;Negri M.;Turchi M.

2021-01-01

Abstract

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to a reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2021
			
	Titolo del volume (Proceedings title)
	
				EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
			
	Luogo di edizione (Place of publication)
	
				Online and Punta Cana, Dominican Republic
			
	Casa editrice (Publisher)
	
				Association for Computational Linguistics (ACL)
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85127443202
			
	Tutti gli autori
	
						Papi, S.; Gaido, M.; Negri, M.; Turchi, M.
					
	Citazione
	
				Speechformer: Reducing Information Loss in Direct Speech Translation / Papi, S.; Gaido, M.; Negri, M.; Turchi, M.. - (2021), pp. 1698-1706. (Intervento presentato al  convegno 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 tenutosi a Online and Punta Cana, Dominican Republic nel 7-11 November  2021).

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369994

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

17

ND

ND

social impact