Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to a reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.

Speechformer: Reducing Information Loss in Direct Speech Translation / Papi, S.; Gaido, M.; Negri, M.; Turchi, M.. - (2021), pp. 1698-1706. (Intervento presentato al convegno 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 tenutosi a Online and Punta Cana, Dominican Republic nel 7-11 November 2021).

Speechformer: Reducing Information Loss in Direct Speech Translation

Papi S.;Gaido M.;Turchi M.
2021-01-01

Abstract

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer's quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to a reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.
2021
EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
Online and Punta Cana, Dominican Republic
Association for Computational Linguistics (ACL)
Papi, S.; Gaido, M.; Negri, M.; Turchi, M.
Speechformer: Reducing Information Loss in Direct Speech Translation / Papi, S.; Gaido, M.; Negri, M.; Turchi, M.. - (2021), pp. 1698-1706. (Intervento presentato al convegno 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021 tenutosi a Online and Punta Cana, Dominican Republic nel 7-11 November 2021).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369994
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? ND
social impact