Pre-trained Vision and Language Transformers achieve high performance on downstream tasks due to their ability to transfer representational knowledge accumulated during pretraining on substantial amounts of data. In this paper, we ask whether it is possible to compete with such models using features based on transferred (pre-trained, frozen) representations combined with a lightweight architecture. We take a multimodal guessing task as our testbed, GuessWhat?!. An ensemble of our lightweight model matches the performance of the finetuned pre-trained transformer (LXMERT). An uncertainty analysis of our ensemble shows that the lightweight transferred representations close the data uncertainty gap with LXMERT, while retaining model diversity leading to ensemble boost. We further demonstrate that LXMERT’s performance gain is due solely to its extra V&L pretraining rather than because of architectural improvements. These results argue for flexible integration of multiple features and lightweight models as a viable alternative to large, cumbersome, pre-trained models.

A Small but Informed and Diverse Model: The Case of the Multimodal GuessWhat!? Guessing Game / Greco, Claudio; Testoni, Alberto; Bernardi, Raffaella; Frank, Stella. - ELETTRONICO. - (2022), pp. 1-10. (Intervento presentato al convegno CLASP tenutosi a Gothenburg nel 15-16 September 2022).

A Small but Informed and Diverse Model: The Case of the Multimodal GuessWhat!? Guessing Game

Greco, Claudio;Testoni, Alberto;Bernardi, Raffaella;Frank, Stella
2022-01-01

Abstract

Pre-trained Vision and Language Transformers achieve high performance on downstream tasks due to their ability to transfer representational knowledge accumulated during pretraining on substantial amounts of data. In this paper, we ask whether it is possible to compete with such models using features based on transferred (pre-trained, frozen) representations combined with a lightweight architecture. We take a multimodal guessing task as our testbed, GuessWhat?!. An ensemble of our lightweight model matches the performance of the finetuned pre-trained transformer (LXMERT). An uncertainty analysis of our ensemble shows that the lightweight transferred representations close the data uncertainty gap with LXMERT, while retaining model diversity leading to ensemble boost. We further demonstrate that LXMERT’s performance gain is due solely to its extra V&L pretraining rather than because of architectural improvements. These results argue for flexible integration of multiple features and lightweight models as a viable alternative to large, cumbersome, pre-trained models.
2022
Proceedings of the 2022 CLASP Conference on (Dis)embodiment
USA
Association for Computational Linguistics
978-1-955917-67-4
Greco, Claudio; Testoni, Alberto; Bernardi, Raffaella; Frank, Stella
A Small but Informed and Diverse Model: The Case of the Multimodal GuessWhat!? Guessing Game / Greco, Claudio; Testoni, Alberto; Bernardi, Raffaella; Frank, Stella. - ELETTRONICO. - (2022), pp. 1-10. (Intervento presentato al convegno CLASP tenutosi a Gothenburg nel 15-16 September 2022).
File in questo prodotto:
File Dimensione Formato  
2022.clasp-1.1.pdf

accesso aperto

Descrizione: paper
Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 535.72 kB
Formato Adobe PDF
535.72 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/365192
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact