Given a word in context, the task of Visual Word Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multimodal models. Text augmentation leverages the fine-grained semantic annotation from WordNet to get a better representation of the textual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank. The code to this project is available on Github1

Given a word in context, the task of VisualWord Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multi-modal models. Text augmentation leverages the fine-grained semantic annotation from Word-Net to get a better representation of the textual component. We then compare this sense-augmented text to the image set using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank.

GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images / Zhang, Shibingfeng; Nath, Shantanu; Mazzaccara, Davide. - ELETTRONICO. - (2023), pp. 1592-1597. ( 17th International Workshop on Semantic Evaluation, SemEval 2023, co-located with the 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 Toronto, Canada 13th-14th July 2023) [10.18653/v1/2023.semeval-1.219].

GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images

Nath, Shantanu
;
Mazzaccara, Davide
2023-01-01

Abstract

Given a word in context, the task of Visual Word Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multimodal models. Text augmentation leverages the fine-grained semantic annotation from WordNet to get a better representation of the textual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank. The code to this project is available on Github1
2023
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Toronto, Canada
Association for Computational Linguistics
9781959429999
Zhang, Shibingfeng; Nath, Shantanu; Mazzaccara, Davide
GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images / Zhang, Shibingfeng; Nath, Shantanu; Mazzaccara, Davide. - ELETTRONICO. - (2023), pp. 1592-1597. ( 17th International Workshop on Semantic Evaluation, SemEval 2023, co-located with the 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 Toronto, Canada 13th-14th July 2023) [10.18653/v1/2023.semeval-1.219].
File in questo prodotto:
File Dimensione Formato  
2023.semeval-1.219.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 1.19 MB
Formato Adobe PDF
1.19 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/388134
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 4
  • OpenAlex 6
social impact