Given a word in context, the task of Visual Word Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multimodal models. Text augmentation leverages the fine-grained semantic annotation from WordNet to get a better representation of the textual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank. The code to this project is available on Github1
Given a word in context, the task of VisualWord Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multi-modal models. Text augmentation leverages the fine-grained semantic annotation from Word-Net to get a better representation of the textual component. We then compare this sense-augmented text to the image set using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank.
GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images / Zhang, Shibingfeng; Nath, Shantanu; Mazzaccara, Davide. - ELETTRONICO. - (2023), pp. 1592-1597. ( 17th International Workshop on Semantic Evaluation, SemEval 2023, co-located with the 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 Toronto, Canada 13th-14th July 2023) [10.18653/v1/2023.semeval-1.219].
GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images
Nath, Shantanu
;Mazzaccara, Davide
2023-01-01
Abstract
Given a word in context, the task of Visual Word Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multimodal models. Text augmentation leverages the fine-grained semantic annotation from WordNet to get a better representation of the textual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank. The code to this project is available on Github1| File | Dimensione | Formato | |
|---|---|---|---|
|
2023.semeval-1.219.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
1.19 MB
Formato
Adobe PDF
|
1.19 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



