Given a word in context, the task of VisualWord Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multi-modal models. Text augmentation leverages the fine-grained semantic annotation from Word-Net to get a better representation of the textual component. We then compare this sense-augmented text to the image set using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank.
GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images / Zhang, Shibingfeng; Nath, Shantanu; Mazzaccara, Davide. - ELETTRONICO. - (2023), pp. 1592-1597. (Intervento presentato al convegno SemEval-2023 tenutosi a Toronto, Canada nel 13th-14th July 2023) [10.18653/v1/2023.semeval-1.219].
GPL at SemEval-2023 Task 1: WordNet and CLIP to disambiguate images
Mazzaccara, Davide
2023-01-01
Abstract
Given a word in context, the task of VisualWord Sense Disambiguation consists of selecting the correct image among a set of candidates. To select the correct image, we propose a solution blending text augmentation and multi-modal models. Text augmentation leverages the fine-grained semantic annotation from Word-Net to get a better representation of the textual component. We then compare this sense-augmented text to the image set using pre-trained multimodal models CLIP and ViLT. Our system has been ranked 16th for the English language, achieving 68.5 points for hit rate and 79.2 for mean reciprocal rank.File | Dimensione | Formato | |
---|---|---|---|
2023.semeval-1.219.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
1.19 MB
Formato
Adobe PDF
|
1.19 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione