Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding / Wang, Yidan; Zhuang, Chenyi; Liu, Wutao; Gao, Pan; Sebe, Nicu. - (2025), pp. 5100-5109. ( ACM Multimedia Dublin October 2025) [10.1145/3746027.3755751].

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

Zhuang, Chenyi;Sebe, Nicu
2025-01-01

Abstract

Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
2025
MM '25: Proceedings of the 33rd ACM International Conference on Multimedia
New York
Association for Computing Machinery, Inc
979-8-4007-2035-2
Wang, Yidan; Zhuang, Chenyi; Liu, Wutao; Gao, Pan; Sebe, Nicu
AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding / Wang, Yidan; Zhuang, Chenyi; Liu, Wutao; Gao, Pan; Sebe, Nicu. - (2025), pp. 5100-5109. ( ACM Multimedia Dublin October 2025) [10.1145/3746027.3755751].
File in questo prodotto:
File Dimensione Formato  
AlignCAT.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 660.71 kB
Formato Adobe PDF
660.71 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/467772
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact