In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class ali...
Textual Knowledge Matters: Cross-Modality Co-teaching for Generalized Visual Class Discovery / Zheng, Haiyang; Pu, Nan; Li, Wenjing; Sebe, Nicu; Zhong, Zhun. - 15110:(2025), pp. 41-58. (Intervento presentato al convegno 18th European Conference on Computer Vision, ECCV 2024 tenutosi a Milano nel Sept. 2024) [10.1007/978-3-031-72943-0_3].
Textual Knowledge Matters: Cross-Modality Co-teaching for Generalized Visual Class Discovery
Zheng, Haiyang;Pu, Nan;Sebe, Nicu;Zhong, Zhun
2025-01-01
Abstract
In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class ali...File | Dimensione | Formato | |
---|---|---|---|
06840.pdf
embargo fino al 29/11/2025
Tipologia:
Post-print referato (Refereed author’s manuscript)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
2.57 MB
Formato
Adobe PDF
|
2.57 MB | Adobe PDF | Visualizza/Apri |
Textual_Knowledge_Matters.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.21 MB
Formato
Adobe PDF
|
1.21 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione