In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class ali...

Textual Knowledge Matters: Cross-Modality Co-teaching for Generalized Visual Class Discovery / Zheng, Haiyang; Pu, Nan; Li, Wenjing; Sebe, Nicu; Zhong, Zhun. - 15110:(2025), pp. 41-58. (Intervento presentato al convegno 18th European Conference on Computer Vision, ECCV 2024 tenutosi a Milano nel Sept. 2024) [10.1007/978-3-031-72943-0_3].

Textual Knowledge Matters: Cross-Modality Co-teaching for Generalized Visual Class Discovery

Zheng, Haiyang;Pu, Nan;Sebe, Nicu;Zhong, Zhun
2025-01-01

Abstract

In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class ali...
2025
Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science
GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND
Springer Science and Business Media Deutschland GmbH
9783031729423
9783031729430
Zheng, Haiyang; Pu, Nan; Li, Wenjing; Sebe, Nicu; Zhong, Zhun
Textual Knowledge Matters: Cross-Modality Co-teaching for Generalized Visual Class Discovery / Zheng, Haiyang; Pu, Nan; Li, Wenjing; Sebe, Nicu; Zhong, Zhun. - 15110:(2025), pp. 41-58. (Intervento presentato al convegno 18th European Conference on Computer Vision, ECCV 2024 tenutosi a Milano nel Sept. 2024) [10.1007/978-3-031-72943-0_3].
File in questo prodotto:
File Dimensione Formato  
06840.pdf

embargo fino al 29/11/2025

Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.57 MB
Formato Adobe PDF
2.57 MB Adobe PDF   Visualizza/Apri
Textual_Knowledge_Matters.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.21 MB
Formato Adobe PDF
1.21 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/439536
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact