Democratizing Fine-grained Visual Recognition with Large Language Models

Liu, Mingxuan; Roy, Subhankar; Wenjing, Li; Zhong, Zhun; Sebe, Nicu; Ricci, Elisa

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

Democratizing Fine-grained Visual Recognition with Large Language Models / Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.. - (2024). (12th International Conference on Learning Representations, ICLR 2024 Vienna 2024).

Democratizing Fine-grained Visual Recognition with Large Language Models

Mingxuan Liu;Subhankar Roy;Wenjing Li;Zhun Zhong;Nicu Sebe;Elisa Ricci

2024-01-01

Abstract

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del volume (Proceedings title)
	
				12th International Conference on Learning Representations, ICLR 2024
			
	Luogo di edizione (Place of publication)
	
				New York
			
	Casa editrice (Publisher)
	
				International Conference on Learning Representations, ICLR
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85197700855
			
	Tutti gli autori
	
						Liu, Mingxuan; Roy, Subhankar; Li, Wenjing; Zhong, Zhun; Sebe, Nicu; Ricci, Elisa
					
	Citazione
	
				Democratizing Fine-grained Visual Recognition with Large Language Models / Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.. - (2024). (12th International Conference on Learning Representations, ICLR 2024 Vienna 2024).
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
2401.13837v2-compressed.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.72 MB Formato Adobe PDF Visualizza/Apri	1.72 MB	Adobe PDF	Visualizza/Apri