Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

IRIS

The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models - such as CLIP, DINOv2, and Stable Diffusion - into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. Project webpage: CUA-O3D.

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding / Li, Jinlong; Saltori, Cristiano; Poiesi, Fabio; Sebe, Nicu. - (2025), pp. 19390-19400. ( 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 Nashville, USA June 2025) [10.1109/cvpr52734.2025.01806].

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Li, Jinlong;Saltori, Cristiano;Poiesi, Fabio;Sebe, Nicu

2025-01-01

Abstract

The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models - such as CLIP, DINOv2, and Stable Diffusion - into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. Project webpage: CUA-O3D.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del volume (Proceedings title)
	
				2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
			
	Luogo di edizione (Place of publication)
	
				New York
			
	Casa editrice (Publisher)
	
				IEEE
			
	ISBN
	
				979-8-3315-4364-8
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-105017089649
			
	Codice WOS (WOS identifier)
	
				WOS:001601158200129
			
	Tutti gli autori
	
						Li, Jinlong; Saltori, Cristiano; Poiesi, Fabio; Sebe, Nicu
					
	Citazione
	
				Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding / Li, Jinlong; Saltori, Cristiano; Poiesi, Fabio; Sebe, Nicu. - (2025), pp. 19390-19400. ( 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 Nashville, USA June 2025) [10.1109/cvpr52734.2025.01806].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Li_Cross-Modal_and_Uncertainty-Aware_Agglomeration_for_Open-Vocabulary_3D_Scene_Understanding_CVPR_2025_paper.pdf accesso aperto Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.73 MB Formato Adobe PDF Visualizza/Apri	1.73 MB	Adobe PDF	Visualizza/Apri
Cross-Modal_and_Uncertainty-Aware_Agglomeration_for_Open-Vocabulary_3D_Scene_Understanding.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.53 MB Formato Adobe PDF Visualizza/Apri	1.53 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/462252

Citazioni

ND

3

1

1

social impact