Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of MaskClu via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, setting new competitive results.

Masked Clustering Prediction for Unsupervised Point Cloud Pre-training / Ren, Bin; Huang, Xiaoshui; Liu, Mengyuan; Liu, Hong; Poiesi, Fabio; Sebe, Nicu; Mei, Guofeng. - 40:11(2026), pp. 8712-8720. ( AAAI Singapore January 2026) [10.1609/aaai.v40i11.37824].

Masked Clustering Prediction for Unsupervised Point Cloud Pre-training

Ren, Bin;Poiesi, Fabio;Sebe, Nicu;
2026-01-01

Abstract

Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of MaskClu via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, setting new competitive results.
2026
Proceedings of the AAAI Conference on Artificial Intelligence
New York
Association for the Advancement of Artificial Intelligence (AAAI)
Ren, Bin; Huang, Xiaoshui; Liu, Mengyuan; Liu, Hong; Poiesi, Fabio; Sebe, Nicu; Mei, Guofeng
Masked Clustering Prediction for Unsupervised Point Cloud Pre-training / Ren, Bin; Huang, Xiaoshui; Liu, Mengyuan; Liu, Hong; Poiesi, Fabio; Sebe, Nicu; Mei, Guofeng. - 40:11(2026), pp. 8712-8720. ( AAAI Singapore January 2026) [10.1609/aaai.v40i11.37824].
File in questo prodotto:
File Dimensione Formato  
37824-Article Text-41916-1-2-20260314-compressed.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 415.89 kB
Formato Adobe PDF
415.89 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/481371
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact