Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at https://github.com/Amazingren/Point-CMAE.

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-supervised Learning / Ren, Bin; Mei, Guofeng; Pani Paudel, Danda; Wang, Weijie; Li, Yawei; Liu, Mengyuan; Cucchiara, Rita; Van Gool, Luc; Sebe, Nicu. - 15478 LNCS:(2024), pp. 56-75. (Intervento presentato al convegno 17th Asian Conference on Computer Vision, ACCV 2024 tenutosi a Hanoi nel 2024) [10.1007/978-981-96-0963-5_4].

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-supervised Learning

Ren, Bin;Wang, Weijie;Sebe, Nicu
2024-01-01

Abstract

Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at https://github.com/Amazingren/Point-CMAE.
2024
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Heidelberg
Springer Science and Business Media Deutschland GmbH
9789819609628
9789819609635
Ren, Bin; Mei, Guofeng; Pani Paudel, Danda; Wang, Weijie; Li, Yawei; Liu, Mengyuan; Cucchiara, Rita; Van Gool, Luc; Sebe, Nicu
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-supervised Learning / Ren, Bin; Mei, Guofeng; Pani Paudel, Danda; Wang, Weijie; Li, Yawei; Liu, Mengyuan; Cucchiara, Rita; Van Gool, Luc; Sebe, Nicu. - 15478 LNCS:(2024), pp. 56-75. (Intervento presentato al convegno 17th Asian Conference on Computer Vision, ACCV 2024 tenutosi a Hanoi nel 2024) [10.1007/978-981-96-0963-5_4].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/442591
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact