In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM4Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation- based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a ''divide and conquer'' solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM4Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found here.

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model / Liu, Yihao; Xue, Feng; Ming, Anlong; Zhao, Mingshuai; Ma, Huadong; Sebe, Nicu. - (2024), pp. 3469-3478. ( 32nd ACM International Conference on Multimedia, MM 2024 Melbourne VIC Australia 2024) [10.1145/3664647.3681405].

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

Feng Xue;Nicu Sebe
2024-01-01

Abstract

In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM4Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation- based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a ''divide and conquer'' solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM4Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found here.
2024
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
New York
Association for Computing Machinery, Inc
9798400706868
Liu, Yihao; Xue, Feng; Ming, Anlong; Zhao, Mingshuai; Ma, Huadong; Sebe, Nicu
SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model / Liu, Yihao; Xue, Feng; Ming, Anlong; Zhao, Mingshuai; Ma, Huadong; Sebe, Nicu. - (2024), pp. 3469-3478. ( 32nd ACM International Conference on Multimedia, MM 2024 Melbourne VIC Australia 2024) [10.1145/3664647.3681405].
File in questo prodotto:
File Dimensione Formato  
3664647.3681405 (1).pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 8.26 MB
Formato Adobe PDF
8.26 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/439531
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 2
social impact