Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of re-dundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network...

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation / Li, Wenhao; Liu, Mengyuan; Liu, Hong; Wang, Pichao; Cai, Jialun; Sebe, Nicu. - 36:(2024), pp. 604-613. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle, WA, USA 16-22 June 2024) [10.1109/cvpr52733.2024.00064].

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Sebe, Nicu
2024-01-01

Abstract

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of re-dundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network...
2024
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
IEEE
9798350353006
Li, Wenhao; Liu, Mengyuan; Liu, Hong; Wang, Pichao; Cai, Jialun; Sebe, Nicu
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation / Li, Wenhao; Liu, Mengyuan; Liu, Hong; Wang, Pichao; Cai, Jialun; Sebe, Nicu. - 36:(2024), pp. 604-613. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle, WA, USA 16-22 June 2024) [10.1109/cvpr52733.2024.00064].
File in questo prodotto:
File Dimensione Formato  
Li_Hourglass_Tokenizer_for_Efficient_Transformer-Based_3D_Human_Pose_Estimation_CVPR_2024_paper (2).pdf

accesso aperto

Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.86 MB
Formato Adobe PDF
1.86 MB Adobe PDF Visualizza/Apri
Hourglass_Tokenizer_for_Efficient_Transformer-Based_3D_Human_Pose_Estimation.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.73 MB
Formato Adobe PDF
1.73 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/432730
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 61
  • ???jsp.display-item.citation.isi??? 44
  • OpenAlex 49
social impact