Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of re-dundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network...
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation / Li, Wenhao; Liu, Mengyuan; Liu, Hong; Wang, Pichao; Cai, Jialun; Sebe, Nicu. - 36:(2024), pp. 604-613. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle, WA, USA 16-22 June 2024) [10.1109/cvpr52733.2024.00064].
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
Sebe, Nicu
2024-01-01
Abstract
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of re-dundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network...| File | Dimensione | Formato | |
|---|---|---|---|
|
Li_Hourglass_Tokenizer_for_Efficient_Transformer-Based_3D_Human_Pose_Estimation_CVPR_2024_paper (2).pdf
accesso aperto
Tipologia:
Post-print referato (Refereed author’s manuscript)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.86 MB
Formato
Adobe PDF
|
1.86 MB | Adobe PDF | Visualizza/Apri |
|
Hourglass_Tokenizer_for_Efficient_Transformer-Based_3D_Human_Pose_Estimation.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.73 MB
Formato
Adobe PDF
|
1.73 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



