Speaker tracking plays a crucial role in various human-robot interaction applications. Recently, leveraging multimodal information, such as audio and visual signals, has become an important strategy for enhancing the robustness of the tracking system. However, current methods face challenges in effectively exploring the complementarity between audio and visual modalities. To this end, we propose an Audio-Visual Tracker based on Multi-Stage Multimodal Distillation (MSMD-AVT), which utilizes an audio-visual knowledge distillation framework to facilitate audio-visual information fusion over multiple stages progressively. MSMD-AVT is constructed based on an audio-visual teacher-student model incorporating three distinct distillation losses. During the feature extraction stage, the feature alignment distillation is designed to ensure that the feature representations from the student network remain consistent with the teacher encoding feature. Moreover, during the feature fusion stage, the fusion guidance distillation is proposed, using deep teacher features to guide the multimodal fusion process in the student network, optimizing the complementary benefits of audiovisual fusion. Finally, the logits distillation is applied during the position estimation stage to help the student model better capture localization features through knowledge transfer and output alignment. Additionally, we present a multimodal fusion module based on a bidirectional cross-attention mechanism in the student network, dynamically adjusting the effectiveness of different modal features for the tracking task by extracting complementary audio-visual contextual information. Extensive experimental results on the widely used AV16.3 dataset indicate that MSMD-AVT significantly outperforms existing state-of-the-art methods in terms of accuracy and robustness. Our code is publicly available at https://github.com/moyitech/MSMD-AVT.
Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking / Li, Yidi; Zhao, Wenkai; Wang, Zeyu; Xu, Zhenhuan; Ren, Bin; Sebe, Nicu. - (2025), pp. 1-5. (Intervento presentato al convegno 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 tenutosi a Hyderabad International Convention Centre, ind nel 2025) [10.1109/ICASSP49660.2025.10888838].
Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking
Ren, Bin;Sebe, Nicu
2025-01-01
Abstract
Speaker tracking plays a crucial role in various human-robot interaction applications. Recently, leveraging multimodal information, such as audio and visual signals, has become an important strategy for enhancing the robustness of the tracking system. However, current methods face challenges in effectively exploring the complementarity between audio and visual modalities. To this end, we propose an Audio-Visual Tracker based on Multi-Stage Multimodal Distillation (MSMD-AVT), which utilizes an audio-visual knowledge distillation framework to facilitate audio-visual information fusion over multiple stages progressively. MSMD-AVT is constructed based on an audio-visual teacher-student model incorporating three distinct distillation losses. During the feature extraction stage, the feature alignment distillation is designed to ensure that the feature representations from the student network remain consistent with the teacher encoding feature. Moreover, during the feature fusion stage, the fusion guidance distillation is proposed, using deep teacher features to guide the multimodal fusion process in the student network, optimizing the complementary benefits of audiovisual fusion. Finally, the logits distillation is applied during the position estimation stage to help the student model better capture localization features through knowledge transfer and output alignment. Additionally, we present a multimodal fusion module based on a bidirectional cross-attention mechanism in the student network, dynamically adjusting the effectiveness of different modal features for the tracking task by extracting complementary audio-visual contextual information. Extensive experimental results on the widely used AV16.3 dataset indicate that MSMD-AVT significantly outperforms existing state-of-the-art methods in terms of accuracy and robustness. Our code is publicly available at https://github.com/moyitech/MSMD-AVT.File | Dimensione | Formato | |
---|---|---|---|
Multi-Stage_Multimodal_Distillation_for_Audio-Visual_Speaker_Tracking (1).pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.54 MB
Formato
Adobe PDF
|
1.54 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione