Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking

Yidi, Li; Zhao, Wenkai; Wang, Zeyu; Zhenhuan, Xu; Ren, Bin; Sebe, Nicu

doi:10.1109/ICASSP49660.2025.10888838

Speaker tracking plays a crucial role in various human-robot interaction applications. Recently, leveraging multimodal information, such as audio and visual signals, has become an important strategy for enhancing the robustness of the tracking system. However, current methods face challenges in effectively exploring the complementarity between audio and visual modalities. To this end, we propose an Audio-Visual Tracker based on Multi-Stage Multimodal Distillation (MSMD-AVT), which utilizes an audio-visual knowledge distillation framework to facilitate audio-visual information fusion over multiple stages progressively. MSMD-AVT is constructed based on an audio-visual teacher-student model incorporating three distinct distillation losses. During the feature extraction stage, the feature alignment distillation is designed to ensure that the feature representations from the student network remain consistent with the teacher encoding feature. Moreover, during the feature fusion stage, the fusion guidance distillation is proposed, using deep teacher features to guide the multimodal fusion process in the student network, optimizing the complementary benefits of audiovisual fusion. Finally, the logits distillation is applied during the position estimation stage to help the student model better capture localization features through knowledge transfer and output alignment. Additionally, we present a multimodal fusion module based on a bidirectional cross-attention mechanism in the student network, dynamically adjusting the effectiveness of different modal features for the tracking task by extracting complementary audio-visual contextual information. Extensive experimental results on the widely used AV16.3 dataset indicate that MSMD-AVT significantly outperforms existing state-of-the-art methods in terms of accuracy and robustness. Our code is publicly available at https://github.com/moyitech/MSMD-AVT.

Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking / Li, Yidi; Zhao, Wenkai; Wang, Zeyu; Xu, Zhenhuan; Ren, Bin; Sebe, Nicu. - (2025), pp. 1-5. ( 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 Hyderabad International Convention Centre, ind 2025) [10.1109/ICASSP49660.2025.10888838].

Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking

Li, Yidi;Zhao, Wenkai;Wang, Zeyu;Xu, Zhenhuan;Ren, Bin;Sebe, Nicu

2025-01-01

Abstract

Speaker tracking plays a crucial role in various human-robot interaction applications. Recently, leveraging multimodal information, such as audio and visual signals, has become an important strategy for enhancing the robustness of the tracking system. However, current methods face challenges in effectively exploring the complementarity between audio and visual modalities. To this end, we propose an Audio-Visual Tracker based on Multi-Stage Multimodal Distillation (MSMD-AVT), which utilizes an audio-visual knowledge distillation framework to facilitate audio-visual information fusion over multiple stages progressively. MSMD-AVT is constructed based on an audio-visual teacher-student model incorporating three distinct distillation losses. During the feature extraction stage, the feature alignment distillation is designed to ensure that the feature representations from the student network remain consistent with the teacher encoding feature. Moreover, during the feature fusion stage, the fusion guidance distillation is proposed, using deep teacher features to guide the multimodal fusion process in the student network, optimizing the complementary benefits of audiovisual fusion. Finally, the logits distillation is applied during the position estimation stage to help the student model better capture localization features through knowledge transfer and output alignment. Additionally, we present a multimodal fusion module based on a bidirectional cross-attention mechanism in the student network, dynamically adjusting the effectiveness of different modal features for the tracking task by extracting complementary audio-visual contextual information. Extensive experimental results on the widely used AV16.3 dataset indicate that MSMD-AVT significantly outperforms existing state-of-the-art methods in terms of accuracy and robustness. Our code is publicly available at https://github.com/moyitech/MSMD-AVT.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del volume (Proceedings title)
	
				2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING CONFERENCE PROCEEDINGS
			
	Luogo di edizione (Place of publication)
	
				New York
			
	Casa editrice (Publisher)
	
				Institute of Electrical and Electronics Engineers Inc.
			
	ISBN
	
				9798350368741
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-105003890270
			
	Tutti gli autori
	
						Li, Yidi; Zhao, Wenkai; Wang, Zeyu; Xu, Zhenhuan; Ren, Bin; Sebe, Nicu
					
	Citazione
	
				Multi-Stage Multimodal Distillation for Audio-Visual Speaker Tracking / Li, Yidi; Zhao, Wenkai; Wang, Zeyu; Xu, Zhenhuan; Ren, Bin; Sebe, Nicu. - (2025), pp. 1-5. ( 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 Hyderabad International Convention Centre, ind 2025) [10.1109/ICASSP49660.2025.10888838].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Multi-Stage_Multimodal_Distillation_for_Audio-Visual_Speaker_Tracking (1).pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.54 MB Formato Adobe PDF Visualizza/Apri	1.54 MB	Adobe PDF	Visualizza/Apri