INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS

IRIS

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping (\mathcal{LD}) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven \mathcal{LD} that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.

INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS / Hannan, A.; Falavigna, D.; Brutti, A.. - (2025). ( IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025 Istanbul, Turkey 31 August - 3 September, 2025).

INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS

Hannan, A.;Falavigna, D.;Brutti, A.

2025-01-01

Abstract

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping (\mathcal{LD}) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven \mathcal{LD} that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del volume (Proceedings title)
	
				IEEE MLSP 2025 Proceedings
			
	Luogo di edizione (Place of publication)
	
				IEEE Xplore
			
	Casa editrice (Publisher)
	
				IEEE Xplore
			
	Tutti gli autori
	
						Hannan, A.; Falavigna, D.; Brutti, A.
					
	Citazione
	
				INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS / Hannan, A.; Falavigna, D.; Brutti, A.. - (2025). ( IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025 Istanbul, Turkey 31 August - 3 September, 2025).

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/476231

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

ND

social impact