Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping (\mathcal{LD}) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven \mathcal{LD} that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.

INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS / Hannan, A.; Falavigna, D.; Brutti, A.. - (2025). ( IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025 Istanbul, Turkey 31 August - 3 September, 2025).

INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS

Falavigna, D.;Brutti, A.
2025-01-01

Abstract

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping (\mathcal{LD}) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven \mathcal{LD} that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.
2025
IEEE MLSP 2025 Proceedings
IEEE Xplore
IEEE Xplore
Hannan, A.; Falavigna, D.; Brutti, A.
INPUT CONDITIONED LAYER DROPPING IN SPEECH FOUNDATION MODELS / Hannan, A.; Falavigna, D.; Brutti, A.. - (2025). ( IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025 Istanbul, Turkey 31 August - 3 September, 2025).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/476231
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact