We present a novel vision-based voice activity detection (VAD) method that relies only on automatic upper body motion (UBM) analysis. Traditionally, VAD is performed using audio features only, but the use of visual cues instead of audio can be desirable especially when audio is not available such as due to technical, ethical or legal issues. Psychology literature confirms that the way people move while speaking is different from while they are not speaking. This motivates us to claim that an effective representation of UBM can be used to detect 'Who is Speaking and When'. On the other hand, the way people move during their speech varies a lot from culture to culture, and even person to person in the same culture. This results in unrelated UBM representations, such that the distribution of training and test data becomes disparate. To overcome this, we combine stacked sparse autoencoders and simple subspace alignment methods while a classifier is jointly learned using the VAD labels of the training data only. This yields new domain invariant feature representations for training and test data, showing improved VAD results. Our approach is applicable to any person without requiring re-training. The tests applied on a publicly available real-life VAD dataset show better results as compared to the state-of-the-art video-only VAD methods. Moreover, the ablation study justifies the superiority of the proposed method and demonstrates the positive contribution of each component.

Voice activity detection by upper body motion analysis and unsupervised domain adaptation / Shahid, M.; Beyan, C.; Murino, V.. - (2019), pp. 1260-1269. (Intervento presentato al convegno 17th IEEE/CVF International Conference on Computer Vision Workshop, ICCVW 2019 tenutosi a Seul nel 2019) [10.1109/ICCVW.2019.00159].

Voice activity detection by upper body motion analysis and unsupervised domain adaptation

Beyan C.;
2019-01-01

Abstract

We present a novel vision-based voice activity detection (VAD) method that relies only on automatic upper body motion (UBM) analysis. Traditionally, VAD is performed using audio features only, but the use of visual cues instead of audio can be desirable especially when audio is not available such as due to technical, ethical or legal issues. Psychology literature confirms that the way people move while speaking is different from while they are not speaking. This motivates us to claim that an effective representation of UBM can be used to detect 'Who is Speaking and When'. On the other hand, the way people move during their speech varies a lot from culture to culture, and even person to person in the same culture. This results in unrelated UBM representations, such that the distribution of training and test data becomes disparate. To overcome this, we combine stacked sparse autoencoders and simple subspace alignment methods while a classifier is jointly learned using the VAD labels of the training data only. This yields new domain invariant feature representations for training and test data, showing improved VAD results. Our approach is applicable to any person without requiring re-training. The tests applied on a publicly available real-life VAD dataset show better results as compared to the state-of-the-art video-only VAD methods. Moreover, the ablation study justifies the superiority of the proposed method and demonstrates the positive contribution of each component.
2019
Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019
10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
Institute of Electrical and Electronics Engineers Inc.
978-1-7281-5023-9
Shahid, M.; Beyan, C.; Murino, V.
Voice activity detection by upper body motion analysis and unsupervised domain adaptation / Shahid, M.; Beyan, C.; Murino, V.. - (2019), pp. 1260-1269. (Intervento presentato al convegno 17th IEEE/CVF International Conference on Computer Vision Workshop, ICCVW 2019 tenutosi a Seul nel 2019) [10.1109/ICCVW.2019.00159].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/298030
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 3
social impact