We present an automatic voice activity detection (VAD) method that is solely based on visual cues. Unlike traditional approaches processing audio, we show that upper body motion analysis is desirable for the VAD task. The proposed method consists of components for body motion representation, feature extraction from a Convolutional Neural Network (CNN) architecture and unsupervised domain adaptation. The body motion representations as images are used by the feature extraction component, which is generic and person-invariant, thus, can be applied to a subject who has never been seen. The endmost component handles the domain-shift problem, which appears due to the fact that the way people move/ gesticulate while speaking might vary from subject to subject, which results in disparate body motion features and consequently poorer VAD performance. The experimental analyses applied on a publicly available real-world VAD dataset show that the proposed method performs better than the state-of-the-art video-only and multimodal VAD approaches. Moreover, the proposed method has a better generalization ability as VAD results are more consistent across different subjects. As another major contribution, we present a new multimodal dataset (called RealVAD), created from a real-world (no role-plays) panel discussion. This dataset contains many actual situations/ challenges that are missing in the previous VAD datasets. We benchmarked the RealVAD dataset by applying the proposed method as well as cross-dataset analyses. Particularly, the results of cross-dataset experiments highlight the remarkable positive contribution of the unsupervised domain adaptation applied.

RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis / Beyan, Cigdem; Shahid, Muhammad; Murino, Vittorio. - In: IEEE TRANSACTIONS ON MULTIMEDIA. - ISSN 1520-9210. - ELETTRONICO. - 2021:23(2021), pp. 2071-2085. [10.1109/TMM.2020.3007350]

RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis

Beyan, Cigdem;
2021-01-01

Abstract

We present an automatic voice activity detection (VAD) method that is solely based on visual cues. Unlike traditional approaches processing audio, we show that upper body motion analysis is desirable for the VAD task. The proposed method consists of components for body motion representation, feature extraction from a Convolutional Neural Network (CNN) architecture and unsupervised domain adaptation. The body motion representations as images are used by the feature extraction component, which is generic and person-invariant, thus, can be applied to a subject who has never been seen. The endmost component handles the domain-shift problem, which appears due to the fact that the way people move/ gesticulate while speaking might vary from subject to subject, which results in disparate body motion features and consequently poorer VAD performance. The experimental analyses applied on a publicly available real-world VAD dataset show that the proposed method performs better than the state-of-the-art video-only and multimodal VAD approaches. Moreover, the proposed method has a better generalization ability as VAD results are more consistent across different subjects. As another major contribution, we present a new multimodal dataset (called RealVAD), created from a real-world (no role-plays) panel discussion. This dataset contains many actual situations/ challenges that are missing in the previous VAD datasets. We benchmarked the RealVAD dataset by applying the proposed method as well as cross-dataset analyses. Particularly, the results of cross-dataset experiments highlight the remarkable positive contribution of the unsupervised domain adaptation applied.
2021
23
Beyan, Cigdem; Shahid, Muhammad; Murino, Vittorio
RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis / Beyan, Cigdem; Shahid, Muhammad; Murino, Vittorio. - In: IEEE TRANSACTIONS ON MULTIMEDIA. - ISSN 1520-9210. - ELETTRONICO. - 2021:23(2021), pp. 2071-2085. [10.1109/TMM.2020.3007350]
File in questo prodotto:
File Dimensione Formato  
TMM2020_realVAD.pdf

Solo gestori archivio

Descrizione: Articolo principale
Tipologia: Pre-print non referato (Non-refereed preprint)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.84 MB
Formato Adobe PDF
4.84 MB Adobe PDF   Visualizza/Apri
RealVAD_A_Real-World_Dataset_and_A_Method_for_Voice_Activity_Detection_by_Body_Motion_Analysis.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.44 MB
Formato Adobe PDF
2.44 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/296384
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 8
social impact