Voice activity detection (VAD) with solely visual cues have usually performed by detecting lip motion, which is not always feasible. On the other hand, visual activity (e.g., head, hand or whole body motion) is also correlated with speech, and can be used for VAD. Convolutional Neural Networks (CNNs) have demonstrated significantly good results for many applications including visual activity-related tasks. It can be possible to exploit CNN’s effectiveness to visual-VAD when whole body visual activity is used. The way visual activity is represented (called visual activity primitives) to be given to a CNN as input, might be important to perform an effective VAD. Some primitives might result in better detection and provide consistent VAD performance such that the detector works equally well for all speakers. This is investigated, for the first time, in this paper. Regarding that, we compare visual activity primitives quantitatively in terms of the overall performance and the standard deviation of the performance, and qualitatively by visualizing the discriminative image regions determined by CNN trained to identify VAD classes. We perform a data-driven VAD with a person-invariant training i.e., without using any labels or features of the test data. This is unlike the state-of-the-art (SOA), which realizes a person-specific VAD with hand-crafted features. Improved performances with much lower standard deviation as compared to SOA are demonstrated.

Comparisons of visual activity primitives for voice activity detection / Shahid, M.; Beyan, C.; Murino, V.. - 11751:(2019), pp. 48-59. (Intervento presentato al convegno 20th International Conference on Image Analysis and Processing, ICIAP 2019 tenutosi a trento, italia nel 2019) [10.1007/978-3-030-30642-7_5].

Comparisons of visual activity primitives for voice activity detection

Beyan C.;
2019-01-01

Abstract

Voice activity detection (VAD) with solely visual cues have usually performed by detecting lip motion, which is not always feasible. On the other hand, visual activity (e.g., head, hand or whole body motion) is also correlated with speech, and can be used for VAD. Convolutional Neural Networks (CNNs) have demonstrated significantly good results for many applications including visual activity-related tasks. It can be possible to exploit CNN’s effectiveness to visual-VAD when whole body visual activity is used. The way visual activity is represented (called visual activity primitives) to be given to a CNN as input, might be important to perform an effective VAD. Some primitives might result in better detection and provide consistent VAD performance such that the detector works equally well for all speakers. This is investigated, for the first time, in this paper. Regarding that, we compare visual activity primitives quantitatively in terms of the overall performance and the standard deviation of the performance, and qualitatively by visualizing the discriminative image regions determined by CNN trained to identify VAD classes. We perform a data-driven VAD with a person-invariant training i.e., without using any labels or features of the test data. This is unlike the state-of-the-art (SOA), which realizes a person-specific VAD with hand-crafted features. Improved performances with much lower standard deviation as compared to SOA are demonstrated.
2019
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND
Springer Verlag
978-3-030-30641-0
978-3-030-30642-7
Shahid, M.; Beyan, C.; Murino, V.
Comparisons of visual activity primitives for voice activity detection / Shahid, M.; Beyan, C.; Murino, V.. - 11751:(2019), pp. 48-59. (Intervento presentato al convegno 20th International Conference on Image Analysis and Processing, ICIAP 2019 tenutosi a trento, italia nel 2019) [10.1007/978-3-030-30642-7_5].
File in questo prodotto:
File Dimensione Formato  
ICIAP_2019.pdf

accesso aperto

Descrizione: Articolo principale
Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.4 MB
Formato Adobe PDF
1.4 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/298121
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 6
social impact