Comparisons of visual activity primitives for voice activity detection

Shahid, M.; Beyan, C.; Murino, V.

doi:10.1007/978-3-030-30642-7_5

Voice activity detection (VAD) with solely visual cues have usually performed by detecting lip motion, which is not always feasible. On the other hand, visual activity (e.g., head, hand or whole body motion) is also correlated with speech, and can be used for VAD. Convolutional Neural Networks (CNNs) have demonstrated significantly good results for many applications including visual activity-related tasks. It can be possible to exploit CNN’s effectiveness to visual-VAD when whole body visual activity is used. The way visual activity is represented (called visual activity primitives) to be given to a CNN as input, might be important to perform an effective VAD. Some primitives might result in better detection and provide consistent VAD performance such that the detector works equally well for all speakers. This is investigated, for the first time, in this paper. Regarding that, we compare visual activity primitives quantitatively in terms of the overall performance and the standard deviation of the performance, and qualitatively by visualizing the discriminative image regions determined by CNN trained to identify VAD classes. We perform a data-driven VAD with a person-invariant training i.e., without using any labels or features of the test data. This is unlike the state-of-the-art (SOA), which realizes a person-specific VAD with hand-crafted features. Improved performances with much lower standard deviation as compared to SOA are demonstrated.

Comparisons of visual activity primitives for voice activity detection / Shahid, M.; Beyan, C.; Murino, V.. - 11751:(2019), pp. 48-59. ( 20th International Conference on Image Analysis and Processing, ICIAP 2019 trento, italia 2019) [10.1007/978-3-030-30642-7_5].

Comparisons of visual activity primitives for voice activity detection

Shahid M.;Beyan C.;Murino V.

2019-01-01

Abstract

Voice activity detection (VAD) with solely visual cues have usually performed by detecting lip motion, which is not always feasible. On the other hand, visual activity (e.g., head, hand or whole body motion) is also correlated with speech, and can be used for VAD. Convolutional Neural Networks (CNNs) have demonstrated significantly good results for many applications including visual activity-related tasks. It can be possible to exploit CNN’s effectiveness to visual-VAD when whole body visual activity is used. The way visual activity is represented (called visual activity primitives) to be given to a CNN as input, might be important to perform an effective VAD. Some primitives might result in better detection and provide consistent VAD performance such that the detector works equally well for all speakers. This is investigated, for the first time, in this paper. Regarding that, we compare visual activity primitives quantitatively in terms of the overall performance and the standard deviation of the performance, and qualitatively by visualizing the discriminative image regions determined by CNN trained to identify VAD classes. We perform a data-driven VAD with a person-invariant training i.e., without using any labels or features of the test data. This is unlike the state-of-the-art (SOA), which realizes a person-specific VAD with hand-crafted features. Improved performances with much lower standard deviation as compared to SOA are demonstrated.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2019
			
	Titolo del volume (Proceedings title)
	
				Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
			
	Luogo di edizione (Place of publication)
	
				GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND
			
	Casa editrice (Publisher)
	
				Springer Verlag
			
	ISBN
	
				978-3-030-30641-0
978-3-030-30642-7
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85072955810
			
	Codice WOS (WOS identifier)
	
				WOS:000562459100005
			
	Tutti gli autori
	
						Shahid, M.; Beyan, C.; Murino, V.
					
	Citazione
	
				Comparisons of visual activity primitives for voice activity detection / Shahid, M.; Beyan, C.; Murino, V.. - 11751:(2019), pp. 48-59. ( 20th International Conference on Image Analysis and Processing, ICIAP 2019 trento, italia 2019) [10.1007/978-3-030-30642-7_5].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
ICIAP_2019.pdf accesso aperto Descrizione: Articolo principale Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.4 MB Formato Adobe PDF Visualizza/Apri	1.4 MB	Adobe PDF	Visualizza/Apri