Exploiting spatial and spectral information for audio source separation and speaker diarization

IRIS

The goal of multichannel audio source separation is to produce high quality separated audio signals, observing mixtures of these signals. The difficulty of tackling the problem comes from not only the source propagation through noisy and echoing environments, but also overlapped source signals. Among the different research directions pursued around this problem, the adoption of probabilistic and advanced modeling aims at exploiting the diversity of multichannel propagation, and the redundancy of source signals. Moreover, prior information about the environments or the signals is helpful to improve the quality and to accelerate the separation. In this thesis, we propose methods to increase the effectiveness of model-based audio source separation methods by exploiting prior information applying spectral and sparse modeling theories. The work is divided into two main parts. In the first part, spectral modeling based on Nonnegative Matrix Factorization is adopted to represent the source signals. The parameters of Gaussian model-based source separation are estimated in sense of Maximum-Likelihood using a Generalized Expectation-Maximization algorithm by applying supervised Nonnegative Matrix and Tensor Factorization, given spectral descriptions of the source signals. Three modalities of making the descriptions available are addressed, i.e. the descriptions are on-line trained during the separation, pre-trained and made directly available, or pre-trained and made indirectly available. In the latter, a detection method is proposed in order to identify the descriptions best representing the signals in the mixtures. In the second part, sparse modeling is adopted to represent the propagation environments. Spatial descriptions of the environments, either deterministic or probabilistic, are pre-trained and made indirectly available. A detection method is proposed in order to identify the deterministic descriptions best representing the environments. The detected descriptions are then used to perform source separation by minimizing a non-convex $l_0$-norm function. For speaker diarization where the task is to determine ``who spoke when" in real meetings, a Watson mixture model is optimized using an Expectation-Maximization algorithm in order to detect the probabilistic descriptions, best representing the environments, and to estimate the temporal activity of each source. The performance of the proposed methods is experimentally evaluated using different datasets, between simulated and live-recorded. The elaborated results show the superiority of the proposed methods over recently developed methods used as baselines.

Exploiting spatial and spectral information for audio source separation and speaker diarization / Abdelraheem, Mahmoud Fakhry Mahmoud. - (2016), pp. 1-168.

Exploiting spatial and spectral information for audio source separation and speaker diarization

Abdelraheem, Mahmoud Fakhry Mahmoud

2016-01-01

Abstract

The goal of multichannel audio source separation is to produce high quality separated audio signals, observing mixtures of these signals. The difficulty of tackling the problem comes from not only the source propagation through noisy and echoing environments, but also overlapped source signals. Among the different research directions pursued around this problem, the adoption of probabilistic and advanced modeling aims at exploiting the diversity of multichannel propagation, and the redundancy of source signals. Moreover, prior information about the environments or the signals is helpful to improve the quality and to accelerate the separation. In this thesis, we propose methods to increase the effectiveness of model-based audio source separation methods by exploiting prior information applying spectral and sparse modeling theories. The work is divided into two main parts. In the first part, spectral modeling based on Nonnegative Matrix Factorization is adopted to represent the source signals. The parameters of Gaussian model-based source separation are estimated in sense of Maximum-Likelihood using a Generalized Expectation-Maximization algorithm by applying supervised Nonnegative Matrix and Tensor Factorization, given spectral descriptions of the source signals. Three modalities of making the descriptions available are addressed, i.e. the descriptions are on-line trained during the separation, pre-trained and made directly available, or pre-trained and made indirectly available. In the latter, a detection method is proposed in order to identify the descriptions best representing the signals in the mixtures. In the second part, sparse modeling is adopted to represent the propagation environments. Spatial descriptions of the environments, either deterministic or probabilistic, are pre-trained and made indirectly available. A detection method is proposed in order to identify the deterministic descriptions best representing the environments. The detected descriptions are then used to perform source separation by minimizing a non-convex $l_0$-norm function. For speaker diarization where the task is to determine ``who spoke when" in real meetings, a Watson mixture model is optimized using an Expectation-Maximization algorithm in order to detect the probabilistic descriptions, best representing the environments, and to estimate the temporal activity of each source. The performance of the proposed methods is experimentally evaluated using different datasets, between simulated and live-recorded. The elaborated results show the superiority of the proposed methods over recently developed methods used as baselines.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				2016
			
	Ciclo
	
				XXVIII
			
	Anno Accademico
	
				2015-2016
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Information and Communication Technology
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Omologo, Maurizio
Svaizer, Piergiorgio
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Lingua (Language)
	
				Inglese
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore INF/01 - Informatica
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
ABDELRAHEEM_disclaimer.pdf Solo gestori archivio Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 519 kB Formato Adobe PDF Visualizza/Apri	519 kB	Adobe PDF	Visualizza/Apri
PhD_Thesis.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 2.9 MB Formato Adobe PDF Visualizza/Apri	2.9 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/368136

Citazioni

ND

ND

ND

ND

social impact