The goal of multichannel audio source separation is to produce high quality separated audio signals, observing mixtures of these signals. The difficulty of tackling the problem comes from not only the source propagation through noisy and echoing environments, but also overlapped source signals. Among the different research directions pursued around this problem, the adoption of probabilistic and advanced modeling aims at exploiting the diversity of multichannel propagation, and the redundancy of source signals. Moreover, prior information about the environments or the signals is helpful to improve the quality and to accelerate the separation. In this thesis, we propose methods to increase the effectiveness of model-based audio source separation methods by exploiting prior information applying spectral and sparse modeling theories. The work is divided into two main parts. In the first part, spectral modeling based on Nonnegative Matrix Factorization is adopted to represent the source signals. The parameters of Gaussian model-based source separation are estimated in sense of Maximum-Likelihood using a Generalized Expectation-Maximization algorithm by applying supervised Nonnegative Matrix and Tensor Factorization, given spectral descriptions of the source signals. Three modalities of making the descriptions available are addressed, i.e. the descriptions are on-line trained during the separation, pre-trained and made directly available, or pre-trained and made indirectly available. In the latter, a detection method is proposed in order to identify the descriptions best representing the signals in the mixtures. In the second part, sparse modeling is adopted to represent the propagation environments. Spatial descriptions of the environments, either deterministic or probabilistic, are pre-trained and made indirectly available. A detection method is proposed in order to identify the deterministic descriptions best representing the environments. The detected descriptions are then used to perform source separation by minimizing a non-convex $l_0$-norm function. For speaker diarization where the task is to determine ``who spoke when" in real meetings, a Watson mixture model is optimized using an Expectation-Maximization algorithm in order to detect the probabilistic descriptions, best representing the environments, and to estimate the temporal activity of each source. The performance of the proposed methods is experimentally evaluated using different datasets, between simulated and live-recorded. The elaborated results show the superiority of the proposed methods over recently developed methods used as baselines.

Exploiting spatial and spectral information for audio source separation and speaker diarization / Abdelraheem, Mahmoud Fakhry Mahmoud. - (2016), pp. 1-168.

Exploiting spatial and spectral information for audio source separation and speaker diarization

Abdelraheem, Mahmoud Fakhry Mahmoud
2016-01-01

Abstract

The goal of multichannel audio source separation is to produce high quality separated audio signals, observing mixtures of these signals. The difficulty of tackling the problem comes from not only the source propagation through noisy and echoing environments, but also overlapped source signals. Among the different research directions pursued around this problem, the adoption of probabilistic and advanced modeling aims at exploiting the diversity of multichannel propagation, and the redundancy of source signals. Moreover, prior information about the environments or the signals is helpful to improve the quality and to accelerate the separation. In this thesis, we propose methods to increase the effectiveness of model-based audio source separation methods by exploiting prior information applying spectral and sparse modeling theories. The work is divided into two main parts. In the first part, spectral modeling based on Nonnegative Matrix Factorization is adopted to represent the source signals. The parameters of Gaussian model-based source separation are estimated in sense of Maximum-Likelihood using a Generalized Expectation-Maximization algorithm by applying supervised Nonnegative Matrix and Tensor Factorization, given spectral descriptions of the source signals. Three modalities of making the descriptions available are addressed, i.e. the descriptions are on-line trained during the separation, pre-trained and made directly available, or pre-trained and made indirectly available. In the latter, a detection method is proposed in order to identify the descriptions best representing the signals in the mixtures. In the second part, sparse modeling is adopted to represent the propagation environments. Spatial descriptions of the environments, either deterministic or probabilistic, are pre-trained and made indirectly available. A detection method is proposed in order to identify the deterministic descriptions best representing the environments. The detected descriptions are then used to perform source separation by minimizing a non-convex $l_0$-norm function. For speaker diarization where the task is to determine ``who spoke when" in real meetings, a Watson mixture model is optimized using an Expectation-Maximization algorithm in order to detect the probabilistic descriptions, best representing the environments, and to estimate the temporal activity of each source. The performance of the proposed methods is experimentally evaluated using different datasets, between simulated and live-recorded. The elaborated results show the superiority of the proposed methods over recently developed methods used as baselines.
2016
XXVIII
2015-2016
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Omologo, Maurizio
Svaizer, Piergiorgio
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
ABDELRAHEEM_disclaimer.pdf

Solo gestori archivio

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 519 kB
Formato Adobe PDF
519 kB Adobe PDF   Visualizza/Apri
PhD_Thesis.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.9 MB
Formato Adobe PDF
2.9 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/368136
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact