Multi-omics integration for biomarker discovery andunsupervised subject clusterization. A novel computational method

Fiorentino, Giuseppe

doi:10.15168/11572_394750

The advent of the high throughput era has resulted in rapid growth in the availability of large biological datasets. These massive datasets are organized in public or private repositories, encompassing not only DNA but also multiple biomolecules that represent different layers of biological information. The examination and quantification of one such layer are commonly known as "omics," which include the genome, proteome, transcriptome, and metabolome. Currently, it has become commonplace to conduct association analyses between a single omics and a specific phenotype. This practice has significantly enhanced our comprehension of both biological mechanisms and disease, particularly Mendelian disorders. However, the study of a single omics often fails to capture the entirety of variations within a multi-layered mechanism, as well as the interplay between different biological layers, thus not accurately characterizing changes in complex disorders and regulatory systems. Hence, the integration of information from multiple omics has emerged as the prevailing approach, leading to the development of computational tools for conducting multi-omics analyses. These tools are essential for further unraveling the underlying causes of complex diseases. However, the landscape of multi-omics analysis software is highly diverse, offering researchers a wide range of options in terms of purposes, data types, integration methods, and development techniques. This diversity provides tailored pipelines that cater to specific research needs. Yet, it also poses challenges, as the multitude of software options often lacks standardized practices and protocols. Consequently, a universally accepted gold standard is absent, impeding result reproducibility and comparability across different research efforts. To address this issue, we have developed MOUSSE, a novel modular omicsgeneric pipeline for unsupervised data integration. The characteristic of our tool is to use rank-based subject-specific signatures as input to derive from each omics a subject similarity network. This network maintains the informative content of the input data while reducing its size and allows for a graph-based integration of multiple omics. Using the resulting integrated network, the pipeline clusters the subjects andallows researchers to identify biomarkers for each cluster. One aspect that sets MOUSSE apart from other techniques is that it require almost no data preprocessing, making it more robust to noise in the data and more suitable to novel and not yet fully characterized data types. We tested our tool by analyzing ten publicly available benchmark datasets for different types of cancer. Each dataset contained data from three separate omics, namely transcriptome, methylome and miRNAome. The aim of our analysis was two-folded. First, we wanted to demonstrate that MOUSSE was able to identify the different phenotypes of cancers as clusters, second, we aimed to demonstrate that the pipeline was also able to identify biomarkers for each cancer type or progression. Moreover, we compared MOUSSE clustering performance against tenmulti-omics tools tested on the same data, achieving the highest median classification score. Finally, we performed an additional analysis on the biomarkers selected by the pipeline for a selected number of cancer phenotypes, showing that MOUSSE was able to identify the markers underlying disease progression and differential survival rate between cancer phenotypes. Collectively, these results showed that MOUSSE clustering and biomarker identification can be reliable even when the disease is changing. Finally, we successfully compiled and implemented MOUSSE as an R-package. To enhance the pipeline, we incorporated an additional omics dataset. This integration allowed us to optimize the selection of subject-specific signatures and introduced the capability of iteratively running the tool. This means that users can refine their clustering results while reducing the size of candidates, therefore enhancing the overall effectiveness of the software.

Multi-omics integration for biomarker discovery andunsupervised subject clusterization. A novel computational method / Fiorentino, Giuseppe. - (2023 Nov 08), pp. 1-86. [10.15168/11572_394750]

Multi-omics integration for biomarker discovery andunsupervised subject clusterization. A novel computational method

Fiorentino, Giuseppe

2023-11-08

Abstract

The advent of the high throughput era has resulted in rapid growth in the availability of large biological datasets. These massive datasets are organized in public or private repositories, encompassing not only DNA but also multiple biomolecules that represent different layers of biological information. The examination and quantification of one such layer are commonly known as "omics," which include the genome, proteome, transcriptome, and metabolome. Currently, it has become commonplace to conduct association analyses between a single omics and a specific phenotype. This practice has significantly enhanced our comprehension of both biological mechanisms and disease, particularly Mendelian disorders. However, the study of a single omics often fails to capture the entirety of variations within a multi-layered mechanism, as well as the interplay between different biological layers, thus not accurately characterizing changes in complex disorders and regulatory systems. Hence, the integration of information from multiple omics has emerged as the prevailing approach, leading to the development of computational tools for conducting multi-omics analyses. These tools are essential for further unraveling the underlying causes of complex diseases. However, the landscape of multi-omics analysis software is highly diverse, offering researchers a wide range of options in terms of purposes, data types, integration methods, and development techniques. This diversity provides tailored pipelines that cater to specific research needs. Yet, it also poses challenges, as the multitude of software options often lacks standardized practices and protocols. Consequently, a universally accepted gold standard is absent, impeding result reproducibility and comparability across different research efforts. To address this issue, we have developed MOUSSE, a novel modular omicsgeneric pipeline for unsupervised data integration. The characteristic of our tool is to use rank-based subject-specific signatures as input to derive from each omics a subject similarity network. This network maintains the informative content of the input data while reducing its size and allows for a graph-based integration of multiple omics. Using the resulting integrated network, the pipeline clusters the subjects andallows researchers to identify biomarkers for each cluster. One aspect that sets MOUSSE apart from other techniques is that it require almost no data preprocessing, making it more robust to noise in the data and more suitable to novel and not yet fully characterized data types. We tested our tool by analyzing ten publicly available benchmark datasets for different types of cancer. Each dataset contained data from three separate omics, namely transcriptome, methylome and miRNAome. The aim of our analysis was two-folded. First, we wanted to demonstrate that MOUSSE was able to identify the different phenotypes of cancers as clusters, second, we aimed to demonstrate that the pipeline was also able to identify biomarkers for each cancer type or progression. Moreover, we compared MOUSSE clustering performance against tenmulti-omics tools tested on the same data, achieving the highest median classification score. Finally, we performed an additional analysis on the biomarkers selected by the pipeline for a selected number of cancer phenotypes, showing that MOUSSE was able to identify the markers underlying disease progression and differential survival rate between cancer phenotypes. Collectively, these results showed that MOUSSE clustering and biomarker identification can be reliable even when the disease is changing. Finally, we successfully compiled and implemented MOUSSE as an R-package. To enhance the pipeline, we incorporated an additional omics dataset. This integration allowed us to optimize the selection of subject-specific signatures and introduced the capability of iteratively running the tool. This means that users can refine their clustering results while reducing the size of candidates, therefore enhancing the overall effectiveness of the software.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				8-nov-2023
			
	Ciclo
	
				XXXIV
			
	Anno Accademico
	
				2022-2023
			
	Dipartimento
	
				CIBIO (29/10/12-)
			
	Corso di dottorato
	
				Biomolecular Sciences
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Domenici, Enrico
			
	Supervisore aggiunto/Correlatore Unitn (Unitn Co-Supervisor)
	
				Marchetti, Luca
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Codice DOI
	
				https://dx.doi.org/10.15168/11572_394750
			
	Lingua (Language)
	
				Inglese
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore BIO/11 - Biologia Molecolare
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
phd_unitn_Giuseppe_Fiorentino.pdf Open Access dal 02/11/2025 Descrizione: Tesi Giuseppe Fiorentino Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 3.38 MB Formato Adobe PDF Visualizza/Apri	3.38 MB	Adobe PDF	Visualizza/Apri
appendix.zip Open Access dal 02/11/2025 Descrizione: appendice Tipologia: Altro materiale allegato (Other attachments) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 7.61 MB Formato Zip File Visualizza/Apri	7.61 MB	Zip File	Visualizza/Apri