The advent of the high throughput era has resulted in rapid growth in the availability of large biological datasets. These massive datasets are organized in public or private repositories, encompassing not only DNA but also multiple biomolecules that represent different layers of biological information. The examination and quantification of one such layer are commonly known as "omics," which include the genome, proteome, transcriptome, and metabolome. Currently, it has become commonplace to conduct association analyses between a single omics and a specific phenotype. This practice has significantly enhanced our comprehension of both biological mechanisms and disease, particularly Mendelian disorders. However, the study of a single omics often fails to capture the entirety of variations within a multi-layered mechanism, as well as the interplay between different biological layers, thus not accurately characterizing changes in complex disorders and regulatory systems. Hence, the integration of information from multiple omics has emerged as the prevailing approach, leading to the development of computational tools for conducting multi-omics analyses. These tools are essential for further unraveling the underlying causes of complex diseases. However, the landscape of multi-omics analysis software is highly diverse, offering researchers a wide range of options in terms of purposes, data types, integration methods, and development techniques. This diversity provides tailored pipelines that cater to specific research needs. Yet, it also poses challenges, as the multitude of software options often lacks standardized practices and protocols. Consequently, a universally accepted gold standard is absent, impeding result reproducibility and comparability across different research efforts. To address this issue, we have developed MOUSSE, a novel modular omicsgeneric pipeline for unsupervised data integration. The characteristic of our tool is to use rank-based subject-specific signatures as input to derive from each omics a subject similarity network. This network maintains the informative content of the input data while reducing its size and allows for a graph-based integration of multiple omics. Using the resulting integrated network, the pipeline clusters the subjects andallows researchers to identify biomarkers for each cluster. One aspect that sets MOUSSE apart from other techniques is that it require almost no data preprocessing, making it more robust to noise in the data and more suitable to novel and not yet fully characterized data types. We tested our tool by analyzing ten publicly available benchmark datasets for different types of cancer. Each dataset contained data from three separate omics, namely transcriptome, methylome and miRNAome. The aim of our analysis was two-folded. First, we wanted to demonstrate that MOUSSE was able to identify the different phenotypes of cancers as clusters, second, we aimed to demonstrate that the pipeline was also able to identify biomarkers for each cancer type or progression. Moreover, we compared MOUSSE clustering performance against tenmulti-omics tools tested on the same data, achieving the highest median classification score. Finally, we performed an additional analysis on the biomarkers selected by the pipeline for a selected number of cancer phenotypes, showing that MOUSSE was able to identify the markers underlying disease progression and differential survival rate between cancer phenotypes. Collectively, these results showed that MOUSSE clustering and biomarker identification can be reliable even when the disease is changing. Finally, we successfully compiled and implemented MOUSSE as an R-package. To enhance the pipeline, we incorporated an additional omics dataset. This integration allowed us to optimize the selection of subject-specific signatures and introduced the capability of iteratively running the tool. This means that users can refine their clustering results while reducing the size of candidates, therefore enhancing the overall effectiveness of the software.
Multi-omics integration for biomarker discovery andunsupervised subject clusterization. A novel computational method / Fiorentino, Giuseppe. - (2023 Nov 08), pp. 1-86. [10.15168/11572_394750]
Multi-omics integration for biomarker discovery andunsupervised subject clusterization. A novel computational method
Fiorentino, Giuseppe
2023-11-08
Abstract
The advent of the high throughput era has resulted in rapid growth in the availability of large biological datasets. These massive datasets are organized in public or private repositories, encompassing not only DNA but also multiple biomolecules that represent different layers of biological information. The examination and quantification of one such layer are commonly known as "omics," which include the genome, proteome, transcriptome, and metabolome. Currently, it has become commonplace to conduct association analyses between a single omics and a specific phenotype. This practice has significantly enhanced our comprehension of both biological mechanisms and disease, particularly Mendelian disorders. However, the study of a single omics often fails to capture the entirety of variations within a multi-layered mechanism, as well as the interplay between different biological layers, thus not accurately characterizing changes in complex disorders and regulatory systems. Hence, the integration of information from multiple omics has emerged as the prevailing approach, leading to the development of computational tools for conducting multi-omics analyses. These tools are essential for further unraveling the underlying causes of complex diseases. However, the landscape of multi-omics analysis software is highly diverse, offering researchers a wide range of options in terms of purposes, data types, integration methods, and development techniques. This diversity provides tailored pipelines that cater to specific research needs. Yet, it also poses challenges, as the multitude of software options often lacks standardized practices and protocols. Consequently, a universally accepted gold standard is absent, impeding result reproducibility and comparability across different research efforts. To address this issue, we have developed MOUSSE, a novel modular omicsgeneric pipeline for unsupervised data integration. The characteristic of our tool is to use rank-based subject-specific signatures as input to derive from each omics a subject similarity network. This network maintains the informative content of the input data while reducing its size and allows for a graph-based integration of multiple omics. Using the resulting integrated network, the pipeline clusters the subjects andallows researchers to identify biomarkers for each cluster. One aspect that sets MOUSSE apart from other techniques is that it require almost no data preprocessing, making it more robust to noise in the data and more suitable to novel and not yet fully characterized data types. We tested our tool by analyzing ten publicly available benchmark datasets for different types of cancer. Each dataset contained data from three separate omics, namely transcriptome, methylome and miRNAome. The aim of our analysis was two-folded. First, we wanted to demonstrate that MOUSSE was able to identify the different phenotypes of cancers as clusters, second, we aimed to demonstrate that the pipeline was also able to identify biomarkers for each cancer type or progression. Moreover, we compared MOUSSE clustering performance against tenmulti-omics tools tested on the same data, achieving the highest median classification score. Finally, we performed an additional analysis on the biomarkers selected by the pipeline for a selected number of cancer phenotypes, showing that MOUSSE was able to identify the markers underlying disease progression and differential survival rate between cancer phenotypes. Collectively, these results showed that MOUSSE clustering and biomarker identification can be reliable even when the disease is changing. Finally, we successfully compiled and implemented MOUSSE as an R-package. To enhance the pipeline, we incorporated an additional omics dataset. This integration allowed us to optimize the selection of subject-specific signatures and introduced the capability of iteratively running the tool. This means that users can refine their clustering results while reducing the size of candidates, therefore enhancing the overall effectiveness of the software.File | Dimensione | Formato | |
---|---|---|---|
phd_unitn_Giuseppe_Fiorentino.pdf
embargo fino al 01/11/2025
Descrizione: Tesi Giuseppe Fiorentino
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
3.38 MB
Formato
Adobe PDF
|
3.38 MB | Adobe PDF | Visualizza/Apri |
appendix.zip
embargo fino al 01/11/2025
Descrizione: appendice
Tipologia:
Altro materiale allegato (Other attachments)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
7.61 MB
Formato
Zip File
|
7.61 MB | Zip File | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione