This work regards the use of high performance computing (HPC) methods for a new bioinformatics challenge: the analysis of Terabyte-size data generated by the new ultra high throughput sequencing (UHTS) technology. As in microarray or mass spectrometry cases, public repositories are growing to store data from the next generation studies produced in laboratories around the world. These can be used to access to a large number of samples from experiments with different individuals, populations and sequencing platforms. Also experimental data of scientific articles are published in these stores, enabling to repeat and verify their results (reproducibility). An automatic downloader and analyzer system (D-Daemons architecture) is proposed to interface to a public repository of sequence reads, select all the experiments that match some research parameters, defined by a user, download them and apply an analysis pipeline to evidence their similarity or variability. A software pipeline based on this architecture and operating in a HPC environment has been developed to analyze the downloaded UHTS files in the shortest time possible. A case study of the system on “Colorectal Cancer (CRC) cell line†datasets and an aligner selection in a SNP discovery task on three RNA-Seq datasets (Human Breast tissue and of BT474 & MCF7 cell lines) are presented.

A high performance computational environment for UHTS studies / Paoli, Silvano. - (2010), pp. 1-104.

A high performance computational environment for UHTS studies

Paoli, Silvano
2010-01-01

Abstract

This work regards the use of high performance computing (HPC) methods for a new bioinformatics challenge: the analysis of Terabyte-size data generated by the new ultra high throughput sequencing (UHTS) technology. As in microarray or mass spectrometry cases, public repositories are growing to store data from the next generation studies produced in laboratories around the world. These can be used to access to a large number of samples from experiments with different individuals, populations and sequencing platforms. Also experimental data of scientific articles are published in these stores, enabling to repeat and verify their results (reproducibility). An automatic downloader and analyzer system (D-Daemons architecture) is proposed to interface to a public repository of sequence reads, select all the experiments that match some research parameters, defined by a user, download them and apply an analysis pipeline to evidence their similarity or variability. A software pipeline based on this architecture and operating in a HPC environment has been developed to analyze the downloaded UHTS files in the shortest time possible. A case study of the system on “Colorectal Cancer (CRC) cell line†datasets and an aligner selection in a SNP discovery task on three RNA-Seq datasets (Human Breast tissue and of BT474 & MCF7 cell lines) are presented.
2010
XXII
2009-2010
Ingegneria e Scienza dell'Informaz (cess.4/11/12)
Information and Communication Technology
Furlanello, Cesare
no
Inglese
Settore INF/01 - Informatica
Settore BIO/18 - Genetica
File in questo prodotto:
File Dimensione Formato  
PhD-Thesis-Silvano-Paoli.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.96 MB
Formato Adobe PDF
1.96 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/368044
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact