Combining learning and constraints for genome-wide protein annotation

IRIS

Background: The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale. Results: We present Ocelot, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as Ocelot), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning.

Combining learning and constraints for genome-wide protein annotation / Teso, S.; Masera, L.; Diligenti, M.; Passerini, A.. - In: BMC BIOINFORMATICS. - ISSN 1471-2105. - 20:1(2019), p. 338. [10.1186/s12859-019-2875-5]

Combining learning and constraints for genome-wide protein annotation

Teso S.;Masera L.;Diligenti M.;Passerini A.

2019-01-01

Abstract

Background: The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale. Results: We present Ocelot, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as Ocelot), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2019
			
	Titolo del periodico (Journal title)
	
				BMC BIOINFORMATICS
			
	Numero e parte del fascicolo (Issue number and part)
	
				1
			
	DOI
	
				https://dx.doi.org/10.1186/s12859-019-2875-5
			
	Codice PubMed (PubMed Identifier)
	
				31208327
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85067382080
			
	Codice WOS (WOS identifier)
	
				WOS:000471857000006
			
	Tutti gli autori
	
						Teso, S.; Masera, L.; Diligenti, M.; Passerini, A.
					
	Citazione
	
				Combining learning and constraints for genome-wide protein annotation / Teso, S.; Masera, L.; Diligenti, M.; Passerini, A.. - In: BMC BIOINFORMATICS. - ISSN 1471-2105. - 20:1(2019), p. 338. [10.1186/s12859-019-2875-5]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/251227

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

0

2

2

ND

social impact