HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences

IRIS

Deciphering the functional effects of genetic variants, especially those inherited together on the same haplotype, remains a major challenge in human genetics, where epistasis among co-occurring variants can further complicate interpretation. To address this, we present HapScoreDB, a database offering protein language model-derived scores for haplotype-resolved protein-coding sequences across all human transcript isoforms. Leveraging GENCODE and Ensembl annotations with phased variant data from the 1000 Genomes Project, HapScoreDB includes over 130000 distinct protein haplotypes from >18000 genes and 78000 transcripts, encompassing over 94000 coding variants. Fitness scores for each haplotype were computed using state-of-the-art protein language models. Preliminary analyses show that haplotypes harboring cancer GWAS variants tend to have significantly reduced predicted fitness. Moreover, variability in scores across haplotypes of the same transcript highlights known cancer genes, suggesting that dispersion in predicted fitness may capture functionally important variation. HapScoreDB features a user-friendly web interface for interactive exploration, visualization, and download of both full and customized datasets. As a dynamic and expandable platform, it connects real-world human genetic variation with advanced protein modeling, enabling novel approaches in variant interpretation, isoform prioritization, and population-scale functional genomics. Access HapScoreDB at https://bcglab.cibio.unitn.it/hapscoredb.

HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences / Mazza, F., Gastaldello, F., Dalfovo, D., Lattanzi, G., Romanel, A.. - In: NUCLEIC ACIDS RESEARCH. - ISSN 1362-4962. - 54:D1(2026), pp. D1087-D1097. [10.1093/nar/gkaf1184]

HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences

Mazza, Fabio;Gastaldello, Filippo;Dalfovo, Davide;Lattanzi, Gianluca;Romanel, Alessandro

2026-01-01

Abstract

Deciphering the functional effects of genetic variants, especially those inherited together on the same haplotype, remains a major challenge in human genetics, where epistasis among co-occurring variants can further complicate interpretation. To address this, we present HapScoreDB, a database offering protein language model-derived scores for haplotype-resolved protein-coding sequences across all human transcript isoforms. Leveraging GENCODE and Ensembl annotations with phased variant data from the 1000 Genomes Project, HapScoreDB includes over 130000 distinct protein haplotypes from >18000 genes and 78000 transcripts, encompassing over 94000 coding variants. Fitness scores for each haplotype were computed using state-of-the-art protein language models. Preliminary analyses show that haplotypes harboring cancer GWAS variants tend to have significantly reduced predicted fitness. Moreover, variability in scores across haplotypes of the same transcript highlights known cancer genes, suggesting that dispersion in predicted fitness may capture functionally important variation. HapScoreDB features a user-friendly web interface for interactive exploration, visualization, and download of both full and customized datasets. As a dynamic and expandable platform, it connects real-world human genetic variation with advanced protein modeling, enabling novel approaches in variant interpretation, isoform prioritization, and population-scale functional genomics. Access HapScoreDB at https://bcglab.cibio.unitn.it/hapscoredb.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2026
			
	Titolo del periodico (Journal title)
	
				NUCLEIC ACIDS RESEARCH
			
	Numero e parte del fascicolo (Issue number and part)
	
				D1
			
	DOI
	
				https://dx.doi.org/10.1093/nar/gkaf1184
			
	Codice PubMed (PubMed Identifier)
	
				41261743
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-105027759597
			
	Codice WOS (WOS identifier)
	
				WOS:001618996900001
			
	Tutti gli autori
	
						Mazza, Fabio; Gastaldello, Filippo; Dalfovo, Davide; Lattanzi, Gianluca; Romanel, Alessandro
					
	Citazione
	
				HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences / Mazza, F., Gastaldello, F., Dalfovo, D., Lattanzi, G., Romanel, A.. - In: NUCLEIC ACIDS RESEARCH. - ISSN 1362-4962. - 54:D1(2026), pp. D1087-D1097. [10.1093/nar/gkaf1184]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
gkaf1184.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 1.63 MB Formato Adobe PDF Visualizza/Apri	1.63 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/479550

Citazioni

1

0

0

0

social impact