Deciphering the functional effects of genetic variants, especially those inherited together on the same haplotype, remains a major challenge in human genetics, where epistasis among co-occurring variants can further complicate interpretation. To address this, we present HapScoreDB, a database offering protein language model-derived scores for haplotype-resolved protein-coding sequences across all human transcript isoforms. Leveraging GENCODE and Ensembl annotations with phased variant data from the 1000 Genomes Project, HapScoreDB includes over 130000 distinct protein haplotypes from >18000 genes and 78000 transcripts, encompassing over 94000 coding variants. Fitness scores for each haplotype were computed using state-of-the-art protein language models. Preliminary analyses show that haplotypes harboring cancer GWAS variants tend to have significantly reduced predicted fitness. Moreover, variability in scores across haplotypes of the same transcript highlights known cancer genes, suggesting that dispersion in predicted fitness may capture functionally important variation. HapScoreDB features a user-friendly web interface for interactive exploration, visualization, and download of both full and customized datasets. As a dynamic and expandable platform, it connects real-world human genetic variation with advanced protein modeling, enabling novel approaches in variant interpretation, isoform prioritization, and population-scale functional genomics. Access HapScoreDB at https://bcglab.cibio.unitn.it/hapscoredb.
HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences / Mazza, Fabio; Gastaldello, Filippo; Dalfovo, Davide; Lattanzi, Gianluca; Romanel, Alessandro. - In: NUCLEIC ACIDS RESEARCH. - ISSN 1362-4962. - 54:D1(2026), pp. D1087-D1097. [10.1093/nar/gkaf1184]
HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences
Mazza, Fabio;Gastaldello, Filippo;Dalfovo, Davide;Lattanzi, Gianluca;Romanel, Alessandro
2026-01-01
Abstract
Deciphering the functional effects of genetic variants, especially those inherited together on the same haplotype, remains a major challenge in human genetics, where epistasis among co-occurring variants can further complicate interpretation. To address this, we present HapScoreDB, a database offering protein language model-derived scores for haplotype-resolved protein-coding sequences across all human transcript isoforms. Leveraging GENCODE and Ensembl annotations with phased variant data from the 1000 Genomes Project, HapScoreDB includes over 130000 distinct protein haplotypes from >18000 genes and 78000 transcripts, encompassing over 94000 coding variants. Fitness scores for each haplotype were computed using state-of-the-art protein language models. Preliminary analyses show that haplotypes harboring cancer GWAS variants tend to have significantly reduced predicted fitness. Moreover, variability in scores across haplotypes of the same transcript highlights known cancer genes, suggesting that dispersion in predicted fitness may capture functionally important variation. HapScoreDB features a user-friendly web interface for interactive exploration, visualization, and download of both full and customized datasets. As a dynamic and expandable platform, it connects real-world human genetic variation with advanced protein modeling, enabling novel approaches in variant interpretation, isoform prioritization, and population-scale functional genomics. Access HapScoreDB at https://bcglab.cibio.unitn.it/hapscoredb.| File | Dimensione | Formato | |
|---|---|---|---|
|
gkaf1184.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
1.63 MB
Formato
Adobe PDF
|
1.63 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



