Motivation. The rise of transformer-based architectures has dramatically improved our ability to analyze natural language. However, the power and flexibility of these general-purpose models come at the cost of highly complex model architectures with billions of parameters that are not always needed. Results. In this work, we present CSpace: a concise word embedding of biomedical concepts that outperforms all alternatives in terms of out-of-vocabulary ratio and semantic textual similarity task, and has comparable performance with respect to transformer-based alternatives in the sentence similarity task. This ability can serve as the foundation for semantic search by enabling efficient retrieval of conceptually related terms. Additionally, CSpace incorporates ontological identifiers (MeSH, NCBI gene and taxonomy IDs), enabling computationally efficient disease, gene or condition relatedness measurement, potentially unlocking previously unknown disease-condition associations.

CSpace: a concept embedding space for biomedical applications / Tomasoni, Danilo; Marchetti, Luca. - In: BIOINFORMATICS. - ISSN 1367-4811. - 2025/41:7(2025), pp. btaf3761-btaf3768. [10.1093/bioinformatics/btaf376]

CSpace: a concept embedding space for biomedical applications

Tomasoni, Danilo
Primo
;
Marchetti, Luca
Ultimo
2025-01-01

Abstract

Motivation. The rise of transformer-based architectures has dramatically improved our ability to analyze natural language. However, the power and flexibility of these general-purpose models come at the cost of highly complex model architectures with billions of parameters that are not always needed. Results. In this work, we present CSpace: a concise word embedding of biomedical concepts that outperforms all alternatives in terms of out-of-vocabulary ratio and semantic textual similarity task, and has comparable performance with respect to transformer-based alternatives in the sentence similarity task. This ability can serve as the foundation for semantic search by enabling efficient retrieval of conceptually related terms. Additionally, CSpace incorporates ontological identifiers (MeSH, NCBI gene and taxonomy IDs), enabling computationally efficient disease, gene or condition relatedness measurement, potentially unlocking previously unknown disease-condition associations.
2025
7
Tomasoni, Danilo; Marchetti, Luca
CSpace: a concept embedding space for biomedical applications / Tomasoni, Danilo; Marchetti, Luca. - In: BIOINFORMATICS. - ISSN 1367-4811. - 2025/41:7(2025), pp. btaf3761-btaf3768. [10.1093/bioinformatics/btaf376]
File in questo prodotto:
File Dimensione Formato  
TomasoniEtAl_CSpace_btaf376-2.pdf

accesso aperto

Descrizione: Main text
Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 1.93 MB
Formato Adobe PDF
1.93 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/462591
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact