A Database and Visualization of the Similarity of Contemporary Lexicons

Bella, Gábor; Batsuren, Khuyagbaatar; Giunchiglia, Fausto

doi:10.1007/978-3-030-83527-9_8

Lexical similarity data, quantifying the “proximity” of languages based on the similarity of their lexicons, has been increasingly used to estimate the cross-lingual reusability of language resources, for tasks such as bilingual lexicon induction or cross-lingual transfer. Existing similarity data, however, originates from the field of comparative linguistics, computed from very small expert-curated vocabularies that are not supposed to be representative of modern lexicons. We explore a different, fully automated approach to lexical similarity computation, based on an existing 8-million-entry cognate database created from online lexicons orders of magnitude larger than the word lists typically used in linguistics. We compare our results to earlier efforts, and automatically produce intuitive visualizations that have traditionally been hand-crafted. With a new, freely available database of over 27 thousand language pairs over 331 languages, we hope to provide more relevant data to cross-lingual NLP applications, as well as material for the synchronic study of contemporary lexicons.

A Database and Visualization of the Similarity of Contemporary Lexicons / Bella, G., Batsuren, K., Giunchiglia, F.. - 12848:(2021), pp. 95-104. (24th International Conference on Text, Speech, and Dialogue, TSD 2021 Olomouc, Czech Republic 6th-9th September 2021) [10.1007/978-3-030-83527-9_8].

A Database and Visualization of the Similarity of Contemporary Lexicons

Bella, Gábor;Batsuren, Khuyagbaatar;Giunchiglia, Fausto

2021-01-01

Abstract

Lexical similarity data, quantifying the “proximity” of languages based on the similarity of their lexicons, has been increasingly used to estimate the cross-lingual reusability of language resources, for tasks such as bilingual lexicon induction or cross-lingual transfer. Existing similarity data, however, originates from the field of comparative linguistics, computed from very small expert-curated vocabularies that are not supposed to be representative of modern lexicons. We explore a different, fully automated approach to lexical similarity computation, based on an existing 8-million-entry cognate database created from online lexicons orders of magnitude larger than the word lists typically used in linguistics. We compare our results to earlier efforts, and automatically produce intuitive visualizations that have traditionally been hand-crafted. With a new, freely available database of over 27 thousand language pairs over 331 languages, we hope to provide more relevant data to cross-lingual NLP applications, as well as material for the synchronic study of contemporary lexicons.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2021
			
	Titolo del volume (Proceedings title)
	
				Text, Speech, and Dialogue: 24th International Conference Proceedings
			
	Luogo di edizione (Place of publication)
	
				Cham, CH
			
	Casa editrice (Publisher)
	
				Springer Science and Business Media Deutschland GmbH
			
	ISBN
	
				978-3-030-83526-2
978-3-030-83527-9
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85115252591
			
	Codice WOS (WOS identifier)
	
				WOS:001310782300008
			
	Tutti gli autori
	
						Bella, Gábor; Batsuren, Khuyagbaatar; Giunchiglia, Fausto
					
	Citazione
	
				A Database and Visualization of the Similarity of Contemporary Lexicons / Bella, G., Batsuren, K., Giunchiglia, F.. - 12848:(2021), pp. 95-104. (24th International Conference on Text, Speech, and Dialogue, TSD 2021 Olomouc, Czech Republic 6th-9th September 2021) [10.1007/978-3-030-83527-9_8].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Lexical_Similarity_TSD(1).pdf Open Access dal 01/01/2023 Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 2.01 MB Formato Adobe PDF Visualizza/Apri	2.01 MB	Adobe PDF	Visualizza/Apri