String Similarity Measures and Pam-like Matrices for Cognate Identification

IRIS

We present a new automatic learning system for cognate identification. We design a linguistic-inspired substitution matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence analysis, in order to learn substitution parameters. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We train and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adapted to datasets using any other alphabet, including the phonetic alphabet if data was available. We compare our system with other models reported in the literature and the results show that our method outperforms both orthographic and phonetic approaches formerly presented, increasing the accuracy by approximately 5%.

String Similarity Measures and Pam-like Matrices for Cognate Identification / Cristianini, N., Delmestri, A.. - ELETTRONICO. - XII:2,0(2010), pp. 1-11. (12th Annual Conference of the English Department, Bucharest Working Papers in Linguistics Bucharest, Romania 3 - 5 June, 2010).

String Similarity Measures and Pam-like Matrices for Cognate Identification

Cristianini, Nello;Delmestri, Antonella

2010-01-01

Abstract

We present a new automatic learning system for cognate identification. We design a linguistic-inspired substitution matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence analysis, in order to learn substitution parameters. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We train and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adapted to datasets using any other alphabet, including the phonetic alphabet if data was available. We compare our system with other models reported in the literature and the results show that our method outperforms both orthographic and phonetic approaches formerly presented, increasing the accuracy by approximately 5%.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2010
			
	Luogo di edizione (Place of publication)
	
				Trento
			
	Casa editrice (Publisher)
	
				Università degli Studi di Trento - Dipartimento di Ingegneria e Scienza dell'Informazione
			
	Citazione
	
				String Similarity Measures and Pam-like Matrices for Cognate Identification / Cristianini, N., Delmestri, A.. - ELETTRONICO. - XII:2,0(2010), pp. 1-11. (12th Annual Conference of the English Department, Bucharest Working Papers in Linguistics Bucharest, Romania 3 - 5 June, 2010).
			
	Tutti gli autori
	
						Cristianini, Nello; Delmestri, Antonella
					
	Appare nelle tipologie:
	
				07.2 Altre pubblicazioni (Other types of publications)

File in questo prodotto:

File	Dimensione	Formato
Delmestri_Cristianini_2010a_DISI.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 390.8 kB Formato Adobe PDF Visualizza/Apri	390.8 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/358525

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

ND

social impact