We present a new automatic learning system for cognate identification. We design a linguistic-inspired substitution matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence analysis, in order to learn substitution parameters. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We train and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adapted to datasets using any other alphabet, including the phonetic alphabet if data was available. We compare our system with other models reported in the literature and the results show that our method outperforms both orthographic and phonetic approaches formerly presented, increasing the accuracy by approximately 5%.
String Similarity Measures and Pam-like Matrices for Cognate Identification / Cristianini, Nello; Delmestri, Antonella. - ELETTRONICO. - XII:2,0(2010), pp. 1-11. (Intervento presentato al convegno 12th Annual Conference of the English Department, Bucharest Working Papers in Linguistics tenutosi a Bucharest, Romania nel 3 - 5 June, 2010).
String Similarity Measures and Pam-like Matrices for Cognate Identification
Delmestri, Antonella
2010-01-01
Abstract
We present a new automatic learning system for cognate identification. We design a linguistic-inspired substitution matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence analysis, in order to learn substitution parameters. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We train and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adapted to datasets using any other alphabet, including the phonetic alphabet if data was available. We compare our system with other models reported in the literature and the results show that our method outperforms both orthographic and phonetic approaches formerly presented, increasing the accuracy by approximately 5%.File | Dimensione | Formato | |
---|---|---|---|
Delmestri_Cristianini_2010a_DISI.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
390.8 kB
Formato
Adobe PDF
|
390.8 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione