Robustness and Statistical Significance of Pam-like Matrices for Cognate Identification

IRIS

This paper tests the influence of the training dataset dimension on a recently proposed orthographic learning system, inspired from biological sequence analysis and successfully applied to cognate identification. This system automatically aligns a given set of cognate pairs producing a meaningful training dataset, learns from it substitution parameters using a PAM-like technique and utilises them to recognise cognate pairs. The results show no difference in the performance when training the system with about 650 cognate pairs extracted from 6 Indo-European languages or with about 62,000 cognate pairs extracted from 76 Indo-European languages. In both cases the system outperforms all comparable orthographic and phonetic methods previously proposed in the literature. This paper also investigates the statistical significance of these results when compared with earlier proposals. The outcome confirms that the performance reached by this system with both training datasets is significantly higher than the one achieved by all the other methods. Indeed, the training dataset dimension seems not to influence either the accuracy or the statistical significance of this learning system that needs only a very small amount of data to reach an outstanding performance.

Robustness and Statistical Significance of Pam-like Matrices for Cognate Identification / Delmestri, A., Cristianini, N.. - ELETTRONICO. - (2010), pp. 1-10.

Robustness and Statistical Significance of Pam-like Matrices for Cognate Identification

Delmestri, Antonella;Cristianini, Nello

2010-01-01

Abstract

This paper tests the influence of the training dataset dimension on a recently proposed orthographic learning system, inspired from biological sequence analysis and successfully applied to cognate identification. This system automatically aligns a given set of cognate pairs producing a meaningful training dataset, learns from it substitution parameters using a PAM-like technique and utilises them to recognise cognate pairs. The results show no difference in the performance when training the system with about 650 cognate pairs extracted from 6 Indo-European languages or with about 62,000 cognate pairs extracted from 76 Indo-European languages. In both cases the system outperforms all comparable orthographic and phonetic methods previously proposed in the literature. This paper also investigates the statistical significance of these results when compared with earlier proposals. The outcome confirms that the performance reached by this system with both training datasets is significantly higher than the one achieved by all the other methods. Indeed, the training dataset dimension seems not to influence either the accuracy or the statistical significance of this learning system that needs only a very small amount of data to reach an outstanding performance.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2010
			
	Luogo di edizione (Place of publication)
	
				Trento
			
	Casa editrice (Publisher)
	
				University of Trento - Dipartimento di Ingegneria e Scienza dell'Informazione
			
	Citazione
	
				Robustness and Statistical Significance of Pam-like Matrices for Cognate Identification / Delmestri, A., Cristianini, N.. - ELETTRONICO. - (2010), pp. 1-10.
			
	Tutti gli autori
	
						Delmestri, Antonella; Cristianini, Nello
					
	Appare nelle tipologie:
	
				07.2 Altre pubblicazioni (Other types of publications)

File in questo prodotto:

File	Dimensione	Formato
CognateIdentification_Followup_DISI.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 595.86 kB Formato Adobe PDF Visualizza/Apri	595.86 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/358430

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

ND

social impact