Natural languages that originate from a common ancestor are genetically related, words are the core of any language and cognates are words sharing the same ancestor and etymology. Cognate identification, therefore, represents the foundation upon which the evolutionary history of languages may be discovered, while linguistic phylogenetic inference aims to estimate the genetic relationships that exist between them. In this thesis, using several techniques originally developed for biological sequence analysis, we have designed a data driven orthographic learning system for measuring string similarity and we have successfully applied it to the tasks of cognate identification and phylogenetic inference. Our system has outperformed the best comparable phonetic and orthographic cognate identification models previously reported in the literature, with results statistically significant and remarkably stable, regardless of the variation of the training dataset dimension. When applied to phylogenetic inference of the Indo-European language family, whose higher structure does not yet have consensus, our method has estimated phylogenies which are compatible with the benchmark tree and has reproduced correctly all the established major language groups and subgroups present in the dataset.

Data Driven Models for Language Evolution / Delmestri, Antonella. - (2011), pp. 1-198.

Data Driven Models for Language Evolution

Delmestri, Antonella
2011-01-01

Abstract

Natural languages that originate from a common ancestor are genetically related, words are the core of any language and cognates are words sharing the same ancestor and etymology. Cognate identification, therefore, represents the foundation upon which the evolutionary history of languages may be discovered, while linguistic phylogenetic inference aims to estimate the genetic relationships that exist between them. In this thesis, using several techniques originally developed for biological sequence analysis, we have designed a data driven orthographic learning system for measuring string similarity and we have successfully applied it to the tasks of cognate identification and phylogenetic inference. Our system has outperformed the best comparable phonetic and orthographic cognate identification models previously reported in the literature, with results statistically significant and remarkably stable, regardless of the variation of the training dataset dimension. When applied to phylogenetic inference of the Indo-European language family, whose higher structure does not yet have consensus, our method has estimated phylogenies which are compatible with the benchmark tree and has reproduced correctly all the established major language groups and subgroups present in the dataset.
2011
XXI
2010-2011
Ingegneria e Scienza dell'Informaz (cess.4/11/12)
Information and Communication Technology
Marchese, Maurizio
Cristianini, Nello
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
PhD-Thesis_Uploaded.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.06 MB
Formato Adobe PDF
2.06 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/368357
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact