Similarity plays a central role in language understanding process. However, it is always difficult to precisely define on which type of data and what similarity metrics we can apply in order to assess the similarity of two texts. According to this spirit, the task Semantic Textual Similarity (STS) was introduced as a pilot task at the Semantic Evaluation (SemEval) workshop in year 2012. This thesis seeks to investigate the variances of performance of STS systems with respect to the heterogeneous data sources, and find solutions to alleviate these variances to improve the system performance. We carry a series of works focusing on addressing different aspects of measuring semantic similarity for texts under the umbrella of the Semantic Textual Similarity task. Firstly, we analyze the variance of system performance on dierent corpora with preliminary experiments and propose the hypothesis that system performance depends heavily on the type of train and test corpora coming from heterogeneous sources. We analyze a standard textual similarity model built on vectorial representation and we derive a couple of modalities which help significantly alleviating the negative in influence of vectorial mapping model. In particular, we study how structural information and the most advanced word alignment models in Machine Translation improve the accuracy of systems. Our analysis also leads us to carry out, for the first time, an analysis between Semantic Relatedness and Textual Entailment, then we propose a co-learning model to improve the accuracy on both tasks by exploiting their mutual relationship. As a result, all these steps lead to a consistent improvement over the standard model which is manifested across corpora. The evaluation shows that our system systematically achieves and goes beyond the former state of the art, whereas it also reduces the variation of the accuracy on various types of corpora.

Contributions to Semantic Textual Similarity Algorithms / Vo, Ngoc Phuoc An. - (2016), pp. 1-185.

Contributions to Semantic Textual Similarity Algorithms

Vo, Ngoc Phuoc An
2016-01-01

Abstract

Similarity plays a central role in language understanding process. However, it is always difficult to precisely define on which type of data and what similarity metrics we can apply in order to assess the similarity of two texts. According to this spirit, the task Semantic Textual Similarity (STS) was introduced as a pilot task at the Semantic Evaluation (SemEval) workshop in year 2012. This thesis seeks to investigate the variances of performance of STS systems with respect to the heterogeneous data sources, and find solutions to alleviate these variances to improve the system performance. We carry a series of works focusing on addressing different aspects of measuring semantic similarity for texts under the umbrella of the Semantic Textual Similarity task. Firstly, we analyze the variance of system performance on dierent corpora with preliminary experiments and propose the hypothesis that system performance depends heavily on the type of train and test corpora coming from heterogeneous sources. We analyze a standard textual similarity model built on vectorial representation and we derive a couple of modalities which help significantly alleviating the negative in influence of vectorial mapping model. In particular, we study how structural information and the most advanced word alignment models in Machine Translation improve the accuracy of systems. Our analysis also leads us to carry out, for the first time, an analysis between Semantic Relatedness and Textual Entailment, then we propose a co-learning model to improve the accuracy on both tasks by exploiting their mutual relationship. As a result, all these steps lead to a consistent improvement over the standard model which is manifested across corpora. The evaluation shows that our system systematically achieves and goes beyond the former state of the art, whereas it also reduces the variation of the accuracy on various types of corpora.
2016
XXVIII
2015-2016
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Popescu, Octavian
Strapparava, Carlo
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
PhD-Thesis_VO.pdf

Solo gestori archivio

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.82 MB
Formato Adobe PDF
4.82 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369262
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact