Natural Language Processing (NLP) is a sub-field of Artificial Intelligence and Linguistics, with the aim of studying problems in the automatic generation and understanding of natural language. It involves identifying and exploiting linguistic rules and variation with code to translate unstructured language data into information with a schema. Empirical methods in NLP employ machine learning techniques to automatically extract linguistic knowledge from big textual data instead of hard-coding the necessary knowledge. Such intelligent machines require input data to be prepared in such a way that the computer can more easily find patterns and inferences. This is feasible by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. In order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. In other words, the supervised machine learning methods intrinsically can not handle the inaccurate and noisy annotations and the performance of the learners have a high correlation with the quality of the input data labels. Hence, the annotations have to be prepared by experts. However, collecting labels for large dataset is impractical to perform by a small group of qualified experts or when the experts are unavailable. This is special crucial for the recent deep learning methods which the algorithms are starving for big supervised data. Crowdsourcing has emerged as a new paradigm for obtaining labels for training machine learning models inexpensively and for high level of data volume. The rationale behind this concept is to harness the “wisdom of the crowd” where groups of people pool their abilities to show collective intelligence. Although crowdsourcing is cheap and fast but collecting high quality data from the non-expert crowd requires careful attention to the task quality control management. The quality control process consists of selection of appropriately qualified workers, providing a clear instruction or training that are understandable to non-experts and performing sanitation on the results to reduce the noise in annotations or eliminate low quality workers. This thesis is dedicated to control the effect of crowd noisy annotations use for training the machine learning models in variety of natural language processing tasks namely: relation extraction, question answering and recognizing textual entailment. The first part of the thesis deals with design a benchmark for evaluation Distant Supervision (DS) for relation extraction task. We propose a baseline which involves training a simple yet accurate one-vs-all strategy using SVM classifier. Moreover, we exploit automatic feature extraction technique using convolutional tree kernels and study several example filtering techniques for improving the quality of the DS output. In the second part, we focused on the problem of the crowd noisy annotations in training two important NLP tasks, i.e., question answering and recognizing textual entailment. We propose two learning methods to handle the noisy labels by (i) taking into account the disagreement between crowd annotators as well as their skills for weighting instances in learning algorithms; and (ii) learning an automatic label selection model based on combining annotators characteristic and the task syntactic structure representation as features in a joint manner. Finally, we observe that in fine-grained tasks like relation extraction where the annotators need to have some deeper expertise, training the crowd workers has more impact on the results than simply filter-out the low quality crowd workers. Training crowd workers often requires high-quality labeled data (namely, gold standard) to provide the instruction and feedback to the crowd workers. We conversely, introduce a self-training strategy for crowd workers where the training examples are automatically selected via a classifier. Our study shows that even without using any gold standard, we still can train workers which open doors toward inexpensive crowd training procedure for different NLP tasks.

Controlling the effect of crowd noisy annotations in NLP Tasks / Abad, Azad. - (2017), pp. 1-121.

Controlling the effect of crowd noisy annotations in NLP Tasks

Abad, Azad
2017-01-01

Abstract

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence and Linguistics, with the aim of studying problems in the automatic generation and understanding of natural language. It involves identifying and exploiting linguistic rules and variation with code to translate unstructured language data into information with a schema. Empirical methods in NLP employ machine learning techniques to automatically extract linguistic knowledge from big textual data instead of hard-coding the necessary knowledge. Such intelligent machines require input data to be prepared in such a way that the computer can more easily find patterns and inferences. This is feasible by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. In order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. In other words, the supervised machine learning methods intrinsically can not handle the inaccurate and noisy annotations and the performance of the learners have a high correlation with the quality of the input data labels. Hence, the annotations have to be prepared by experts. However, collecting labels for large dataset is impractical to perform by a small group of qualified experts or when the experts are unavailable. This is special crucial for the recent deep learning methods which the algorithms are starving for big supervised data. Crowdsourcing has emerged as a new paradigm for obtaining labels for training machine learning models inexpensively and for high level of data volume. The rationale behind this concept is to harness the “wisdom of the crowd” where groups of people pool their abilities to show collective intelligence. Although crowdsourcing is cheap and fast but collecting high quality data from the non-expert crowd requires careful attention to the task quality control management. The quality control process consists of selection of appropriately qualified workers, providing a clear instruction or training that are understandable to non-experts and performing sanitation on the results to reduce the noise in annotations or eliminate low quality workers. This thesis is dedicated to control the effect of crowd noisy annotations use for training the machine learning models in variety of natural language processing tasks namely: relation extraction, question answering and recognizing textual entailment. The first part of the thesis deals with design a benchmark for evaluation Distant Supervision (DS) for relation extraction task. We propose a baseline which involves training a simple yet accurate one-vs-all strategy using SVM classifier. Moreover, we exploit automatic feature extraction technique using convolutional tree kernels and study several example filtering techniques for improving the quality of the DS output. In the second part, we focused on the problem of the crowd noisy annotations in training two important NLP tasks, i.e., question answering and recognizing textual entailment. We propose two learning methods to handle the noisy labels by (i) taking into account the disagreement between crowd annotators as well as their skills for weighting instances in learning algorithms; and (ii) learning an automatic label selection model based on combining annotators characteristic and the task syntactic structure representation as features in a joint manner. Finally, we observe that in fine-grained tasks like relation extraction where the annotators need to have some deeper expertise, training the crowd workers has more impact on the results than simply filter-out the low quality crowd workers. Training crowd workers often requires high-quality labeled data (namely, gold standard) to provide the instruction and feedback to the crowd workers. We conversely, introduce a self-training strategy for crowd workers where the training examples are automatically selected via a classifier. Our study shows that even without using any gold standard, we still can train workers which open doors toward inexpensive crowd training procedure for different NLP tasks.
2017
XXVII
2017-2018
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Moschitti, Alessandro
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
Disclaimer_Abadi.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 238.82 kB
Formato Adobe PDF
238.82 kB Adobe PDF Visualizza/Apri
PhD-Thesis.pdf

Solo gestori archivio

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 7.34 MB
Formato Adobe PDF
7.34 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369190
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact