Controlling the effect of crowd noisy annotations in NLP Tasks

Abad, Azad

doi:10.15168/11572_369190

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence and Linguistics, with the aim of studying problems in the automatic generation and understanding of natural language. It involves identifying and exploiting linguistic rules and variation with code to translate unstructured language data into information with a schema. Empirical methods in NLP employ machine learning techniques to automatically extract linguistic knowledge from big textual data instead of hard-coding the necessary knowledge. Such intelligent machines require input data to be prepared in such a way that the computer can more easily find patterns and inferences. This is feasible by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. In order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. In other words, the supervised machine learning methods intrinsically can not handle the inaccurate and noisy annotations and the performance of the learners have a high correlation with the quality of the input data labels. Hence, the annotations have to be prepared by experts. However, collecting labels for large dataset is impractical to perform by a small group of qualified experts or when the experts are unavailable. This is special crucial for the recent deep learning methods which the algorithms are starving for big supervised data. Crowdsourcing has emerged as a new paradigm for obtaining labels for training machine learning models inexpensively and for high level of data volume. The rationale behind this concept is to harness the “wisdom of the crowd” where groups of people pool their abilities to show collective intelligence. Although crowdsourcing is cheap and fast but collecting high quality data from the non-expert crowd requires careful attention to the task quality control management. The quality control process consists of selection of appropriately qualified workers, providing a clear instruction or training that are understandable to non-experts and performing sanitation on the results to reduce the noise in annotations or eliminate low quality workers. This thesis is dedicated to control the effect of crowd noisy annotations use for training the machine learning models in variety of natural language processing tasks namely: relation extraction, question answering and recognizing textual entailment. The first part of the thesis deals with design a benchmark for evaluation Distant Supervision (DS) for relation extraction task. We propose a baseline which involves training a simple yet accurate one-vs-all strategy using SVM classifier. Moreover, we exploit automatic feature extraction technique using convolutional tree kernels and study several example filtering techniques for improving the quality of the DS output. In the second part, we focused on the problem of the crowd noisy annotations in training two important NLP tasks, i.e., question answering and recognizing textual entailment. We propose two learning methods to handle the noisy labels by (i) taking into account the disagreement between crowd annotators as well as their skills for weighting instances in learning algorithms; and (ii) learning an automatic label selection model based on combining annotators characteristic and the task syntactic structure representation as features in a joint manner. Finally, we observe that in fine-grained tasks like relation extraction where the annotators need to have some deeper expertise, training the crowd workers has more impact on the results than simply filter-out the low quality crowd workers. Training crowd workers often requires high-quality labeled data (namely, gold standard) to provide the instruction and feedback to the crowd workers. We conversely, introduce a self-training strategy for crowd workers where the training examples are automatically selected via a classifier. Our study shows that even without using any gold standard, we still can train workers which open doors toward inexpensive crowd training procedure for different NLP tasks.

Controlling the effect of crowd noisy annotations in NLP Tasks / Abad, Azad. - (2017), pp. 1-121. [10.15168/11572_369190]

Controlling the effect of crowd noisy annotations in NLP Tasks

Abad, Azad

2017-01-01

Abstract

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence and Linguistics, with the aim of studying problems in the automatic generation and understanding of natural language. It involves identifying and exploiting linguistic rules and variation with code to translate unstructured language data into information with a schema. Empirical methods in NLP employ machine learning techniques to automatically extract linguistic knowledge from big textual data instead of hard-coding the necessary knowledge. Such intelligent machines require input data to be prepared in such a way that the computer can more easily find patterns and inferences. This is feasible by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. In order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. In other words, the supervised machine learning methods intrinsically can not handle the inaccurate and noisy annotations and the performance of the learners have a high correlation with the quality of the input data labels. Hence, the annotations have to be prepared by experts. However, collecting labels for large dataset is impractical to perform by a small group of qualified experts or when the experts are unavailable. This is special crucial for the recent deep learning methods which the algorithms are starving for big supervised data. Crowdsourcing has emerged as a new paradigm for obtaining labels for training machine learning models inexpensively and for high level of data volume. The rationale behind this concept is to harness the “wisdom of the crowd” where groups of people pool their abilities to show collective intelligence. Although crowdsourcing is cheap and fast but collecting high quality data from the non-expert crowd requires careful attention to the task quality control management. The quality control process consists of selection of appropriately qualified workers, providing a clear instruction or training that are understandable to non-experts and performing sanitation on the results to reduce the noise in annotations or eliminate low quality workers. This thesis is dedicated to control the effect of crowd noisy annotations use for training the machine learning models in variety of natural language processing tasks namely: relation extraction, question answering and recognizing textual entailment. The first part of the thesis deals with design a benchmark for evaluation Distant Supervision (DS) for relation extraction task. We propose a baseline which involves training a simple yet accurate one-vs-all strategy using SVM classifier. Moreover, we exploit automatic feature extraction technique using convolutional tree kernels and study several example filtering techniques for improving the quality of the DS output. In the second part, we focused on the problem of the crowd noisy annotations in training two important NLP tasks, i.e., question answering and recognizing textual entailment. We propose two learning methods to handle the noisy labels by (i) taking into account the disagreement between crowd annotators as well as their skills for weighting instances in learning algorithms; and (ii) learning an automatic label selection model based on combining annotators characteristic and the task syntactic structure representation as features in a joint manner. Finally, we observe that in fine-grained tasks like relation extraction where the annotators need to have some deeper expertise, training the crowd workers has more impact on the results than simply filter-out the low quality crowd workers. Training crowd workers often requires high-quality labeled data (namely, gold standard) to provide the instruction and feedback to the crowd workers. We conversely, introduce a self-training strategy for crowd workers where the training examples are automatically selected via a classifier. Our study shows that even without using any gold standard, we still can train workers which open doors toward inexpensive crowd training procedure for different NLP tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				2017
			
	Ciclo
	
				XXVII
			
	Anno Accademico
	
				2017-2018
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Informatica e telecomunicazioni (fino a.a. 2020-21, 36° ciclo)
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Moschitti, Alessandro
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Codice DOI
	
				https://dx.doi.org/10.15168/11572_369190
			
	Lingua (Language)
	
				Inglese
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore INF/01 - Informatica
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
Disclaimer_Abadi.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 238.82 kB Formato Adobe PDF Visualizza/Apri	238.82 kB	Adobe PDF	Visualizza/Apri
PhD-Thesis.pdf Open Access dal 02/12/2017 Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 7.34 MB Formato Adobe PDF Visualizza/Apri	7.34 MB	Adobe PDF	Visualizza/Apri