Classification is a pervasive problem in research that aims at grouping items in categories according to established criteria. There are two prevalent ways to classify items of interest: i) to train and exploit machine learning (ML) algorithms or ii) to resort to human classification (via experts or crowdsourcing). Machine Learning algorithms have been rapidly improving with an impressive performance in complex problems such as object recognition and natural language understanding. However, in many cases they cannot yet deliver the required levels of precision and recall, typically due to difficulty of the problem and (lack of) availability of sufficiently large and clean datasets. Research in crowdsourcing has also made impressive progress in the last few years, and the crowd has been shown to perform well even in difficult tasks [Callaghan et al., 2018; Ranard et al., 2014]. However, crowdsourcing remains expensive, especially when aiming at high levels of accuracy, which often implies collecting more votes per item to make classification more robust to workers' errors. Recently, we witness rapidly emerging the third direction of hybrid crowd-machine classification that can achieve superior performance by combining the cost-effectiveness of automatic machine classifiers with the accuracy of human judgment. In this thesis, we focus on designing crowdsourcing strategies and hybrid crowd-machine approaches that optimize the item classification problem in terms of results and budget. We start by investigating crowd-based classification under the budget constraint with different loss implications, i.,e., when false positive and false negative errors carry different harm to the task. Further, we propose and validate a probabilistic crowd classification algorithm that iteratively estimates the statistical parameters of the task and data to efficiently manage the accuracy vs. cost trade-off. We then investigate how the crowd and machines can support each other in tackling classification problems. We present and evaluate a set of hybrid strategies balancing between investing money in building machines and exploiting them jointly with crowd-based classifiers. While analyzing our results of crowd and hybrid classification, we found it is relevant to study the problem of quality of crowd observations and their confusions as well as another promising direction of linking entities from structured and unstructured sources of data. We propose crowd and neural network grounded algorithms to cope with these challenges followed by rich evaluation on synthetic and real-world datasets.
Crowd and Hybrid Algorithms for Cost-Aware Classification / Krivosheev, Evgeny. - (2020 May 28), pp. 1-141. [10.15168/11572_263787]
Crowd and Hybrid Algorithms for Cost-Aware Classification
Krivosheev, Evgeny
2020-05-28
Abstract
Classification is a pervasive problem in research that aims at grouping items in categories according to established criteria. There are two prevalent ways to classify items of interest: i) to train and exploit machine learning (ML) algorithms or ii) to resort to human classification (via experts or crowdsourcing). Machine Learning algorithms have been rapidly improving with an impressive performance in complex problems such as object recognition and natural language understanding. However, in many cases they cannot yet deliver the required levels of precision and recall, typically due to difficulty of the problem and (lack of) availability of sufficiently large and clean datasets. Research in crowdsourcing has also made impressive progress in the last few years, and the crowd has been shown to perform well even in difficult tasks [Callaghan et al., 2018; Ranard et al., 2014]. However, crowdsourcing remains expensive, especially when aiming at high levels of accuracy, which often implies collecting more votes per item to make classification more robust to workers' errors. Recently, we witness rapidly emerging the third direction of hybrid crowd-machine classification that can achieve superior performance by combining the cost-effectiveness of automatic machine classifiers with the accuracy of human judgment. In this thesis, we focus on designing crowdsourcing strategies and hybrid crowd-machine approaches that optimize the item classification problem in terms of results and budget. We start by investigating crowd-based classification under the budget constraint with different loss implications, i.,e., when false positive and false negative errors carry different harm to the task. Further, we propose and validate a probabilistic crowd classification algorithm that iteratively estimates the statistical parameters of the task and data to efficiently manage the accuracy vs. cost trade-off. We then investigate how the crowd and machines can support each other in tackling classification problems. We present and evaluate a set of hybrid strategies balancing between investing money in building machines and exploiting them jointly with crowd-based classifiers. While analyzing our results of crowd and hybrid classification, we found it is relevant to study the problem of quality of crowd observations and their confusions as well as another promising direction of linking entities from structured and unstructured sources of data. We propose crowd and neural network grounded algorithms to cope with these challenges followed by rich evaluation on synthetic and real-world datasets.File | Dimensione | Formato | |
---|---|---|---|
evgeny_phd_theses_final.pdf
Open Access dal 29/05/2022
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
13.28 MB
Formato
Adobe PDF
|
13.28 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione