Languages describe the world in diverse ways, a phenomenon known as linguistic diversity, which is also reflected in their lexical-semantic resources (LSRs). These resources, such as online lexicons and WordNets, are essential for natural language processing (NLP) applications. However, in many languages, this diversity often results in pervasive quality issues, including inaccuracies, incompleteness, and, notably, a bias towards the English language and Anglo-Saxon culture. This bias is evident in the omission of concepts unique to specific languages or cultures, the inclusion of foreign (Anglo-Saxon) concepts, and the lack of explicit indications for untranslatability. The latter, referred to as cross-lingual lexical gaps, occurs when a term has no equivalent in another language. The development of diversity-aware multilingual lexical resources faces significant challenges, particularly in addressing cross-lingual lexical gaps and capturing linguistic diversity in low-resource languages. Current approaches often lack systematic methods to identify lexical untranslatability, leading to resources that inadequately represent the linguistic and cultural richness of diverse languages. Low-resource languages are especially underserved due to the lack of comprehensive lexical databases and the limitations of automated methods in capturing linguistic nuances. Existing practices, which are often unidirectional and rely on English as a pivot language, exacerbate English bias and distort the representation of non-English linguistic concepts. Furthermore, expert-driven methods typically focus narrowly on advanced-level lexical gaps in a limited subset of languages, leaving much of the world’s linguistic diversity unaddressed. These challenges impede the creation of high-quality, inclusive datasets essential for advancing NLP applications, particularly for underrepresented languages. This thesis presents a systematic hybrid approach for generating diversityaware, multilingual lexical-semantic resources by integrating an expert-driven method with a novel crowdsourcing methodology. The expert-driven method involves language specialists in a structured process to identify lexical gaps, encompassing contribution collection, validation, and concept-level verification. The crowdsourcing methodology complements this by facilitating the bidirectional exploration of lexical gaps between language pairs without relying on English as an intermediary, leveraging contributions from native speakers. Based on these considerations, the main contributions of this research are as follows. (i) a systematic hybrid approach consisting of an expertdriven method and a novel crowdsourcing methodology for collecting data on linguistic diversity. (ii) development of LingoGap, a crowdsourcing platform specifically designed to gather linguistic diversity data from ordinary native speakers. (iii) validation of the expert-driven method through five case studies, including: (a) three large-scale studies conducted on kinship terminology across seven Arabic dialects, three Indonesian languages, and ten languages from various language families. (b) a study focusing on basiclevel categories in six languages: Arabic, Turkish, Persian, Indonesian, Banjarese, and Javanese. (c) a study aimed at enhancing the Arabic WordNet to address linguistic diversity by restructuring its content and incorporating lexical gaps and Arabic-specific concepts. (iv) validation of the crowdsourcing methodology using LingoGap through eight large-scale experiments, including: (a) two studies on food-related terminology across English-Arabic and Indonesian-Banjarese language pairs. (b) six studies on kinship-related terms involving the following language pairs: EnglishPersian, Arabic-Persian, English-Turkish, as well as comparisons between language-independent kinship concepts and Turkish, Persian, and Maba. (v) publication of the resulting data, which consists of 7,273 lexical gaps, 8,497 equivalent words, and 9,576 high-quality Arabic synsets—sets of synonymous words—as open, computer-processable datasets, along with their integration into the Universal Knowledge Core multilingual database. (vi) development of high-quality, diversity-aware lexicons includes the following: Arabic WordNet (V3), the first diversity-aware lexicon for Standard Arabic; ArabicUKC, a multilingual lexicon for Arabic dialects; and IndonesianUKC, a multilingual lexicon for Indonesian languages. (vii) evaluation of the proposed methodology in comparison with large language models (LLMs), focusing on data annotation via crowdsourcing versus LLM agents in two case studies. Results demonstrate that crowdsourcing significantly outperforms LLMs in identifying language- and culture-specific concepts. (viii) introduction of a machine translation evaluation methodology using the produced diversity-aware datasets.
Developing Language Resources: A Lexical-Diversity-Centric Approach / Khalilia, Hadi Mahmoud Yousef. - (2025 Apr 28), pp. 1-207.
Developing Language Resources: A Lexical-Diversity-Centric Approach
Khalilia, Hadi Mahmoud Yousef
2025-04-28
Abstract
Languages describe the world in diverse ways, a phenomenon known as linguistic diversity, which is also reflected in their lexical-semantic resources (LSRs). These resources, such as online lexicons and WordNets, are essential for natural language processing (NLP) applications. However, in many languages, this diversity often results in pervasive quality issues, including inaccuracies, incompleteness, and, notably, a bias towards the English language and Anglo-Saxon culture. This bias is evident in the omission of concepts unique to specific languages or cultures, the inclusion of foreign (Anglo-Saxon) concepts, and the lack of explicit indications for untranslatability. The latter, referred to as cross-lingual lexical gaps, occurs when a term has no equivalent in another language. The development of diversity-aware multilingual lexical resources faces significant challenges, particularly in addressing cross-lingual lexical gaps and capturing linguistic diversity in low-resource languages. Current approaches often lack systematic methods to identify lexical untranslatability, leading to resources that inadequately represent the linguistic and cultural richness of diverse languages. Low-resource languages are especially underserved due to the lack of comprehensive lexical databases and the limitations of automated methods in capturing linguistic nuances. Existing practices, which are often unidirectional and rely on English as a pivot language, exacerbate English bias and distort the representation of non-English linguistic concepts. Furthermore, expert-driven methods typically focus narrowly on advanced-level lexical gaps in a limited subset of languages, leaving much of the world’s linguistic diversity unaddressed. These challenges impede the creation of high-quality, inclusive datasets essential for advancing NLP applications, particularly for underrepresented languages. This thesis presents a systematic hybrid approach for generating diversityaware, multilingual lexical-semantic resources by integrating an expert-driven method with a novel crowdsourcing methodology. The expert-driven method involves language specialists in a structured process to identify lexical gaps, encompassing contribution collection, validation, and concept-level verification. The crowdsourcing methodology complements this by facilitating the bidirectional exploration of lexical gaps between language pairs without relying on English as an intermediary, leveraging contributions from native speakers. Based on these considerations, the main contributions of this research are as follows. (i) a systematic hybrid approach consisting of an expertdriven method and a novel crowdsourcing methodology for collecting data on linguistic diversity. (ii) development of LingoGap, a crowdsourcing platform specifically designed to gather linguistic diversity data from ordinary native speakers. (iii) validation of the expert-driven method through five case studies, including: (a) three large-scale studies conducted on kinship terminology across seven Arabic dialects, three Indonesian languages, and ten languages from various language families. (b) a study focusing on basiclevel categories in six languages: Arabic, Turkish, Persian, Indonesian, Banjarese, and Javanese. (c) a study aimed at enhancing the Arabic WordNet to address linguistic diversity by restructuring its content and incorporating lexical gaps and Arabic-specific concepts. (iv) validation of the crowdsourcing methodology using LingoGap through eight large-scale experiments, including: (a) two studies on food-related terminology across English-Arabic and Indonesian-Banjarese language pairs. (b) six studies on kinship-related terms involving the following language pairs: EnglishPersian, Arabic-Persian, English-Turkish, as well as comparisons between language-independent kinship concepts and Turkish, Persian, and Maba. (v) publication of the resulting data, which consists of 7,273 lexical gaps, 8,497 equivalent words, and 9,576 high-quality Arabic synsets—sets of synonymous words—as open, computer-processable datasets, along with their integration into the Universal Knowledge Core multilingual database. (vi) development of high-quality, diversity-aware lexicons includes the following: Arabic WordNet (V3), the first diversity-aware lexicon for Standard Arabic; ArabicUKC, a multilingual lexicon for Arabic dialects; and IndonesianUKC, a multilingual lexicon for Indonesian languages. (vii) evaluation of the proposed methodology in comparison with large language models (LLMs), focusing on data annotation via crowdsourcing versus LLM agents in two case studies. Results demonstrate that crowdsourcing significantly outperforms LLMs in identifying language- and culture-specific concepts. (viii) introduction of a machine translation evaluation methodology using the produced diversity-aware datasets.| File | Dimensione | Formato | |
|---|---|---|---|
|
Thesis_final version.pdf
accesso aperto
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Creative commons
Dimensione
17.78 MB
Formato
Adobe PDF
|
17.78 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



