Lemmatization - computing the canonical forms of words in running text - is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Towards an Optimal Solution to Lemmatization in Arabic / Freihat, Abed Alhakim; Abbas, Mourad; Bella, Gábor; Giunchiglia, Fausto. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - 142:(2018), pp. 132-140. [10.1016/j.procs.2018.10.468]
Towards an Optimal Solution to Lemmatization in Arabic
Freihat, Abed Alhakim;Bella, Gábor;Giunchiglia, Fausto
2018-01-01
Abstract
Lemmatization - computing the canonical forms of words in running text - is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S1877050918321707-main.pdf
accesso aperto
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
717.57 kB
Formato
Adobe PDF
|
717.57 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione