Lemmatization - computing the canonical forms of words in running text - is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.

Towards an Optimal Solution to Lemmatization in Arabic / Freihat, Abed Alhakim; Abbas, Mourad; Bella, Gábor; Giunchiglia, Fausto. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - 142:(2018), pp. 132-140. [10.1016/j.procs.2018.10.468]

Towards an Optimal Solution to Lemmatization in Arabic

Freihat, Abed Alhakim;Bella, Gábor;Giunchiglia, Fausto
2018

Abstract

Lemmatization - computing the canonical forms of words in running text - is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.
Freihat, Abed Alhakim; Abbas, Mourad; Bella, Gábor; Giunchiglia, Fausto
Towards an Optimal Solution to Lemmatization in Arabic / Freihat, Abed Alhakim; Abbas, Mourad; Bella, Gábor; Giunchiglia, Fausto. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - 142:(2018), pp. 132-140. [10.1016/j.procs.2018.10.468]
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1877050918321707-main.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 717.57 kB
Formato Adobe PDF
717.57 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/313156
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 4
social impact