Normalized Arabic Fragments for Inestimable Stemming (NAFIS)

Submission date: Dec. 5, 2016, 10:07 a.m.

Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of sentences, selected to be representative of Arabic stemming tasks and manually annotated. Indeed, NAFIS is: Comprehensive: The content of NAFIS can be generalized to the Arabic language as a whole. Within the stemming issue, to be comprehensive the corpus must contain all possible affix combinations. To reflect this purpose, linguists made an inventory of all Arabic affix combinations. An affix is a prefix-suffix couple that can be agglutinated to a specific word type (noun, verb or particle). Arabic affixes consist of 12 atomic prefixes and 11 atomic suffixes. Their combining generates about 94 prefixes and 73 suffixes (we note that we use the terms affix, prefix and suffix instead of clitic, proclitic and enclitic because they are widely used in the literature). For example the prefix “وَال” (and the) is composed with two atomic prefixes “وَ” (the conjunction “and”) and “لا” (the definite article “the”). Compiled: linguists gathered a set of sentences containing all earlier listed affixes to ensure the comprehensiveness criterion. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). For instance, the following sentence "عليكم بالجد فإنه أساس النجاح" is part of the corpus and contains four affixes combination: 1. [-كم]: the empty prefix associated with the suffix pronoun ‘you’, 2. [بال-]: composed with two atomic prefixes ("ب" the preposition 'with' and “ال” the definite article 'the') and the empty suffix, 3. [ه-ف]: composed with the prefix “ف” (the conjunction “then”) and the suffix “ه” (the pronoun “his”) 4. [ال-]: composed with “ال” the definite article 'the' and the empty suffix. As shown in the extract below, NAFIS is represented according to the TEI standard. Sentences are enclosed within the <phr> tag. A sentence is a set of segments representing words <w>. Since a word can have several stemming solutions (<choice>), each alternative is included within a <form> tag, which contains the prefix, base (root and stem) and suffix morphemes. All alternatives are ordered randomly except the first one, which is the suitable solution when taking the sentence context into consideration. The corpus has the following characteristics: • 37 sentences • The average length of sentences is 5,05 words, with the longest being 10 words • Declarative, interrogative, imperative and exclamatory sentences accounted for 37,84%, 32,43%, 16,22% and 13,51% respectively • 154 tokens with 5,95 solutions as an average number of stemming solutions Citation: Driss Namly, Rachida Tajmout, Karim Bouzoubaa, Lahsen. Abouenour. "NAFIS: A Gold Standard Corpus for Arabic Stemmers Evaluation". International Business Information Management Association (IBIMA), November 2016 Seville, Spain

