文章基本信息

标题：Towards Improving Rule-Based Arabic Root Extraction Algorithm for Non-Vocalized Text
本地全文：下载
作者：Nisrean Thalji ; Zyad Thalji ; Sohair Al-Hakeem 等
期刊名称：International Journal of Computer and Information Technology
印刷版ISSN：2279-0764
出版年度：2018
卷号：7
期号：6
页码：235-242
出版社：International Journal of Computer and Information Technology
摘要：Rooting algorithms are used to remove affixes from different words, and extract the root from which the inputted word is derived. Rooting process helps to standardize terms referring to the same concept. These algorithms are widely used in Arabic language applications, such as information retrieval systems, indexes, text mining, text classifiers, data compression, spelling checkers, text summarization, question answering systems, machine translation, part of speech tagging systems, stemmers, and morphological analyzer ...etc. Khoja’s algorithm is a standard Arabic root extraction algorithm, which has a number of flaws. The proposed algorithm extends Khoja’s algorithm and resolves most of its flaws. The testing process was conducted on Thalji’s corpus, which was mainly built to test and compare Arabic roots extraction algorithms. This corpus contains 720,000 word-root pairs from 12,000 roots. The performance of the proposed algorithm is then compared with Khoja’s algorithm, the proposed algorithm obtained higher accuracy than Khoja’s algorithm. The result shows that Khoja algorithm achieved 63%, and the presented algorithm achieved 92% accuracy of root extraction.
关键词：component; Root Extraction; stem; rules; pattern; prefix; suffix; infix; (key words)