期刊名称:The International Arab Journal of Information Technology
印刷版ISSN:1683-3198
出版年度:2013
卷号:10
期号:2
出版社:Zarqa Private University
摘要:Developments in Arabic information retrieval did not follow the increasing use of the Arabic Web during the last decade. Semantic indexing in a language with high inflectional morphology, such as Arabic, is not a trivial task and requires a text analysis in the original language. Excepting cross-language retrieval methods or limited studies, the main efforts, for developing semantic analysis methods and topic modeling, did not include Arabic text. This paper describes our approach for analyzing semantics in Arabic texts. A new lemma-based stemmer is developed and compared to root-based one for characterizing Arabic text. The Latent Dirichlet Allocation (LDA) model is adapted to extract Arabic latent topics from various real-world corpora. In addition to the interesting subjects discovered in the press articles during the 2007-2009 period, experiments show that the classification performances with lemma-based stemming in the topics space, are improved when comparing to classification with root-based stemming.
关键词:Arabic stemming; topic model; semantic analysis; classification; test collection