期刊名称:American Journal of Computing Research Repository
印刷版ISSN:2377-4606
电子版ISSN:2377-4266
出版年度:2013
卷号:1
期号:1
页码:1-5
DOI:10.12691/ajcrr-1-1-1
语种:English
出版社:Science and Education Publishing
摘要:This paper explicates a systematic approach of implementing text format categorization. It also emphasizes defined corpus linguistics and accordingly demonstrates how various Text files Html, Pdf, Doc and Txt format respectively could be analyzed. This work concentrates on comparing Arabic text format with English text format, for which various text formats have been considered. Hence the idea is implemented by calculating a distributed factor for the keywords distribution with respect to Arabic and English text documentation. All the text selected is from the Computer Technology domain. The text categorization process is implemented on the text collection and consists of two main corpus namely, Arabic and English text respectively. The obtained results show that the Arabic text format document is well distributed in Doc files compared to the English text document which is well distributed in Xml files. These results shall contribute in handling and building an effective Electronic Learning System for Arabic and English Texts. The results and conclusions are presented here with various graphical outputs for better understanding.
关键词:information retrieval; text categorization; distributing factor; natural language processing; future trends