期刊名称:International Journal of Reviews in Computing
印刷版ISSN:2076-3328
电子版ISSN:2076-3336
出版年度:2011
卷号:6
出版社:Little Lion Scientific Research and Developement
摘要:In this paper, we propose a set of new features to enhance the classification of Arabic Web pages into spam and non-spam under different classification algorithms, namely Decision Tree, Naїve Bayes, and LogitBoost. We compare our features, which we call Arabic Content Analysis (ACA) features, to state-of-the-art Content Analysis (CA) features for spam detection in the English Web. We show that augmenting the CA features with our ACA features achieves an increase in detection accuracy of Arabic spam pages compared to CA features alone. When combined, ACA and CA features correctly identified 5,536 pages of the 5,645 Arabic spam pages that we used for testing with a false positive rate of 1.9% using the Decision Tree classifier. We also identified the top-ranked features using the Gain Ratio method.
关键词:Web Spam; Web Pages; Arabic Web Spam; Detecting Arabic Spam; Arabic Corpus; Arabic Keywords; Spamdexing