文章基本信息

标题：Web Spam Detection Using C5.0 Classification Algorithm
本地全文：下载
作者：Prof Manasi Kulkarni ; Ms Rashmi R. Tundalwar ; Ms Chaitali R. Tundalwar 等
期刊名称：International Journal of Advanced Research In Computer Science and Software Engineering
印刷版ISSN：2277-6451
电子版ISSN：2277-128X
出版年度：2013
卷号：3
期号：2
出版社：S.S. Mishra
摘要：Recently, there is dramatic increase in amount of web spam, leading to a degradation of search results. For this we require proper classification methods & algorithms. Classification is most common method used for finding the mine rule from the large database. Decision tree induction method is most commonly used for classification because it is the simple hierarchical structure for the user understanding & decision making .There are various algorithms available but decision tree is simple one. We use the modified decisions tree algorithm of C4.5 as C5.0. Using boosting decision tree algorithm such as C5.0 on datasets some rules are derived and create the Decision tree, which helps in improving the accuracy. But C4.5's ruleset methods are slow and memory-hungry. The result is a system that significantly improves the detection of Web spam using C5.0 algorithm on public datasets such as WEBSPAM-UK2007. This can also be used in improving the accuracy. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. We have specifically applied the Kullback¨CLeibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages. We are comparing the results of various classification algorithm for detection of more accurate accuracy
关键词：Data mining; Content analysis; classification algorithm; Decision tree; Web spam detection.