期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2013
卷号:55
期号:1
出版社:Journal of Theoretical and Applied
摘要:Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc.�) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this paper, we propose an approach which categorizes web pages by exploiting plain text and text contained in HTML tags. Our method operates in two steps. In step 1, we use Support Vector Machine classifier (SVM) to generate, for each target web page (page to classify), reduced vector representation based on plain text and text from HTML tags. In Step 2, we submit this vector representation to Naive Bayes (NB) algorithm to determine the final class for the target page. We conducted our experiments on two large datasets of pages from ODP (Open Directory Project) and WebKB (Web Knowledge Base), which are accidentally discovered to suffer from a lot of missing HTML tags. The results prove that NB classifier, supported by our model and using HTML tags content combined with plain text, (1) performs significantly better than NB classifier using text alone in terms of both Micro-F1 and Macro-F1 measures and even with the presence of missing HTML tags, (2) performs consistently with respect to category distribution and (3) outperforms NB classifier, using text alone, simply with the use of very basic handling techniques of missing HTML tags.
关键词:HTML Tags; Naive Bayes; Semantic Web; SVM; Web Mining; Web Page Classification