期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2014
卷号:59
期号:1
出版社:Journal of Theoretical and Applied
摘要:Web page classification differs from traditional text classification due to additional information by Hyper Text Markup Language (HTML) structure and the presence of hyperlinks. While effort was taken to exploit hyperlinks for classification, web pages structured nature is rarely considered. A noticeable HTML documents feature is HTML tags and respective attributes that ensure that HTML documents are viewed in browsers and other user agents. This paper proposes a semantic-based feature selection to improve web pages search and retrieval over large document repositories. Web page classification using HTML tags is evaluated using the 4 Universities Dataset. The features are classified using Proposed Neural Network. The experimental results show improved precision and recall with the presented method.
关键词:Hyper Text Markup Language (HTML); Web page classification; HTML tags; Neural Network