文章基本信息

标题：An Extraction Method of an Informative DOM Node from a Web Page by Using Layout Information
本地全文：下载
作者：Masanobu TSURUTA ; Shigeru MASUYAMA
期刊名称：人工知能学会論文誌
印刷版ISSN：1346-0714
电子版ISSN：1346-8030
出版年度：2010
卷号：25
期号：6
页码：742-756
DOI：10.1527/tjsai.25.742
出版社：The Japanese Society for Artificial Intelligence
摘要：We propose an informative DOM node extraction method from a Web page for preprocessing of Web content mining. Our proposed method LM uses layout data of DOM nodes generated by a generic Web browser, and the learning set consists of hundreds of Web pages and the annotations of informative DOM nodes of those Web pages. Our method does not require large scale crawling of the whole Web site to which the target Web page belongs. We design LM so that it uses the information of the learning set more efficiently in comparison to the existing method that uses the same learning set. By experiments, we evaluate the methods obtained by combining one that consists of the method for extracting the informative DOM node both the proposed method and the existing methods, and the existing noise elimination methods: Heur removes advertisements and link-lists by some heuristics and CE removes the DOM nodes existing in the Web pages in the same Web site to which the target Web page belongs. Experimental results show that 1) LM outperforms other methods for extracting the informative DOM node, 2) the combination method ( LM , { CE (10), Heur }) based on LM (precision: 0.755, recall: 0.826, F-measure: 0.746) outperforms other combination methods.
关键词：web content mining ; content extraction ; noise elimination