首页    期刊浏览 2024年12月01日 星期日
登录注册

文章基本信息

  • 标题:An Extraction Method of an Informative DOM Node from a Web Page by Using Layout Information
  • 本地全文:下载
  • 作者:Masanobu TSURUTA ; Shigeru MASUYAMA
  • 期刊名称:人工知能学会論文誌
  • 印刷版ISSN:1346-0714
  • 电子版ISSN:1346-8030
  • 出版年度:2010
  • 卷号:25
  • 期号:6
  • 页码:742-756
  • DOI:10.1527/tjsai.25.742
  • 出版社:The Japanese Society for Artificial Intelligence
  • 摘要:We propose an informative DOM node extraction method from a Web page for preprocessing of Web content mining. Our proposed method LM uses layout data of DOM nodes generated by a generic Web browser, and the learning set consists of hundreds of Web pages and the annotations of informative DOM nodes of those Web pages. Our method does not require large scale crawling of the whole Web site to which the target Web page belongs. We design LM so that it uses the information of the learning set more efficiently in comparison to the existing method that uses the same learning set. By experiments, we evaluate the methods obtained by combining one that consists of the method for extracting the informative DOM node both the proposed method and the existing methods, and the existing noise elimination methods: Heur removes advertisements and link-lists by some heuristics and CE removes the DOM nodes existing in the Web pages in the same Web site to which the target Web page belongs. Experimental results show that 1) LM outperforms other methods for extracting the informative DOM node, 2) the combination method ( LM , { CE (10), Heur }) based on LM (precision: 0.755, recall: 0.826, F-measure: 0.746) outperforms other combination methods.
  • 关键词:web content mining ; content extraction ; noise elimination
国家哲学社会科学文献中心版权所有