首页    期刊浏览 2024年10月05日 星期六
登录注册

文章基本信息

  • 标题:A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations
  • 本地全文:下载
  • 作者:Hiroyuki Sano ; Robin M. E. Swezey ; Shun Shiramatsu
  • 期刊名称:International Journal of Computer Science and Network Security
  • 印刷版ISSN:1738-7906
  • 出版年度:2013
  • 卷号:13
  • 期号:1
  • 页码:1-6
  • 出版社:International Journal of Computer Science and Network Security
  • 摘要:In this paper, we describe a Web page segmentation method based on title blocks and show its evaluation. Title blocks are minimum blocks that function as headlines for specific Web content. A typical Web page consists of multiple elements with different types of features, such as main content, navigation panels, copyright and privacy notices, and advertisements. Web page segmentation is the division of the page into visually and semantically cohesive pieces. Our segmentation method is comprised of three steps. First, it divides the page into minimum blocks. Second, it classifies the blocks into two classes, title blocks or non-title blocks. Third, it assembles groups of these blocks into Web content blocks. While the minimum blocks can play many roles, this study focused on blocks that are the titles of various Web content bits. A decision tree learning is used with nine features for each minimum block to extract title blocks from Web pages. Experimental results showed that our segmentation method could divide Web pages that are collected from the news site with 96.1 percent accuracy, independently of amount of content. The results also describes that the method can divide all Web pages that are used in the experiment in less than 1000 milliseconds.
  • 关键词:Web Page Segmentation; Semi-structured Data; Web Intelligence; Decision Tree Learning.
国家哲学社会科学文献中心版权所有