首页    期刊浏览 2024年11月28日 星期四
登录注册

文章基本信息

  • 标题:A Web Page Segmentation Method based on Page Layouts and Title Blocks
  • 本地全文:下载
  • 作者:Hiroyuki Sano ; Shun Shiramatsu ; Tadachika Ozono
  • 期刊名称:International Journal of Computer Science and Network Security
  • 印刷版ISSN:1738-7906
  • 出版年度:2011
  • 卷号:11
  • 期号:10
  • 页码:84-90
  • 出版社:International Journal of Computer Science and Network Security
  • 摘要:In this work, we describe a new Web page segmentation method to extract the semantic structure from a Web page. A typical Web page consists of multiple elements with different functionalities, such as main content, navigation panels, copyright and privacy notices, and advertisements, and Web page segmentation is the division of the page into visually and semantically cohesive pieces. The proposed method is comprised of three steps. First, it determines the layout template of a Web page by template matching. Second, it divides the page into minimum blocks. Third, it assembles groups of these blocks into Web content blocks. While the minimum blocks can play many roles, in this study we have focused on the those that are the titles of various Web content bits. We used decision tree learning with nine parameters for each minimum block to extract the title blocks from Web pages. Experimental results showed that the decision tree generated by the J48 algorithm is the most suitable for this type of extraction.
  • 关键词:Web page segmentation; Page layout; Title block; Machine learning
国家哲学社会科学文献中心版权所有