文章基本信息

标题：A Web Page Segmentation Method based on Page Layouts and Title Blocks
本地全文：下载
作者：Hiroyuki Sano ; Shun Shiramatsu ; Tadachika Ozono 等
期刊名称：International Journal of Computer Science and Network Security
印刷版ISSN：1738-7906
出版年度：2011
卷号：11
期号：10
页码：84-90
出版社：International Journal of Computer Science and Network Security
摘要：In this work, we describe a new Web page segmentation method to extract the semantic structure from a Web page. A typical Web page consists of multiple elements with different functionalities, such as main content, navigation panels, copyright and privacy notices, and advertisements, and Web page segmentation is the division of the page into visually and semantically cohesive pieces. The proposed method is comprised of three steps. First, it determines the layout template of a Web page by template matching. Second, it divides the page into minimum blocks. Third, it assembles groups of these blocks into Web content blocks. While the minimum blocks can play many roles, in this study we have focused on the those that are the titles of various Web content bits. We used decision tree learning with nine parameters for each minimum block to extract the title blocks from Web pages. Experimental results showed that the decision tree generated by the J48 algorithm is the most suitable for this type of extraction.
关键词：Web page segmentation; Page layout; Title block; Machine learning