文章基本信息

标题：Automatic Template Extraction from Heterogeneous Web Pages
本地全文：下载
作者：Mr. Vinod Kumar Raavi ; Satya P Kumar Somayajula
期刊名称：International Journal of Advanced Research In Computer Science and Software Engineering
印刷版ISSN：2277-6451
电子版ISSN：2277-128X
出版年度：2012
卷号：2
期号：8
出版社：S.S. Mishra
摘要：Extracting structured info rmation from unstructured and/or semi -structured machine-readable documents automatically plays a major role now a days, So most websites are using common templates with contents to populate the information to achieve good publishing productivity, Where WWW is the major resource for extracting the information. In recent days Template detection technique received lot of concentration to improve in different aspects like performance of search engine , clustering and classification of web documents , as templates degrade the performance and accuracy of web application for a machines because of irrelevant template terms. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. Using the similarity of underlying template structures in the document we cluster the web documents so that template for each cluster is extracted simultaneously. Thus, applying the real-life data sets the efficiency of our algorithms can be considered to the best among template detection algorithms.
关键词：Template; Extraction; Information; DOM; Cluster.