首页    期刊浏览 2024年11月24日 星期日
登录注册

文章基本信息

  • 标题:Automatic Template Extraction from Heterogeneous Web Pages
  • 本地全文:下载
  • 作者:Mr. Vinod Kumar Raavi ; Satya P Kumar Somayajula
  • 期刊名称:International Journal of Advanced Research In Computer Science and Software Engineering
  • 印刷版ISSN:2277-6451
  • 电子版ISSN:2277-128X
  • 出版年度:2012
  • 卷号:2
  • 期号:8
  • 出版社:S.S. Mishra
  • 摘要:Extracting structured info rmation from unstructured and/or semi -structured machine-readable documents automatically plays a major role now a days, So most websites are using common templates with contents to populate the information to achieve good publishing productivity, Where WWW is the major resource for extracting the information. In recent days Template detection technique received lot of concentration to improve in different aspects like performance of search engine , clustering and classification of web documents , as templates degrade the performance and accuracy of web application for a machines because of irrelevant template terms. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. Using the similarity of underlying template structures in the document we cluster the web documents so that template for each cluster is extracted simultaneously. Thus, applying the real-life data sets the efficiency of our algorithms can be considered to the best among template detection algorithms.
  • 关键词:Template; Extraction; Information; DOM; Cluster.
国家哲学社会科学文献中心版权所有