首页    期刊浏览 2024年11月26日 星期二
登录注册

文章基本信息

  • 标题:A Methodology for Template Extraction from Heterogeneous Web Pages
  • 本地全文:下载
  • 作者:Vidya Kadam ; Prakash. R. Devale
  • 期刊名称:Indian Journal of Computer Science and Engineering
  • 印刷版ISSN:2231-3850
  • 电子版ISSN:0976-5166
  • 出版年度:2012
  • 卷号:3
  • 期号:3
  • 页码:449-452
  • 出版社:Engg Journals Publications
  • 摘要:The World Wide Web is a vast and most useful collection of information. To achieve high productivity in publishing the web pages are automatically evaluated using common templates with contents. The templates are considered harmful because they compromise the relevance judgement of many web information retrieval and web mining methods such as clustering and classification and badly impact the performance and resources of tools that processes the web pages. Thus, the template detection techniques have received a lot of attention to improve the performance of search engines, clustering and classification of web documents. In this paper, we are presenting the approach to detect and extract the templates from heterogeneous web documents and cluster them into different group. The pages belong to each group should possess the same structure .This saves the time to find out best templates from a large number of web document and also saves the memory which is required to find out the best template structure.
  • 关键词:MinHash; Minimum Description Length (MDL); parsing.
国家哲学社会科学文献中心版权所有