首页    期刊浏览 2024年10月05日 星期六
登录注册

文章基本信息

  • 标题:AN APPROACH TO DESIGN INCREMENTAL PARALLEL WEBCRAWLER
  • 本地全文:下载
  • 作者:DIVAKAR YADAV ; AK SHARMA ; SONIA SANCHEZ-CUADRADO
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2012
  • 卷号:43
  • 期号:1
  • 页码:008-029
  • 出版社:Journal of Theoretical and Applied
  • 摘要:World Wide Web (WWW) is a huge repository of interlinked hypertext documents known as web pages. Users access these hypertext documents via Internet. Since its inception in 1990, WWW has become many folds in size, and now it contains more than 50 billion publicly accessible web documents distributed all over the world on thousands of web servers and still growing at exponential rate. It is very difficult to search information from such a huge collection of WWW as the web pages or documents are not organized as books on shelves in a library, nor are web pages completely catalogued at one central location. Search engine is basic information retrieval tool, used to access information from WWW. In response to the search query provided by users, Search engines use their database to search the relevant documents and produce the result after ranking on the basis of relevance. In fact, the Search engine builds its database, with the help of WebCrawlers. To maximize the download rate and to retrieve the whole or significant portion of the Web, search engines run multiple crawlers in parallel. Overlapping of downloaded web documents, quality, network bandwidth and refreshing of web documents are the major challenging problems faced by existing parallel WebCrawlers that are addressed in this work. A Multi Threaded (MT) server based novel architecture for incremental parallel web crawler has been designed that helps to reduce overlapping, quality and network bandwidth problems. Additionally, web page change detection methods have been developed to refresh the web document by detecting the structural, presentation and content level changes in web documents. These change detection methods help to detect whether version of a web page, existing at Search engine side has got changed from the one existing at Web server end or not. If it has got changed, the WebCrawler should replace the existing version at Search engine database side to keep its repository up-to-date
  • 关键词:World Wide Web (WWW); Uniform Resource Locator (URLs); Search engine; WebCrawler; Checksum; Change detection; Ranking algorithms
国家哲学社会科学文献中心版权所有