首页    期刊浏览 2024年12月02日 星期一
登录注册

文章基本信息

  • 标题:Efficient Multi-threaded Crawling Using In Memory Data Structures
  • 本地全文:下载
  • 作者:Mohammad A.R. Abdeen
  • 期刊名称:International Journal of Computer Science and Network Security
  • 印刷版ISSN:1738-7906
  • 出版年度:2020
  • 卷号:20
  • 期号:2
  • 页码:88-92
  • 出版社:International Journal of Computer Science and Network Security
  • 摘要:Crawling the internet is an important task for any search engine. A crawler is a software program that sends HTTP requests to various webservers available on the world datasphere and downloads their contents. As the size of the internet has gone through a big bang in the last decade, designing efficient parallel crawlers became a necessity. One of the factors that degrades the crawler performance is the disk access every time a file is written. As the process of crawling the web requires the download of tens or hundreds of millions of webpages, much time will be consumed in disk writes due to the seek times. This work presents an efficient multi-threaded crawler that incorporates an in-memory data structure to reduce the overall disk write times. The results show that the proposed technique can increase the throughput by about 50% at selected values of size of the in-memory data structure over the normal multi-threaded crawler with no in-memory data structure. In addition, the results show that this design can achieve an average crawler speed of 22 pages/sec which supersedes previously reported work.
  • 关键词:Web Crawlers; Distributed Applications; Multi-threading; In-memory Data Structures; Performance Evaluation.
国家哲学社会科学文献中心版权所有