摘要:In the fast growing of digital technologies, crawlers and search engines
face unpredictable challenges. Focused web-crawlers are essential for mining the
boundless data available on the internet. Web-Crawlers face indeterminate latency
problem due to differences in their response time. The proposed work attempts to
optimize the designing and implementation of Focused Web-Crawlers using MasterSlave
architecture for Bioinformatics web sources. Focused Crawlers ideally should
crawl only relevant pages, but the relevance of the page can only be estimated after
crawling the genomics pages. A solution for predicting the page relevance, which is
based on Natural Language Processing, is proposed in the paper. The frequency of
the keywords on the top ranked sentences of the page determines the relevance of the
pages within genomics sources. The proposed solution uses a TextRank algorithm to
rank the sentences, as well as ensuring the correct classification of Bioinformatics
web page. Finally, the model is validated by being compared with a breadth first
search web-crawler. The comparison shows significant reduction in run time for the
same harvest rate.