文章基本信息

标题：An Improved Extraction Algorithm from Domain Specific Hidden Web
本地全文：下载
作者：Juhi Sharma ; Mukesh Rawat
期刊名称：International Journal of Computer Science and Information Technologies
电子版ISSN：0975-9646
出版年度：2014
卷号：5
期号：6
页码：8239-8242
出版社：TechScience Publications
摘要：The web contains a large amount of information which is increasing by magnitude every day. The World Wide Web consists of Surface Web (Publicly Indexed Web) and the Deep Web which consists of Hidden Data, also- referred to by different names such as Hidden Web, Deepnet or the Invisible Web. A user can directly access the surface web through a Search Engine but to access the hidden data/information, the users have to manually feed a set of keywords in a typical search interface to access these hidden web pages from source web sites. The problem area we are working on is devising efficient mechanisms to extract this information automatically beforehand since "crawlers" cannot access it otherwise. In this paper we present a mechanism to extract search forms from HTML pages spread over the web, automatic filling and submission of those forms at their source sites to download the Hidden Web pages in a repository for further use by web crawlers.
关键词：Hidden Web; Query Interface; Data Mining