文章基本信息

标题：Web Database Sampling Based on Dependency of Keywords
本地全文：下载
作者：Zhang Rui ; Wang Feng ; Lin Peiguang 等
期刊名称：The Open Cybernetics & Systemics Journal
电子版ISSN：1874-110X
出版年度：2015
卷号：9
期号：1
页码：375-383
DOI：10.2174/1874110X01509010375
出版社：Bentham Science Publishers Ltd
摘要：
The Information Era has witnessed a huge number of sources from websites. The abundance of useful data surrounding us has made it possible for integration systems to improve the quality of the integrated data. However, how to choose proper data sources efficiently to extract data with high coverage and low redundancy is still a hot topic in the area. Sampling the databases hiding behind the websites makes it possible to obtain the characteristics of the web databases, and further to choose appropriate sources when collecting data for integration and query optimization. In this paper we construct a sampling model to represent data characteristics of web databases based on posing keyword queries on the deep web query interface. The dependency of text attribute keywords within the data source is used to construct the dependent-relational probability matrix, which indicate the sample distribution and is used for keyword extension to fetch more sampling data and get new characteristics of the actual data. Further, we provide an efficiency method to evaluate the similarity between the sample databases and the real web databases. We evaluate the proposed method in real world dataset and the results show that our method can sample the web data sources with high similarity.