首页    期刊浏览 2024年09月19日 星期四
登录注册

文章基本信息

  • 标题:Detecting Duplicates and near Duplicates Records in Large Datasets
  • 本地全文:下载
  • 作者:Shailesh Singh ; Syed Imtiyaz Hassan
  • 期刊名称:International Journal on Computer Science and Engineering
  • 印刷版ISSN:2229-5631
  • 电子版ISSN:0975-3397
  • 出版年度:2017
  • 卷号:9
  • 期号:05
  • 页码:178-185
  • 出版社:Engg Journals Publications
  • 摘要:The rapid growth in data volumes and the need to integrate data from various heterogeneous resources bring to the fore the test of making the efficient detection of the duplicate copy of records in databases. Since the data sources are incoherent and autonomous, they may adopt their own conventions and often, integrating data from different sources may lead to erroneous redundancy of data. To ensure high quality data, the database must validate and filter the incoming data from the external sources. In this regard, data normalization has become a necessity to ensure the high quality of the data stored in these databases. The process of identifying the record pairs that represent the same entity is commonly known as duplicate record detection making it one of the most important tasks in the process of data cleansing. The proposed work suggests an approach to improve the accuracy of the duplicate record detection process which when used in combination with two other concepts of text similarity and edit distance leads to a well filtered data. The background of implementation trials for these concepts was chosen as Scholarship Portal data developed for various organizations where finding and identifying of such records to the most possible extents as well as enabling the genuine students not to be debarred from getting scholarships as it has various kind of reservation/quota mechanism was a dire need.
  • 关键词:Big Data; Trigrams; Similarity; Lavensthein Edit Distance; Database data mining; Scholarships.
国家哲学社会科学文献中心版权所有