首页    期刊浏览 2024年11月14日 星期四
登录注册

文章基本信息

  • 标题:An Improving Genetic Programming Approach Based Deduplication Using KFINDMR
  • 本地全文:下载
  • 作者:P.Shanmugavadivu ; N.Baskar
  • 期刊名称:International Journal of Computer Trends and Technology
  • 电子版ISSN:2231-2803
  • 出版年度:2012
  • 卷号:3
  • 期号:5
  • 出版社:Seventh Sense Research Group
  • 摘要:—The record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, types, different writing styles or even different schema representations or data types. In existing system aims at providing Unsupervised Duplication Detection (UDD) method which can be used to identify and remove the duplicate records from different data sources. Starting from the non duplicate set, the two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and present a genetic programming (GP) approach to record deduplication. Their GPbased approach is also able to automatically find effective deduplication functions. The genetic programming approach is time consuming task so we propose new algorithm KFINDMR (KFIND using Most Represented data samples) to find the most represented data samples to improve the accuracy of the classifier. The proposed system calculates the mean value of the most represented data samples in centroid of the record members; it selects the first most represented data sample that closest to the mean value calculates the minimum distance. The system Remove the duplicate dataset samples in the system and find the optimization solution to deduplication of records or data samples.
  • 关键词:Extracting data; identifying duplication; deduplication; genetic programming
国家哲学社会科学文献中心版权所有