文章基本信息

标题：Detection and elimination of duplicate data using token-based method for a data warehouse: a clustering based approach.
作者：Tamilselvi, J. Jebamalar ; Saravanan, V.
期刊名称：International Journal of Dynamics of Fluids
印刷版ISSN：0973-1784
出版年度：2009
期号：December
语种：English
出版社：Research India Publications
摘要：This research work presents a general sequential framework for duplicate detection and elimination. The proposed framework uses six steps to improve the process of duplicate detection and elimination. First, an attribute selection algorithm is used to identify or select best and suitable attributes for duplicate identification and elimination. The token is formed for the selected attribute field values in the next step. After the token formation, clustering algorithm or blocking method is used to group the records based on the similarities value. The best blocking key will be selected for the blocking records by comparing performance of the duplicate detection. In the next step the threshold value is calculated based on the similarities between records and fields. Then, a rule based approach is used to identify or detect duplicates and to eliminate poor quality duplicates by retaining only one copy of the best duplicate record. Finally, all the cleaned records are grouped or merged and made available for the next process.
关键词：Clustering (Computers);Databases;Information management;Information storage and retrieval systems

Detection and elimination of duplicate data using token-based method for a data warehouse: a clustering based approach.

Tamilselvi, J. Jebamalar ; Saravanan, V.

The process of detecting and removing database defects and duplicates is referred to as data cleaning. The fundamental issue of duplicate detection is that inexact duplicates in a database may refer to the same real world object due to errors and missing data. Duplicate elimination is hard because it is caused by different types of errors like typographical errors, missing values, abbreviations and different representations of the same logical value. In the existing approaches, duplicate detection and elimination is domain dependent. These domain dependent methods for duplicate elimination rely on similarity functions and threshold for duplicate elimination and produce high false positives.

This research work presents a general sequential framework for duplicate detection and elimination. The proposed framework uses six steps to improve the process of duplicate detection and elimination. First, an attribute selection algorithm is used to identify or select best and suitable attributes for duplicate identification and elimination. The token is formed for the selected attribute field values in the next step. After the token formation, clustering algorithm or blocking method is used to group the records based on the similarities value. The best blocking key will be selected for the blocking records by comparing performance of the duplicate detection. In the next step the threshold value is calculated based on the similarities between records and fields. Then, a rule based approach is used to identify or detect duplicates and to eliminate poor quality duplicates by retaining only one copy of the best duplicate record. Finally, all the cleaned records are grouped or merged and made available for the next process.

This research work will be efficient for reducing the number of false positives without missing out on detecting duplicates. To compare this new framework with previous approaches the token concept is included to speed up the data cleaning process and reduce the complexity. Analysis of several blocking key is made to select best blocking key to bring similar records together through extensive experiments to avoid comparing all pairs of records. A rule based approach is used to identify exact and inexact duplicates and to eliminate duplicates.

Introduction

In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology was simply too cumbersome to provide relevant data efficiently and quickly. Completing reporting requests could take days or weeks using antiquated reporting tools that were designed more or less to 'execute' the business rather than 'run' the business.

A data warehouse is basically a database and having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. In the data warehousing community, the task of finding duplicated records within data warehouse has long been a persistent problem and has become an area of active research. There have been many research undertakings to address the problems of data duplication caused by duplicate contamination of data.

There are two issues to be considered for duplicate detection: Accuracy and Speed. The measure of accuracy in duplicate detection depends on the number of false negatives (duplicates that were not classified as such) and false positives (non-duplicates which were classified as duplicates). The algorithm's speed is mainly affected by the number of records compared, and how costly these comparisons are. Generally CPUs are not able to do duplicate detection on large databases within any reasonable time, so normally the number of record comparison needs to be cut down [4].

In this research work, a framework is developed to handle any duplicate data in a data warehouse. The main objective of this research work is to improve data quality and increase speed of the data cleaning process. A high quality, scalable blocking algorithm, similarity computation algorithm and duplicate elimination algorithm are used and evaluated on real datasets from an operational data warehouse to achieve the objective.

Framework Design

A sequential based framework is developed for detection and elimination of duplicate data. This framework comprises of some existing data cleaning approaches and new approaches which are used to reduce the complexity of duplicate data detection and elimination and to clean with more flexibility and less effort.

Fig. 1 shows the framework to clean the duplicate data in a sequential order. Each step of the framework is well suited for the different purposes. This framework will work according to the data by using software agent in each step with less user interaction.

The principle on this framework is as follows:

A. Selection of attributes: There is a clear need to identify and select attributes. These selected attributes are to be used in the other steps.

B. Formation of tokens: The well suited token is created to check the similarities between records as well as fields.

C. Clustering/Blocking of records: Clustering algorithm or blocking method is used to group the records based on the similarities of block token key value.

D. Similarity computation for selected attributes: Jaccard similarity method is used for token-based similarity computation.

E. Detection and elimination of duplicate records: The rule based detection and elimination approach is used for detecting and eliminating the duplicates.

F. Merge: The result or cleaned data is merged.

A. Selection of attributes

The data cleaning process is complex with the large amount of data in the data warehouse. The attribute selection is very important to reduce the time and effort for the further work such as record similarity and elimination process etc. Attribute selection is very important when comparing two records [5]. This step is the foundation step for all the remaining steps. Therefore time and effort are two important requirements to promptly and qualitatively select the attribute to be considered.

[FIGURE 1 OMITTED]

A missing value is expressed and treated as a string of blanks. Obviously, missing character values are not the smallest strings. Distinct is used to retrieve number of rows that have unique values for each attribute. Uniqueness is the characteristic of an attribute that make up its relevance for duplicate detection. The uniqueness is a property that is reflected in the distribution of the similarity of the attribute-pairs of non-duplicates, i.e. if an attribute is highly unique, the mean similarity on non-duplicates for this attribute will be low. The value of measurement types are also considered for the attribute selection. The data cleaning with numeric data will not be effective. The categorical data is efficient for the data cleaning process.

To identify relevant attributes for further data cleaning process, three criteria are used. The three criteria are: (a) Identifying key attributes, (b) classifying attributes with high distinct value and low missing value and (c) measurement types of the attributes. Based on the above information, 'weight' or 'rank value' is calculated for all the attributes. Finally, the highest priority attributes are selected for the further data cleaning process [2].

This flow diagram (Fig 3) shows attribute selection procedure in a sequential way. First, the data set is identified for the data cleaning process. From this data set, attributes are analyzed by identifying types of the attribute, relationship between attributes, and properties of each attribute to select the appropriate attribute. Type of attribute is classified using data type, size or length of the attribute.

Threshold value is used to identify best attributes for the data cleaning process. The threshold value is measured by using three different criteria--High Threshold value, Data Quality and High Rank. High threshold value is calculated to identify high power attribute for the data cleaning process. Attributes are ranked based on the threshold value and data quality value. Finally, high rank attributes are selected for the next cleaning process to improve speed of the data cleaning process. The developed attribute selection algorithm is presented in Fig 2.

The attribute selection algorithm can eliminate both irrelevant and redundant attributes and is applicable to any type of data (nominal, numeric, etc.). Also, this attribute selection algorithm can handle different attribute types of data smoothly. The quality of the algorithm is confirmed by applying a set of rules. The main purpose for this attribute selection for data cleaning is to reduce time and to improve the speed for the further data cleaning process such as token formation, record similarity and elimination process in an efficient way.

Figure 2: Algorithm for attribute selection.

Input : N-Attributes, no. of tuples n, relation instance r
Output : S--Sub set of the attributes
Initialize : L--Temporary relation instance, x--Attribute set
Var : L--Temporary relation instance, x--Attribute set, F--Field set,
i, j
begin
  I.  Analyze attribute set X to select best attributes
 II.  Calculate threshold value a for each attribute
      for each attribute xi, i<{0,1,2,.... N}
      do
      i)  Assign threshold value as 1 for key attribute, put into L
      ii) Calculate threshold a with ([sigma] : D [and] M [and] MT
          [and] S)
          a. Distinct(D) value of the attribute xi if tuple
             [[??].sub.i]=[i.sup.n] t = ti
          b. Missing(M) value of the attribute xi if tuple
             [[??].sub.i]=[i.sup.n] ti = NULL
          c. Measurement types(MT) of attribute (ordinal, nominal,
             interval, and ratio) xi
             Put into L.
             end
III.  Select attribute with high threshold value a i, then put into L.
 IV.  Ignore attribute with low threshold value
  V.  Calculate data quality [dq.sub.j] of the selected attributes
      for each attribute xi, i<{0,1,2,.... N}
      for each field fj, j<{0,1,2,.... n}
compute

      a. [Completeness.sub.j] =  [[summation].sup.N.sub.i=0]
                                 [completeness.sub.i,j]/n
      b. [Accuracy.sub.j] =      [[summation].sup.N.sub.i=0]
                                 [accuracy.sub.i,j]/n
      c. [Consistency.sub.j] =   [[summation].sup.N.sub.i=0]
                                 [consistency.sub.i,j]/n
      d. [dq.sub.j] =            [alpha] x [Completeness.sub.j] +
                                 [beta] x [accuracy.sub.j] + [gamma] x
                                 [consistency.sub.j] / n
     end for
     end for
 VI.  Rank attributes based on data quality and threshold value
VII.  Select high rank attributes
End

[FIGURE 3 OMITTED]

B. Formation of Tokens

This step makes use of the selected attribute field values to form a token. The tokens can be created for a single attribute field value or for combined attributes. For example, contact name attribute is selected to create a token for further cleaning process. The contact name attribute are split as first name, middle name and last name. Here first name and last name are combined as contact name to form a token.

Tokens are formed using numeric values, alphanumeric values and alphabetic values by selecting some combination of characters. Unimportant elements are removed before the token formation [title tokens like Mr., Dr. and so on [6].

Numeric tokens comprise only digits [0-9]. Alphabetic tokens consist of alphabets (aA-zZ). The first character of each word in the field is considered and the characters are sorted. Alphanumeric tokens comprise of both numeric and alphabetic tokens. It composes a given alphanumeric element into numeric [7].

This step eliminates the need to use the entire string records with multiple passes, for duplicate identification. It also, solves similarity computation problem in a large database by forming token key from some selected fields, to reduce the number of comparisons.

Fig 4 shows an algorithm for token formation. In this algorithm, rules are specified to form the tokens. This algorithm works according to the type of the data. For example, if address attribute is selected, alphanumeric rule is used to form the tokens. The formed tokens are stored in LOG table.

The idea behind this algorithm is to define smart tokens from fields of selected multiple most important attributes by applying simple rules for defining numeric, alphabetic, and alphanumeric tokens. Temporary table now consists of smart token records, composed from field tokens of the records. These smart token records are sorted out using block-token-key.

Figure 4: Algorithm for Token Formation.

Input: Tables with dirty data, Reference table, Selected attributes
Output: LOG table with tokens
Var: i, j, m--attribute set, n--no. of records
begin
For attribute i = 1 to m
      for row j = 1 to n
      do
            i)        remove unimportant characters
            ii)       expand  abbreviations  using Reference
                      table
            iii)      if row(j) isnumeric then
                  a.  convert string into number
                  b.  sort or arrange the number in order
                  c.  form a token, then put into LOG table
                      end if
            iv)       if row(j) isaphanumeric then
                  a.  separate numeric and alphanumeric
                  b.  split alphanumeric into numeric and
                           alphabetic
                  c.  sort numeric and alphabetic separately
                  d.  form a token, then put into LOG table
                      end if
            v)        if row(j) isalphabetic then
                  a.  select the first character from each word
                  b.  sort these letters in a specific order
                  c.  string them together
                  d.  if one word is present, take the first three
                           character as token, then sort the
                           characters
                  e.  form a token, then put into LOG table
                      end if

end

C. Clustering/Blocking of records

Data mining primarily works with large databases. Sorting the large datasets and data duplicate elimination process with this large database faces the scalability problems. The clustering techniques are used to cluster or group the dataset into small groups based on the distance values or some threshold values to reduce the time for the elimination process.

The blocking methods are used for reducing huge number of comparisons. This step is used for grouping the records that are most likely to be duplicated based on the similarity of block-token-key. It works in a blocking fashion. i.e Block-token key is used to split the data sets into blocks. Block-token key is very important in blocking the records. There are four types of block-token-key generation used and identified good block-token-key for blocking. They are i) Blocking with Single Attribute ii) Blocking with Multiple Attributes iii) Array based Block-Token-Key and iv) Token based Blocking key. The choice of a good block-token-key can greatly reduce the number of record pair evaluations to be performed and so the user can achieve significant result.

Figure 5: Algorithm for blocking single and multiple attribute
records.

Input: Data set (records) D, keys k[]
Output: Blocked records
Var:    n [left arrow] no. of records, block b[], i, j
begin
     1. Sort database using Block-Token-Key
     2. Initialize new block
     3. Blocking Records
i. Single Attribute
      for each record i = 0 to n
      for each key j = 1 to kn
      for each block k = 1 to bn
     if distance (key[j], key[j + 1]) > threshold
          add key[j] to block b[k]
     else
          add key[j + 1] to b1
          initialize new block b[k + 1]
      end if
           j = j + 1
      end for
      end
ii. Multiple Attribute
      for each record i = 0 to rn
      for each column j = 0 to cn
      for each key k = 1 to kn
      for each block b = 1 to bn
Calculates distance of all selected column key values d(values)
    if distance (d[values]) > threshold
          add key[j] to block b[k]
   else
          add key[j + 1] to b1
          initialize new block b[k + 1]
end if
            j = j + 1
end for
end

This algorithm is efficient to identify exact and in exact duplicate records based on the selection of block-token key. The number of comparison is reduced when compared with other existing methods and increases the speed of the data cleaning process. This blocking method gives better performance than other methods like shrinking or expanding the window size based on the block-token key.

D. Similarity Computation for Selected Attributes

Record linkage algorithms fundamentally depend on string similarity functions for record fields as well as on record similarity functions for string fields. Similarity computation functions depend on the data type. Therefore the user must choose the function according to the attribute's data type, for example numerical, string and so on.

This step uses Jaccard similarity function to compare token values of adjacent field values for selected attribute. Tokenization is typically formed by treating each individual word of certain minimum length as a separate token or by taking first character from each word. In step (ii), token has been created for the selected attributes. Each function measures the similarity of selected attributes with other record fields and assigns a similarity value for each field. In the next step, the clustering techniques have been selected to group the fields based on the similarity values.

Accurate similarity functions are important for clustering and duplicate detection problem. Better string distance might also be useful to pair the record as match or non-match. This matching and non-matching pairs is used for clustering and to eliminate the duplicates.

E. Detection and elimination of duplicate data

In step (v), the rule based duplicate detection and elimination approach is used for detecting and eliminating the records. During the elimination process, only one copy of duplicated records are retained and eliminated other duplicate records [8] [9]. The elimination process is very important to produce a cleaned data. The above steps are used to identify the duplicate records. This step is used to detect and remove the duplicate records from one cluster or many clusters. Before the elimination process, the similarity threshold values for all the records in the dataset are calculated. The similarity threshold values are important for the elimination process.

The same comparison process is done in Step (iv). Step (v) also requires a comparison process. To reduce a comparison, a threshold value or similarity value is stored as a LOG file. The threshold criteria and certainty factors are used to detect and eliminate the duplicate records. Finally one record is retained as prime representative and maintained this value in the log file. This primary copy will be used for the incremental cleaning process also for further work. This approach can substantially reduce the probability of false mismatches, with a relatively small increase in the running time.

F. Merge

This step merges the corrected data as a single cluster [10],[11]. The user must maintain the merged record and the prime representative as a separate file in the data warehouse. This information helps the user for further changes in the duplicate elimination process. This merge step is useful for the incremental data cleaning. When a new data enters into the data warehouse, incremental data cleaning checks the new data with the already created LOG file. Hence, this reduces the time for the data cleaning process.

Experimental Results

An attribute selection algorithm is implemented to select more important attributes which is having enough information for identifying duplicate records. Selected attributes are used for duplicate record detection. The selected attributes have enough information for duplicate data detection. The results are drawn below with different attribute values, numbers of duplicate records detection and token formation.. Efficiency of duplicate detection and elimination largely depends on the selection of attributes. Token formation algorithm is used to form tokens for selected attribute field values to reduce the time for cleaning planning.

Attribute Selection with parameters

Wrong selection of attribute affects the performance and accuracy of data cleaning process. Hence key fields are selected in such a way that fields should contain sufficient information to identify the duplication of the record. Fig. 6 shows the attribute selection for data cleaning process. The threshold value is calculated using four important criteria such as size of the field, missing values, distinct values and measurement type. The best attributes are selected based on the threshold value for the data cleaning process. The name attribute has the highest threshold value compared to other attributes and seven attributes are selected for the next process of data cleaning.

[FIGURE 6 OMITTED]

Attribute Vs Duplicates

The identification of duplicates is mainly based on the selection of attributes and selection of window size. In the existing methods, fixed size sliding window is used to minimize the number of comparison. In this method, dynamic window size is used based on the similarities of field values. The best accuracy of the duplicate detection is obtained by using our dynamic method. Fig. 7 shows how the number of duplicate records detected varies as the sliding window size changes and for dynamic size of window. The result of duplicate detection is varied based on the selection of window size and dynamic size. To test this phenomenon, results are taken by varying the attribute values for each execution setting window size between 10 and 50 and dynamic size.

[FIGURE 7 OMITTED]

Duplicates Vs No. of Attributes

Efficiency of the record matching algorithm mainly depends on the selection of the attributes. It has been observed from fig. 8 that selection of key column influences the result greatly. Table 1 shows key field selection and approximate duplicate record detected for those selected key fields. The selection of combination of multiple attribute will be more useful to identify exact and inexact duplicates.

Time Vs Token Formation

In this research work, token formation step takes very low time for token formation. Table loading and attribute selection from the data warehouse also takes few seconds to select best attribute for the data cleaning process (Shown in Fig. 9). The time is varied based on the size of the data set.

[FIGURE 9 OMITTED]

Similarity Time Vs Database Size

In multiple similarities, the similarity between the entire records and fields are compared bit by bit using exact string matching function. In token similarities, token values of the adjacent selected attribute fields are compared. The time taken for token based similarity method is lower than the multiple similarity method. Fig. 10 shows that variance of time between token based similarities and multiple attribute similarities with different database size. The speed of the data similarity computation is increased in Token-based similarity computation. From above fig. 10, as the size of the data increases, the time taken for similarity computation also increases. As a result, the time taken for the similarity computation is reduced when comparing token based similarity computation with multi-similarity computation.

[FIGURE 10 OMITTED]

Duplicates Vs Size of dataset and attributes

Number of blocks are varied based on the size of dataset and number of duplicates. Basically, size of block depends on the number of duplicates available in the data set. In this fig. 11, identification of duplicates is very low using address1 attribute and identification of duplicates is high using phone attribute. Attribute address1 identifies more duplicates because attribute address1 has more high distinct values than other attributes. As a result, high power attributes identify high amount of duplicates than other attributes. At the same time, identification of duplicates varies for different sizes of data.

Duplicates Vs Window

The fig. 12 shows relationship between identification of duplicates, window size and attributes. In this figure 5h, size of each window and number of windows are varied for each attributes and also numbers of identified duplicates are varied with each attribute. Accurate results are obtained using attributes address1 and phone. Theses two attributes are having enough information that is high distinct values for duplicate identification. Attributes name and address2 have less number of unique values than other attributes. But, wrongly identified duplicates are high with these two attributes. Finally, as a result, smaller blocks result in less comparison but match pairs are missed. Pair completeness results improve with larger windows. Figure 5h shows the identification of duplicates as being varied based on the size of the window. As the duplicates increase the size of the window also increases for different attributes.

[FIGURE 12 OMITTED]

Pair completeness for blocking keys and dynamic window

Fig. 13 shows pair completeness with different types of blocking keys. The pair completeness (PC) is defined as PC = [S.sub.M] / [N.sub.M], where [S.sub.M] is the number of true match record pairs in the set of record pairs produced for comparison by the blocking keys and dynamic window size blocking method and [N.sub.M] is the total number of true match record pairs in the entire data. These results show pair completeness by using the blocking keys and dynamic blocking method. Pair completeness varies as size of dataset increases based on the blocking key. Size of the window varies based on the resemblance of records. In this research work, token based blocking key gives better pair completeness results. Identification of duplicates is high when comparing token based blocking key method with other three blocking key methods. Hence, token based blocking key gives better results in pair completeness.

To evaluate the performance of this research work, the errors introduced in duplicate records range from small typographical changes to large changes of some fields. Generally, the database duplicate ratio is the ratio of the number of duplicate records to the number of records of the database. To analyze the efficiency of this research work, proposed approach is applied on a selected data warehouse with variant window size, database duplicate ratios and database sizes. In all tests the time taken for duplicate data detection and elimination processes are analyzed to evaluate the efficiency of time saved in this research work.

[FIGURE 13 OMITTED]

Duplicates Vs Threshold Values

Fig. 14 shows the performance of duplicate detection with false mismatches by varying the threshold value. Table 2 contains number of duplicates detected by each combination of attributes for particular threshold value. The data values shown in bold letters represent the total number of duplicate records detected at optimum threshold value. The optimum threshold values are 0.6, 0.7 and 0.8. The results are not accurate if the threshold value is greater than or less than the optimum threshold value. The number of mismatches and false mismatches are increased and it is not possible to detect exact and inexact duplicates. Hence, the accuracy of the duplicate detection is not exact. The threshold value is an important parameter for duplicate detection process.

[FIGURE 14 OMITTED]

Percent of incorrectly detected duplicated pairs Vs Attributes

Fig. 15 shows that false positive ratio is increased as increases the window sizes. Identification of duplicates depends on keys selected and size of the window. This figure shows the percent of those records incorrectly marked as duplicates as a function of the window size. The percent of false positive is almost insignificant for each independent run and grows slowly as the window size increases. The percentage of false positives is very slow if the window size is dynamic. This suggests that the dynamic window size is more accurate than fixed size window in duplicate detection.

[FIGURE 15 OMITTED]

Time taken on databases with different dataset size

Fig. 16 shows variations of time taken in different database sizes. The result on time taken by each step of this research work is shown in figure 6e. The time taken by proposed research work increases as database size increases: the time increases when the duplicate ratio increases in dataset. The time taken for duplicate data detection and elimination is mainly dependent upon size of the dataset and duplicate ratio in dataset. The efficiency of time saved is much larger than existing work because token based approach is implemented to reduce the time taken for cleaning process and improves the quality of the data.

[FIGURE 16 OMITTED]

Time performance Vs Different size databases and Percentage of created duplicates

10%, 30% and 50% duplicates are created in the selected datasets for result analysis. The results are shown in Fig. 17. The time taken for duplicate detection and elimination is varied based on the size of the data and percentage of duplicates available in the dataset. For these relatively large size databases, the time seems to increase linearly as the size of the databases increase is independent of the duplicate factor.

[FIGURE 17 OMITTED]

Conclusion

Deduplication and data linkage are important tasks in the pre-processing step for many data mining projects. It is important to improve data quality before data is loaded into data warehouse. Locating approximate duplicates in large databases is an important part of data management and plays a critical role in the data cleaning process. In this research wok, a framework is designed to clean duplicate data for improving data quality and also to support any subject oriented data. This framework is useful to develop a powerful data cleaning tool by using the existing data cleaning techniques in a sequential order.

A new attribute selection algorithm is implemented and evaluated through extensive experiments. The attribute selection algorithm can eliminate both irrelevant and redundant attributes and is applicable to any type of data (nominal, numeric, etc.). Also, this attribute selection algorithm can handle data of different attribute types smoothly. The quality of the algorithm results are confirmed by applying a set of rule. The main purpose of this attribute selection for data cleaning is to reduce the time for the further data cleaning process such as token formation, record similarity and elimination process in an efficient way.

The token formation algorithm is used to form smart tokens for data cleaning and it is suitable for numeric, alphanumeric and alphabetic data. There are three different rules described for the numeric, alphabetic, and alphanumeric tokens. The result of the token based data cleaning is to remove duplicate data in an efficient way. The time will be reduced by the selection of attributes and by the token based approach. These formed tokens are stored in the LOG Table. The time required to compare entire string is more than comparison of tokens. This formed token will be used as the blocking key in the further data cleaning process. So, the token formation is very important to define best and smart token.

Using of an unsuitable key, which is not able to group the duplicates together, has a deterring effect on the result, i.e. many false duplicates are detected in comparison with the true duplicates, using say, the address key. Hence, key creation and selection of attributes are important in the blocking method to group similar records together. The selection of the most suitable blocking key (parameter) for the blocking method is addressed in this research work. Dynamically adjusting the blocking key for the blocking method will be effective in record linkage algorithms during the execution time. The blocking key is selected based on the type of the data and usage of the data in the data warehouse. The dynamically adjusting blocking key and token based blocking key as well as the dynamic window size SNM method is used in this research work. An agent is used in tuning parameter or everything is set dynamically for the blocking method without human intervention to yield better performance. However, in most real world problems where expert knowledge is hard to obtain, it is helpful to have methods that can automatically choose reasonable parameters for us.

Time is critical in cleansing large database. In this research work, efficient token based blocking method and similarity computation method is used to reduce the time taken on each comparison. In this research work, efficient duplicate detection and duplicate elimination approach is developed to obtain good result of duplicate detection and elimination by reducing false positives. Performance of this research work shows that there was significant time saving and improved duplicate results than the existing approach.

To compare this new framework with previous approaches the token concept is included to speed up the data cleaning process and reduce the complexity. Each step of this new framework is specified clearly in a sequential order by means of the six data cleaning process offered such as attribute selection, token formation, clustering, similarity computation, elimination, and merge. An agent is used in this framework to reduce the effort taken by the user. This agent will work according to the type and size of the data set. This framework is flexible for all kinds of data in the relational databases.

The framework is mainly developed to increase the speed of the duplicate data detection and elimination process and to increase the quality of the data by identifying true duplicates and strict enough to keep out false-positives. The accuracy and efficiency of duplicate elimination strategies are improved by introducing the concept of a certainty factor for a rule. Data cleansing is a complex and challenging problem. This rule-based strategy helps to manage the complexity, but does not remove that complexity. This approach can be applied to any subject oriented databases in any domain. This proposed research work maintains LOG files with all the cleaning process for the incremental data cleaning.

References

[1] Dorian Pyle, Data Preparation for Data Mining, Published by Morgan Kaufmann, ISBN 1558605290, 9781558605299, 540 pages, 1999.

[2] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Publisher: Elsevier Science & Technology Books, ISBN-13:9781558609013, March 2006.

[3] Hans-peter Keriegel, Karsten M. Borgwardt, Peer Kroger, Alexey Pryakhin, Matthias Schubert, Arthur Zimek, Future trends in data mining, Data Mining and Knowledge Discovery, Volume 15, Issue 1, Pages: 87--97, ISSN:13845810, 2007.

[4] Robert Leland, Duplicate Detection with PMC--A Parallel Approach to Pattern Matching Department of Computer and Information Science, Norwegian University of Science and Technology, Ph.D. Thesis, August 2007.

[5] Kononenko and S.J. Hong, Attribute Selection for Modeling, Future Generation Computer Systems, Elsevier, Volume 13, Number 2, pp. 181195(15), November 1997.

[6] C.I. Ezeife and Timothy E. Ohanekwu, Use of Smart Tokens in Cleaning Integrated Warehouse Data, the International Journal of Data Warehousing and Mining (IJDW), Vol. 1, No. 2, pp. 1-22, Ideas Group Publishers, April June 2005.

[7] Ohanekwu, T.E. & Ezeife, C.I., A token-based data cleaning technique for data warehouse systems. Proceedings of the International Workshop on Data Quality in Cooperative Information Systems (pp. 21-26), held in conjunction with the 9th International Conference on Database Theory (ICDT 2003), Siena, Italy, January 2003.

[8] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy Duplicates in Data Warehouses. VLDB, pages 586-597, 2002.

[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate Record Detection: A Survey, IEEE TKDE, 19(1):1-16, 2007.

[10] Hernandez, M.A. & Stolfo, S.J. The merge/purge problem for large databases, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 127-138, 1995.

[11] A. Hernandez Mauricio and J. S. Stolfo, Real-world Data is dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, pp. 9-27, 1998.

[12] Partrick Lehti, Unsupervised Duplicate Detection Using Sample Non-duplicates, Lecture Notes in Computer Science, NUMB 4244, pages 136-164, 2006.

[13] Prabhu CSR, Data Warehousing: Concepts Techniques 3/e, ISBN: 8120336275, ISBN-13: 9788120336278, Publisher: Prentice-hall Of India Pvt Ltd, 2008

[14] J. Paul Kirby, Tom Pohlmann, Shelby Semmes, Topic Overview: Data Warehousing, Information & Knowledge Management Professionals, Forrester Research, January 23, 2007.

[15] Su Yan, Dongwon Lee, Min-Yen Kan, C. Lee Giles, Adaptive sorted neighborhood methods for efficient record linkage, 185-194, JCDL 2007.

(1) J. Jebamalar Tamilselvi and (2) V. Saravanan

(1) PhD Research Scholar, Department of Computer Application, Karunya University, Coimbatore--641114, Tamilnadu, INDIA

E-mail: jjebamalar@gmail.com

(2) Professor & HOD, Department of Computer Application, Karunya University, Coimbatore--641114, Tamilnadu, INDIA

E-mail: saravanan@karunya.edu

Table 1: Key columns and no. of duplicate detected.

No. of
Columns   Key Columns Selected            No. of duplicate detected

1         ADD_ADDRESS1                              60988

2         ADD_ADDRESS1 ADD_NAME                     58944

3         ADD_ADDRESS1 ADD_NAME                     50860
          ADD_PHONE1

4         ADD_ADDRESS1 ADD_NAME                     45128
          ADD_PHONE1 ADD_CDATE

5         ADD_ADDRESS1 ADD_NAME                     40156
          ADD_PHONE1 ADD_CDATE ADD_DEL

6         ADD_ADDRESS1 ADD_NAME                     39160
          ADD_PHONE1 ADD_CDATE ADD_DEL
          ADD_PARENTTYPE

7         ADD_ADDRESS1 ADD_NAME                     33172
          ADD_PHONE1 ADD_CDATE ADD_DEL
          ADD_PARENTTYPE ADD_PINCODE

Table 2: Attributes and threshold value.

ATTRIBUTES      0.5     0.6     0.7     0.8     0.9      1

ADD_ADDRESS1   40153   38511   37462   34423   31048   30988

ADD_ADDRESS1   48523   45263   45139   42874   40743   38944
ADD_NAME

ADD_ADDRESS1   49898   48990   48987   45214   41985   39860
ADD_NAME
ADD_PHONE1

ADD_ADDRESS1   52984   50234   49136   46987   42035   40128
ADD_NAME
ADD_PHONE1
ADD_CDATE

ADD_ADDRESS1   55133   53134   52456   50156   42542   40156
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_DEL

ADD_ADDRESS1   55982   53634   53452   50598   42590   40160
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_DEL
ADD_ADDRESS2

ADD_ADDRESS1   57213   54136   53874   52345   42898   40172
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_DEL
ADD_ADDRESS2
ADD_PINCODE

Figure 8: Duplicate detected Vs No. of attribute selected.

No. of keys columns

1   60988
2   58944
3   50860
4   45128
5   40156
6   39160
7   33172

Note: Table made from line graph.

Figure 11: Duplicates Vs Size of dataset and attributes.

          Phone   Address2      Name   Address1
100000    45923      14275     12472      44075
200000    66479      36457     36315      63841
300000    73425      52346     51043      85353
400000   118953     707398    778231     103891
500000   189623      83752     98734     148737

Note: Table made from line graph.