文章基本信息

标题：On clustering heterogeneous data and clustering by compression.
作者：Carstoiu, Dorin ; Cernian, Alexandra ; Sgarciu, Valentin 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Given the variety of data sources that exist nowadays, we expect the data itself to be very heterogeneous (text, images, films, programs etc.), which makes it difficult to organize and structure it.
关键词：Clustering (Computers);Data processing;Electronic data processing

On clustering heterogeneous data and clustering by compression.

Carstoiu, Dorin ; Cernian, Alexandra ; Sgarciu, Valentin 等

1. INTRODUCTION

Given the variety of data sources that exist nowadays, we expect the data itself to be very heterogeneous (text, images, films, programs etc.), which makes it difficult to organize and structure it.

In this paper, we intend to assess the results of a new clustering technique--clustering by compression--when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). In order to validate the results, we calculate some quality indices. If the values we obtain prove a high quality of the clustering, in the near future we plan to include the clustering by compression technique into a framework for clustering web objects.

2. RELATED WORK

Data clustering is a well-studied problem. The applications of Web objects clustering (including web-pages, purchase items, queries, users, and etc.) utilize the link information at different level. Traditional clustering methods ignore the link information and clusters object solely based on content features (Boley et al., 1999; Zamir & Etzioni, 1998). Finding good similarity functions and selecting better features are the main focus of these works. There are some clustering algorithms which treat link information as additional features. That is, the heterogeneous data is first represented by homogeneous features before clustering. Su (Su et al., 2001) described a correlation-based document clustering method, which measures the similarity between web-pages based on the simultaneous visits to them.

3. CLUSTERING BY COMPRESSION

In 2004, Rudi Cilibrasi and Paul Vitanyi proposed a new method for clustering based on compression (Cilibrasi & Vitanyi, 2005). The method does not use subject-specific features or background knowledge, and works as follows:

* First, it determines a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation).

* Second, it applies a hierarchical clustering method.

If x and y are the two objects concerned, and C(x) and C(y) are the lengths of the compressed versions of x and y using compressor C, then the NCD is defined as follows

NCD(x,y) = C(xy) - min{C(x), C(y)}/max{C(x), C(y)} (1)

The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance is provably optimal. However, the optimality comes at the price of using the non-computable notion of Kolmogorov complexity. Nevertheless, experiments have proven that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, a dendrogram (ternary tree) is determined by using a clustering algorithm, under choice of different compressors. To substantiate the claims of universality and robustness, evidence of successful application has been reported in various areas such as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors.

4. VALIDATION

The validation procedure went through the following phases:

1) Setting up the data set

2) Forming a pre-defined clustering structure for the data set

3) Applying a clustering algorithm based on NCD

4) Computing cluster validity indeces

5) Applying clustering algorithms based on traditional metrics

6) Computing the FScore

The data set we have used is made up of 50 heterogeneous documents in .doc, .pdf, .jpg, .eml and .pps formats.

The pre-defined clustering structure (P) was set up based on our empirical observations. P is the structure we aim at obtaining, so all clustering structures obtained through clustering algorithms will be compared against P.

After applying the clustering by compression algorithm to our data set, we obtain a clustering structure C. In order to obtain the classification tree, we have used the BZIP2 compression algorithm and the UPGMA clustering method. The reason for this choice resides in our previous experiences with the clustering by compression procedure, when the best clustering results were obtained with this combination of algorithms.

4.1 Cluster Validity

The procedure of evaluating the results of a clustering algorithm is known under the term of cluster validity. The objective of the clustering methods is to discover significant groups present in a data set. In general, they should search for clusters whose members are close to each other (in other words have a high degree of similarity) and well separated.

For this work, we have used the external criteria approach to investigate cluster validity. This implies that we evaluate the results of a clustering algorithm based on a pre-specified structure (P), which is imposed on a data set and reflects our intuition about the clustering structure of the data set. We evaluate the resulting clustering structure C by comparing it to P. The result of this comparison will be the similitude degree between C and P.

Consider C = {C1***Cm} is the clustering structure of a data set X and P = {P1***Ps} is the defined partition of the data. We refer to a pair of points (xv, xu) from the data set using the following terms:

* SS: if both points belong to the same cluster of the clustering structure C and to the same group of partition P.

* SD: if the points belong to the same cluster of C and to different groups of P.

* DS: if the points belong to different clusters of C and to the same group of P.

* DD: if both points belong to different clusters of C and to different groups of P.

Assuming now that a, b, c and d are the number of SS, SD, DS and DD pairs respectively, then a+b+c+d=M, which is the maximum number of all pairs in the data set (meaning, M = N(N - 1)/2 where N is the total number of points in the data set).

Based on the above notations, the indices we calculate are: Rand Index (R), Hubert's statistics (r), the Jaccard Coefficient (J) and Fowles-Mallows Index (FM) (Cernian et al., 2008).

The values of these indeces lie between 0 and 1, and values close to 1 indicate a high degree of similitude between C and P.

For our data set composed of 50 heterogeneous documents, we have obtained the following values for the above mentioned indeces:

* R = 0.90

* J = 0.71

* [GAMMA] = 0.70

* FM = 0.84

4.2 FScore

The next step is comparing the performances of a clustering algorithm based on NCD with the performances of clustering algorithms based on traditional metrics. The traditional metrics we have used are the normalized Kullback-Leibler distance (NKLD) and the normalized Levenstein distance (NLD).

Given a particular class [L.sub.r] of size [n.sub.r] and a particular cluster Si of size [n.sub.i], suppose nri documents in the cluster [S.sub.i] belong to [L.sub.r], then the FScore of this class and cluster is defined to be:

F([L.sub.r], [S.sub.i]) = 2*R([L.sub.r],[S.sub.i])*P([L.sub.r],[S.sub.i])/R([L.sub.r],[S.sub.i]) + P([L.sub.r], [S.sub.i]) (2)

where R([L.sub.r], [S.sub.i]) = [n.sub.ri]/[n.sub.r] is called the recall value for the class [L.sub.r] and the cluster [S.sub.i] and P([L.sub.r], [S.sub.i]) = [n.sub.ri]/[n.sub.i] is called the precision value for the class [L.sub.r] and the cluster [S.sub.i].

The FScore of the class [L.sub.r] is the maximum FScore value obtained for that class.

The FScore of the entire clustering solution is defined as the sum of the individual FScore for each class, weighted according to the classs size:

FScore = [c.summation over (r=1)][n.sub.r]/n F([L.sub.r] (3)

This quality measure takes values between 0 and 1. The closer it is to 1, the better the quality of the clustering structures.

The FScore for this clustering solution obtained using the NCD is 0.86, which is rather high, while the FScore for the clustering solution obtained using the NKLD metric is 0.52 and the FScore for the clustering solution obtained using the NLD metric is 0.48. Thus, the quality of the clustering solutions generated with traditional metrics is significantly lower than the quality of the clustering solution obtained using the NCD.

5. CONCLUSION AND FUTURE WORK

The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation).

The cluster validity indices we have computed have values between 0 and 1. The closer they are to 1, the better the quality of the clustering structures. As shown in section 4, for our heterogeneous data set, we have obtained values around 0.8 - 0.9 for the cluster validity procedure. Therefore, this indicates a high degree of similitude between the clustering structure C obtained as a result of the clustering by compression technique and the predefined partition P, designed according to our intuition.

The FScore quality measure takes values between 0 and 1. The closer it is to 1, the better the quality of the clustering structures. As shown in section 4, for our heterogeneous data set, we have obtained a value of 0.86 when using clustering by compression. Therefore, this indicates a high degree of similitude between the clustering structure obtained as a result of the clustering by compression technique and the predefined structure designed according to our intuition.

The results of these tests encourage us to proceed to including the clustering by compression procedure into a framework for clustering heterogeneous data, which could be further tested for clustering heterogeneous web data or for integration into a document management system.

6. REFERENCES

Boley, D.; Gini, M.; Gross, R.; Han, E.H.; Hastings, K.; Karypis, G.; Kumar, V.; Mobasher, B. & Moore, J. (1999). Partitioning-based clustering for web document categorization, Decision Support Systems 27:329-341

Cernian, A.; Carstoiu, D & Olteanu, A. (2008). Clustering Heterogeneous Web Data using Clustering by Compression.Cluster Validity, Proceedings of SYNACS 2008, 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania

Cilibrasi, R. & Vitanyi, P.(2005). Clustering by compression. IEEE Transactions on Information Theory, Vol. 51, No. 4, pp 1523-1545.

Su, Z. et al (2001). Correlation-based Document Clustering using Web Logs, Proceedings of the 34th Hawaii International Conference On System Sciences(HICSS-34), January 3-6

Zamir, O. & Etzioni, O. (1998). Web document clustering: A feasibility demonstration, Proceedings of SIGIR '98, pp 46-53