On clustering heterogeneous data and clustering by compression.
Carstoiu, Dorin ; Cernian, Alexandra ; Sgarciu, Valentin 等
1. INTRODUCTION
Given the variety of data sources that exist nowadays, we expect
the data itself to be very heterogeneous (text, images, films, programs
etc.), which makes it difficult to organize and structure it.
In this paper, we intend to assess the results of a new clustering
technique--clustering by compression--when applied to heterogeneous sets
of data. The clustering by compression procedure is based on a
parameter-free, universal, similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed
data files (singly and in pair-wise concatenation). In order to validate
the results, we calculate some quality indices. If the values we obtain
prove a high quality of the clustering, in the near future we plan to
include the clustering by compression technique into a framework for
clustering web objects.
2. RELATED WORK
Data clustering is a well-studied problem. The applications of Web
objects clustering (including web-pages, purchase items, queries, users,
and etc.) utilize the link information at different level. Traditional
clustering methods ignore the link information and clusters object
solely based on content features (Boley et al., 1999; Zamir &
Etzioni, 1998). Finding good similarity functions and selecting better
features are the main focus of these works. There are some clustering
algorithms which treat link information as additional features. That is,
the heterogeneous data is first represented by homogeneous features
before clustering. Su (Su et al., 2001) described a correlation-based
document clustering method, which measures the similarity between
web-pages based on the simultaneous visits to them.
3. CLUSTERING BY COMPRESSION
In 2004, Rudi Cilibrasi and Paul Vitanyi proposed a new method for
clustering based on compression (Cilibrasi & Vitanyi, 2005). The
method does not use subject-specific features or background knowledge,
and works as follows:
* First, it determines a parameter-free, universal, similarity
distance, the normalized compression distance or NCD, computed from the
lengths of compressed data files (singly and in pairwise concatenation).
* Second, it applies a hierarchical clustering method.
If x and y are the two objects concerned, and C(x) and C(y) are the
lengths of the compressed versions of x and y using compressor C, then
the NCD is defined as follows
NCD(x,y) = C(xy) - min{C(x), C(y)}/max{C(x), C(y)} (1)
The NCD is not restricted to a specific application area, and works
across application area boundaries. A theoretical precursor, the
normalized information distance is provably optimal. However, the
optimality comes at the price of using the non-computable notion of
Kolmogorov complexity. Nevertheless, experiments have proven that the
NCD approximates optimality. To extract a hierarchy of clusters from the
distance matrix, a dendrogram (ternary tree) is determined by using a
clustering algorithm, under choice of different compressors. To
substantiate the claims of universality and robustness, evidence of
successful application has been reported in various areas such as
genomics, virology, languages, literature, music, handwritten digits,
astronomy, and combinations of objects from completely different
domains, using statistical, dictionary, and block sorting compressors.
4. VALIDATION
The validation procedure went through the following phases:
1) Setting up the data set
2) Forming a pre-defined clustering structure for the data set
3) Applying a clustering algorithm based on NCD
4) Computing cluster validity indeces
5) Applying clustering algorithms based on traditional metrics
6) Computing the FScore
The data set we have used is made up of 50 heterogeneous documents
in .doc, .pdf, .jpg, .eml and .pps formats.
The pre-defined clustering structure (P) was set up based on our
empirical observations. P is the structure we aim at obtaining, so all
clustering structures obtained through clustering algorithms will be
compared against P.
After applying the clustering by compression algorithm to our data
set, we obtain a clustering structure C. In order to obtain the
classification tree, we have used the BZIP2 compression algorithm and
the UPGMA clustering method. The reason for this choice resides in our
previous experiences with the clustering by compression procedure, when
the best clustering results were obtained with this combination of
algorithms.
4.1 Cluster Validity
The procedure of evaluating the results of a clustering algorithm
is known under the term of cluster validity. The objective of the
clustering methods is to discover significant groups present in a data
set. In general, they should search for clusters whose members are close
to each other (in other words have a high degree of similarity) and well
separated.
For this work, we have used the external criteria approach to
investigate cluster validity. This implies that we evaluate the results
of a clustering algorithm based on a pre-specified structure (P), which
is imposed on a data set and reflects our intuition about the clustering
structure of the data set. We evaluate the resulting clustering
structure C by comparing it to P. The result of this comparison will be
the similitude degree between C and P.
Consider C = {C1***Cm} is the clustering structure of a data set X
and P = {P1***Ps} is the defined partition of the data. We refer to a
pair of points (xv, xu) from the data set using the following terms:
* SS: if both points belong to the same cluster of the clustering
structure C and to the same group of partition P.
* SD: if the points belong to the same cluster of C and to
different groups of P.
* DS: if the points belong to different clusters of C and to the
same group of P.
* DD: if both points belong to different clusters of C and to
different groups of P.
Assuming now that a, b, c and d are the number of SS, SD, DS and DD
pairs respectively, then a+b+c+d=M, which is the maximum number of all
pairs in the data set (meaning, M = N(N - 1)/2 where N is the total
number of points in the data set).
Based on the above notations, the indices we calculate are: Rand
Index (R), Hubert's statistics (r), the Jaccard Coefficient (J) and
Fowles-Mallows Index (FM) (Cernian et al., 2008).
The values of these indeces lie between 0 and 1, and values close
to 1 indicate a high degree of similitude between C and P.
For our data set composed of 50 heterogeneous documents, we have
obtained the following values for the above mentioned indeces:
* R = 0.90
* J = 0.71
* [GAMMA] = 0.70
* FM = 0.84
4.2 FScore
The next step is comparing the performances of a clustering
algorithm based on NCD with the performances of clustering algorithms
based on traditional metrics. The traditional metrics we have used are
the normalized Kullback-Leibler distance (NKLD) and the normalized
Levenstein distance (NLD).
Given a particular class [L.sub.r] of size [n.sub.r] and a
particular cluster Si of size [n.sub.i], suppose nri documents in the
cluster [S.sub.i] belong to [L.sub.r], then the FScore of this class and
cluster is defined to be:
F([L.sub.r], [S.sub.i]) =
2*R([L.sub.r],[S.sub.i])*P([L.sub.r],[S.sub.i])/R([L.sub.r],[S.sub.i]) +
P([L.sub.r], [S.sub.i]) (2)
where R([L.sub.r], [S.sub.i]) = [n.sub.ri]/[n.sub.r] is called the
recall value for the class [L.sub.r] and the cluster [S.sub.i] and
P([L.sub.r], [S.sub.i]) = [n.sub.ri]/[n.sub.i] is called the precision
value for the class [L.sub.r] and the cluster [S.sub.i].
The FScore of the class [L.sub.r] is the maximum FScore value
obtained for that class.
The FScore of the entire clustering solution is defined as the sum
of the individual FScore for each class, weighted according to the
classs size:
FScore = [c.summation over (r=1)][n.sub.r]/n F([L.sub.r] (3)
This quality measure takes values between 0 and 1. The closer it is
to 1, the better the quality of the clustering structures.
The FScore for this clustering solution obtained using the NCD is
0.86, which is rather high, while the FScore for the clustering solution
obtained using the NKLD metric is 0.52 and the FScore for the clustering
solution obtained using the NLD metric is 0.48. Thus, the quality of the
clustering solutions generated with traditional metrics is significantly
lower than the quality of the clustering solution obtained using the
NCD.
5. CONCLUSION AND FUTURE WORK
The clustering by compression procedure is based on a
parameter-free, universal, similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed
data files (singly and in pair-wise concatenation).
The cluster validity indices we have computed have values between 0
and 1. The closer they are to 1, the better the quality of the
clustering structures. As shown in section 4, for our heterogeneous data
set, we have obtained values around 0.8 - 0.9 for the cluster validity
procedure. Therefore, this indicates a high degree of similitude between
the clustering structure C obtained as a result of the clustering by
compression technique and the predefined partition P, designed according
to our intuition.
The FScore quality measure takes values between 0 and 1. The closer
it is to 1, the better the quality of the clustering structures. As
shown in section 4, for our heterogeneous data set, we have obtained a
value of 0.86 when using clustering by compression. Therefore, this
indicates a high degree of similitude between the clustering structure
obtained as a result of the clustering by compression technique and the
predefined structure designed according to our intuition.
The results of these tests encourage us to proceed to including the
clustering by compression procedure into a framework for clustering
heterogeneous data, which could be further tested for clustering
heterogeneous web data or for integration into a document management
system.
6. REFERENCES
Boley, D.; Gini, M.; Gross, R.; Han, E.H.; Hastings, K.; Karypis,
G.; Kumar, V.; Mobasher, B. & Moore, J. (1999). Partitioning-based
clustering for web document categorization, Decision Support Systems
27:329-341
Cernian, A.; Carstoiu, D & Olteanu, A. (2008). Clustering
Heterogeneous Web Data using Clustering by Compression.Cluster Validity,
Proceedings of SYNACS 2008, 10th International Symposium on Symbolic and
Numeric Algorithms for Scientific Computing, Timisoara, Romania
Cilibrasi, R. & Vitanyi, P.(2005). Clustering by compression.
IEEE Transactions on Information Theory, Vol. 51, No. 4, pp 1523-1545.
Su, Z. et al (2001). Correlation-based Document Clustering using
Web Logs, Proceedings of the 34th Hawaii International Conference On
System Sciences(HICSS-34), January 3-6
Zamir, O. & Etzioni, O. (1998). Web document clustering: A
feasibility demonstration, Proceedings of SIGIR '98, pp 46-53