next up previous
Next: Estimate of the reliability Up: Clusterv tutorial Previous: Computation of the distortion

Subsections

Estimate of the reliability of clusters generated by specific clustering algorithms.

Several high-level function implement the computation of the stability indices for specific clustering algorithms, such as hierarchical [15], k-means [10], fuzzy c-mean [5] and Prediction Around Medoids [14] clustering algorithms.

There are 4 functions that implement the computation of the stability indices for the above clustering algorithms:

  1. Random.hclustering.validity
  2. Random.kmeans.validity
  3. Random.fuzzy.kmeans.validity
  4. Random.PAM.validity
These functions compute the validity indices for each individual cluster, the overall validity index of the clustering and the AC indices of each example. To this end different randomized maps may be chosen (input parameter of the function) to generate multiple projected data. On each projected data the clustering algorithm is applied; then the similarity matrix across the multiple clusterings on the projected data is computed and finally, by using the previously computed similarity matrix, the validity indices for each individual cluster, the overall validity index of the clustering and the AC indices of each example are computed. These functions assume that the data are represented as matrices (or data frames) having examples in the columns and variables in rows. Moreover they assume that the labels of the examples are integer starting from 1 to the number of columns of the data matrix.

Consider ,for instance, instance a 2000-dimensional synthetic data set composed by three clusters, each one with 15 examples:

> M <- generate.sample1(n = 15, m = 4, sigma = 2, dim = 2000)

Estimate of the reliability of clusters generated by hierarchical clustering algorithms.

If we want to estimate the reliability of clusters generated by a hierarchical clustering:
> l2 <- Random.hclustering.validity(M, dim=JL.predict.dim(45,0.2), 
             pmethod = "PMO", c = 2, hmethod = "average", n = 20)
We chose a dimension dim of the subspace such that the distortion induced by the PMO projection (parameter pmethod) will be less than 1.2. Moreover we chose an "average linkage" hierarchical clustering (parameter hmethod) with 20 replications (parameter n) of the clusterings in the projected subspace. The stability indices are stored in list l2 and are computed with respect to a partition with 2 clusters (parameter c). The returned list has 8 elements (comprising the compute stability indices, the similarity matrix, the list of clusterings and other):
> l2$overall.validity
[1] 0.9353448
> l2$validity
[1] 1.0000000 0.8706897
The overall validity with 2 clusters is equal to 0.9353448, while the stability indices for the two clusters are respectively 1.0000000 and 0.8706897. The element l2$orig.cluster stores the clustering for which we computed the stability indices:
> l2$orig.cluster
[[1]]
 [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 31 32 33 34 35 36 37 38 39 40
[26] 41 42 43 44 45
We see that the first cluster is highly reliable (indeed it corresponds to the examples of the second "natural" cluster), while the second less reliable corresponds to the merging of the first and third "natural" cluster.

It is very simple to estimate the reliability of clusters obtained with partitions of 3-clusters; it is sufficient to change only the c parameter:

> l3 <- Random.hclustering.validity(M, dim=JL.predict.dim(45,0.2), 
             pmethod = "PMO", c = 3, hmethod = "average", n = 20)
> l3$overall.validity
[1] 1
> l3$validity
[1] 1 1 1
> l3$orig.cluster
[[1]]
 [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[[2]]
 [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[[3]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
In this case the clusters obtained are those that correspond to the "true" cluster generated by the synthetic data generator generate.sample1. Note that the stability indices denote as highly reliable both the overall clustering and the obtained individual clusterings.

Using other number of clusters (e.g. 4,5,6,8) we obtain less reliable clusterings and less reliable individual clusters. For instance with 4 clusters:

> l4 <- Random.hclustering.validity(M, dim=JL.predict.dim(45,0.2), 
             pmethod = "PMO", c = 4, hmethod = "average", n = 20)
> l4$overall.validity
[1] 0.7840842
> l4$validity
[1] 0.9666667 0.9542857 0.2500000 0.9653846
> l4$orig.cluster
[[1]]
 [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[[2]]
 [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[[3]]
[1] 14
[[4]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 15
Note the third cluster is scored as poorly reliable (s=0.25) and indeed it corresponds to an example that should belong to the fourth cluster. Looking at the AC indices that store how much reliable is the assignment of an example to a specific cluster, we see that the AC index of the example 14 (the "unnatural" element of the third singleton cluster) is very low:
> l4$AC[14,]
[1] 0.00 0.00 0.25 0.00
indeed it reveals a membership to the third cluster equal to 0.25, while the AC indices for, e.g. the first 5 examples (that belong to the fourth cluster) is close to 1:
> l4$AC[1:5,]
     [,1] [,2] [,3]      [,4]
[1,]    0    0    0 0.9807692
[2,]    0    0    0 0.8884615
[3,]    0    0    0 0.9807692
[4,]    0    0    0 0.9807692
[5,]    0    0    0 0.9423077
Note that the AC indices are stored as matrices, even if it would be sufficient to store them simply through a vector, since with the hierarchical clustering we have, for a given cut of the tree, a strict partition of the data. Anyway we choose this implementation because in future releases we plan to implement fuzzy AC indices specific for fuzzy partitions in order to permit for each example a fuzzy membership to each cluster.

The reader could try to repeat these computations with other random projections: in this case we should obtain the same results. independently of the chosen projection.

Estimate of the reliability of clusters generated by the k-means clustering algorithm.

Similarly we may estimate the reliability of clusters obtained with k-means clustering. The function Random.kmeans.validity has about the same syntax as the previous one:
> l3 <- Random.kmeans.validity(M, dim=JL.predict.dim(45,0.2), 
                             pmethod = "PMO", c = 3, n = 20)
> l3$overall.validity
[1] 0.9504762
> l3$validity
[1] 1.0000000 0.8961905 0.9552381
> l3$orig.cluster
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
[[2]]
 [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[[3]]
 [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
The results are quite similar to the results obtained with hierarchical clustering, but the validity indices are in this case slightly lower. Let's try to repeat the computations:
> l3 <- Random.kmeans.validity(M, dim=JL.predict.dim(45,0.2), 
                             pmethod = "PMO", c = 3, n = 20)
> l3$validity
[1] 0.5089655 0.9500000 0.9433333
> l3 <- Random.kmeans.validity(M, dim=JL.predict.dim(45,0.2), 
                             pmethod = "PMO", c = 3, n = 20)
> l3$validity
[1] 1.0000000 0.8933333 0.9495238
What is happened? The results are quite different. This is due to the fact that the k-means clustering algorithm strongly depends on the initial conditions and we may obtain different results in different runs. If you try to repeat these experiments it is likely that you obtain in turn other results.

Estimate of the reliability of clusters generated by the fuzzy k-means clustering algorithm.

The following example shows how to apply the stability indices to the analysis of clusterings generated through a fuzzy-k-means algorithm. Here we consider only the partition obtained by assigning each example to the cluster with the higher membership. We used now a synthetic 6000-dimensional data set with 5 clusters:
> M <- generate.sample4(n = 10, sigma = 0.1)
The obtained 5 clusters (each one with 10 examples) are quite well separated, as shown by plotting the points projected along the two main principal components via PCA (Fig. 4)
Figure 4: Plot of the two principal components of the 6000-dimensional data set. The examples that belong to the 5 different clusters are plotted with different symbols.
\includegraphics[width = 12cm]{ps/PCA-sample4.eps}
To apply the stability analysis we may use the
Random.fuzzy.kmeans.validity function:
> r2 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), 
                                   pmethod = "PMO", c = 2, n = 20);
> r2$overall.validity
[1] 1
> r2$validity
[1] 1 1
Hence two clusters are considered reliable. But we know that we have 5 clusters. Getting a look to the clusters, we see that the first 4 clusters are grouped together against the fifth (the first 10 examples belongs to the first cluster, the next ten to the second and so on):
> r2$orig.cluster
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
[21]  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[[2]]
 [1] 41 42 43 44 45 46 47 48 49 50
Does the stability indices fail or the results depend on how the clusters are defined? Indeed the fuzzy-k-means sees 2 clusters in the data, and the fifth cluster is surely separated from the other data.

Consider now c=3 clusters:

> r3 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), 
                                   pmethod = "PMO", c = 3, n = 20);
> r3$overall.validity
[1] 1
> r3$validity
[1] 1 1 1
> r3$orig.cluster
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
[21]  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[[2]]
 [1] 31 32 33 34 35 36 37 38 39 40
[[3]]
 [1] 41 42 43 44 45 46 47 48 49 50
In this case also 3 clusters are considered reliable, because really the first clusters is composed by the three "true" clusters in the center (see Fig. 4), while the two other clusters on the left and on the right are more separated.

And with 5 clusters?

> r5 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), 
                                   pmethod = "PMO", c = 5, n = 20);
> r5$overall.validity
[1] 1
> r5$validity
[1] 1 1 1 1 1
> r5$orig.cluster
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10
[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20
[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30
[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40
[[5]]
 [1] 41 42 43 44 45 46 47 48 49 50
Of course, the stability indices considered the five "true" clusters very reliable.

With 4 clusters are considered very reliable only the two "extreme clusters", as the first big cluster losses its example n.5 that is assigned to the second cluster:

> r4 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), 
                                   pmethod = "PMO", c = 4, n = 20);
> r4$overall.validity
[1] 0.8899123
> r4$validity
[1] 0.6596491 0.9000000 1.0000000 1.0000000
> r4$orig.cluster
[[1]]
 [1]  1  2  3  4  6  7  8  9 10 21 22 23 24 25 26 27 28 29 30
[[2]]
 [1]  5 11 12 13 14 15 16 17 18 19 20
[[3]]
 [1] 31 32 33 34 35 36 37 38 39 40
[[4]]
 [1] 41 42 43 44 45 46 47 48 49 50

If we use an unnatural number of cluster (e.g. 10) we obtain low values of the stability indices:

> r10 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), 
                                   pmethod = "PMO", c = 10, n = 20);
> r10$overall.validity
[1] 0.573125
> r10$validity
[1] 0.5972222 0.0500000 0.7366667 0.7411111 0.5500000 0.6300000 
[7] 0.6300000 0.6500000

Estimate of the reliability of clusters generated by PAM (Prediction Around Medoids) clustering algorithm.

Finally we perform clustering analysis with the PAM algorithm, using the same data set we used with the fuzzy-k-means algorithm:
> p2 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), 
                          pmethod = "PMO", c = 2, n = 20);
> p2$overall.validity
[1] 0.8846154
> p2$validity
[1] 0.7692308 1.0000000
> p2$orig.cluster
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[[2]]
 [1] 41 42 43 44 45 46 47 48 49 50
With two clusters in this case the overall validity is lower, as the first big cluster is considered less reliable. This fact suggest that the PAM clustering algorithm seems to be better suited for this data set. These results show also the validity indices depend on the clustering algorithm used: indeed the computation of he similarity matrix that is used to compute the stability indices is performed through multiple clusterings on the randomly projected data, and hence depends on the applied clustering algorithm.

With partitions composed by clusters, we obtain, as expected, very reliable clusters:

> p3 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), 
                          pmethod = "PMO", c = 3, n = 20);
> p3$overall.validity
[1] 1
> p3$validity
[1] 1 1 1
> p3$orig.cluster
[[1]]
 [1] 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 
 [20] 20 21 22 23 24 25 26 27 28 29 30
[[2]]
 [1] 31 32 33 34 35 36 37 38 39 40
[[3]]
 [1] 41 42 43 44 45 46 47 48 49 50

And, of, course also with 5 clusters:

> p5 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), 
                          pmethod = "PMO", c = 5, n = 20);
> p5$overall.validity
[1] 1
> p5$validity
[1] 1 1 1 1 1

With 4 or e.g. 8 clusters the reliability of the obtained clusters is lower, as the "true" clusters are fragmented:

> p4 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), 
                         pmethod = "PMO", c = 4, n = 20);
> p4$overall.validity
[1] 0.9047581
> p4$validity
[1] 0.7099415 0.9090909 1.0000000 1.0000000

> p8 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), 
                          pmethod = "PMO", c = 8, n = 20);
> p8$overall.validity
[1] 0.6332639
> p8$validity
[1] 0.7450000 0.7900000 0.6633333 0.7000000 0.7166667 
[6] 0.6333333 0.0500000 0.7677778


next up previous
Next: Estimate of the reliability Up: Clusterv tutorial Previous: Computation of the distortion
Giorgio 2006-08-16