There are 4 functions that implement the computation of the stability indices for the above clustering algorithms:
Consider ,for instance, instance a 2000-dimensional synthetic data set composed by three clusters, each one with 15 examples:
> M <- generate.sample1(n = 15, m = 4, sigma = 2, dim = 2000)
> l2 <- Random.hclustering.validity(M, dim=JL.predict.dim(45,0.2), pmethod = "PMO", c = 2, hmethod = "average", n = 20)We chose a dimension dim of the subspace such that the distortion induced by the PMO projection (parameter pmethod) will be less than 1.2. Moreover we chose an "average linkage" hierarchical clustering (parameter hmethod) with 20 replications (parameter n) of the clusterings in the projected subspace. The stability indices are stored in list l2 and are computed with respect to a partition with 2 clusters (parameter c). The returned list has 8 elements (comprising the compute stability indices, the similarity matrix, the list of clusterings and other):
> l2$overall.validity [1] 0.9353448 > l2$validity [1] 1.0000000 0.8706897The overall validity with 2 clusters is equal to 0.9353448, while the stability indices for the two clusters are respectively 1.0000000 and 0.8706897. The element l2$orig.cluster stores the clustering for which we computed the stability indices:
> l2$orig.cluster [[1]] [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [[2]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 31 32 33 34 35 36 37 38 39 40 [26] 41 42 43 44 45We see that the first cluster is highly reliable (indeed it corresponds to the examples of the second "natural" cluster), while the second less reliable corresponds to the merging of the first and third "natural" cluster.
It is very simple to estimate the reliability of clusters obtained with partitions of 3-clusters; it is sufficient to change only the c parameter:
> l3 <- Random.hclustering.validity(M, dim=JL.predict.dim(45,0.2), pmethod = "PMO", c = 3, hmethod = "average", n = 20) > l3$overall.validity [1] 1 > l3$validity [1] 1 1 1 > l3$orig.cluster [[1]] [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [[2]] [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 [[3]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15In this case the clusters obtained are those that correspond to the "true" cluster generated by the synthetic data generator generate.sample1. Note that the stability indices denote as highly reliable both the overall clustering and the obtained individual clusterings.
Using other number of clusters (e.g. 4,5,6,8) we obtain less reliable clusterings and less reliable individual clusters. For instance with 4 clusters:
> l4 <- Random.hclustering.validity(M, dim=JL.predict.dim(45,0.2), pmethod = "PMO", c = 4, hmethod = "average", n = 20) > l4$overall.validity [1] 0.7840842 > l4$validity [1] 0.9666667 0.9542857 0.2500000 0.9653846 > l4$orig.cluster [[1]] [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [[2]] [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 [[3]] [1] 14 [[4]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 15Note the third cluster is scored as poorly reliable (s=0.25) and indeed it corresponds to an example that should belong to the fourth cluster. Looking at the AC indices that store how much reliable is the assignment of an example to a specific cluster, we see that the AC index of the example 14 (the "unnatural" element of the third singleton cluster) is very low:
> l4$AC[14,] [1] 0.00 0.00 0.25 0.00indeed it reveals a membership to the third cluster equal to 0.25, while the AC indices for, e.g. the first 5 examples (that belong to the fourth cluster) is close to 1:
> l4$AC[1:5,] [,1] [,2] [,3] [,4] [1,] 0 0 0 0.9807692 [2,] 0 0 0 0.8884615 [3,] 0 0 0 0.9807692 [4,] 0 0 0 0.9807692 [5,] 0 0 0 0.9423077Note that the AC indices are stored as matrices, even if it would be sufficient to store them simply through a vector, since with the hierarchical clustering we have, for a given cut of the tree, a strict partition of the data. Anyway we choose this implementation because in future releases we plan to implement fuzzy AC indices specific for fuzzy partitions in order to permit for each example a fuzzy membership to each cluster.
The reader could try to repeat these computations with other random projections: in this case we should obtain the same results. independently of the chosen projection.
> l3 <- Random.kmeans.validity(M, dim=JL.predict.dim(45,0.2), pmethod = "PMO", c = 3, n = 20) > l3$overall.validity [1] 0.9504762 > l3$validity [1] 1.0000000 0.8961905 0.9552381 > l3$orig.cluster [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [[2]] [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [[3]] [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45The results are quite similar to the results obtained with hierarchical clustering, but the validity indices are in this case slightly lower. Let's try to repeat the computations:
> l3 <- Random.kmeans.validity(M, dim=JL.predict.dim(45,0.2), pmethod = "PMO", c = 3, n = 20) > l3$validity [1] 0.5089655 0.9500000 0.9433333 > l3 <- Random.kmeans.validity(M, dim=JL.predict.dim(45,0.2), pmethod = "PMO", c = 3, n = 20) > l3$validity [1] 1.0000000 0.8933333 0.9495238What is happened? The results are quite different. This is due to the fact that the k-means clustering algorithm strongly depends on the initial conditions and we may obtain different results in different runs. If you try to repeat these experiments it is likely that you obtain in turn other results.
> M <- generate.sample4(n = 10, sigma = 0.1)The obtained 5 clusters (each one with 10 examples) are quite well separated, as shown by plotting the points projected along the two main principal components via PCA (Fig. 4)
> r2 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 2, n = 20); > r2$overall.validity [1] 1 > r2$validity [1] 1 1Hence two clusters are considered reliable. But we know that we have 5 clusters. Getting a look to the clusters, we see that the first 4 clusters are grouped together against the fifth (the first 10 examples belongs to the first cluster, the next ten to the second and so on):
> r2$orig.cluster [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [[2]] [1] 41 42 43 44 45 46 47 48 49 50Does the stability indices fail or the results depend on how the clusters are defined? Indeed the fuzzy-k-means sees 2 clusters in the data, and the fifth cluster is surely separated from the other data.
Consider now c=3 clusters:
> r3 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 3, n = 20); > r3$overall.validity [1] 1 > r3$validity [1] 1 1 1 > r3$orig.cluster [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [[2]] [1] 31 32 33 34 35 36 37 38 39 40 [[3]] [1] 41 42 43 44 45 46 47 48 49 50In this case also 3 clusters are considered reliable, because really the first clusters is composed by the three "true" clusters in the center (see Fig. 4), while the two other clusters on the left and on the right are more separated.
And with 5 clusters?
> r5 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 5, n = 20); > r5$overall.validity [1] 1 > r5$validity [1] 1 1 1 1 1 > r5$orig.cluster [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] 11 12 13 14 15 16 17 18 19 20 [[3]] [1] 21 22 23 24 25 26 27 28 29 30 [[4]] [1] 31 32 33 34 35 36 37 38 39 40 [[5]] [1] 41 42 43 44 45 46 47 48 49 50Of course, the stability indices considered the five "true" clusters very reliable.
With 4 clusters are considered very reliable only the two "extreme clusters", as the first big cluster losses its example n.5 that is assigned to the second cluster:
> r4 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 4, n = 20); > r4$overall.validity [1] 0.8899123 > r4$validity [1] 0.6596491 0.9000000 1.0000000 1.0000000 > r4$orig.cluster [[1]] [1] 1 2 3 4 6 7 8 9 10 21 22 23 24 25 26 27 28 29 30 [[2]] [1] 5 11 12 13 14 15 16 17 18 19 20 [[3]] [1] 31 32 33 34 35 36 37 38 39 40 [[4]] [1] 41 42 43 44 45 46 47 48 49 50
If we use an unnatural number of cluster (e.g. 10) we obtain low values of the stability indices:
> r10 <- Random.fuzzy.kmeans.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 10, n = 20); > r10$overall.validity [1] 0.573125 > r10$validity [1] 0.5972222 0.0500000 0.7366667 0.7411111 0.5500000 0.6300000 [7] 0.6300000 0.6500000
> p2 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 2, n = 20); > p2$overall.validity [1] 0.8846154 > p2$validity [1] 0.7692308 1.0000000 > p2$orig.cluster [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [[2]] [1] 41 42 43 44 45 46 47 48 49 50With two clusters in this case the overall validity is lower, as the first big cluster is considered less reliable. This fact suggest that the PAM clustering algorithm seems to be better suited for this data set. These results show also the validity indices depend on the clustering algorithm used: indeed the computation of he similarity matrix that is used to compute the stability indices is performed through multiple clusterings on the randomly projected data, and hence depends on the applied clustering algorithm.
With partitions composed by clusters, we obtain, as expected, very reliable clusters:
> p3 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 3, n = 20); > p3$overall.validity [1] 1 > p3$validity [1] 1 1 1 > p3$orig.cluster [[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 [20] 20 21 22 23 24 25 26 27 28 29 30 [[2]] [1] 31 32 33 34 35 36 37 38 39 40 [[3]] [1] 41 42 43 44 45 46 47 48 49 50
And, of, course also with 5 clusters:
> p5 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 5, n = 20); > p5$overall.validity [1] 1 > p5$validity [1] 1 1 1 1 1
With 4 or e.g. 8 clusters the reliability of the obtained clusters is lower, as the "true" clusters are fragmented:
> p4 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 4, n = 20); > p4$overall.validity [1] 0.9047581 > p4$validity [1] 0.7099415 0.9090909 1.0000000 1.0000000 > p8 <- Random.PAM.validity(M, dim=JL.predict.dim(50,0.2), pmethod = "PMO", c = 8, n = 20); > p8$overall.validity [1] 0.6332639 > p8$validity [1] 0.7450000 0.7900000 0.6633333 0.7000000 0.7166667 [6] 0.6333333 0.0500000 0.7677778