In this section we analyze the reliability of clusters generated by the application of a hierarchical clustering algorithm to high dimensional synthetic data.
As a first step, we need to load the clusterv library:
> library(clusterv) Loading required package: MASS Loading required package: cluster Library clusterv loaded.The clusterv library requires two libraries cluster and MASS that are usually available on all R environments. In the unlikely hypothesis that thee libraries are not installed in your R environment it is straightforward to download them from the R web site.
Then we generate a synthetic data set that we will use for our reliability analysis:
> M <- generate.sample0(n=10, m=1, sigma=1, dim=6000)The function generate.sample0 generates a 6000-dimensional data set with 3 clusters composed each one of 1O examples. The data are distributed according to a multivariate spherical gaussian distribution with a covariance matrix equal to a a identity matrix. The three clusters are centered, the first one in the 0 vector (that is a a 6000-dimensional vector with all '1'), the second in the 1 and the third in -1 vectors.
Then we want to perform a reliability analysis of the clustering obtained with the hierarchical clustering Ward's method, choosing a cut (number of clusters) corresponding to 2. To this end we choose an Achlioptas random projection and a subspace dimension such that the maximum distortion will be less than 1.2 (see Background on random projections in euclidean spaces)
Hence we need to compute first the subspace dimension according to the JL lemma with 1+epsilon distortion:
> subspace.dim <- ceiling(JL.predict.dim(30, epsilon=0.2)); > subspace.dim [1] 341That is we will perform random projection from 6000 to 341-dimensional subspaces. Then we perform the clustering on the original space, and to evaluate its reliability we perform 20 Achlioptas random projections into 341-dimensional subspaces, performing 20 hierarchical clusterings on that subspaces to compute the stability indices:
> l2 <- Random.hclustering.validity (M, dim=subspace.dim, hmethod="ward", pmethod="Achlioptas", c=2, n=20, scale=TRUE, seed=100, AC=TRUE);The list l2 is composed by different elements that store the different stability indices computed and other informations:
> l2$overall.validity [1] 0.9210526 > l2$validity [1] 1.0000000 0.8421053These results show that the reliability (overall validity) of the clustering is high (0.9210) and the validity of the obtained 2 individual clusters are respectively 1.0000 and 0.8421.
We could repeat the same test, but this time choosing 3 clusters for the partition (we need only to change the parameter c=3, indicating that we test a 3-clusters clustering:
> l3 <- Random.hclustering.validity (M, dim=subspace.dim, c=3, n=20, pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE); > l3$overall.validity [1] 1 > l3$validity [1] 1 1 1In this case we achieve the maximum of the reliability, both for the overall clustering and for the individual clusters.
We repeat now the same test with c=4, 5, 10 clusters:
4 clusters partition:
> l4 <- Random.hclustering.validity (M, dim=subspace.dim, c=4, n=20, pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE); > l4$overall.validity [1] 0.8245833 > l4$validity [1] 0.8911111 0.7755556 0.8250000 0.80666675 clusters partition:
> l5 <- Random.hclustering.validity (M, dim=subspace.dim, c=5, n=20, pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE); > l5$overall.validity [1] 0.7097778 > l5$validity [1] 0.7500000 0.7350000 0.6055556 0.7416667 0.716666710 clusters partition:
> l10 <- Random.hclustering.validity (M, dim=subspace.dim, c=10, n=20, pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE); > l10$overall.validity [1] 0.3213333 > l10$validity [1] 0.3800000 0.1500000 0.3250000 0.2750000 0.4500000 0.2500000 [7] 0.2833333 0.3166667 0.4000000 0.3833333We know in advance that the correct number of clusters is 3. The stability indices correctly detect that the most likely clustering is composed by 3 clusters, and each cluster id highly reliable. Note that with 2, 4, 5 clusters partitions we obtain lower values for the stability indices, and with 10-clusters partition, the unnatural fragmentation of the clusters lead to very low values of the stability indices.
The element l$AC of the list returned by Random.hclustering.validity is a matrix that returns the "confidence" by which we can assign an example to a cluster:
> l3$AC [,1] [,2] [,3] [1,] 0 1 0 [2,] 0 1 0 [3,] 0 1 0 [4,] 0 1 0 [5,] 0 1 0 [6,] 0 1 0 [7,] 0 1 0 [8,] 0 1 0 [9,] 0 1 0 [10,] 0 1 0 [11,] 1 0 0 [12,] 1 0 0 [13,] 1 0 0 [14,] 1 0 0 [15,] 1 0 0 [16,] 1 0 0 [17,] 1 0 0 [18,] 1 0 0 [19,] 1 0 0 [20,] 1 0 0 [21,] 0 0 1 [22,] 0 0 1 [23,] 0 0 1 [24,] 0 0 1 [25,] 0 0 1 [26,] 0 0 1 [27,] 0 0 1 [28,] 0 0 1 [29,] 0 0 1 [30,] 0 0 1The rows refer to the examples, columns to the cluster: in this case we have that the assignment is highly reliable for the example, but the AC-index may have values between 0 (no reliable assignment) to 1 (highly reliable assignment).
However, to perform a deeper and systematic analysis is preferable to use R scripts that automatically launch multiple
instances of the function
Random.hclustering.validity and automatically store the corresponding results for further
analysis and visualization.
An example of such a script is downloadable from:
http://homes.dsi.unimi.it/valenti/SW/clusterv/examples/sample0.RSvalidity.R. This script perform a reliability analysis
on a data set generated with the same generator we used in our example, but random subspace projections are used instead.
The results are summarized in the following figure (Fig. 1 ) that represents the dendrogram of the clustering and table (Tab. 1) where the corresponding validity indices are shown. Values with S refer to the overall stability index, while the other values inside the table represent the stability index of an individual cluster. In each column are computed the stability measures using random projections into different subspace dimensions corresponding to different 1 + epsilon (eps) distortion, according to the JL lemma (see Background on random projections in euclidean spaces).
|
Clusters | Stability index s | ||||
eps=0.5 | eps=0.4 | eps=0.3 | eps=0.2 | eps=0.1 | |
2 clusters | S = 0.8631 | S = 0.8684 | S = 0.8684 | S = 0.9157 | S = 0.9421 |
1 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
2 | 0.7263 | 0.7368 | 0.7368 | 0.8314 | 0.8842 |
3 clusters | S = 1.0000 | S = 1.0000 | S = 1.0000 | S = 1.0000 | S = 1.0000 |
1 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
2 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
3 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
5 clusters | S = 0.7059 | S = 0.6843 | S = 0.7044 | S = 0.7004 | S = 0.7472 |
1 | 0.6973 | 0.7346 | 0.7293 | 0.6506 | 0.7560 |
2 | 0.6666 | 0.7066 | 0.6866 | 0.6466 | 0.7133 |
3 | 0.7155 | 0.7582 | 0.7448 | 0.7591 | 0.8364 |
4 | 0.7600 | 0.5600 | 0.6800 | 0.7400 | 0.7800 |
5 | 0.6900 | 0.6621 | 0.6814 | 0.7057 | 0.6507 |
10 clusters | S = 0.3093 | S = 0.3043 | S = 0.2651 | S = 0.3286 | S = 0.3936 |
1 | 0.0600 | 0.1200 | 0.0600 | 0.2000 | 0.2400 |
2 | 0.4260 | 0.3520 | 0.2900 | 0.3360 | 0.4560 |
3 | 0.1400 | 0.1600 | 0.1600 | 0.2000 | 0.1400 |
4 | 0.4066 | 0.3533 | 0.3200 | 0.3800 | 0.4200 |
5 | 0.3733 | 0.3000 | 0.2866 | 0.3600 | 0.4200 |
6 | 0.3276 | 0.3419 | 0.3285 | 0.3866 | 0.3933 |
7 | 0.3600 | 0.2800 | 0.3000 | 0.3600 | 0.3800 |
8 | 0.3000 | 0.3366 | 0.3066 | 0.3433 | 0.3866 |
9 | 0.3400 | 0.4000 | 0.2600 | 0.4200 | 0.5000 |
10 | 0.3600 | 0.4000 | 0.3400 | 0.3000 | 0.6000 |