next up previous
Next: Generation of randomly projected Up: Clusterv tutorial Previous: Overview of Clusterv functionalities


Getting started with Clusterv

In this section we analyze the reliability of clusters generated by the application of a hierarchical clustering algorithm to high dimensional synthetic data.

As a first step, we need to load the clusterv library:

> library(clusterv)
Loading required package: MASS
Loading required package: cluster
Library clusterv loaded.
The clusterv library requires two libraries cluster and MASS that are usually available on all R environments. In the unlikely hypothesis that thee libraries are not installed in your R environment it is straightforward to download them from the R web site.

Then we generate a synthetic data set that we will use for our reliability analysis:

> M <- generate.sample0(n=10, m=1, sigma=1, dim=6000)
The function generate.sample0 generates a 6000-dimensional data set with 3 clusters composed each one of 1O examples. The data are distributed according to a multivariate spherical gaussian distribution with a covariance matrix equal to a a identity matrix. The three clusters are centered, the first one in the 0 vector (that is a a 6000-dimensional vector with all '1'), the second in the 1 and the third in -1 vectors.

Then we want to perform a reliability analysis of the clustering obtained with the hierarchical clustering Ward's method, choosing a cut (number of clusters) corresponding to 2. To this end we choose an Achlioptas random projection and a subspace dimension such that the maximum distortion will be less than 1.2 (see Background on random projections in euclidean spaces)

Hence we need to compute first the subspace dimension according to the JL lemma with 1+epsilon distortion:

> subspace.dim <- ceiling(JL.predict.dim(30, epsilon=0.2));
> subspace.dim
[1] 341
That is we will perform random projection from 6000 to 341-dimensional subspaces. Then we perform the clustering on the original space, and to evaluate its reliability we perform 20 Achlioptas random projections into 341-dimensional subspaces, performing 20 hierarchical clusterings on that subspaces to compute the stability indices:
> l2 <- Random.hclustering.validity (M, dim=subspace.dim, hmethod="ward",
pmethod="Achlioptas", c=2, n=20, scale=TRUE, seed=100, AC=TRUE);
The list l2 is composed by different elements that store the different stability indices computed and other informations:
> l2$overall.validity
[1] 0.9210526
> l2$validity
[1] 1.0000000 0.8421053
These results show that the reliability (overall validity) of the clustering is high (0.9210) and the validity of the obtained 2 individual clusters are respectively 1.0000 and 0.8421.

We could repeat the same test, but this time choosing 3 clusters for the partition (we need only to change the parameter c=3, indicating that we test a 3-clusters clustering:

> l3 <- Random.hclustering.validity (M, dim=subspace.dim, c=3, n=20, 
pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE);
> l3$overall.validity
[1] 1
> l3$validity
[1] 1 1 1
In this case we achieve the maximum of the reliability, both for the overall clustering and for the individual clusters.

We repeat now the same test with c=4, 5, 10 clusters:
4 clusters partition:

> l4 <- Random.hclustering.validity (M, dim=subspace.dim, c=4,  n=20,
pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE);
> l4$overall.validity
[1] 0.8245833
> l4$validity
[1] 0.8911111 0.7755556 0.8250000 0.8066667
5 clusters partition:
> l5 <- Random.hclustering.validity (M, dim=subspace.dim, c=5, n=20,
pmethod="Achlioptas", hmethod="ward",  scale=TRUE, seed=100, AC=TRUE);
> l5$overall.validity
[1] 0.7097778
> l5$validity
[1] 0.7500000 0.7350000 0.6055556 0.7416667 0.7166667
10 clusters partition:
> l10 <- Random.hclustering.validity (M, dim=subspace.dim, c=10, n=20,
pmethod="Achlioptas", hmethod="ward", scale=TRUE, seed=100, AC=TRUE);
> l10$overall.validity
[1] 0.3213333
> l10$validity
 [1] 0.3800000 0.1500000 0.3250000 0.2750000 0.4500000 0.2500000 
 [7] 0.2833333 0.3166667 0.4000000 0.3833333
We know in advance that the correct number of clusters is 3. The stability indices correctly detect that the most likely clustering is composed by 3 clusters, and each cluster id highly reliable. Note that with 2, 4, 5 clusters partitions we obtain lower values for the stability indices, and with 10-clusters partition, the unnatural fragmentation of the clusters lead to very low values of the stability indices.

The element l$AC of the list returned by Random.hclustering.validity is a matrix that returns the "confidence" by which we can assign an example to a cluster:

> l3$AC
      [,1] [,2] [,3]
 [1,]    0    1    0
 [2,]    0    1    0
 [3,]    0    1    0
 [4,]    0    1    0
 [5,]    0    1    0
 [6,]    0    1    0
 [7,]    0    1    0
 [8,]    0    1    0
 [9,]    0    1    0
[10,]    0    1    0
[11,]    1    0    0
[12,]    1    0    0
[13,]    1    0    0
[14,]    1    0    0
[15,]    1    0    0
[16,]    1    0    0
[17,]    1    0    0
[18,]    1    0    0
[19,]    1    0    0
[20,]    1    0    0
[21,]    0    0    1
[22,]    0    0    1
[23,]    0    0    1
[24,]    0    0    1
[25,]    0    0    1
[26,]    0    0    1
[27,]    0    0    1
[28,]    0    0    1
[29,]    0    0    1
[30,]    0    0    1
The rows refer to the examples, columns to the cluster: in this case we have that the assignment is highly reliable for the example, but the AC-index may have values between 0 (no reliable assignment) to 1 (highly reliable assignment).

However, to perform a deeper and systematic analysis is preferable to use R scripts that automatically launch multiple instances of the function
Random.hclustering.validity and automatically store the corresponding results for further analysis and visualization. An example of such a script is downloadable from:
http://homes.dsi.unimi.it/valenti/SW/clusterv/examples/sample0.RSvalidity.R. This script perform a reliability analysis on a data set generated with the same generator we used in our example, but random subspace projections are used instead.

The results are summarized in the following figure (Fig. 1 ) that represents the dendrogram of the clustering and table (Tab. 1) where the corresponding validity indices are shown. Values with S refer to the overall stability index, while the other values inside the table represent the stability index of an individual cluster. In each column are computed the stability measures using random projections into different subspace dimensions corresponding to different 1 + epsilon (eps) distortion, according to the JL lemma (see Background on random projections in euclidean spaces).

Figure 1: Hierarchical clustering of examples obtained with the Ward method. Gray dotted lines cut the dendrogram such that exactly k clusters are produced, for k=2,3,5.
\includegraphics[width = 13cm]{ps/tree.s0.ward.eps}


Table 1: Estimate of cluster stability with random subspace projections
Clusters Stability index s
eps=0.5 eps=0.4 eps=0.3 eps=0.2 eps=0.1
2 clusters S = 0.8631 S = 0.8684 S = 0.8684 S = 0.9157 S = 0.9421
1 1.0000 1.0000 1.0000 1.0000 1.0000
2 0.7263 0.7368 0.7368 0.8314 0.8842
3 clusters S = 1.0000 S = 1.0000 S = 1.0000 S = 1.0000 S = 1.0000
1 1.0000 1.0000 1.0000 1.0000 1.0000
2 1.0000 1.0000 1.0000 1.0000 1.0000
3 1.0000 1.0000 1.0000 1.0000 1.0000
5 clusters S = 0.7059 S = 0.6843 S = 0.7044 S = 0.7004 S = 0.7472
1 0.6973 0.7346 0.7293 0.6506 0.7560
2 0.6666 0.7066 0.6866 0.6466 0.7133
3 0.7155 0.7582 0.7448 0.7591 0.8364
4 0.7600 0.5600 0.6800 0.7400 0.7800
5 0.6900 0.6621 0.6814 0.7057 0.6507
10 clusters S = 0.3093 S = 0.3043 S = 0.2651 S = 0.3286 S = 0.3936
1 0.0600 0.1200 0.0600 0.2000 0.2400
2 0.4260 0.3520 0.2900 0.3360 0.4560
3 0.1400 0.1600 0.1600 0.2000 0.1400
4 0.4066 0.3533 0.3200 0.3800 0.4200
5 0.3733 0.3000 0.2866 0.3600 0.4200
6 0.3276 0.3419 0.3285 0.3866 0.3933
7 0.3600 0.2800 0.3000 0.3600 0.3800
8 0.3000 0.3366 0.3066 0.3433 0.3866
9 0.3400 0.4000 0.2600 0.4200 0.5000
10 0.3600 0.4000 0.3400 0.3000 0.6000


next up previous
Next: Generation of randomly projected Up: Clusterv tutorial Previous: Overview of Clusterv functionalities
Giorgio 2006-08-16