Getting started with Clusterv

In this section we analyze the reliability of clusters generated by the application of a hierarchical clustering algorithm to high dimensional synthetic data.

Then we generate a synthetic data set that we will use for our reliability analysis:

Then we want to perform a reliability analysis of the clustering obtained with the hierarchical clustering Ward's method, choosing a cut (number of clusters) corresponding to 2. To this end we choose an Achlioptas random projection and a subspace dimension such that the maximum distortion will be less than 1.2 (see Background on random projections in euclidean spaces)

Hence we need to compute first the subspace dimension according to the JL lemma with 1+epsilon distortion:

We could repeat the same test, but this time choosing 3 clusters for the partition (we need only to change the parameter c=3, indicating that we test a 3-clusters clustering:

The element l$AC of the list returned by Random.hclustering.validity is a matrix that returns the "confidence" by which we can assign an example to a cluster:

However, to perform a deeper and systematic analysis is preferable to use R scripts that automatically launch multiple instances of the function
Random.hclustering.validity and automatically store the corresponding results for further analysis and visualization. An example of such a script is downloadable from:
http://homes.dsi.unimi.it/valenti/SW/clusterv/examples/sample0.RSvalidity.R. This script perform a reliability analysis on a data set generated with the same generator we used in our example, but random subspace projections are used instead.

The results are summarized in the following figure (Fig. 1 ) that represents the dendrogram of the clustering and table (Tab. 1) where the corresponding validity indices are shown. Values with S refer to the overall stability index, while the other values inside the table represent the stability index of an individual cluster. In each column are computed the stability measures using random projections into different subspace dimensions corresponding to different 1 + epsilon (eps) distortion, according to the JL lemma (see Background on random projections in euclidean spaces).

**Figure 1:** Hierarchical clustering of examples obtained with the Ward method. Gray dotted lines cut the dendrogram such that exactly k clusters are produced, for k=2,3,5.
$\includegraphics[width = 13cm]{ps/tree.s0.ward.eps}$

**Table 1:** Estimate of cluster stability with random subspace projections

Clusters	Stability index s
	eps=0.5	eps=0.4	eps=0.3	eps=0.2	eps=0.1
2 clusters	S = 0.8631	S = 0.8684	S = 0.8684	S = 0.9157	S = 0.9421
1	1.0000	1.0000	1.0000	1.0000	1.0000
2	0.7263	0.7368	0.7368	0.8314	0.8842
3 clusters	S = 1.0000	S = 1.0000	S = 1.0000	S = 1.0000	S = 1.0000
1	1.0000	1.0000	1.0000	1.0000	1.0000
2	1.0000	1.0000	1.0000	1.0000	1.0000
3	1.0000	1.0000	1.0000	1.0000	1.0000
5 clusters	S = 0.7059	S = 0.6843	S = 0.7044	S = 0.7004	S = 0.7472
1	0.6973	0.7346	0.7293	0.6506	0.7560
2	0.6666	0.7066	0.6866	0.6466	0.7133
3	0.7155	0.7582	0.7448	0.7591	0.8364
4	0.7600	0.5600	0.6800	0.7400	0.7800
5	0.6900	0.6621	0.6814	0.7057	0.6507
10 clusters	S = 0.3093	S = 0.3043	S = 0.2651	S = 0.3286	S = 0.3936
1	0.0600	0.1200	0.0600	0.2000	0.2400
2	0.4260	0.3520	0.2900	0.3360	0.4560
3	0.1400	0.1600	0.1600	0.2000	0.1400
4	0.4066	0.3533	0.3200	0.3800	0.4200
5	0.3733	0.3000	0.2866	0.3600	0.4200
6	0.3276	0.3419	0.3285	0.3866	0.3933
7	0.3600	0.2800	0.3000	0.3600	0.3800
8	0.3000	0.3366	0.3066	0.3433	0.3866
9	0.3400	0.4000	0.2600	0.4200	0.5000
10	0.3600	0.4000	0.3400	0.3000	0.6000