Estimate of the reliability of clusters generated by a generic clustering algorithm.

Next: Application of clusterv to Up: Clusterv tutorial Previous: Estimate of the reliability

Estimate of the reliability of clusters generated by a generic clustering algorithm.

In this section we see how to use the functions provided by clusterv to estimate the reliability of generic clustering algorithms.

It is worth noting that even if the proposed stability indices may be applied to any clustering algorithm, their computation is based on clusterings applied to subspaces of euclidean spaces, where the condition of low distortion is guaranteed if euclidean distances are used.

With this limitation in mind, the logical steps of the procedure you need to compute the stability indices for a given generic Clust algorithm applied to a d-dimensional data set D are:

Compute the clustering on the d-dimensional data set D using Clust
Generate multiple randomly projected data using a suitable random projection (with a given distortion)
Perform multiple clusterings using the projected data
Compute the similarity matrix using the previously computed multiple clusterings
Compute the validity indices for the individual clusters obtained at point 1.
Compute the overall validity index for the clustering
Compute the AC indices for each example
Save the results in a suitable way

For each logical step, the clusterv functions to be used are listed below:

Use Transform.vector.to.list if Clust returns a vector to represent the clustering
Plus.Minus.One.random.projection, Achlioptas.random.projection, norm.random.projection, random.subspace
Simply iterate the application of Clust on the data generated at step 2.
Do.similarity.matrix, Do.similarity.matrix.partition
Validity.indices, Cluster.validity, Cluster.validity.from.similarity
Simply average the validity indices of the individual clusters or use Cluster.validity, Cluster.validity.from.similarity
AC.index, Cluster.validity, Cluster.validity.from.similarity
--

From an implementative point of view, Clust needs to return the clustering represented as a vector v or a list l. More precisely the vector v should be an integer vector, whose indices represent the label of the examples, and the content the label (an integer) of the cluster to which the example belong (such as clustering component of the list partition.object of the cluster package.

The elements of list l must be vectors. Each vector represents a different cluster of the clustering; the elements of the vector are the (integer) labels of the examples (such as the orig.cluster component of the list returned by, e.g., the functions Random.hclustering.validity or Random.PAM.validity of the clusterv package.

Note that the functions implemented in clusterv need the clusterings implemented as list. Anyway if your clustering algorithm outputs a vector v, you can use the function Transform.vector.to.list. Consider, for instance, that the kmeans function of the package stats returns a vector for the computed k-means clustering. To convert the vector to a list:

> M <- generate.sample0(n=10, m=2, sigma=1, dim=500)
> # transforming a clustering vector obtained with kmeans to a list
> r<-kmeans(t(M), 3, 100);
> clustering.list.kmeans <- Transform.vector.to.list(r$cluster);

As an example about how to write a function to compute the stability indices for a generic clustering algorithm, consider the code of the function Random.kmeans.validity that computes the stability indices for the clusterings obtained with the k-means algorithm (note that the Step 1,2, etc refer to the logical steps listed above):

Random.kmeans.validity <- function(M, dim, pmethod="PMO", c=3, 
             it.max=1000, n=50, scale=TRUE, seed=-1, AC=TRUE) {
  dim.Sim.M <- ncol(M);
	if (seed == -1)
	   seed <- round(runif(1,1,10000));
	# Step 1. Computing the clusters in the original space	
	r<-kmeans(t(M), c, iter.max = it.max);
	cl.orig <- Transform.vector.to.list(r$cluster);
	
	# Step 2. and 3. Perform multiple random projections and multiple 
	#                clusterings on the resulting projected data
	cl <- Multiple.Random.kmeans (M=M, dim=dim, pmethod=pmethod, 
	           c=c, n=n, it.max=it.max, scale=scale, seed=seed);	
	# Step 4. Compute the similarity matrix
	Sim.M <- Do.similarity.matrix(cl, dim.Sim.M);
	
	# Computing the list of validity measures
	# Step 5. Computing the validity indices vi
	c <- length(cl.orig);  # it corrects for empty sets
	vi <- Validity.indices(cl.orig, c, Sim.M);
	
	# Step 6. Computing the overall  validity of the clustering:
	ov.vi <- sum(vi)/c;	
	
	# Step 7 and 8. Computing the AC indices and store 
	#               the results in a list:
	if (AC == TRUE) {
	  ac <- AC.index(cl.orig, c, Sim.M);		
		res <- list (validity=vi, overall.validity=ov.vi, 
		             similarity.matrix=Sim.M, dimension=dim, 
								 cluster=cl, orig.cluster=cl.orig, AC=ac);
	}														 
	else															 
	  res <- list (validity=vi, overall.validity=ov.vi, 
		             similarity.matrix=Sim.M, dimension=dim, 
								 cluster=cl, orig.cluster=cl.orig);
	return(res);	
}

For the meaning of the input parameters of the above function, please, see the Reference manual. Finally the code of the function Multiple.Random.kmeans used for the Steps 2 and 3 inside the function Random.kmeans.validity is the following (see the reference manual for the meaning of the input parameters):

Multiple.Random.kmeans <- function(M, dim, pmethod="PMO", 
            c=3, n=50, it.max=1000, scale=TRUE, seed=100) {
 set.seed(seed);                                                                                  
 cl <- list();                                                                                    
 for (i in 1:n) {                                                                                 
   # A. selection of the randomized map                                                           
   P.M <- switch(pmethod,                                                                         
     RS = random.subspace(d=dim, M, scaling=scale),                                               
     PMO = Plus.Minus.One.random.projection(d=dim, M, scaling=scale),                             
     Norm = norm.random.projection(d=dim, M, scaling=scale),                                      
     Achlioptas = Achlioptas.random.projection(d=dim, M, scaling=scale),                          
     stop("Multiple.Random.kmeans: not supported random projection.", call.=FALSE));              
   r<-kmeans(t(P.M), c, iter.max = it.max);                                                       
   cl[[i]] <- Transform.vector.to.list(r$cluster);                                                
 }                                                                                                
 return(cl);                                                                                      
}

Next: Application of clusterv to Up: Clusterv tutorial Previous: Estimate of the reliability

Giorgio 2006-08-16