Next: Background on random projections Up: The clusterv R package: Previous: The clusterv R package:

Overview of the clusterv R package

The clusterv R package implements a set of functions to assess the reliability of clusters discovered by clustering algorithms [19] This library is tailored to the analysis of high dimensional data and in particular it is conceived for the analysis of the reliability of clusters discovered using DNA microarray data.

Indeed cluster analysis has been used for investigating structure in microarray data, such as the search of new tumor taxonomies [2],[9],[16]. It provides a way for validating groups of patients according to prior biological knowledge or to discover new "natural groups" inside the data. Anyway, clustering algorithms always find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters.

To assess the reliability of the discovered classes, clusterv provides a set of measures that estimate the stability of the clusters obtained by perturbing the original data set. This perturbation is achieved through random projections of the original high dimensional data to lower dimensional subspaces, approximately preserving the distances between examples, in order to avoid too large distortions of the data. These random projections are repeated many times and each time a new clustering is performed. The obtained multiple clusterings are then compared with the clustering for which we need to evaluate its reliability. Intuitively a cluster will be reliable if it will be maintained across multiple clusterings performed in the lower dimensional subspaces. The measures provided by clusterv are based on the evaluation of the stability of the clusters across multiple random projections. By these measures we can assess:

the reliability of single individual clusters inside a clustering
the reliability of the overall clustering (that is, an estimate of the "optimal" number of clusters)
the confidence by which example may be assigned to each cluster

Our approach is based on random projections in euclidean spaces and in the next section we provide a brief overview of this topic. To learn more about our approach, please see [4]. A clusterv tutorial introduces to the usage of the package, providing also some examples of applications of the stability measures to synthetic and real DNA microarray data. To download the R software and documentation (comprising the tutorial and the reference manual in pdf format) go to the section Download software and documentation.

The stability measures based on random projections implemented in the clusterv package have been jointly designed by Alberto Bertoni (DSI, Università degli Studi di Milano) and Giorgio Valentini. The author of the clusterv package thanks Alberto Bertoni for his fundamental theoretical and methodological contributions.

Next: Background on random projections Up: The clusterv R package: Previous: The clusterv R package:

Giorgio 2006-08-16