Indeed cluster analysis has been used for investigating structure in microarray data, such as the search of new tumor taxonomies [2],[9],[16]. It provides a way for validating groups of patients according to prior biological knowledge or to discover new "natural groups" inside the data. Anyway, clustering algorithms always find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters.
To assess the reliability of the discovered classes, clusterv provides a set of measures that estimate the stability of the clusters obtained by perturbing the original data set. This perturbation is achieved through random projections of the original high dimensional data to lower dimensional subspaces, approximately preserving the distances between examples, in order to avoid too large distortions of the data. These random projections are repeated many times and each time a new clustering is performed. The obtained multiple clusterings are then compared with the clustering for which we need to evaluate its reliability. Intuitively a cluster will be reliable if it will be maintained across multiple clusterings performed in the lower dimensional subspaces. The measures provided by clusterv are based on the evaluation of the stability of the clusters across multiple random projections. By these measures we can assess:
Our approach is based on random projections in euclidean spaces and in the next section we provide a brief overview of this topic. To learn more about our approach, please see [4]. A clusterv tutorial introduces to the usage of the package, providing also some examples of applications of the stability measures to synthetic and real DNA microarray data. To download the R software and documentation (comprising the tutorial and the reference manual in pdf format) go to the section Download software and documentation.
The stability measures based on random projections implemented in the clusterv package have been jointly designed by Alberto Bertoni (DSI, Università degli Studi di Milano) and Giorgio Valentini. The author of the clusterv package thanks Alberto Bertoni for his fundamental theoretical and methodological contributions.