Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data [8,18,16,9,21]. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations.
Several perturbation techniques have been proposed, ranging form bootstrap techniques [15,4,18], to random projections to lower dimensional subspaces [20,7] to noise injection procedures [17]. All these perturbation techniques are implemented in mosclust.
The library implements indices of stability/reliability of the clusterings based on the distribution of similarity measures between multiple instances of clusterings performed on multiple instances of data obtained through a given random perturbation of the original data.
These indices provides a "score" that can be used to compare the reliability of different clusterings. Moreover statistical tests based on and on the classical Bernstein inequality [12] are implemented in order to assess the statistical significance of the discovered clustering solutions. By this approach we could also find multiple structures simultaneously present in the data. For instance, it is possible that data exhibit a hierarchical structure, with subclusters inside other clusters, and using the indices and the statistical tests implemented in mosclust we may detect them at a given significance level.
Summarizing, this package may be used for:
Note that this package cannot be used to assess the reliability of an individual cluster inside a given clustering (to this end you may use the clusterv R package).
The next section provides a background on stability methods, with a brief description of the stability indices and the statistical tests implemented in the package. For more details, please see [6,5].
Then a brief introduction to the functionalities and the usage of mosclust is given.
To download the R software and documentation (comprising the tutorial and the reference manual in pdf format) go to the section Download software and documentation.
The statistical tests implemented in the package have been designed with the theoretical and methodological contribution of Alberto Bertoni (DSI, Università degli Studi di Milano).