next up previous
Next: An example of the Up: The mosclust R package: Previous: Statistical tests to assess


Introduction to the functionalities and the usage of mosclust

In this section we summarize the main functionalities provided by the package. Then we provide some examples of R scripts to introduce the usage of the mosclust R package in practical problems, using synthetic and DNA microarray data. For details about the single functions implemented in the library, please, see the Reference manual.

The R package mosclust implements stability methods for unsupervised structure discovery in bio-molecular data through a set of functionalities that may be summarized as follows:

  1. Functions to compute similarity measures between pairs of perturbed clusterings.
  2. Functions to compute similarity matrices using different data perturbation methods
  3. Functions to compute similarity matrices with specific clustering algorithms:
  4. Functions to compute stability indices and p-values according to different statistical tests
  5. Functions to perform tests of hypothesis to select k-clustering solutions significant at a given significance level:
  6. Graphical functions to plot histograms and empirical cumulative distribution functions of the similarity measures for different number of clusters, and to plot the p-values for different tests of hypothesis.
  7. Miscellaneous utility functions:

From a very general standpoint, to discover significant structures in a given data set, we need at first to choose one of the functions at point 2. These functions perturb multiple times the data set with a specific perturbation procedure (resampling, random projection or noise). The functions do.similiarity.xxx use one of the functions listed at item 1 (similarity measure) and 3 (computation of similarities with a given clustering algorithm) to compute a matrix of similarity measures between pairs of perturbed k-clusterings for different numbers k of clusters.

Then we need to choose one of the functions at point 4 to compute the stability indices and the p-values associated with each specific k-clustering. At this point we can sort the k-clusterings from the most reliable to the least reliable according to the values of the stability indices.

The next step is to assess the significance of the discovered structures. To this end we may choose one of the functions listed at the item 6 that implement statistical tests to to select k-clustering solutions significant at a given significance level. Note that this by approach ore than 1 solution can be found, revealing multiple structures simultaneously present in the data (at a given significance level).

Finally we can choose the functions listed at item 7 to nicely plot our experimental results.

Of course, this is only one of the possible ways to use the package, and you need to read the reference manual and to experiment by yourself for a good usage of all the functionalities of the package.

In the rest of this section we provide some examples of R scripts to introduce the usage of the mosclust R package in practical problems, using synthetic and DNA microarray data. The source code of the scripts and the related data are downloadable from: http://homes.dsi.unimi.it/valenti/SW/mosclust/examples.



Subsections
next up previous
Next: An example of the Up: The mosclust R package: Previous: Statistical tests to assess