next up previous
Next: Software download and documentation Up: Introduction to the functionalities Previous: An example of the

Subsections


Examples of the usage of mosclust with DNA microarray data

In order to exemplify the usage of the mosclust package with DNA icroarray data, we consider two classical gene expression data sets, the Leukemia data set [10] and the Lymphoma gene expression data set [2].


Analysis of the Leukemia data set

As a first example of the application of mosclust to the analysis of bio-molecular data we consider the classical Golub's et al. Leukemia data set [10]. The Leukemia data set is composed by a group of 25 acute myeloid leukemia (AML) samples and another group of 47 acute lymphoblastic leukemia (ALL) samples, that can be subdivided into 38 B-Cell and 9 T-Cell subgroups, resulting in a two-level hierarchical structure.

After applying the same pre-processing steps performed by the authors of the original Leukemia study, in this example we selected only the first 100 genes with the highest variance for further analysis.

The leukemia data set Leukemia.filtered.norm.hv (in R readable binary format) is downloadable from: http://homes.dsi.unimi.it/valenti/SW/mosclust/examples.

The example source code of the R script to analyze the example DNA microarray data is the following (note that the execution of the script may require several minutes on a desktop computer):

##############################################################################################################
# doLeukemia.small.R
# September 2006
# 
# Script to evaluate the number of cluster in the classical Golub's ALL-AML Leukemia data set (using 100 genes
# with the highest variance) through the r package mosclust.
# PAM, k-means and hierarchical clustering algorithms are used using PMO random projections and subsampling
# techniques to perturb the data.
# The chi-square based, Bernstein and the hybrid two-steps statistical test are applied to assess the 
# significance of the clustering solutions
##############################################################################################################
library(Biobase);
library(mosclust);


# parameters
nsubsamples <- 100;  # number of pairs od clusterings to be evaluated
max.num.clust <- 10; # maximum number of cluster to be evaluated
fract.resampled <- 0.8; # fraction of samples to be subsampled
dim.projection <- 80; # dimension of te projected suspace


# Loading Golub's Leukemia data set (limited to the 100 genes with the highest variance across samples)
load("Leukemia.filtered.norm.hv");
M <- Leukemia.filtered.norm.hv@exprs;



##############################################
#### Kmeans clustering               #########
##############################################

## Resampling
Sr.kmeans.Leukemia.small <- do.similarity.resampling(M, c=max.num.clust, nsub=nsubsamples, f=fract.resampled, s=sFM, 
                                      alg.clust.sim=Kmeans.sim.resampling);
pr.kmeans.Leukemia.small.Bernstein <- Bernstein.compute.pvalues(Sr.kmeans.Leukemia.small);
pr.kmeans.Leukemia.small.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sr.kmeans.Leukemia.small);
pr07.kmeans.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sr.kmeans.Leukemia.small,s0=0.7);
pr09.kmeans.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sr.kmeans.Leukemia.small,s0=0.9);
lr07.kmeans.Leukemia.small <- Hybrid.testing(Sr.kmeans.Leukemia.small, alpha=0.01, s0=0.7);
lr09.kmeans.Leukemia.small <- Hybrid.testing(Sr.kmeans.Leukemia.small, alpha=0.01, s0=0.9);

 
## PMO Random projections
Sp.kmeans.Leukemia.small <- do.similarity.projection(M, c=max.num.clust, nprojections=nsubsamples, dim=dim.projection, pmethod="PMO",  
                                      alg.clust.sim=Kmeans.sim.projection);
pp.kmeans.Leukemia.small.Bernstein <- Bernstein.compute.pvalues(Sp.kmeans.Leukemia.small);
pp.kmeans.Leukemia.small.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sp.kmeans.Leukemia.small);
pp07.kmeans.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sp.kmeans.Leukemia.small,s0=0.7);
pp09.kmeans.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sp.kmeans.Leukemia.small,s0=0.9);
lp07.kmeans.Leukemia.small <- Hybrid.testing(Sp.kmeans.Leukemia.small, alpha=0.01, s0=0.7);
lp09.kmeans.Leukemia.small <- Hybrid.testing(Sp.kmeans.Leukemia.small, alpha=0.01, s0=0.9);


# saving objects 
save(Sr.kmeans.Leukemia.small,pr.kmeans.Leukemia.small.Bernstein,pr.kmeans.Leukemia.small.ind.Bernstein,pr07.kmeans.Leukemia.small.chi.sq,
pr09.kmeans.Leukemia.small.chi.sq,lr07.kmeans.Leukemia.small,lr09.kmeans.Leukemia.small,
Sp.kmeans.Leukemia.small,pp.kmeans.Leukemia.small.Bernstein,pp.kmeans.Leukemia.small.ind.Bernstein,pp07.kmeans.Leukemia.small.chi.sq,
pp09.kmeans.Leukemia.small.chi.sq,lp07.kmeans.Leukemia.small,lp09.kmeans.Leukemia.small,
file="Leukemia.small.kmeans.objects");

##############################################
## Hierarchical clustering            ########
##############################################

## Resampling
Sr.HC.Leukemia.small <- do.similarity.resampling(M, c=max.num.clust, nsub=nsubsamples, f=fract.resampled, s=sFM, 
                                      alg.clust.sim=Hierarchical.sim.resampling);
pr.HC.Leukemia.small.Bernstein <- Bernstein.compute.pvalues(Sr.HC.Leukemia.small);
pr.HC.Leukemia.small.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sr.HC.Leukemia.small);
pr07.HC.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sr.HC.Leukemia.small,s0=0.7);
pr09.HC.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sr.HC.Leukemia.small,s0=0.9);
lr07.HC.Leukemia.small <- Hybrid.testing(Sr.HC.Leukemia.small, alpha=0.01, s0=0.7);
lr09.HC.Leukemia.small <- Hybrid.testing(Sr.HC.Leukemia.small, alpha=0.01, s0=0.9);

 
## PMO Random projections
Sp.HC.Leukemia.small <- do.similarity.projection(M, c=max.num.clust, nprojections=nsubsamples, dim=dim.projection, pmethod="PMO",  
                                      alg.clust.sim=Hierarchical.sim.projection);
pp.HC.Leukemia.small.Bernstein <- Bernstein.compute.pvalues(Sp.HC.Leukemia.small);
pp.HC.Leukemia.small.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sp.HC.Leukemia.small);
pp07.HC.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sp.HC.Leukemia.small,s0=0.7);
pp09.HC.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sp.HC.Leukemia.small,s0=0.9);
lp07.HC.Leukemia.small <- Hybrid.testing(Sp.HC.Leukemia.small, alpha=0.01, s0=0.7);
lp09.HC.Leukemia.small <- Hybrid.testing(Sp.HC.Leukemia.small, alpha=0.01, s0=0.9);

# saving objects 
save(Sr.HC.Leukemia.small,pr.HC.Leukemia.small.Bernstein,pr.HC.Leukemia.small.ind.Bernstein,pr07.HC.Leukemia.small.chi.sq,
pr09.HC.Leukemia.small.chi.sq,lr07.HC.Leukemia.small,lr09.HC.Leukemia.small,
Sp.HC.Leukemia.small,pp.HC.Leukemia.small.Bernstein,pp.HC.Leukemia.small.ind.Bernstein,pp07.HC.Leukemia.small.chi.sq,
pp09.HC.Leukemia.small.chi.sq,lp07.HC.Leukemia.small,lp09.HC.Leukemia.small,
file="Leukemia.small.HC.objects");

##############################################
## PAM clustering                     ########
##############################################

## Resampling
Sr.PAM.Leukemia.small <- do.similarity.resampling(M, c=max.num.clust, nsub=nsubsamples, f=fract.resampled, s=sFM, 
                                      alg.clust.sim=PAM.sim.resampling);
pr.PAM.Leukemia.small.Bernstein <- Bernstein.compute.pvalues(Sr.PAM.Leukemia.small);
pr.PAM.Leukemia.small.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sr.PAM.Leukemia.small);
pr07.PAM.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sr.PAM.Leukemia.small,s0=0.7);
pr09.PAM.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sr.PAM.Leukemia.small,s0=0.9);
lr07.PAM.Leukemia.small <- Hybrid.testing(Sr.PAM.Leukemia.small, alpha=0.01, s0=0.7);
lr09.PAM.Leukemia.small <- Hybrid.testing(Sr.PAM.Leukemia.small, alpha=0.01, s0=0.9);

 
## PMO Random projections
Sp.PAM.Leukemia.small <- do.similarity.projection(M, c=max.num.clust, nprojections=nsubsamples, dim=dim.projection, pmethod="PMO",  
                                      alg.clust.sim=PAM.sim.projection);
pp.PAM.Leukemia.small.Bernstein <- Bernstein.compute.pvalues(Sp.PAM.Leukemia.small);
pp.PAM.Leukemia.small.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sp.PAM.Leukemia.small);
pp07.PAM.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sp.PAM.Leukemia.small,s0=0.7);
pp09.PAM.Leukemia.small.chi.sq <- Chi.square.compute.pvalues(Sp.PAM.Leukemia.small,s0=0.9);
lp07.PAM.Leukemia.small <- Hybrid.testing(Sp.PAM.Leukemia.small, alpha=0.01, s0=0.7);
lp09.PAM.Leukemia.small <- Hybrid.testing(Sp.PAM.Leukemia.small, alpha=0.01, s0=0.9);


# saving objects 
save(Sr.PAM.Leukemia.small,pr.PAM.Leukemia.small.Bernstein,pr.PAM.Leukemia.small.ind.Bernstein,pr07.PAM.Leukemia.small.chi.sq,
pr09.PAM.Leukemia.small.chi.sq,lr07.PAM.Leukemia.small,lr09.PAM.Leukemia.small,
Sp.PAM.Leukemia.small,pp.PAM.Leukemia.small.Bernstein,pp.PAM.Leukemia.small.ind.Bernstein,pp07.PAM.Leukemia.small.chi.sq,
pp09.PAM.Leukemia.small.chi.sq,lp07.PAM.Leukemia.small,lp09.PAM.Leukemia.small,
file="Leukemia.small.PAM.objects");


###################################
##### Writing results on text file
###################################
sink("Leukemia.small.res");
cat("** Experiments with Leukemia data **\n\n");
######## Kmeans
cat("------------------------\n")
cat("---- Kmeans clustering subsampling\n\n");
cat("Leukemia.small kmeans subsampling Bernstein p-values: \n");
print(pr.kmeans.Leukemia.small.Bernstein);  
cat("Leukemia.small kmeans subsampling Bernstein independent p-values: \n");
print(pr.kmeans.Leukemia.small.ind.Bernstein);
cat("Leukemia.small kmeans subsampling Chi square p-values s=0.7: \n");
print(pr07.kmeans.Leukemia.small.chi.sq);     
cat("Leukemia.small kmeans subsampling Chi square p-values s=0.9: \n");
print(pr09.kmeans.Leukemia.small.chi.sq);    
cat("Leukemia.small kmeans subsampling clusterings selected by the Hybrid test s=0.7: \n");
print(lr07.kmeans.Leukemia.small$selected.res);    
cat("Leukemia.small kmeans subsampling clusterings selected by the Hybrid test s=0.9: \n");
print(lr09.kmeans.Leukemia.small$selected.res);    

cat("------------------------\n")
cat("---- Kmeans clustering projections\n\n");
cat("Leukemia.small kmeans projections Bernstein p-values: \n");
print(pp.kmeans.Leukemia.small.Bernstein);  
cat("Leukemia.small kmeans projections Bernstein independent p-values: \n");
print(pp.kmeans.Leukemia.small.ind.Bernstein);
cat("Leukemia.small kmeans projections Chi square p-values s=0.7: \n");
print(pp07.kmeans.Leukemia.small.chi.sq);     
cat("Leukemia.small kmeans projections Chi square p-values s=0.9: \n");
print(pp09.kmeans.Leukemia.small.chi.sq);    
cat("Leukemia.small kmeans projections clusterings selected by the Hybrid test s=0.7: \n");
print(lp07.kmeans.Leukemia.small$selected.res);    
cat("Leukemia.small kmeans projections clusterings selected by the Hybrid test s=0.9: \n");
print(lp09.kmeans.Leukemia.small$selected.res);    

######## HC
cat("------------------------\n")
cat("---- HC clustering subsampling\n\n");
cat("Leukemia.small HC subsampling Bernstein p-values: \n");
print(pr.HC.Leukemia.small.Bernstein);  
cat("Leukemia.small HC subsampling Bernstein independent p-values: \n");
print(pr.HC.Leukemia.small.ind.Bernstein);
cat("Leukemia.small HC subsampling Chi square p-values s=0.7: \n");
print(pr07.HC.Leukemia.small.chi.sq);     
cat("Leukemia.small HC subsampling Chi square p-values s=0.9: \n");
print(pr09.HC.Leukemia.small.chi.sq);    
cat("Leukemia.small HC subsampling clusterings selected by the Hybrid test s=0.7: \n");
print(lr07.HC.Leukemia.small$selected.res);    
cat("Leukemia.small HC subsampling clusterings selected by the Hybrid test s=0.9: \n");
print(lr09.HC.Leukemia.small$selected.res);    

cat("------------------------\n")
cat("---- HC clustering projections\n\n");
cat("Leukemia.small HC projections Bernstein p-values: \n");
print(pp.HC.Leukemia.small.Bernstein);  
cat("Leukemia.small HC projections Bernstein independent p-values: \n");
print(pp.HC.Leukemia.small.ind.Bernstein);
cat("Leukemia.small HC projections Chi square p-values s=0.7: \n");
print(pp07.HC.Leukemia.small.chi.sq);     
cat("Leukemia.small HC projections Chi square p-values s=0.9: \n");
print(pp09.HC.Leukemia.small.chi.sq);    
cat("Leukemia.small HC projections clusterings selected by the Hybrid test s=0.7: \n");
print(lp07.HC.Leukemia.small$selected.res);    
cat("Leukemia.small HC projections clusterings selected by the Hybrid test s=0.9: \n");
print(lp09.HC.Leukemia.small$selected.res);      

 
######## PAM
cat("------------------------\n")
cat("---- PAM clustering subsampling\n\n");
cat("Leukemia.small PAM subsampling Bernstein p-values: \n");
print(pr.PAM.Leukemia.small.Bernstein);  
cat("Leukemia.small PAM subsampling Bernstein independent p-values: \n");
print(pr.PAM.Leukemia.small.ind.Bernstein);
cat("Leukemia.small PAM subsampling Chi square p-values s=0.7: \n");
print(pr07.PAM.Leukemia.small.chi.sq);     
cat("Leukemia.small PAM subsampling Chi square p-values s=0.9: \n");
print(pr09.PAM.Leukemia.small.chi.sq);    
cat("Leukemia.small PAM subsampling clusterings selected by the Hybrid test s=0.7: \n");
print(lr07.PAM.Leukemia.small$selected.res);    
cat("Leukemia.small PAM subsampling clusterings selected by the Hybrid test s=0.9: \n");
print(lr09.PAM.Leukemia.small$selected.res);    

cat("------------------------\n")
cat("---- PAM clustering projections\n\n");
cat("Leukemia.small PAM projections Bernstein p-values: \n");
print(pp.PAM.Leukemia.small.Bernstein);  
cat("Leukemia.small PAM projections Bernstein independent p-values: \n");
print(pp.PAM.Leukemia.small.ind.Bernstein);
cat("Leukemia.small PAM projections Chi square p-values s=0.7: \n");
print(pp07.PAM.Leukemia.small.chi.sq);     
cat("Leukemia.small PAM projections Chi square p-values s=0.9: \n");
print(pp09.PAM.Leukemia.small.chi.sq);    
cat("Leukemia.small PAM projections clusterings selected by the Hybrid test s=0.7: \n");
print(lp07.PAM.Leukemia.small$selected.res);    
cat("Leukemia.small PAM projections clusterings selected by the Hybrid test s=0.9: \n");
print(lp09.PAM.Leukemia.small$selected.res);      

sink();
########################################################################################

The functions do.similarity.xxx perform multiple random pertubation of the data by resampling (do.similarity.resampling) or random projections (do.similarity.projection) and compute multiple similarity values for different number of clusters, storing the results in a matrix (e.g. Sr.kmeans.Leukemia.small or Sp.kmeans.Leukemia.small). Then these matrices with the computed similarity values between pairs of clusterings are used by the functions that compute stability indices and p-values according to different statistical tests. Each of these functions compute the stability index for each k-clustering and the associated p-value between the given k-clustering and the other clusterings with a larger stability index, according to the Bernstein-based statistical tests (Bernstein.compute.pvalues, Bernstein.ind.compute.pvalues) and the $\chi^2$-based statistical test (Chi.square.compute.pvalues) (see Statistical tests to assess the significance of clustering solutions for more details). Finally the function Hybrid.testing selects the k-clusterings significant at a given significance level, using at first the Bernstein based test to single out the k-clustering that significantly differ from the top ranked k-clustering (making no assumptions between the distribution of the similarity indices) and then applying the $\chi^2$-based statistical test on the remaining top-ranked clusterings to refine the subset of statistically significant k-clusterings (in this case implicitly assuming a normal distribution of the similarity indices).

The detailed results of the analysis are stored in the lists returned by the functions that computes the stability indices and the p-values (see the reference manuals for details). The results may be also represented through the graphical functions provided the package. For instance, Fig. 2 represents the empirical cumulative distribution functions of the similarity measures for different number of clusters obtained through resampling techniques. These graphs can be obtained through the function plot.cumulative.multiple.

The statistical tests select as significant 2 and 3-clusterings, but also looking at Fig. 2 we may deduce that 2 and 3 clusters are more reliable than others. Indeed the empirical cumulative distributions for 2 and 3 clusters are clearly below the others, showing that 2 and 3 clusters are more reliable. Moreover the corresponding graphs cross several times, while the other ecdf are clearly apart from them (Fig. 2).

Figure 2: Leukemia data set. Empirical cumulative distribution functions of the similarity measures for different number of clusters $k$ using resampling techniques.
\includegraphics[width = 14.0cm]{ecdf.kmeans.sub.Leukemia-4.small.eps}

The p.values computed according to the different proposed statistical tests may be plotted for a visual comparison using the function plot.pvalues.

For instance, to plot the p-values previously computed through Bernstein and $\chi^2$-based statistical tests, using random projections and k-means, PAM and hierarchical clustering algorithms, we may use the following R code:

 
test.lab <- c("Bernstein", "Bernstein ind.", "chi sq. t=0.7", "chi sq. t=0.9");
# kmeans 
l <- list(pp.kmeans.Leukemia.small.Bernstein, pp.kmeans.Leukemia.small.ind.Bernstein, pp07.kmeans.Leukemia.small.chi.sq, pp09.kmeans.Leukemia.small.chi.sq);
x11(); plot.pvalues(l,alpha=1e-03,legendy=1e-80, leg_label=test.lab);

# PAM 
l <- list(pp.PAM.Leukemia.small.Bernstein, pp.PAM.Leukemia.small.ind.Bernstein, pp07.PAM.Leukemia.small.chi.sq, pp09.PAM.Leukemia.small.chi.sq);
x11(); plot.pvalues(l,alpha=1e-03,legendy=1e-80, leg_label=test.lab);

# hierarchical clustering
l <- list(pp.HC.Leukemia.small.Bernstein, pp.HC.Leukemia.small.ind.Bernstein, pp07.HC.Leukemia.small.chi.sq, pp09.HC.Leukemia.small.chi.sq);
x11(); plot.pvalues(l,alpha=1e-03,legendy=1e-50, leg_label=test.lab);

The resulting graphs are plotted below:

Figure 3: Plot of the p-values computed according to different statistical tests. (a) K-means (b) PAM (c) hierarchical clustering.
\includegraphics[width = 14.0cm]{p-values.kmeans.eps}
(a)
\includegraphics[width = 14.0cm]{p-values.PAM.eps}
(b)
\includegraphics[width = 14.0cm]{p-values.HC.eps}
(c)

In Fig. 3 in abscissa is represented the number of clusters sorted from the most reliable to the least reliable (with respect to the stability indices computed by random projections and using a specific clustering algorithm). In ordinate are represented the p-values in log scale. The results relative to two variants of the Bernstein-based test and the $\chi^2$-based statistical test using as threshold t=0.7 and t=0.9 are plotted. Moreover a straight horizontal dashed line, representing a given significance level (in this case alpha=0.001) is plotted. K-clusterings above the dashed line are significant, that is their reliability significantly differ from the k-clusterings below the dashed horizontal line. Note that the three clustering algorithms provide a slighthly different ranking of the stability indices, but in most cases 2 and 3 clusterings are considered significantly more reliable than the others, according to the biological meaning of the data: two classes ALL and AML of leukemia with ALL splittable in two other B-cell and T-cell subclusters, resulting in a two-level hierarchical structure. Note the the Bernstein test, due to its more general assumptions (no particular distributions and no independence are assumed for the random variables that represent the similarity between clusterings) is less selective (in the sense that it may consider reliable a larger number of k-clusterings) than the $\chi^2$-based test that make assumptions about the distribution of the random variables. Assuming independence between the random variables, we obtain a more selective Bernstein inequality-based test (red lines in Fig. 3).


Analysis of the Lymphoma data set

The Lymphoma gene expression data set [2] comprises three different lymphoid malignancies: Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL) and Chronic Lymphocytic Leukemia (CLL). The data are obtained from a cDNA microarray specialized for genes related to lymphoid diseases, the Lymphochip, which provides expression levels for 4026 genes [3]. The 62 available samples are subdivided in 42 DLBCL, 11 CLL and 9 FL. We performed pre-processing of the data according to [2], replacing missing values with $0$ and then normalizing the data to zero mean and unit variance across genes. We considered both resampling techniques and random projections to perturb the data. In particular, data have been resampled by randomly drawing 80% of the available data without replacement, and data have been projected using PMO projections with $\epsilon = 0.2$ corresponding to $413$-dimensional subspaces. The k-means, hierarchical and PAM clustering algorithm has been applied.

The Lymphoma data set AlizadehLange.norm (in R readable binary format) is downloadable from: http://homes.dsi.unimi.it/valenti/SW/mosclust/examples.

The example source code of the R script to analyze the example DNA microarray data is the following (note that the execution of the script may require several minutes on a desktop computer):

 
##############################################################################################################
# doLymphoma.R
# November 2006
# 
# Script to evaluate the number of cluster in the classical Alizadeh et al. Lymphoma data set (using 4026 genes
# from the specialized cDNA microarray Lymphochip) through the R package mosclust.
# PAM, k-means and hierarchical clustering algorithms are used using PMO random projections and subsampling
# techniques to perturb the data.
# The chi-square based, Bernstein and the hybrid two-steps statistical test are applied to assess the 
# significance of the clustering solutions
##############################################################################################################
library(Biobase);
library(mosclust);

# parameters
nsubsamples <- 100;  # number of pairs od clusterings to be evaluated
max.num.clust <- 10; # maximum number of cluster to be evaluated
fract.resampled <- 0.8; # fraction of samples to subsampled
dim.projection <- JL.predict.dim(62,epsilon=0.2);


# Data set preparation
load("AlizadehLange.norm");
M <- AlizadehLange.norm@exprs;



##############################################
#### Kmeans clustering               #########
##############################################

## Resampling
Sr.kmeans.AlizadehLange <- do.similarity.resampling(M, c=max.num.clust, nsub=nsubsamples, f=fract.resampled, s=sFM, 
                                      alg.clust.sim=Kmeans.sim.resampling);
pr.kmeans.AlizadehLange.Bernstein <- Bernstein.compute.pvalues(Sr.kmeans.AlizadehLange);
pr.kmeans.AlizadehLange.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sr.kmeans.AlizadehLange);
pr07.kmeans.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sr.kmeans.AlizadehLange,s0=0.7);
pr09.kmeans.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sr.kmeans.AlizadehLange,s0=0.9);
lr07.kmeans.AlizadehLange <- Hybrid.testing(Sr.kmeans.AlizadehLange, alpha=0.01, s0=0.7);
lr09.kmeans.AlizadehLange <- Hybrid.testing(Sr.kmeans.AlizadehLange, alpha=0.01, s0=0.9);

 
## PMO Random projections
Sp.kmeans.AlizadehLange <- do.similarity.projection(M, c=max.num.clust, nprojections=nsubsamples, dim=dim.projection, pmethod="PMO",  
                                      alg.clust.sim=Kmeans.sim.projection);
pp.kmeans.AlizadehLange.Bernstein <- Bernstein.compute.pvalues(Sp.kmeans.AlizadehLange);
pp.kmeans.AlizadehLange.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sp.kmeans.AlizadehLange);
pp07.kmeans.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sp.kmeans.AlizadehLange,s0=0.7);
pp09.kmeans.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sp.kmeans.AlizadehLange,s0=0.9);
lp07.kmeans.AlizadehLange <- Hybrid.testing(Sp.kmeans.AlizadehLange, alpha=0.01, s0=0.7);
lp09.kmeans.AlizadehLange <- Hybrid.testing(Sp.kmeans.AlizadehLange, alpha=0.01, s0=0.9);


# saving objects 
save(Sr.kmeans.AlizadehLange,pr.kmeans.AlizadehLange.Bernstein,pr.kmeans.AlizadehLange.ind.Bernstein,pr07.kmeans.AlizadehLange.chi.sq,
pr09.kmeans.AlizadehLange.chi.sq,lr07.kmeans.AlizadehLange,lr09.kmeans.AlizadehLange,
Sp.kmeans.AlizadehLange,pp.kmeans.AlizadehLange.Bernstein,pp.kmeans.AlizadehLange.ind.Bernstein,pp07.kmeans.AlizadehLange.chi.sq,
pp09.kmeans.AlizadehLange.chi.sq,lp07.kmeans.AlizadehLange,lp09.kmeans.AlizadehLange,
file="AlizadehLange.kmeans.objects");

##############################################
## Hierarchical clustering            ########
##############################################

## Resampling
Sr.HC.AlizadehLange <- do.similarity.resampling(M, c=max.num.clust, nsub=nsubsamples, f=fract.resampled, s=sFM, 
                                      alg.clust.sim=Hierarchical.sim.resampling);
pr.HC.AlizadehLange.Bernstein <- Bernstein.compute.pvalues(Sr.HC.AlizadehLange);
pr.HC.AlizadehLange.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sr.HC.AlizadehLange);
pr07.HC.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sr.HC.AlizadehLange,s0=0.7);
pr09.HC.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sr.HC.AlizadehLange,s0=0.9);
lr07.HC.AlizadehLange <- Hybrid.testing(Sr.HC.AlizadehLange, alpha=0.01, s0=0.7);
lr09.HC.AlizadehLange <- Hybrid.testing(Sr.HC.AlizadehLange, alpha=0.01, s0=0.9);

 
## PMO Random projections
Sp.HC.AlizadehLange <- do.similarity.projection(M, c=max.num.clust, nprojections=nsubsamples, dim=dim.projection, pmethod="PMO",  
                                      alg.clust.sim=Hierarchical.sim.projection);
pp.HC.AlizadehLange.Bernstein <- Bernstein.compute.pvalues(Sp.HC.AlizadehLange);
pp.HC.AlizadehLange.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sp.HC.AlizadehLange);
pp07.HC.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sp.HC.AlizadehLange,s0=0.7);
pp09.HC.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sp.HC.AlizadehLange,s0=0.9);
lp07.HC.AlizadehLange <- Hybrid.testing(Sp.HC.AlizadehLange, alpha=0.01, s0=0.7);
lp09.HC.AlizadehLange <- Hybrid.testing(Sp.HC.AlizadehLange, alpha=0.01, s0=0.9);

# saving objects 
save(Sr.HC.AlizadehLange,pr.HC.AlizadehLange.Bernstein,pr.HC.AlizadehLange.ind.Bernstein,pr07.HC.AlizadehLange.chi.sq,
pr09.HC.AlizadehLange.chi.sq,lr07.HC.AlizadehLange,lr09.HC.AlizadehLange,
Sp.HC.AlizadehLange,pp.HC.AlizadehLange.Bernstein,pp.HC.AlizadehLange.ind.Bernstein,pp07.HC.AlizadehLange.chi.sq,
pp09.HC.AlizadehLange.chi.sq,lp07.HC.AlizadehLange,lp09.HC.AlizadehLange,
file="AlizadehLange.HC.objects");

##############################################
## PAM clustering                     ########
##############################################

## Resampling
Sr.PAM.AlizadehLange <- do.similarity.resampling(M, c=max.num.clust, nsub=nsubsamples, f=fract.resampled, s=sFM, 
                                      alg.clust.sim=PAM.sim.resampling);
pr.PAM.AlizadehLange.Bernstein <- Bernstein.compute.pvalues(Sr.PAM.AlizadehLange);
pr.PAM.AlizadehLange.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sr.PAM.AlizadehLange);
pr07.PAM.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sr.PAM.AlizadehLange,s0=0.7);
pr09.PAM.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sr.PAM.AlizadehLange,s0=0.9);
lr07.PAM.AlizadehLange <- Hybrid.testing(Sr.PAM.AlizadehLange, alpha=0.01, s0=0.7);
lr09.PAM.AlizadehLange <- Hybrid.testing(Sr.PAM.AlizadehLange, alpha=0.01, s0=0.9);

 
## PMO Random projections
Sp.PAM.AlizadehLange <- do.similarity.projection(M, c=max.num.clust, nprojections=nsubsamples, dim=dim.projection, pmethod="PMO",  
                                      alg.clust.sim=PAM.sim.projection);
pp.PAM.AlizadehLange.Bernstein <- Bernstein.compute.pvalues(Sp.PAM.AlizadehLange);
pp.PAM.AlizadehLange.ind.Bernstein <- Bernstein.ind.compute.pvalues(Sp.PAM.AlizadehLange);
pp07.PAM.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sp.PAM.AlizadehLange,s0=0.7);
pp09.PAM.AlizadehLange.chi.sq <- Chi.square.compute.pvalues(Sp.PAM.AlizadehLange,s0=0.9);
lp07.PAM.AlizadehLange <- Hybrid.testing(Sp.PAM.AlizadehLange, alpha=0.01, s0=0.7);
lp09.PAM.AlizadehLange <- Hybrid.testing(Sp.PAM.AlizadehLange, alpha=0.01, s0=0.9);


# saving objects 
save(Sr.PAM.AlizadehLange,pr.PAM.AlizadehLange.Bernstein,pr.PAM.AlizadehLange.ind.Bernstein,pr07.PAM.AlizadehLange.chi.sq,
pr09.PAM.AlizadehLange.chi.sq,lr07.PAM.AlizadehLange,lr09.PAM.AlizadehLange,
Sp.PAM.AlizadehLange,pp.PAM.AlizadehLange.Bernstein,pp.PAM.AlizadehLange.ind.Bernstein,pp07.PAM.AlizadehLange.chi.sq,
pp09.PAM.AlizadehLange.chi.sq,lp07.PAM.AlizadehLange,lp09.PAM.AlizadehLange,
file="AlizadehLange.PAM.objects");


###################################
##### Writing results on text file
###################################
sink("AlizadehLange.res");
cat("** Experiments with AlizadehLange data **\n\n");
######## Kmeans
cat("------------------------\n")
cat("---- Kmeans clustering subsampling\n\n");
cat("AlizadehLange kmeans subsampling Bernstein p-values: \n");
print(pr.kmeans.AlizadehLange.Bernstein);  
cat("AlizadehLange kmeans subsampling Bernstein independent p-values: \n");
print(pr.kmeans.AlizadehLange.ind.Bernstein);
cat("AlizadehLange kmeans subsampling Chi square p-values s=0.7: \n");
print(pr07.kmeans.AlizadehLange.chi.sq);     
cat("AlizadehLange kmeans subsampling Chi square p-values s=0.9: \n");
print(pr09.kmeans.AlizadehLange.chi.sq);    
cat("AlizadehLange kmeans subsampling clusterings selected by the Hybrid test s=0.7: \n");
print(lr07.kmeans.AlizadehLange$selected.res);    
cat("AlizadehLange kmeans subsampling clusterings selected by the Hybrid test s=0.9: \n");
print(lr09.kmeans.AlizadehLange$selected.res);    

cat("------------------------\n")
cat("---- Kmeans clustering projections\n\n");
cat("AlizadehLange kmeans projections Bernstein p-values: \n");
print(pp.kmeans.AlizadehLange.Bernstein);  
cat("AlizadehLange kmeans projections Bernstein independent p-values: \n");
print(pp.kmeans.AlizadehLange.ind.Bernstein);
cat("AlizadehLange kmeans projections Chi square p-values s=0.7: \n");
print(pp07.kmeans.AlizadehLange.chi.sq);     
cat("AlizadehLange kmeans projections Chi square p-values s=0.9: \n");
print(pp09.kmeans.AlizadehLange.chi.sq);    
cat("AlizadehLange kmeans projections clusterings selected by the Hybrid test s=0.7: \n");
print(lp07.kmeans.AlizadehLange$selected.res);    
cat("AlizadehLange kmeans projections clusterings selected by the Hybrid test s=0.9: \n");
print(lp09.kmeans.AlizadehLange$selected.res);    

######## HC
cat("------------------------\n")
cat("---- HC clustering subsampling\n\n");
cat("AlizadehLange HC subsampling Bernstein p-values: \n");
print(pr.HC.AlizadehLange.Bernstein);  
cat("AlizadehLange HC subsampling Bernstein independent p-values: \n");
print(pr.HC.AlizadehLange.ind.Bernstein);
cat("AlizadehLange HC subsampling Chi square p-values s=0.7: \n");
print(pr07.HC.AlizadehLange.chi.sq);     
cat("AlizadehLange HC subsampling Chi square p-values s=0.9: \n");
print(pr09.HC.AlizadehLange.chi.sq);    
cat("AlizadehLange HC subsampling clusterings selected by the Hybrid test s=0.7: \n");
print(lr07.HC.AlizadehLange$selected.res);    
cat("AlizadehLange HC subsampling clusterings selected by the Hybrid test s=0.9: \n");
print(lr09.HC.AlizadehLange$selected.res);    

cat("------------------------\n")
cat("---- HC clustering projections\n\n");
cat("AlizadehLange HC projections Bernstein p-values: \n");
print(pp.HC.AlizadehLange.Bernstein);  
cat("AlizadehLange HC projections Bernstein independent p-values: \n");
print(pp.HC.AlizadehLange.ind.Bernstein);
cat("AlizadehLange HC projections Chi square p-values s=0.7: \n");
print(pp07.HC.AlizadehLange.chi.sq);     
cat("AlizadehLange HC projections Chi square p-values s=0.9: \n");
print(pp09.HC.AlizadehLange.chi.sq);    
cat("AlizadehLange HC projections clusterings selected by the Hybrid test s=0.7: \n");
print(lp07.HC.AlizadehLange$selected.res);    
cat("AlizadehLange HC projections clusterings selected by the Hybrid test s=0.9: \n");
print(lp09.HC.AlizadehLange$selected.res);      

 
######## PAM
cat("------------------------\n")
cat("---- PAM clustering subsampling\n\n");
cat("AlizadehLange PAM subsampling Bernstein p-values: \n");
print(pr.PAM.AlizadehLange.Bernstein);  
cat("AlizadehLange PAM subsampling Bernstein independent p-values: \n");
print(pr.PAM.AlizadehLange.ind.Bernstein);
cat("AlizadehLange PAM subsampling Chi square p-values s=0.7: \n");
print(pr07.PAM.AlizadehLange.chi.sq);     
cat("AlizadehLange PAM subsampling Chi square p-values s=0.9: \n");
print(pr09.PAM.AlizadehLange.chi.sq);    
cat("AlizadehLange PAM subsampling clusterings selected by the Hybrid test s=0.7: \n");
print(lr07.PAM.AlizadehLange$selected.res);    
cat("AlizadehLange PAM subsampling clusterings selected by the Hybrid test s=0.9: \n");
print(lr09.PAM.AlizadehLange$selected.res);    

cat("------------------------\n")
cat("---- PAM clustering projections\n\n");
cat("AlizadehLange PAM projections Bernstein p-values: \n");
print(pp.PAM.AlizadehLange.Bernstein);  
cat("AlizadehLange PAM projections Bernstein independent p-values: \n");
print(pp.PAM.AlizadehLange.ind.Bernstein);
cat("AlizadehLange PAM projections Chi square p-values s=0.7: \n");
print(pp07.PAM.AlizadehLange.chi.sq);     
cat("AlizadehLange PAM projections Chi square p-values s=0.9: \n");
print(pp09.PAM.AlizadehLange.chi.sq);    
cat("AlizadehLange PAM projections clusterings selected by the Hybrid test s=0.7: \n");
print(lp07.PAM.AlizadehLange$selected.res);    
cat("AlizadehLange PAM projections clusterings selected by the Hybrid test s=0.9: \n");
print(lp09.PAM.AlizadehLange$selected.res);      


sink();
#########################################################################################

The functions do.similarity.resampling and do.similarity.projection, compute multiple similarity values for different number of clusters, storing the results in the matrices Sr.***.AlizadehLange (subsampling techniques) or Sp.***.AlizadehLange (random projections), in a way similar to the previous example. Then these matrices with the computed similarity values between pairs of clusterings are used by the functions that compute stability indices and p-values according to different statistical tests, using both Bernstein-based statistical tests, and the $\chi^2$-based statistical test (functions Bernstein.compute.pvalues, Bernstein.ind.compute.pvalues and Chi.square.compute.pvalues. Finally the function Hybrid.testing selects the k-clusterings significant at a given significance level, using at first the Bernstein based test to single out the k-clustering that significantly differ from the top ranked k-clustering (making no assumptions between the distribution of the similarity indices) and then applying the $\chi^2$-based statistical test on the remaining top-ranked clusterings to refine the subset of statistically significant k-clusterings (in this case implicitly assuming a normal distribution of the similarity indices, see the reference manual for details).

Detailed results about the computed p-values and the detected significant clusterings (printed by the last part of the script, starting from "Writing results on text file") using k-means, hierarchical and PAM clusterings and resampling and random projections techniques to perturb the data, are reported below. For each pair clustering algorithm - type of data perturbation, results relative to the Bernstein test, the Bernstein test with assumptiom of independence, the $\chi^2$-based statistical test with threshold set at 0.7 and 0.9 are shown (see Sect. 2.3 for details). For each statistical test and for each pair clustering algorithm - type of data perturbation the p-value, the mean of the computed stability measure and the corresponding variance are reported, and the clusterings are ranked from the most to the least significant; at the end of each pair clustering algorithm - type of data the clusterings detected significant at 0.01 significance level by the Hybrid test are reported. Note that in all cases the 2-clustering is considered significant. Only using the hierarchical clustering algorithm (Ward's method) and perturbation through random projections we can detect the 2 and 3-clustering as equally significant at 0.01 significance level.

 

****************************************************************
** p-values computed according to different statistical tests **

------------------------
---- Kmeans clustering subsampling

AlizadehLange kmeans subsampling Bernstein p-values: 
  ordered.clusterings      p.value     means    variance
1                   2 1.000000e+00 0.9908277 0.004175165
2                   3 8.666348e-13 0.7605658 0.014520040
3                  10 6.696925e-17 0.7028087 0.010882106
4                   4 6.578817e-18 0.6760998 0.013932829
5                   9 3.303310e-18 0.6729627 0.013726663
6                   7 1.371911e-18 0.6629595 0.013233187
7                   5 9.952097e-19 0.6628799 0.014514355
8                   8 4.258909e-19 0.6612229 0.011356406
9                   6 2.718664e-19 0.6522468 0.017046131
AlizadehLange kmeans subsampling Bernstein independent p-values: 
  ordered.clusterings       p.value     means    variance
1                   2  1.000000e+00 0.9908277 0.004175165
2                   3  8.665678e-13 0.7605658 0.014520040
3                  10  5.233241e-29 0.7028087 0.010882106
4                   4  1.714151e-46 0.6760998 0.013932829
5                   9  3.310710e-64 0.6729627 0.013726663
6                   7  1.247150e-82 0.6629595 0.013233187
7                   5 7.100259e-101 0.6628799 0.014514355
8                   8 1.093614e-119 0.6612229 0.011356406
9                   6 2.973169e-138 0.6522468 0.017046131
AlizadehLange kmeans subsampling Chi square p-values s=0.7: 
  ordered.clusterings      p.value     means    variance
1                   2 1.000000e+00 0.9908277 0.004175165
2                   3 7.437384e-13 0.7605658 0.014520040
3                  10 2.808864e-14 0.7028087 0.010882106
4                   4 0.000000e+00 0.6760998 0.013932829
5                   9 0.000000e+00 0.6729627 0.013726663
6                   7 0.000000e+00 0.6629595 0.013233187
7                   5 0.000000e+00 0.6628799 0.014514355
8                   8 0.000000e+00 0.6612229 0.011356406
9                   6 0.000000e+00 0.6522468 0.017046131
AlizadehLange kmeans subsampling Chi square p-values s=0.9: 
  ordered.clusterings p.value     means    variance
1                   2       1 0.9908277 0.004175165
2                   3       0 0.7605658 0.014520040
3                  10       0 0.7028087 0.010882106
4                   4       0 0.6760998 0.013932829
5                   9       0 0.6729627 0.013726663
6                   7       0 0.6629595 0.013233187
7                   5       0 0.6628799 0.014514355
8                   8       0 0.6612229 0.011356406
9                   6       0 0.6522468 0.017046131
AlizadehLange kmeans subsampling clusterings selected by the Hybrid test s=0.7: 
  ordered.clusterings p.value     means    variance
1                   2       1 0.9908277 0.004175165
AlizadehLange kmeans subsampling clusterings selected by the Hybrid test s=0.9: 
  ordered.clusterings p.value     means    variance
1                   2       1 0.9908277 0.004175165

------------------------
---- Kmeans clustering projections

AlizadehLange kmeans projections Bernstein p-values: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9814441 0.0003408922
2                   3 1.473335e-12 0.7641770 0.0138719640
3                   4 4.912472e-17 0.6965926 0.0125162004
4                   5 3.770498e-18 0.6732710 0.0152785300
5                  10 2.094206e-20 0.6513283 0.0088911097
6                   6 6.505633e-21 0.6423731 0.0086476607
7                   7 3.048242e-21 0.6366460 0.0088810577
8                   8 1.438292e-21 0.6312703 0.0099562656
9                   9 3.481603e-22 0.6311055 0.0070838350
AlizadehLange kmeans projections Bernstein independent p-values: 
  ordered.clusterings       p.value     means     variance
1                   2  1.000000e+00 0.9814441 0.0003408922
2                   3  1.473286e-12 0.7641770 0.0138719640
3                   4  6.681975e-29 0.6965926 0.0125162004
4                   5  2.505444e-46 0.6732710 0.0152785300
5                  10  3.616967e-66 0.6513283 0.0088911097
6                   6  1.250527e-86 0.6423731 0.0086476607
7                   7 2.013285e-107 0.6366460 0.0088810577
8                   8 2.194746e-128 0.6312703 0.0099562656
9                   9 7.641234e-150 0.6311055 0.0070838350
AlizadehLange kmeans projections Chi square p-values s=0.7: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9814441 0.0003408922
2                   3 1.887379e-15 0.7641770 0.0138719640
3                   4 0.000000e+00 0.6965926 0.0125162004
4                   5 0.000000e+00 0.6732710 0.0152785300
5                  10 0.000000e+00 0.6513283 0.0088911097
6                   6 0.000000e+00 0.6423731 0.0086476607
7                   7 0.000000e+00 0.6366460 0.0088810577
8                   8 0.000000e+00 0.6312703 0.0099562656
9                   9 0.000000e+00 0.6311055 0.0070838350
AlizadehLange kmeans projections Chi square p-values s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9814441 0.0003408922
2                   3       0 0.7641770 0.0138719640
3                   4       0 0.6965926 0.0125162004
4                   5       0 0.6732710 0.0152785300
5                  10       0 0.6513283 0.0088911097
6                   6       0 0.6423731 0.0086476607
7                   7       0 0.6366460 0.0088810577
8                   8       0 0.6312703 0.0099562656
9                   9       0 0.6311055 0.0070838350
AlizadehLange kmeans projections clusterings selected by the Hybrid test s=0.7: 
  ordered.clusterings p.value    means     variance
1                   2       1 0.981444 0.0003408922
AlizadehLange kmeans projections clusterings selected by the Hybrid test s=0.9: 
  ordered.clusterings p.value    means     variance
1                   2       1 0.981444 0.0003408922

------------------------
---- HC clustering subsampling

AlizadehLange HC subsampling Bernstein p-values: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9848977 0.0006425551
2                   3 6.905720e-03 0.9232085 0.0170380900
3                   5 4.870046e-10 0.8083010 0.0129431910
4                   7 3.724811e-11 0.7789919 0.0093123798
5                   4 3.531829e-11 0.7724449 0.0222737279
6                  10 3.567320e-13 0.7693848 0.0063627506
7                   6 1.952906e-13 0.7592265 0.0096774674
8                   9 7.715196e-14 0.7580952 0.0087278023
9                   8 5.626276e-15 0.7455932 0.0068558439
AlizadehLange HC subsampling Bernstein independent p-values: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9848977 0.0006425551
2                   3 6.905720e-03 0.9232085 0.0170380900
3                   5 3.105892e-12 0.8083010 0.0129431910
4                   7 5.993822e-24 0.7789919 0.0093123798
5                   4 2.095533e-34 0.7724449 0.0222737279
6                  10 3.383059e-47 0.7693848 0.0063627506
7                   6 3.996699e-60 0.7592265 0.0096774674
8                   9 2.858666e-73 0.7580952 0.0087278023
9                   8 1.608364e-87 0.7455932 0.0068558439
AlizadehLange HC subsampling Chi square p-values s=0.7: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9848977 0.0006425551
2                   3 8.687712e-06 0.9232085 0.0170380900
3                   5 1.724031e-05 0.8083010 0.0129431910
4                   7 1.009175e-05 0.7789919 0.0093123798
5                   4 1.297296e-11 0.7724449 0.0222737279
6                  10 2.621936e-11 0.7693848 0.0063627506
7                   6 8.403667e-11 0.7592265 0.0096774674
8                   9 2.101905e-10 0.7580952 0.0087278023
9                   8 4.273791e-10 0.7455932 0.0068558439
AlizadehLange HC subsampling Chi square p-values s=0.9: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9848977 0.0006425551
2                   3 2.428467e-06 0.9232085 0.0170380900
3                   5 0.000000e+00 0.8083010 0.0129431910
4                   7 0.000000e+00 0.7789919 0.0093123798
5                   4 0.000000e+00 0.7724449 0.0222737279
6                  10 0.000000e+00 0.7693848 0.0063627506
7                   6 0.000000e+00 0.7592265 0.0096774674
8                   9 0.000000e+00 0.7580952 0.0087278023
9                   8 0.000000e+00 0.7455932 0.0068558439
AlizadehLange HC subsampling clusterings selected by the Hybrid test s=0.7: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9848977 0.0006425551
AlizadehLange HC subsampling clusterings selected by the Hybrid test s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9848977 0.0006425551

------------------------
---- HC clustering projections

AlizadehLange HC projections Bernstein p-values: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9799573 0.0004290850
2                   3 9.934397e-01 0.9796000 0.0004217074
3                   4 9.469176e-14 0.7418705 0.0144147301
4                   5 8.811393e-15 0.7312696 0.0122120630
5                   7 4.038489e-17 0.7017409 0.0055530857
6                   6 3.096409e-17 0.7015686 0.0082543686
7                  10 4.877904e-18 0.6948260 0.0052640030
8                   8 1.891878e-18 0.6876439 0.0051231205
9                   9 9.278911e-19 0.6872879 0.0051605767
AlizadehLange HC projections Bernstein independent p-values: 
  ordered.clusterings       p.value     means     variance
1                   2  1.000000e+00 0.9799573 0.0004290850
2                   3  9.934397e-01 0.9796000 0.0004217074
3                   4  8.531697e-14 0.7418705 0.0144147301
4                   5  7.483158e-28 0.7312696 0.0122120630
5                   7  7.049728e-45 0.7017409 0.0055530857
6                   6  1.839006e-61 0.7015686 0.0082543686
7                  10  5.491318e-79 0.6948260 0.0052640030
8                   8  5.293558e-97 0.6876439 0.0051231205
9                   9 4.911845e-115 0.6872879 0.0051605767
AlizadehLange HC projections Chi square p-values s=0.7: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9799573 0.0004290850
2                   3       1 0.9796000 0.0004217074
3                   4       0 0.7418705 0.0144147301
4                   5       0 0.7312696 0.0122120630
5                   7       0 0.7017409 0.0055530857
6                   6       0 0.7015686 0.0082543686
7                  10       0 0.6948260 0.0052640030
8                   8       0 0.6876439 0.0051231205
9                   9       0 0.6872879 0.0051605767
AlizadehLange HC projections Chi square p-values s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9799573 0.0004290850
2                   3       1 0.9796000 0.0004217074
3                   4       0 0.7418705 0.0144147301
4                   5       0 0.7312696 0.0122120630
5                   7       0 0.7017409 0.0055530857
6                   6       0 0.7015686 0.0082543686
7                  10       0 0.6948260 0.0052640030
8                   8       0 0.6876439 0.0051231205
9                   9       0 0.6872879 0.0051605767
AlizadehLange HC projections clusterings selected by the Hybrid test s=0.7: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9799573 0.0004290850
2                   3       1 0.9796000 0.0004217074
AlizadehLange HC projections clusterings selected by the Hybrid test s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9799573 0.0004290850
2                   3       1 0.9796000 0.0004217074

------------------------
---- PAM clustering subsampling

AlizadehLange PAM subsampling Bernstein p-values: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9878272 0.0007288978
2                   3 1.698901e-07 0.8289074 0.0273066512
3                  10 9.185600e-13 0.7717029 0.0107885005
4                   7 1.918046e-13 0.7430427 0.0111158863
5                   4 1.799158e-13 0.7303123 0.0263896611
6                   8 1.319592e-15 0.7271536 0.0105294084
7                   9 3.938336e-16 0.7229678 0.0090813439
8                   6 9.771977e-17 0.7053756 0.0131500390
9                   5 5.706574e-18 0.6868620 0.0130158837
AlizadehLange PAM subsampling Bernstein independent p-values: 
  ordered.clusterings       p.value     means     variance
1                   2  1.000000e+00 0.9878272 0.0007288978
2                   3  1.698892e-07 0.8289074 0.0273066512
3                  10  1.234679e-19 0.7717029 0.0107885005
4                   7  1.467876e-33 0.7430427 0.0111158863
5                   4  2.621572e-46 0.7303123 0.0263896611
6                   8  2.426944e-61 0.7271536 0.0105294084
7                   9  7.186515e-77 0.7229678 0.0090813439
8                   6  6.612542e-93 0.7053756 0.0131500390
9                   5 3.773496e-110 0.6868620 0.0130158837
AlizadehLange PAM subsampling Chi square p-values s=0.7: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9878272 0.0007288978
2                   3 1.076916e-14 0.8289074 0.0273066512
3                  10 2.152722e-13 0.7717029 0.0107885005
4                   7 1.225686e-13 0.7430427 0.0111158863
5                   4 0.000000e+00 0.7303123 0.0263896611
6                   8 1.110223e-16 0.7271536 0.0105294084
7                   9 1.110223e-16 0.7229678 0.0090813439
8                   6 1.110223e-16 0.7053756 0.0131500390
9                   5 0.000000e+00 0.6868620 0.0130158837
AlizadehLange PAM subsampling Chi square p-values s=0.9: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9878272 0.0007288978
2                   3 5.828671e-14 0.8289074 0.0273066512
3                  10 0.000000e+00 0.7717029 0.0107885005
4                   7 0.000000e+00 0.7430427 0.0111158863
5                   4 0.000000e+00 0.7303123 0.0263896611
6                   8 0.000000e+00 0.7271536 0.0105294084
7                   9 0.000000e+00 0.7229678 0.0090813439
8                   6 0.000000e+00 0.7053756 0.0131500390
9                   5 0.000000e+00 0.6868620 0.0130158837
AlizadehLange PAM subsampling clusterings selected by the Hybrid test s=0.7: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9878272 0.0007288978
AlizadehLange PAM subsampling clusterings selected by the Hybrid test s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9878272 0.0007288978

------------------------
---- PAM clustering projections

AlizadehLange PAM projections Bernstein p-values: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9823205 0.0004814611
2                   3 3.390246e-08 0.8184958 0.0229302928
3                  10 3.999673e-18 0.6995391 0.0040424940
4                   7 1.356796e-18 0.6876611 0.0056988548
5                   9 4.838407e-19 0.6863448 0.0036930730
6                   8 1.664215e-19 0.6786969 0.0047684830
7                   4 9.345933e-21 0.6543197 0.0054680662
8                   6 3.908605e-21 0.6504018 0.0054379736
9                   5 9.195783e-22 0.6399589 0.0063888996
AlizadehLange PAM projections Bernstein independent p-values: 
  ordered.clusterings       p.value     means     variance
1                   2  1.000000e+00 0.9823205 0.0004814611
2                   3  3.390246e-08 0.8184958 0.0229302928
3                  10  8.960006e-26 0.6995391 0.0040424940
4                   7  7.821681e-44 0.6876611 0.0056988548
5                   9  2.482752e-62 0.6863448 0.0036930730
6                   8  3.899796e-81 0.6786969 0.0047684830
7                   4 2.120447e-101 0.6543197 0.0054680662
8                   6 6.338073e-122 0.6504018 0.0054379736
9                   5 5.828354e-143 0.6399589 0.0063888996
AlizadehLange PAM projections Chi square p-values s=0.7: 
  ordered.clusterings      p.value     means     variance
1                   2 1.000000e+00 0.9823205 0.0004814611
2                   3 1.350031e-13 0.8184958 0.0229302928
3                  10 2.553513e-15 0.6995391 0.0040424940
4                   7 1.110223e-16 0.6876611 0.0056988548
5                   9 0.000000e+00 0.6863448 0.0036930730
6                   8 0.000000e+00 0.6786969 0.0047684830
7                   4 0.000000e+00 0.6543197 0.0054680662
8                   6 0.000000e+00 0.6504018 0.0054379736
9                   5 0.000000e+00 0.6399589 0.0063888996
AlizadehLange PAM projections Chi square p-values s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9823205 0.0004814611
2                   3       0 0.8184958 0.0229302928
3                  10       0 0.6995391 0.0040424940
4                   7       0 0.6876611 0.0056988548
5                   9       0 0.6863448 0.0036930730
6                   8       0 0.6786969 0.0047684830
7                   4       0 0.6543197 0.0054680662
8                   6       0 0.6504018 0.0054379736
9                   5       0 0.6399589 0.0063888996
AlizadehLange PAM projections clusterings selected by the Hybrid test s=0.7: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9823205 0.0004814611
AlizadehLange PAM projections clusterings selected by the Hybrid test s=0.9: 
  ordered.clusterings p.value     means     variance
1                   2       1 0.9823205 0.0004814611
****************************************************************************

The following figures represent the empirical cumulative distributions and the histograms of the similarity measures computed for different numbers of clusters (from 2 to 10). Fig. 4 and 5 show that only 2-clusterings are significant with respectively k-means and PAM algorithms, independently of the choiche of the perturbation method: indeed in both case the black curve corresponding to the 2-clustering is largely below all the other ecdf curves. This is confirmed also by the histograms (see Fig. 7 and 8): in both cases the similarity values are accumulated near 1 only with the 2-clustering, denoting high stability only when 2 clusters are detected. On the contrary, using the hierachical clustering algorithm and random projections, 2 and 3-clusterings are detected as significant at 0.01 significance level, as shown by the corresponding ecdf (Fig. 6) where the black curve (2-clustering) and the red curve (3-clustering) are quite overlapping, and by the corresponding histograms (Fig. 9).




Figure 4: Lymphoma data set. Clusterings obtained with the k-means algorithm. Empirical cumulative distribution functions of the similarity measures for different number of clusters $k$ using (a) resampling techniques (b) random projections.
\includegraphics[width = 14.0cm]{ecdf.kmeans.sub.AlizadehLange.eps}
(a)
\includegraphics[width = 14.0cm]{ecdf.kmeans.proj.AlizadehLange.eps}
(b)




Figure 5: Lymphoma data set. Clusterings obtained with the PAM clustering algorithm. Empirical cumulative distribution functions of the similarity measures for different number of clusters $k$ using (a) resampling techniques (b) random projections.
\includegraphics[width = 14.0cm]{ecdf.PAM.sub.AlizadehLange.eps}
(a)
\includegraphics[width = 14.0cm]{ecdf.PAM.proj.AlizadehLange.eps}
(b)




Figure 6: Lymphoma data set. Clusterings obtained with the hierarchical clustering algorithm. Empirical cumulative distribution functions of the similarity measures for different number of clusters $k$ using (a) resampling techniques (b) random projections.
\includegraphics[width = 14.0cm]{ecdf.HC.sub.AlizadehLange.eps}
(a)
\includegraphics[width = 14.0cm]{ecdf.HC.proj.AlizadehLange.eps}
(b)




Figure 7: Lymphoma data set. Clusterings obtained with the k-means algorithm. Histograms of the similarity measures for different number of clusters $k$ using (a) resampling techniques (b) random projections.
\includegraphics[width = 14.0cm]{hist.kmeans.sub.AlizadehLange.eps}
(a)
\includegraphics[width = 14.0cm]{hist.kmeans.proj.AlizadehLange.eps}
(b)




Figure 8: Lymphoma data set. Clusterings obtained with the PAM algorithm. Histograms of the similarity measures for different number of clusters $k$ using (a) resampling techniques (b) random projections.
\includegraphics[width = 14.0cm]{hist.PAM.sub.AlizadehLange.eps}
(a)
\includegraphics[width = 14.0cm]{hist.PAM.proj.AlizadehLange.eps}
(b)




Figure 9: Lymphoma data set. Clusterings obtained with the hierarchical clustering algorithm. Histograms of the similarity measures for different number of clusters $k$ using (a) resampling techniques (b) random projections.
\includegraphics[width = 14.0cm]{hist.HC.sub.AlizadehLange.eps}
(a)
\includegraphics[width = 14.0cm]{hist.HC.proj.AlizadehLange.eps}
(b)


next up previous
Next: Software download and documentation Up: Introduction to the functionalities Previous: An example of the