Computation of the distortion induced by random projections.

Our goal is now to use random projections to reduce the dimensionality of the data, but without introduce too large distortions into the projected data. In this way if the distances between the examples were approximately maintained we could safely apply clustering to lower dimensional data, without destroying the metric properties of the data.

To do this in a principled way we need to experimentally evaluate the distortion introduced into the projected data. The clusterv package implements several functions to evaluate the distortion. The main functions are:

The function Average.Expansion computes, of course the average ratio of the distances between corresponding examples in the projected space and in the original space:

There exist also function that estimate the distortion (that is the maximum expansion) theoretically predicted by the Johnson Lindenstrauss lemma (see random projections in euclidean spaces for more details). In the previous case the 1 + epsilon distortion predicted for the examples stored in the matrix M.PMO is:

Another useful function provides the dimension of the subspace for which we could expect a given epsilon distortion, given the cardinality of the sample to be projected:

These functions, together with the functions to perform random projections described in the previous section, allow us to compare the empirical distortions induced by the different randomized projections, as well as to compare the observed empirical distortions with the theoretical distortion predicted through the JL lemma.

In the rest of this section we present the results of an experimental analysis of the distortion induced in DNA microarray data by different types of random projections. An example of the R scripts that we used to implement this empirical analysis is downloadable from:
http://homes.dsi.unimi.it/valenti/SW/clusterv/examples/Shipp.DLBCL.filtered.norm-PMO.R and the DNA microarray data set that we used is downloadable in gzipped format from:
http://homes.dsi.unimi.it/valenti/SW/clusterv/examples/Shipp.DLBCL.filtered.norm.nolabels.txt.gz. This R script refers to an analysis of the distortion induced by PMO random projections, but it is straightforward to modify it for using the other random projections implemented in the package.

The data we used have been downloaded from the MIT Whitehead Institute. The samples are tumor specimens from 8 Diffuse large B-cell lymphoma (DLBCL) and 19 Follicular lymphoma (FL) patients [17]. For each patient, expression levels of 7129 genes or EST sequences are provided from Affymetrix HU6800 oligonucleotide arrays. Raw data have been pre-processed and re-scaled according to the procedures described in [17], obtaining a final set of 6286 genes

We performed Achlioptas, Normal, PMO and RS random projections and then we experimentally evaluated the expectation of the maximum, minimum and average expansion, averaging their values over 50 repeated random projections. Then we compared the results with the estimated theoretical distortion predicted by the JL lemma. We performed random projections into subspace whose dimensions correspond to predicted distortions 1+epsilon, with epsilon values ranging from 0.1 to 0.5 (Fig. 3)

Figure 3: Comparing theoretical and empirical distortion with DLBCL samples using (a) Achlioptas (b) Normal (c) PMO and (d) RS projections. Continuous lines represent the bounds of the maximum and minimum distortion (expansion) according to the JL lemma. Dashed lines represent the average maximum and minimum expansion empirically computed and averaged over 50 random projections. The pairs of dotted lines just above and below the the dashed lines represent the confidence interval at 99 % confidence level. The dash-dotted line represents the average expansion.

$\includegraphics[width = 6.5cm]{ps/DLBCL.norm-Achlioptas.eps}$	$\includegraphics[width = 6.5cm]{ps/DLBCL.norm-N.eps}$
(a)	(b)
$\includegraphics[width = 6.5cm]{ps/DLBCL.norm-PMO.eps}$	$\includegraphics[width = 6.5cm]{ps/DLBCL.norm-RS.eps}$
(c)	(d)

In abscissa are represented the dimensions of the projected subspace and in ordinate the corresponding distortion. Continuous lines represent the bounds of the maximum and minimum expansion according to the JL lemma; dashed lines represent the empirical average maximum and minimum expansion computed and averaged over 50 random projections. The pairs of dotted lines just above and below the dashed lines represent the confidence interval at

% confidence level. The dash-dotted line represents the average distortion. The dashed and dot-dashed lines represent the corresponding estimated empirical values (the lines are simply computed by linear interpolation between points). We recall that distortion equal to 1 means no distortion.

For the Achlioptas, Normal and PMO random projections, the empirical bounds of the maximum and minimum distortions are largely inside the theoretical bounds predicted by the JL lemma (Fig. 3 a, b and c) On the contrary, with RS random projections in both cases the maximum and minimum distortions are largely outside the theoretical bounds (Fig.]3 (d)). Note also that these results are significant at 99 % confidence level, as the dashed lines of the confidence intervals do not intersect the continuous lines of the theoretical bounds. In any case the average empirical distortion is very close to 1 (that is we have no distortion on the average), even if with RS projections for large values of epsilon (that corresponds to very low dimensional subspaces) the average empirical distortions moves slightly from 1.

These results show that using random subspace (RS) projections with DNA microarray data we may introduce large distortions in the data, especially when large values of epsilon are used. We strongly suggest to use one of the other random projections (Achlioptas, Normal or PMO).