Generation of randomly projected data

Next: Computation of the distortion Up: Clusterv tutorial Previous: Getting started with Clusterv

Generation of randomly projected data

Different types of random projections are available with clusterv:

Plus-Minus-One (PMO)
Achlioptas
Normal
Random Subspace (RS)

Before looking at some examples, we'll see how to generate synthetic data that we will use in our examples. Different synthetic data generators named generate.sampleX are available , where X is between 0 and 5. They generate clusters of data distributed according to multivariate gaussian distributions. Each generator provides from 2 to 5 clusters, each one characterized by its mean and covariance matrix. Usually the mean (center of each cluster) and the covariance matrix are input parameters for the functions (see theClusterv reference manual for more details).

For instance:

> M <- generate.sample1(n = 20, m = 6, sigma = 1, dim = 2000)

generates a matrix M of data (examples are in columns, variables on rows) with 3 clusters, each one composed by n=20 examples whose dimension is dim=2000. All clusters have their last dim-500 variables centered in 0. The first class (first n examples) has its first 500 features centered in 0. The parameter m select the center for the second and third cluster: the second class (second n examples) has its first 500 features centered in 6, the third (last n examples) has its first 500 features centered in -6. All the clusters are distributed according to a "spherical" gaussian with sigma=1. The relating matrix M is composed by 2000 rows and 60 columns:

> dim(M)
[1] 2000   60

As another a little bit more complex example consider:

> M5 <- generate.sample5(n = 10, m = 2, ratio.noisy = 0.9, dim = 1000)

This generates a 1000 x 40 data matrix M5: 4 clusters with n = 10 examples 1000-dimensional are randomly generated. The parameter ratio.noisy sets the proportion of "noisy" features, where for noisy" feature we mean features that are equally distributed in all the classes (these variables are centered in 0), while for "no-noisy" we mean features that are centered in different points (set by the m parameter) in the different classes. In this case we have 1000*0.9=900 "noisy" variables, and 100 "no-noisy" variables, centered in 0 for the first cluster, 2 for the second, -2 for the third and centered in (2,-2) alternatively for the fourth. The covariance matrix Sigma is equal for all the cluster: Sigma = (B, Zero; Zero', I) where B is a (dim*(1-ratio.noisy))x(dim*(1-ratio.noisy)) matrix (in this case a 100 x 100 matrix) s.t. B[i,i]=1, B[i,i+1]=B[i,i-1]=0.5 and B[i,j]=0.1 if j!=i-1,i,i+1; Zero is a (dim*(1-ratio.noisy))x(dim*ratio.noisy) zero matrix ( in this case a 100 x 900 matrix) and Zero' its transpose; I is a (dim*ratio.noisy)x(dim*ratio.noisy) identity matrix (in this case a 900 x 900 matrix).

Now we will apply different random projections to these two (quite) high dimensional data matrices. For instance we could apply a Plus-Minus-One (PMO) random projection:

> M.PMO <- Plus.Minus.One.random.projection(d = 50, M)
> dim(M.PMO)
[1] 50 60

This function performs a PMO random projection of the data matrix M into a 50-dimensional subspace.

The other functions that implements random projections have a similar syntax:

> M.Achlioptas <- Achlioptas.random.projection(d = 50, M)
> M.Normal <- norm.random.projection(d = 50, M)
> M.RS <- random.subspace(d = 50, M)

In all cases the functions return a 50 x 60 matrix using different random projections. You can get a look to the different projected matrices: they differ one from another not only because different random projections are performed, but also because each time a different random matrix is generated by the randomized map. For instance if you now perform a second random projection with Plus-Minus-One (PMO), and compare the result with the previously computed (PMO) data matrix you'll get different results:

> M.PMO.2 <- Plus.Minus.One.random.projection(d = 50, M)
> R <- M.PMO == M.PMO.2

Indeed the elements of the resulting boolean R matrix get all the FALSE value, and this is not the effect of round-off errors or permutations of the columns, as shown by the plot of the two first principal components of the data (Fig. 2):

**Figure 2:** Plot of the two principal components of the data sets represented in matrices M.PMO (solid circles) and M.PMO.2 (squares)
$\includegraphics[width = 12cm]{ps/PCA.eps}$

We would like to use random projections to reduce the dimensionality of the data, but without introduce too large distortions into the projected data, in order to use them to perform clustering. How could we do this possibly in a principled way ? This is the subject of the next section of the tutorial.

Next: Computation of the distortion Up: Clusterv tutorial Previous: Getting started with Clusterv

Giorgio 2006-08-16