Next: Clusterv tutorial
Up: The clusterv R package:
Previous: Overview of the clusterv
Background on random projections in euclidean spaces
Dimensionality reduction may be obtained by mapping points from a high to a low-dimensional space, approximately preserving some
characteristics, i.e. the distances between points.
In this context randomized embeddings with low distortion represent a key concept.
Randomized embeddings have been successfully applied both to combinatorial optimization and data compression [12].
A randomized embedding between normed metric spaces with distortion , with
and failure probability is a distribution probability over mappings
,
such that for every pair
, the following property holds with probability :
|
(1) |
The main result on randomized embedding is due to Johnson and Lindenstrauss [13], who proved the existence
of a randomized embedding
with distortion and
failure probability
, for every
.
As a consequence, for a fixed data set
, with , by union bound, for all , it holds:
|
(2) |
Hence, by choosing such that
, it is proved the following:
Johnson-Lindenstrauss (JL) lemma: Given a set with there exists a -distortion embedding into
with
, where is a suitable constant.
The embedding exhibited in [13] consists in random projections from
into
, represented
by matrices with random orthonormal vectors.
Similar results may be obtained by using simpler embeddings, represented through random matrices
, where are random variables such that:
For sake of simplicity, we call random projections even this kind of embeddings.
We may build random matrices in different ways to obtain random projections obeying the Johnson-Lindenstrauss
lemma:
- The Plus-Minus-One (PMO) projection:
, randomly chosen such that:
- The Achlioptas projection [1]:
- Generalized Achlioptas projection:
with
.
- A Normal projection:
.
- Other projections such that:
A particular case of randomized map is represented by Random Subspace (RS) projections [11].
Indeed these projections do not satisfy the JL lemma and may induce large distortions into the projected data.
They are represented by matrices
, where
are uniformly chosen with entries in , and with exactly one per row and at most one per column.
It is worth noting that, in this case,
the "compressed" data set
can be quickly computed in time
, independently from .
Unfortunately, it easy to see that in this case
and
and hence
RS does not satisfy the JL lemma.
Next: Clusterv tutorial
Up: The clusterv R package:
Previous: Overview of the clusterv
Giorgio
2006-08-16