next up previous
Next: Clusterv tutorial Up: The clusterv R package: Previous: Overview of the clusterv

Background on random projections in euclidean spaces

Dimensionality reduction may be obtained by mapping points from a high to a low-dimensional space, approximately preserving some characteristics, i.e. the distances between points. In this context randomized embeddings with low distortion represent a key concept. Randomized embeddings have been successfully applied both to combinatorial optimization and data compression [12].

A randomized embedding between $L_2$ normed metric spaces with distortion $1 + \epsilon$, with $\epsilon > 0$ and failure probability $P$ is a distribution probability over mappings $\mu : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$, such that for every pair $p,q \in \mathbb{R}^d$, the following property holds with probability $1 - P$:

\frac{1}{1+\epsilon} \leq \frac{\vert\vert \mu(p) - \mu(q) \vert\vert _2}{\vert\vert p - q \vert\vert _2} \leq 1+\epsilon
\end{displaymath} (1)

The main result on randomized embedding is due to Johnson and Lindenstrauss [13], who proved the existence of a randomized embedding $\mu : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$ with distortion $1 + \epsilon$ and failure probability $e^{\Omega(-d'\epsilon^2)}$, for every $0<\epsilon < 1/2$. As a consequence, for a fixed data set $S \subset \mathbb{R}^d$, with $\vert S\vert = n$, by union bound, for all $p,q \in S$, it holds:
Prob \left(\frac{1}{1+\epsilon} \leq \frac{\vert\vert \mu(p)...
...\leq 1+\epsilon \right) \geq
1 - n^2 e^{\Omega(-d'\epsilon^2)}
\end{displaymath} (2)

Hence, by choosing $d'$ such that $n^2 e^{\Omega(-d'\epsilon^2)} < 1/2$, it is proved the following:
Johnson-Lindenstrauss (JL) lemma: Given a set $S$ with $\vert S\vert = n$ there exists a $1 + \epsilon$-distortion embedding into $\mathbb{R}^{d'}$ with $ d' = c \; \log n / \epsilon^2$, where $c$ is a suitable constant.

The embedding exhibited in [13] consists in random projections from $\mathbb{R}^{d}$ into $\mathbb{R}^{d'}$, represented by matrices $d' \times d$ with random orthonormal vectors. Similar results may be obtained by using simpler embeddings, represented through random $d' \times d$ matrices $P = 1/\sqrt{d'} (r_{ij})$, where $r_{ij}$ are random variables such that:

E[r_{ij}] = 0, \qquad \qquad Var[r_{ij}] = 1

For sake of simplicity, we call random projections even this kind of embeddings.

We may build random matrices $P$ in different ways to obtain random projections obeying the Johnson-Lindenstrauss lemma:

A particular case of randomized map is represented by Random Subspace (RS) projections [11]. Indeed these projections do not satisfy the JL lemma and may induce large distortions into the projected data. They are represented by $d' \times d$ matrices $P =\sqrt{d/d'} (r_{ij})$, where $p_{ij}$ are uniformly chosen with entries in $\{0,1\}$, and with exactly one $1$ per row and at most one $1$ per column. It is worth noting that, in this case, the "compressed" data set $D_P = P \cdot D$ can be quickly computed in time $\mathcal{O}(n d')$, independently from $d$. Unfortunately, it easy to see that in this case $E[p_{ij}] \neq 0$ and $Var[p_{ij}] \neq 1$ and hence RS does not satisfy the JL lemma.

next up previous
Next: Clusterv tutorial Up: The clusterv R package: Previous: Overview of the clusterv
Giorgio 2006-08-16