Next: Clusterv tutorial Up: The clusterv R package: Previous: Overview of the clusterv

Background on random projections in euclidean spaces

Dimensionality reduction may be obtained by mapping points from a high to a low-dimensional space, approximately preserving some characteristics, i.e. the distances between points. In this context randomized embeddings with low distortion represent a key concept. Randomized embeddings have been successfully applied both to combinatorial optimization and data compression [12].

A randomized embedding between normed metric spaces with distortion $1 + \epsilon$ , with $\epsilon > 0$ and failure probability is a distribution probability over mappings $\mu : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$ , such that for every pair $p,q \in \mathbb{R}^d$ , the following property holds with probability :

$\begin{displaymath} \frac{1}{1+\epsilon} \leq \frac{\vert\vert \mu(p) - \mu(q) \vert\vert _2}{\vert\vert p - q \vert\vert _2} \leq 1+\epsilon \end{displaymath}$

(1)

The main result on randomized embedding is due to Johnson and Lindenstrauss [13], who proved the existence of a randomized embedding $\mu : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$ with distortion $1 + \epsilon$ and failure probability $e^{\Omega(-d'\epsilon^2)}$ , for every $0<\epsilon < 1/2$ . As a consequence, for a fixed data set $S \subset \mathbb{R}^d$ , with $\vert S\vert = n$ , by union bound, for all $p,q \in S$ , it holds:

$\begin{displaymath} Prob \left(\frac{1}{1+\epsilon} \leq \frac{\vert\vert \mu(p)... ...\leq 1+\epsilon \right) \geq 1 - n^2 e^{\Omega(-d'\epsilon^2)} \end{displaymath}$

(2)

Hence, by choosing

such that $n^2 e^{\Omega(-d'\epsilon^2)} < 1/2$ , it is proved the following:
Johnson-Lindenstrauss (JL) lemma: Given a set

with $\vert S\vert = n$ there exists a $1 + \epsilon$ -distortion embedding into $\mathbb{R}^{d'}$ with $d' = c \; \log n / \epsilon^2$ , where

is a suitable constant.

The embedding exhibited in [13] consists in random projections from $\mathbb{R}^{d}$ into $\mathbb{R}^{d'}$ , represented by matrices $d' \times d$ with random orthonormal vectors. Similar results may be obtained by using simpler embeddings, represented through random $d' \times d$ matrices $P = 1/\sqrt{d'} (r_{ij})$ , where $r_{ij}$ are random variables such that:

$\begin{displaymath} E[r_{ij}] = 0, \qquad \qquad Var[r_{ij}] = 1 \end{displaymath}$

For sake of simplicity, we call random projections even this kind of embeddings.

We may build random matrices in different ways to obtain random projections obeying the Johnson-Lindenstrauss lemma:

The Plus-Minus-One (PMO) projection: $p_{ij} \in \{+1,-1\}$ , $p_{ij}$ randomly chosen such that:

$\begin{displaymath} p_{ij} = \left\{ \begin{array}{ll} +1 & \mbox{with probabi... ...ox{with probability $\frac{1}{2}$} \\ \end{array} \right. \end{displaymath}$
The Achlioptas projection [1]:

$\begin{displaymath} p_{ij} = \sqrt{3} \cdot \left\{ \begin{array}{ll} +1 & \mb... ...ox{with probability $\frac{1}{6}$} \\ \end{array} \right. \end{displaymath}$
Generalized Achlioptas projection:

$\begin{displaymath} p_{ij} = \sqrt{n} \cdot \left\{ \begin{array}{ll} +1 & \mb... ...x{with probability $\frac{1}{2n}$} \\ \end{array} \right. \end{displaymath}$

with $n \in \mathbb{N}^+$ .
A Normal projection: $p_{ij} \sim N(0,1)$ .
Other projections such that:

$\begin{eqnarray*} E[p_{ij}] & = & 0 \\ Var[p_{ij}] & = & 1 \\ \end{eqnarray*}$

A particular case of randomized map is represented by Random Subspace (RS) projections [11]. Indeed these projections do not satisfy the JL lemma and may induce large distortions into the projected data. They are represented by $d' \times d$ matrices $P =\sqrt{d/d'} (r_{ij})$ , where $p_{ij}$ are uniformly chosen with entries in $\{0,1\}$ , and with exactly one per row and at most one per column. It is worth noting that, in this case, the "compressed" data set $D_P = P \cdot D$ can be quickly computed in time $\mathcal{O}(n d')$ , independently from . Unfortunately, it easy to see that in this case $E[p_{ij}] \neq 0$ and $Var[p_{ij}] \neq 1$ and hence RS does not satisfy the JL lemma.

Next: Clusterv tutorial Up: The clusterv R package: Previous: Overview of the clusterv

Giorgio 2006-08-16