next up previous
Next: Statistical tests to assess Up: Background on stability based Previous: Stability based methods


Stability indices

Let be $\mathcal{C}$ a clustering algorithm, $\rho(D)$ a given random perturbation procedure applied to a data set $D$ and $sim$ a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity [13]). For instance $\rho$ may be a random projection from a high dimensional to a low dimensional subspace [1], or a bootstrap procedure to sample a random subset of data from the original data set $D$ [4].

We define $S_k \; (0 \leq S_k\leq 1)$ as the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm $\mathcal{C}$ to pairs $D_1$ and $D_2$ of random independently perturbed data. The intuitive idea is that if $S_k$ is concentrated close to $1$, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.

Let be $f_k(s)$ its density function, and

\begin{displaymath}
F_k(\bar{s}) = \int_{-\infty}^{\bar{s}} f_k(s) ds
\end{displaymath} (1)

its cumulative distribution function.

We define $g(k)$ as the integral of the cumulative distribution function:

\begin{displaymath}
g(k) = \int_0^1 F_k(s) ds
\end{displaymath} (2)

Intuitively $g(k)$ represents the "concentration" of the similarity values close to 1; that is, if $g(k) \simeq 0$ then the distribution of the values of $S_k$ is concentrated near 1, or, in other words, the k-clustering is stable. On the other hand, if $g(k) \simeq 1$ then the clusterings are totally unstable, while if the distribution is close to the uniform distribution, we have $g(k) \simeq 1/2$.

We may directly estimate eq. 2 by numerical integration, or we may more easily obtain $g(k)$ from the estimate of the expectation $E[S_k]$:

$\displaystyle E[S_k]$ $\textstyle =$ $\displaystyle \int_0^1 s f_k(s) ds = \int_0^1 s F'_k(s) ds$  
  $\textstyle =$ $\displaystyle s F_k(s)\left\vert _0^1 \right. - \int_0^1 F_k(s) ds = 1 - \int_0^1 F_k(s) ds$ (3)

Hence from eq. 3 we may easily compute $g(k)$:

\begin{displaymath}
g(k) = \int_0^1 F_k(s) ds = 1 - E[S_k]
\end{displaymath} (4)

Eq. 4 shows also more clearly that we have a very stable and reliable clustering ($E[S_k]$ close to 1), if and only if $g(k)$ is close to 0.


next up previous
Next: Statistical tests to assess Up: Background on stability based Previous: Stability based methods