Stability indices

Next: Statistical tests to assess Up: Background on stability based Previous: Stability based methods

Stability indices

Let be $\mathcal{C}$ a clustering algorithm, $\rho(D)$ a given random perturbation procedure applied to a data set

and

a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity [13]). For instance $\rho$ may be a random projection from a high dimensional to a low dimensional subspace [1], or a bootstrap procedure to sample a random subset of data from the original data set

[4].

We define $S_k \; (0 \leq S_k\leq 1)$ as the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm $\mathcal{C}$ to pairs and of random independently perturbed data. The intuitive idea is that if is concentrated close to , the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.

Let be its density function, and

$\begin{displaymath} F_k(\bar{s}) = \int_{-\infty}^{\bar{s}} f_k(s) ds \end{displaymath}$

(1)

its cumulative distribution function.

We define as the integral of the cumulative distribution function:

$\begin{displaymath} g(k) = \int_0^1 F_k(s) ds \end{displaymath}$

(2)

Intuitively

represents the "concentration" of the similarity values close to 1; that is, if $g(k) \simeq 0$ then the distribution of the values of

is concentrated near 1, or, in other words, the k-clustering is stable. On the other hand, if $g(k) \simeq 1$ then the clusterings are totally unstable, while if the distribution is close to the uniform distribution, we have $g(k) \simeq 1/2$ .

We may directly estimate eq. 2 by numerical integration, or we may more easily obtain from the estimate of the expectation :

$\displaystyle E[S_k]$	$\textstyle =$	$\displaystyle \int_0^1 s f_k(s) ds = \int_0^1 s F'_k(s) ds$
	$\textstyle =$	$\displaystyle s F_k(s)\left\vert _0^1 \right. - \int_0^1 F_k(s) ds = 1 - \int_0^1 F_k(s) ds$	(3)

Hence from eq. 3 we may easily compute :

$\begin{displaymath} g(k) = \int_0^1 F_k(s) ds = 1 - E[S_k] \end{displaymath}$

(4)

Eq. 4 shows also more clearly that we have a very stable and reliable clustering (

close to 1), if and only if

is close to 0.

Next: Statistical tests to assess Up: Background on stability based Previous: Stability based methods