Next: Statistical tests to assess
Up: Background on stability based
Previous: Stability based methods
Stability indices
Let be a clustering algorithm, a given random perturbation procedure applied to a data set and a
suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity [13]).
For instance may be a random projection from a high dimensional to a low dimensional subspace [1], or a bootstrap
procedure to sample a random subset of data from the original data set [4].
We define
as the random variable given by the similarity between two k-clusterings obtained by applying a
clustering algorithm to pairs and of random independently perturbed data.
The intuitive idea is that if is concentrated close to , the corresponding clustering is stable with respect to a given
controlled perturbation and hence it is reliable.
Let be its density function, and
|
(1) |
its cumulative distribution function.
We define as the integral of the cumulative distribution function:
|
(2) |
Intuitively represents the "concentration" of the similarity values close to 1; that is, if then
the distribution of the values of is concentrated near 1, or, in other words, the k-clustering is stable. On the
other hand, if then the clusterings are totally unstable, while if the distribution is close to the
uniform distribution, we have
.
We may directly estimate eq. 2 by numerical integration, or we may more easily obtain from the estimate of
the expectation :
Hence from eq. 3 we may easily compute :
|
(4) |
Eq. 4 shows also more clearly that we have a very stable and reliable clustering ( close to 1),
if and only if is close to 0.
Next: Statistical tests to assess
Up: Background on stability based
Previous: Stability based methods