## Abstract

n this work we study the problem of clustering with respect to the diameter and the radius costs: We say that a set *X* of points in $\Re^d$ is $(*k*,*b*)-*clusterable* with respect to the diameter cost if *X* can be partitioned into *k* subsets (*clusters*) so that the distance between every pair of points in each cluster is at most *b*. In the case of the radius cost we require that all points that belong to the same cluster be at a distance of at most *b* for some common central point.

Here we approach the problem of clustering from within the framework of *property testing*. In property testing, the goal is to determine whether a given object has a particular property or whether it should be modified significantly so that it obtains the property. In the context of clustering, testing takes on the following form: The algorithm is given parameters *k*, *b*, $\beta$, and $\epsilon$, and it can sample from the set of points *X*. The goal of the algorithm is to distinguish between the case when *X* is (*k*,*b*)-clusterable and the case when *X* is $\epsilon$-far from being $(k,\bbeta)$-clusterable. By $\epsilon$-far from being $(k,\bbeta)$-clusterable we mean that more than $\epsilon\cdot|X|$ points should be removed from *X* so that it becomes $(k,\bbeta)$-clusterable. In this work we describe and analyze algorithms that use a sample of size polynomial in *k* and $1/\epsilon$ and *independent* of |*X*|. (The dependence on $\beta$ and on the dimension, *d*, of the points varies with the different algorithms.) Such algorithms may be especially useful when the set of points *X* is very large and it may not even be feasible to observe all of it.

Our algorithms can also be used to find *approximately good* clusterings. Namely, these are clusterings of all but an $\epsilon$-fraction of the points in *X* that have optimal (or close to optimal) cost. The benefit of our algorithms is that they construct an *implicit representation* of such clusterings in time independent of |*X*|. That is, without actually having to partition all points in *X*, the implicit representation can be used to answer queries concerning the cluster to which any given point belongs.

Original language | American English |
---|---|

Pages (from-to) | 285–308 |

Number of pages | 23 |

Journal | SIAM Review |

Volume | 46 |

Issue number | 2 |

DOIs | |

State | Published - 2004 |