n this work we study the problem of clustering with respect to the diameter and the radius costs: We say that a set X of points in $\Re^d$ is $(k,b)-clusterable with respect to the diameter cost if X can be partitioned into k subsets (clusters) so that the distance between every pair of points in each cluster is at most b. In the case of the radius cost we require that all points that belong to the same cluster be at a distance of at most b for some common central point.
Here we approach the problem of clustering from within the framework of property testing. In property testing, the goal is to determine whether a given object has a particular property or whether it should be modified significantly so that it obtains the property. In the context of clustering, testing takes on the following form: The algorithm is given parameters k, b, $\beta$, and $\epsilon$, and it can sample from the set of points X. The goal of the algorithm is to distinguish between the case when X is (k,b)-clusterable and the case when X is $\epsilon$-far from being $(k,\bbeta)$-clusterable. By $\epsilon$-far from being $(k,\bbeta)$-clusterable we mean that more than $\epsilon\cdot|X|$ points should be removed from X so that it becomes $(k,\bbeta)$-clusterable. In this work we describe and analyze algorithms that use a sample of size polynomial in k and $1/\epsilon$ and independent of |X|. (The dependence on $\beta$ and on the dimension, d, of the points varies with the different algorithms.) Such algorithms may be especially useful when the set of points X is very large and it may not even be feasible to observe all of it.
Our algorithms can also be used to find approximately good clusterings. Namely, these are clusterings of all but an $\epsilon$-fraction of the points in X that have optimal (or close to optimal) cost. The benefit of our algorithms is that they construct an implicit representation of such clusterings in time independent of |X|. That is, without actually having to partition all points in X, the implicit representation can be used to answer queries concerning the cluster to which any given point belongs.