A law of large numbers for nearest neighbour statistics

In practical data analysis, methods based on proximity (near-neighbour) relationships between sample points are important because these relations can be computed in time (n log n) as the number of points n→∞. Associated with such methods are a class of random variables defined to be functions of a given point and its nearest neighbours in the sample. If the sample points are independent and identically distributed, the associated random variables will also be identically distributed but not independent. Despite this, we show that random variables of this type satisfy a strong law of large numbers, in the sense that their sample means converge to their expected values almost surely as the number of sample points n→∞.


Introduction
Let X ZX 1 , X 2 , . be a sequence of independent and identically distributed random vectors X i 2 R d , and let X n Z ðX 1 ; .; X n Þ denote the first n points of the sequence. The common distribution of the X i will be called the sampling distribution and X n will be called a sample (of size n). Let n i (n, k) denote the index of the k -nearest neighbour of X i among the points of X n Z ðX 1 ; .; X n Þ, where the distance between any two points is defined with respect to the metric induced by any l p -norm on R d , 1% p!N; max jZ1; .; d ja j j; p ZN; 8 > < > : and equidistant points are ordered by their indices. Let h : R d!ðkC1Þ / R be a measurable function. We define the identically distributed random variables and let S n and H n , respectively, denote their sum and sample mean, Our main result is that H n is a strongly consistent estimator for m n as n/N.
Theorem 1.1. If the sampling distribution is continuous and if the kurtosis k n is uniformly bounded for all n2N, then H n K m n s n / 0 a:s: as n/N: ð1:3Þ Theorem 1.1 shows that the rate at which H n converges to m n is at least of the order of O(s n ) as n/N. Let us define the normalized random variables, If the sampling distribution is continuous, we show that the variance of S Ã n is of the asymptotic order O(n) as n/N. Theorem 1.2. If the sampling distribution is continuous, then for all nR16k, var S Ã n ð Þ% 2ðn C 1Þð3 C 8kbÞð3 C 64kÞðk n C k nC1 Þ; ð1:6Þ where bðk; d; pÞZ kbdv d;p c and v d,p is the volume of the unit ball in ðR d ; k$k p Þ.
To show that jH Ã n j/ 0 a.s. as n/N, we first show that the expected squared difference EðS Ã n KS Ã nC1 Þ 2 between successive terms of the sequence S Ã n is bounded independently of n. We then use the Efron-Stein inequality (Efron & Stein 1981;Steele 1986) to bound the variance varðS Ã n Þ, which by Chebyshev's inequality yields the weak convergence of H Ã n to zero as n/N. Finally, we again use the bound on EðS Ã n KS Ã nC1 Þ 2 to prove strong convergence using a standard argument. Similar results have previously appeared in the literature. In particular, Penrose & Yukich (2003) obtained a weak law of large numbers for functions of binomial point processes in R d , then applied the result to a number of different types of random proximity graphs, and a number of functions including the total edge length (under arbitrary weighting) and the number of components in the graph. Wade (2007) applied the general result of Penrose & Yukich (2003) to the total (power-weighted) edge length of several types of the nearest neighbour graphs on random point sets in R d , and gave explicit expressions for the limiting constants. In two recent papers, Penrose (2007) obtained a law of large numbers for a class of marked point processes and gave some applications to noise estimation, while Penrose & Wade (2008) have investigated the asymptotic properties of the random online nearest neighbour graph.
The methods of Penrose & Yukich (2003) depend on the notion of stabilization, which demands that the neighbourhood of a particular vertex is unaffected by changes in the neighbourhood of another vertex located outside some sufficiently large ball (the radius of this ball is called the radius of stabilization). Because we consider only k -nearest neighbours graphs, we will instead rely on the standard geometric fact that there exists a finite number bZb(k, d, p) such that for any countable set of points in ðR d ; k$k p Þ, any point of the set can be among the first k-nearest neighbours of at most b other points of the set.
Motivated by the need for a multidimensional goodness-of-fit test, Bickel & Breiman (1983) investigated the asymptotic properties of sums of bounded functions of the (first) nearest neighbour distances, and proved a central limit theorem for the random variables f ðX i ÞkX i K X n i ðn;1Þ k where f is the (unknown) sampling density. This was extended to k-nearest neighbour distances by Penrose (2000), with k being allowed to increase as a fractional power of the number of points n.
Functions defined in terms of proximity relations exhibit 'local' dependence, which can often be represented by dependency graphs, first described by Petrovskaya & Leontovitch (1982). For a set of random variables fX i : i 2 V g, the graph G(V, E ) is said to be a dependency graph for the set if for any pair of disjoint sets A 1 , A 2 4V such that no edge has one endpoint in A 1 and the other in A 2 , the s-fields sfX i : i 2 A 1 g and sfX i : i 2 A 2 g are mutually independent. The following result is due to Baldi & Rinott (1989). Theorem 1.3 (Baldi & Rinott 1989). Let fZ i : i 2 V g be random variables having a dependency graph G(V, E ), and define S Z P i2V Z i . Let D denote the maximal degree of G and suppose jZ i j%B almost surely. Then where jV j is the cardinality of V and F(x) is the standard normal distribution N(0, 1).
Avram & Bertsimas (1993) applied theorem 1.3 to the length of the k-nearest neighbour graph, the Delaunay triangulation and the Voronoi diagram of random point sets. More recently, Penrose & Yukich (2001) used stabilization properties to prove central limit theorems for functions of various types of (random) proximity graphs, while Chen & Shao (2004) obtained central limit theorems for various types of local dependence, and in particular improved on the result of Baldi & Rinott (1989).
In the present context, vertices in the dependency graph corresponding to the random variables h i,n and h j,n are connected by an edge only if they share a common nearest neighbour. By lemma 4.2, the maximal degree of G is therefore bounded independently of the number of points n. Under the additional assumption that the h i,n are bounded a.s., if we could show that there exist constants c, eO0 such that varðS n ÞR cn 2=3Ce , it would then follow that S Ã n / N ð0; 1Þ in distribution as n/N. This is beyond the scope of the present paper.

Applications
(a ) Nearest neighbour distances Let X ZX 1 , X 2 , . be a sequence of independent and identically distributed random variables taking values in the unit cube [0,1] d . Let d i,n (X ) be the distance between the point X i and its k -nearest neighbour in the sample X n , and let D n (X ) denote the sample mean of the d i,n (X), d i;n ðX Þ Z kX i K X n i ðn;kÞ k and D n ðX Þ Z 1 n X n i Z1 d i;n ðX Þ: ð2:1Þ Let m ðdÞ n , s ðdÞ n and k ðdÞ n , respectively, denote the mean, standard deviation and kurtosis of the (identically distributed) random variables d i,n (X ).
Let F denote the sampling distribution, B x (r) denote the ball of radius r centred at x 2 R d and u x (r) denote its probability measure: ð2:2Þ Suppose that F satisfies a positive density condition (Gruber 2004), in the sense that there exist constants aO1 and rO0 such that a K1 r d % u x ðrÞ% ar d for all 0%r%r. Suppose also that F satisfies a smooth density condition in the sense that its second partial derivatives are bounded at every point of [0,1] d . Under these conditions, it is known (Evans et al. 2002) that for all eO0, the moments of the k -nearest neighbour distance distribution satisfies ð2:3Þ For any set of vertices {x 1 , ., x n } in R d , the (undirected) Euclidean k -nearest neighbours graph is constructed by including an edge between every point and its k -nearest neighbours in the set, where the nearest neighbour relations are defined with respect to the Euclidean metric ( pZ2). Let L n (X ) denote the total length of the Euclidean k -nearest neighbours graph of the random sample X n , which agrees with theorem 2(b) of Wade (2007).

(b ) Noise estimation
Non-parametric regression attempts to model the behaviour of some observable variable Y2R in terms of another observable variable X 2 R d , based only on a finite number of observations of the joint variable (X, Y ). The relationship between X and Y is often assumed to satisfy the additive hypothesis where E(YjX ) is the regression function and R is the residual variable (noise). Let ZZZ 1 , Z 2 , . be a sequence of independent and identically distributed observations . denote the marginal sequences, and let n i (n, k) denote the index of the k-nearest neighbour of X i among the points of the marginal sample X n Z ðX 1 ; .; X n Þ. To estimate the k -moment E(R k ) of the residual distribution (k2N), we consider the random variables g i,n (Z) and their sample mean G n (Z), defined by and k ðgÞ n , respectively, denote the mean, standard deviation and kurtosis of the (identically distributed) random variables g i,n (Z).
Perhaps the main contribution of this paper is that theorem 1.1 extends to possibly unbounded random variables, the only requirement being that their first four moments are bounded. This allows the residual variable R in (2.10) to take arbitrarily large values.
Suppose that the sampling distribution of the explanatory variables X i satisfies smooth and positive density conditions, the regression function E(YjX ) satisfies a Lipschitz condition on R d and the residual variable R is independent of X (homoscedastic). Under these conditions, Evans & Jones (2008)  where m ðdÞ n is the expected k -nearest neighbour distance in the marginal sample (X 1 , ., X n ) and the implied constant depends on the residual moments up to order kK1, the constant implied by the Lipschitz condition on the regression function, and the constant implied by the positive density condition on the sampling distribution.
If the first 4k moments of the residual distribution are bounded, it can be shown that s ðgÞ n Z Oð1Þ and k ðgÞ n Z Oð1Þ as n/N. Hence, by (2.5), (2.12) and theorem 1.1, it follows that: jG n Km ðgÞ n j/ 0 a:s: as n/N: ð2:13Þ Thus, the sample mean G n is a strongly consistent estimator for the kth moment E(R k ) of the residual distribution as n/N.

The Efron-Stein inequality
Let X 1 , X 2 , . be a sequence of independent and identically distributed random vectors in R d , and let g : R d!n / R be a symmetric function of n vectors in R d . The Efron-Stein inequality (Efron & Stein 1981) provides an upper bound for the variance of the statistic Z Z gðX 1 ; .; X n Þ in terms of the quantities, Theorem 3.1 (Efron & Stein 1981). Let X 1 , X 2 , ., X nC1 be independent and identically distributed random vectors in R d , let g : R d!n / R be a symmetric function, and define the random variable Z Z gðX 1 ; .; X n Þ.
The Efron-Stein inequality has found a wide range of applications in statistics. Our approach is partly based on the work of Reitzner (2003), who uses the inequality to prove strong laws of large numbers for statistics related to random polytopes.
First we note that, because X 1 , ., X nC1 are independent and identically distributed, it follows by symmetry on the indices that: If Eðh 2 i;n Þ!N then EðZ 2 Þ!N. Furthermore, because the points X 1 , ., X n are independent and identically distributed, it follows that S n Z ZðX 1 ; .; X n Þ is invariant under permutations of its arguments. Thus, we may apply the Efron-Stein inequality to S n . In this case Z ðnC1Þ Z S n , so by replacing Z ($) by S nC1 in (3.3), we obtain varðS n Þ% ðn C 1ÞEðS n K S nC1 Þ 2 : ð3:8Þ An identical expression also holds for the normalized sum S Ã n . The lemma 3.2 shows that the expected squared difference between successive values of S Ã n is bounded independently of the total number of points n. The proof is given in §4.
Lemma 3.2. If the sampling distribution is continuous, then for all nR16k, The following corollary of lemma 3.2 follows immediately by (3.8) and Chebyshev's inequality, and asserts the weak consistency of the sample mean H n , defined in (1.2), as an estimator for true mean m n as n/N (provided the kurtosis k n is uniformly bounded for all n2N).
Corollary 3.3. If the sampling distribution is continuous, then for all nR16k, We prove lemma 3.2 using methods based on the approach of Bickel & Breiman (1983) and Hitczenko et al. (1999). Having established lemma 3.2, the proof of theorem 1.1 is then straightforward.

(a ) Standard results
We need two standard geometric results. The first of these concerns the expected probability measure of k -nearest neighbour balls. For X i 2 X n , let B i;n ðX Þ 3R d denote the k -nearest neighbour ball of X i (with respect to the finite sample X n ), defined to be the ball centred at X i of radius equal to the distance from X i to its k-nearest neighbour in X n , and let u i,n (X) denote its probability measure, where F is the (common) distribution function of the sample points X 1 , X 2 , . . It is well known, see for example Percus & Martin (1998) or Evans et al. (2002), that provided the sampling distribution is continuous, the expected probability measure of any k -nearest neighbour ball (over all sample realizations) is equal to k/n. The second result concerns the maximum degree of any vertex in a k -nearest neighbour graph. It is well known that this number is bounded independently of the total number of vertices (e.g. Stone 1977;Bickel & Breiman 1983;Zeger & Gersho 1994;Yukich 1998).
Lemma 4.2. For every countable set of points in R d , any point can be among the first k-nearest neighbours of at most bðk; d; pÞZ kbdv d;p c other points of the set, where v d,p is the volume of the unit ball in ðR d ; k$k p Þ.

(b ) Adding a sample point
Let l n denote the second moment of the random variables h i,n (X), l n Z s 2 n C m 2 n : ð4:3Þ Lemma 4.3. If the sampling distribution is continuous, Proof. Let B i;n ðX Þ 3R d denote the k -nearest neighbour ball of X i with respect to the finite sample X n Z ðX 1 ; .; X n Þ, and suppose we add another (independent and identically distributed) point X nC1 to the ensemble. If X nC1 ;B i;n ðX Þ (i.e. if the new point X nC1 falls outside the current k -nearest neighbour ball of X i ), then X n i ðn;[ Þ Z X n i ðnC1;[ Þ for all 1%[%k, and hence h i;n Z hðX i ; X n i ðn;1Þ ; .; X n i ðn;kÞ Þ ð 4:5Þ So that we can consider samples of sizes n and nC1 separately, we apply the crude bound ð4:12Þ For samples of size n, because the random variable h 2 i;n is determined only by the points X 1 ,.,X n , it follows that h 2 i;n is independent of the event X nC1 2B i,n , and therefore: For samples of size nC1, the event X nC1 2B i,n is simply that the 'last' point of the sample X 1 ,.,X nC1 is among the first k -nearest neighbours of X i , i.e. X nC1 2B i,n if and only if n i ðnC 1; [ ÞZ nC 1 for some 1%[%k. Because the points X 1 ,.,X nC1 are independent and identically distributed, the order in which they are indexed is arbitrary. Hence, it follows by symmetry on the indices jsi that the condition n i ðnC 1; kÞZ nC 1 cannot affect the expected value of h 2 i;nC1 , so only if i 2 M nC1 . Furthermore, by lemma 4.2, the new point X nC1 can become one of the first k -nearest neighbours of at most b of the existing points X 1 , ., X n , so jM nC1 j% b. Hence, by the Cauchy inequality, Lemma 4.6. If the sampling distribution is continuous, then for all nR16k, ðiÞ EðS n K S nC1 Þ 2 % 2ð3 C 8kbÞl n and ðiiÞ varðS n Þ% 2ðn C 1Þð3 C 8kbÞl n : Proof. By definition, To find an upper bound on EðS Ã nC1 KS Ã n Þ 2 , we first need to quantify the squared difference between successive values of m n and s n .
Lemma 4.7. If the sampling distribution is continuous, then for all nR16k, ðiÞ ðm n K m nC1 Þ 2 % 8kb n 2 l n and ðiiÞ ðs n K s nC1 Þ 2 % 16k n l n : Proof. To prove (i), because the h i,n are identically distributed, where the last inequality follows by lemma 4.5.
To prove (ii), letX Z ðX 1 ;X 2 ; .Þ be an independent copy of X Z(X 1 , X 2 ,.), and letñ i ðn; kÞ denote the index of the k -nearest neighbour ofX i among the points ofX n Z ðX 1 ; .;X n Þ. We consider the random variables, from which it follows that: ð4:50Þ Thus, by lemmas 4.4, 4.6 and 4.7 we obtain ð4:51Þ

D. Evans
Finally, because l n =s 2 n % k 1=2 n and k nC1 R1, : ð4:56Þ Then jH Ã n j% jH Ã m 2 jC W m for all m 2 ! n! ðmC 1Þ 2 , so by (4.54) it is sufficient to show that W m /0 a.s. as m/N. Writing we see that Hence by the Cauchy inequality, and using the fact that the sum has exactly 2m terms, it follows by lemma 3.2 that: Proof. Let tOdv d,p be an integer and suppose that x 0 is the nearest neighbour of every point in the set {x 1 ,.,x t }. We project the points x i onto the surface of the unit ball in R d , writing Let x i and x j be two distinct points of {x 1 ,.,x t } and suppose (without loss of generality) that kx j K x 0 k% kx i K x 0 k. The vector x i Kx j can be expressed as Hence, because r i % kx i K x j k and kx i kZ1 we have and therefore 1% kx i K x j k: ðA 12Þ Thus, every point x i must be located within an otherwise empty region on the surface of the unit ball in R d , and the surface area of each of these regions must be at least equal to 1. By hypothesis, there are tOdv d,p points x 1 , ., x t so the total surface area covered by these disjoint regions must be greater than dv d,p . However, because the total surface area of the unit ball is equal to dv d,p we have a contradiction, and thus we conclude that t%bdv d,p c. & Lemma 8.4 of Yukich (1998) shows that lemma 4.2 follows lemma A.1. An alternative proof is given here.
Proof of lemma 4.2. Let tOkdv d,p be an integer; suppose that x 0 is one of the k -nearest neighbours of every point in the set {x 1 ,.,x t }. First we choose x 1 to be the point of {x 1 ,.,x t } furthest away from x 0 , then eliminate this point x 1 along with all points that are closer to x 1 than x 0 is to x 1 . At least one point is eliminated (namely the point x 1 itself ), and because x 0 is one of the k -nearest neighbours of x 1 there can be at most kK1 other points closer to x 1 than x 0 is to x 1 , so at most k points are eliminated in total.
Next we repeat the procedure on remaining points, choosing x 2 to be the furthest point away from x 0 , then eliminating x 2 along with all points that are closer to x 2 than x 0 is to x 2 . Because x 2 was not eliminated in the first round, it must be closer to x 0 than it is to x 1 , kx 2 K x 0 k% kx 2 K x 1 k: ðA 13Þ We continue in this way, at each stage choosing x i to be the point (among those that remain) furthest away from x 0 , and eliminating x i along with all points that are closer to x i than x 0 is to x i . Because x i was not eliminated in the previous rounds, it must be closer to x 0 than to any of the points chosen previously, kx i K x 0 k% kx i K x j k for all j Z 1; .; i K1: ðA 14Þ At least one point is eliminated at each stage, so the process must eventually terminate, say after T steps, and we obtain a set of points {x 1 ,.,x T }, each having x 0 as its nearest neighbour. At most k points are eliminated at each step, so a minimum of t/k steps must be performed before the process terminates. By hypothesis, t Okdv d,p so T Odv d,p . However, by lemma A.1 we know that any point x 0 can be the nearest neighbour of at most bdv d,p c other points. Thus we have a contradiction, and conclude that t%kbdv d,p c. & For Euclidean space, Zeger & Gersho (1994) establish an alternative bound in terms of kissing numbers.
Corollary A.2. If pZ2 then for every countable set of points in R d , any point can be among the k-nearest neighbours of at most kK(d ) other points of the set, where K(d ) is the maximum kissing number in R d .
Proof. Following the proof of lemma A.1, by (A12) we can place a set of t nonoverlapping spheres of radius 1/2 at each point x i , and each of these will be tangent to the sphere of radius 1/2 centred at the origin. This contradicts the fact that there can be at most K(d ) such tangent spheres. &