## Abstract

In practical data analysis, methods based on proximity (near-neighbour) relationships between sample points are important because these relations can be computed in time (*n* log *n*) as the number of points *n*→∞. Associated with such methods are a class of random variables defined to be functions of a given point and its nearest neighbours in the sample. If the sample points are independent and identically distributed, the associated random variables will also be identically distributed but not independent. Despite this, we show that random variables of this type satisfy a strong law of large numbers, in the sense that their sample means converge to their expected values almost surely as the number of sample points *n*→∞.

## 1. Introduction

Let =*X*_{1}, *X*_{2}, … be a sequence of independent and identically distributed random vectors , and let denote the first *n* points of the sequence. The common distribution of the *X*_{i} will be called the sampling distribution and will be called a sample (of size *n*). Let *ν*_{i}(*n*, *k*) denote the index of the *k*-nearest neighbour of *X*_{i} among the points of , where the distance between any two points is defined with respect to the metric induced by any *l*^{p}-norm on ,and equidistant points are ordered by their indices. Let be a measurable function. We define the identically distributed random variables(1.1)so that *h*_{i,n}() is a function of the sample point *X*_{i} and its *k*-nearest neighbours among the points of . Let *μ*_{n}, and *κ*_{n}, respectively, denote the mean, variance and kurtosis of the (identically distributed) random variables *h*_{i,n}(), and let *S*_{n} and *H*_{n}, respectively, denote their sum and sample mean,(1.2)Our main result is that *H*_{n} is a strongly consistent estimator for *μ*_{n} as *n*→∞.

*If the sampling distribution is continuous and if the kurtosis κ*_{n} *is uniformly bounded for all n*∈*, then*(1.3)

Theorem 1.1 shows that the rate at which *H*_{n} converges to *μ*_{n} is at least of the order of (*σ*_{n}) as *n*→∞. Let us define the normalized random variables,(1.4)along with their sum and sample mean ,(1.5)If the sampling distribution is continuous, we show that the variance of is of the asymptotic order (*n*) as *n*→∞.

*If the sampling distribution is continuous,* *then for all n*≥16*k,*(1.6)*where* *and v*_{d,p} *is the volume of the unit ball in* .

To show that a.s. as *n*→∞, we first show that the expected squared difference between successive terms of the sequence is bounded independently of *n*. We then use the Efron–Stein inequality (Efron & Stein 1981; Steele 1986) to bound the variance , which by Chebyshev's inequality yields the weak convergence of to zero as *n*→∞. Finally, we again use the bound on to prove strong convergence using a standard argument.

Similar results have previously appeared in the literature. In particular, Penrose & Yukich (2003) obtained a weak law of large numbers for functions of binomial point processes in , then applied the result to a number of different types of random proximity graphs, and a number of functions including the total edge length (under arbitrary weighting) and the number of components in the graph. Wade (2007) applied the general result of Penrose & Yukich (2003) to the total (power-weighted) edge length of several types of the nearest neighbour graphs on random point sets in , and gave explicit expressions for the limiting constants. In two recent papers, Penrose (2007) obtained a law of large numbers for a class of marked point processes and gave some applications to noise estimation, while Penrose & Wade (2008) have investigated the asymptotic properties of the random online nearest neighbour graph.

The methods of Penrose & Yukich (2003) depend on the notion of *stabilization*, which demands that the neighbourhood of a particular vertex is unaffected by changes in the neighbourhood of another vertex located outside some sufficiently large ball (the radius of this ball is called the *radius of stabilization*). Because we consider only *k*-nearest neighbours graphs, we will instead rely on the standard geometric fact that there exists a finite number *β*=*β*(*k*, *d*, *p*) such that for any countable set of points in , any point of the set can be among the first *k*-nearest neighbours of at most *β* other points of the set.

Motivated by the need for a multidimensional goodness-of-fit test, Bickel & Breiman (1983) investigated the asymptotic properties of sums of bounded functions of the (first) nearest neighbour distances, and proved a central limit theorem for the random variables where *f* is the (unknown) sampling density. This was extended to *k*-nearest neighbour distances by Penrose (2000), with *k* being allowed to increase as a fractional power of the number of points *n*.

Functions defined in terms of proximity relations exhibit ‘local’ dependence, which can often be represented by dependency graphs, first described by Petrovskaya & Leontovitch (1982). For a set of random variables , the graph *G*(*V*, *E*) is said to be a *dependency graph* for the set if for any pair of disjoint sets *A*_{1}, *A*_{2}⊆*V* such that no edge has one endpoint in *A*_{1} and the other in *A*_{2}, the *σ*-fields and are mutually independent. The following result is due to Baldi & Rinott (1989).

*Let* *be random variables having a dependency graph G*(*V*, *E*)*,* *and define* . *Let D denote the maximal degree of G and suppose* |*Z*_{i}|≤*B almost surely*. *Then*(1.7)*where* |*V*| *is the cardinality of V and Φ*(*x*) *is the standard normal distribution N*(0, 1).

Avram & Bertsimas (1993) applied theorem 1.3 to the length of the *k*-nearest neighbour graph, the Delaunay triangulation and the Voronoi diagram of random point sets. More recently, Penrose & Yukich (2001) used stabilization properties to prove central limit theorems for functions of various types of (random) proximity graphs, while Chen & Shao (2004) obtained central limit theorems for various types of local dependence, and in particular improved on the result of Baldi & Rinott (1989).

In the present context, vertices in the dependency graph corresponding to the random variables *h*_{i,n} and *h*_{j,n} are connected by an edge only if they share a common nearest neighbour. By lemma 4.2, the maximal degree of *G* is therefore bounded independently of the number of points *n*. Under the additional assumption that the *h*_{i,n} are bounded a.s., if we could show that there exist constants *c*, *ϵ*>0 such that , it would then follow that in distribution as *n*→∞. This is beyond the scope of the present paper.

## 2. Applications

### (a) Nearest neighbour distances

Let =*X*_{1}, *X*_{2}, … be a sequence of independent and identically distributed random variables taking values in the unit cube [0,1]^{d}. Let *δ*_{i,n}() be the distance between the point *X*_{i} and its *k*-nearest neighbour in the sample , and let *Δ*_{n}() denote the sample mean of the *δ*_{i,n}(),(2.1)Let , and , respectively, denote the mean, standard deviation and kurtosis of the (identically distributed) random variables *δ*_{i,n}().

Let *F* denote the sampling distribution, *B*_{x}(*r*) denote the ball of radius *r* centred at and *ω*_{x}(*r*) denote its probability measure:(2.2)Suppose that *F* satisfies a positive density condition (Gruber 2004), in the sense that there exist constants *a*>1 and *ρ*>0 such that for all 0≤*r*≤*ρ*. Suppose also that *F* satisfies a smooth density condition in the sense that its second partial derivatives are bounded at every point of [0,1]^{d}. Under these conditions, it is known (Evans *et al*. 2002) that for all *ϵ*>0, the moments of the *k*-nearest neighbour distance distribution satisfies(2.3)as *n*→∞, where *v*_{d,p} is the volume of the unit ball in and *f*(*x*)=*F*′(*x*) is the density of the sampling distribution. In particular, and as *n*→∞, so by theorem 1.1 it follows that:(2.4)Thus, *Δ*_{n} is a strongly consistent estimator for the expected *k*-nearest neighbour distance , and its rate of convergence is of the asymptotic order as the number of sample points *n*→∞. In fact, taking *α*=1 in (2.3) we have(2.5)so by theorem 1.1,(2.6)

#### (i) The Euclidean *k*-nearest neighbours graph

For any set of vertices {*x*_{1}, …, *x*_{n}} in , the (undirected) Euclidean *k*-nearest neighbours graph is constructed by including an edge between every point and its *k*-nearest neighbours in the set, where the nearest neighbour relations are defined with respect to the Euclidean metric (*p*=2). Let *L*_{n}() denote the total length of the Euclidean *k*-nearest neighbours graph of the random sample ,(2.7)Because the volume of the unit ball in is equal to , then subject to the above conditions on the sampling distribution it follows by (2.6) that:(2.8)where(2.9)which agrees with theorem 2(b) of Wade (2007).

### (b) Noise estimation

Non-parametric regression attempts to model the behaviour of some observable variable *Y*∈ in terms of another observable variable , based only on a finite number of observations of the joint variable (*X*, *Y*). The relationship between *X* and *Y* is often assumed to satisfy the additive hypothesis(2.10)where (*Y*|*X*) is the regression function and *R* is the residual variable (noise). Let =*Z*_{1}, *Z*_{2}, … be a sequence of independent and identically distributed observations *Z*_{i}=(*X*_{i}, *Y*_{i}) of the joint variable *Z*=(*X*, *Y*), let =*X*_{1}, *X*_{2}, … and =*Y*_{1}, *Y*_{2}, … denote the marginal sequences, and let *ν*_{i}(*n*, *k*) denote the index of the *k*-nearest neighbour of *X*_{i} among the points of the marginal sample . To estimate the *k*-moment (*R*^{k}) of the residual distribution (*k*∈), we consider the random variables *γ*_{i,n}() and their sample mean *Γ*_{n}(), defined by(2.11)Let , and , respectively, denote the mean, standard deviation and kurtosis of the (identically distributed) random variables *γ*_{i,n}().

Perhaps the main contribution of this paper is that theorem 1.1 extends to possibly unbounded random variables, the only requirement being that their first four moments are bounded. This allows the residual variable *R* in (2.10) to take arbitrarily large values.

Suppose that the sampling distribution of the explanatory variables *X*_{i} satisfies smooth and positive density conditions, the regression function (*Y*|*X*) satisfies a Lipschitz condition on and the residual variable *R* is independent of *X* (homoscedastic). Under these conditions, Evans & Jones (2008) have shown that for *k*∈,(2.12)where is the expected *k*-nearest neighbour distance in the marginal sample (*X*_{1}, …, *X*_{n}) and the implied constant depends on the residual moments up to order *k*−1, the constant implied by the Lipschitz condition on the regression function, and the constant implied by the positive density condition on the sampling distribution.

If the first 4*k* moments of the residual distribution are bounded, it can be shown that and as *n*→∞. Hence, by (2.5), (2.12) and theorem 1.1, it follows that:(2.13)Thus, the sample mean *Γ*_{n} is a strongly consistent estimator for the *k*th moment (*R*^{k}) of the residual distribution as *n*→∞.

## 3. The Efron–Stein inequality

Let *X*_{1}, *X*_{2}, … be a sequence of independent and identically distributed random vectors in , and let be a symmetric function of *n* vectors in . The Efron–Stein inequality (Efron & Stein 1981) provides an upper bound for the variance of the statistic in terms of the quantities,(3.1)

*Let X*_{1}, *X*_{2}, …, *X*_{n+1} *be independent and identically distributed random vectors in* *, let* *be a symmetric function,* *and define the random variable* .

*If* (*Z*^{2})<∞*, then*(3.2)

The Efron–Stein inequality has found a wide range of applications in statistics. Our approach is partly based on the work of Reitzner (2003), who uses the inequality to prove strong laws of large numbers for statistics related to random polytopes.

First we note that, because *X*_{1}, …, *X*_{n+1} are independent and identically distributed, it follows by symmetry on the indices that:(3.3)Furthermore, for any *a*∈, we have that(3.4)(3.5)so inequalities (3.2) and (3.3) are preserved if we replace *Z*_{(.)} by any other function of *X*_{1}, …, *X*_{n+1}. To prove theorem 1.1, we apply the Efron–Stein inequality to the sum *S*_{n}(),(3.6)(3.7)If then . Furthermore, because the points *X*_{1}, …, *X*_{n} are independent and identically distributed, it follows that is invariant under permutations of its arguments. Thus, we may apply the Efron–Stein inequality to *S*_{n}. In this case , so by replacing *Z*_{(.)} by *S*_{n+1} in (3.3), we obtain(3.8)An identical expression also holds for the normalized sum . The lemma 3.2 shows that the expected squared difference between successive values of is bounded independently of the total number of points *n*. The proof is given in §4.

*If the sampling distribution is continuous*, *then for all n*≥16*k,*(3.9)

The following corollary of lemma 3.2 follows immediately by (3.8) and Chebyshev's inequality, and asserts the weak consistency of the sample mean *H*_{n}, defined in (1.2), as an estimator for true mean *μ*_{n} as *n*→∞ (provided the kurtosis *κ*_{n} is uniformly bounded for all *n*∈).

*If the sampling distribution is continuous*, *then for all n*≥16*k,*(3.10)

We prove lemma 3.2 using methods based on the approach of Bickel & Breiman (1983) and Hitczenko *et al*. (1999). Having established lemma 3.2, the proof of theorem 1.1 is then straightforward.

## 4. Proofs

### (a) Standard results

We need two standard geometric results. The first of these concerns the expected probability measure of *k*-nearest neighbour balls. For , let denote the *k*-nearest neighbour ball of *X*_{i} (with respect to the finite sample ), defined to be the ball centred at *X*_{i} of radius equal to the distance from *X*_{i} to its *k*-nearest neighbour in , and let *ω*_{i,n}() denote its probability measure,(4.1)where *F* is the (common) distribution function of the sample points *X*_{1}, *X*_{2}, …. It is well known, see for example Percus & Martin (1998) or Evans *et al*. (2002), that provided the sampling distribution is continuous, the expected probability measure of any *k*-nearest neighbour ball (over all sample realizations) is equal to *k*/*n*.

*If the sampling distribution is continuous,* *then*(4.2)

The second result concerns the maximum degree of any vertex in a *k*-nearest neighbour graph. It is well known that this number is bounded independently of the total number of vertices (e.g. Stone 1977; Bickel & Breiman 1983; Zeger & Gersho 1994; Yukich 1998).

*For every countable set of points in* *,* *any point can be among the first k*-*nearest neighbours of at most* *other points of the set,* *where v*_{d,p} *is the volume of the unit ball in* .

In particular, and .

### (b) Adding a sample point

Let *λ*_{n} denote the second moment of the random variables *h*_{i,n}(),(4.3)

*If the sampling distribution is continuous,*(4.4)

Let denote the *k*-nearest neighbour ball of *X*_{i} with respect to the finite sample , and suppose we add another (independent and identically distributed) point *X*_{n+1} to the ensemble. If (i.e. if the new point *X*_{n+1} falls outside the current *k*-nearest neighbour ball of *X*_{i}), then for all 1≤*ℓ*≤*k*, and hence(4.5)(4.6)(4.7)Thus, we have *h*_{i,n}≠*h*_{i,n+1} only if *X*_{n+1}∈*B*_{i,n}(). Let *J*_{i,n+1} denote the indicator variable of the event *X*_{n+1}∈*B*_{i,n}(),(4.8)Then(4.9)By lemma 4.1, the probability measure of *B*_{i,n}() is equal to *k*/*n*. Hence, because the new point *X*_{n+1} is assumed to be independent of the previously selected points *X*_{1},…,*X*_{n}, we have(4.10)and thus(4.11)So that we can consider samples of sizes *n* and *n*+1 separately, we apply the crude bound(4.12)For samples of size *n*, because the random variable is determined only by the points *X*_{1},…,*X*_{n}, it follows that is independent of the event *X*_{n+1}∈*B*_{i,n}, and therefore:(4.13)For samples of size *n*+1, the event *X*_{n+1}∈*B*_{i,n} is simply that the ‘last’ point of the sample *X*_{1},…, *X*_{n+1} is among the first *k*-nearest neighbours of *X*_{i}, i.e. *X*_{n+1}∈*B*_{i,n} if and only if for some 1≤*ℓ*≤*k*. Because the points *X*_{1}, …, *X*_{n+1} are independent and identically distributed, the order in which they are indexed is arbitrary. Hence, it follows by symmetry on the indices *j*≠*i* that the condition cannot affect the expected value of , so(4.14)Thus, we have that(4.15)and the result follows by (4.11). ▪

*If the sampling distribution is continuous,* *then for all n*≥16*k,*(4.16)

Because the *h*_{i,n} are identically distributed,(4.17)(4.18)(4.19)(4.20)Hence by lemma 4.3,(4.21)Thus, for *n*≥16*k* we have 4*k*/*n*≤1/4 and hence *λ*_{n+1}≤3*λ*_{n}. ▪

*If the sampling distribution is continuous*, *then for all n*≥16*k*,(4.22)

Let be the set containing the index of every point in , which has the new point *X*_{n+1} among its first *k*-nearest neighbours,(4.23)By construction, if then for all 1≤*ℓ*≤*k*, so *h*_{i,n}≠*h*_{i,n+1} only if . Furthermore, by lemma 4.2, the new point *X*_{n+1} can become one of the first *k*-nearest neighbours of at most *β* of the existing points *X*_{1},…,*X*_{n}, so . Hence, by the Cauchy inequality,(4.24)(4.25)(4.26)and the result follows by lemmas 4.3 and 4.4. ▪

### (c) An upper bound on var(*S*_{n})

*If the sampling distribution is continuous,* *then for all n*≥16*k,**and*

By definition,(4.27)so(4.28)Taking expectations then applying lemmas 4.4 and 4.5,(4.29)and by (3.8),(4.30)

▪### (d) Normalization

To find an upper bound on , we first need to quantify the squared difference between successive values of *μ*_{n} and *σ*_{n}.

*If the sampling distribution is continuous,* *then for all n*≥16*k,**and*

To prove (i), because the *h*_{i,n} are identically distributed,(4.31)(4.32)(4.33)where the last inequality follows by lemma 4.5.

To prove (ii), let be an independent copy of =(*X*_{1}, *X*_{2}, …), and let denote the index of the *k*-nearest neighbour of among the points of . We consider the random variables,(4.34)Because *h*_{i,n} and are independent and identically distributed,(4.35)(4.36)(4.37)(4.38)By the Cauchy–Schwarz inequality,(4.39)so(4.40)(4.41)Thus, we have(4.42)(4.43)and because and are identically distributed,(4.44)Finally, because the *h*_{i,n} are identically distributed,(4.45)and (ii) follows by lemmas 4.3 and 4.4. ▪

### (e) An upper bound on var

We aim to show that for all *n*≥16*k*,(4.46)First, we write(4.47)(4.48)(4.49)from which it follows that:(4.50)Thus, by lemmas 4.4, 4.6 and 4.7 we obtain(4.51)Finally, because and *κ*_{n+1}≥1,(4.52)and the result follows by the fact that . ▪

### (f) Strong convergence

We aim to show that a.s. as *n*→∞. Suppose that there exists a finite constant *c*>0 such that *κ*_{n}<*c* for all *n*∈. Then by corollary 3.3, the probabilities are summable for the subsequence *n*_{m}=*m*^{2},(4.54)where . Hence, by the Borel–Cantelli lemma it follows that a.s. as *m*→∞. For , we write(4.55)Let(4.56)Then for all , so by (4.54) it is sufficient to show that *W*_{m}→0 a.s. as *m*→∞. Writing(4.57)we see that(4.58)Hence by the Cauchy inequality, and using the fact that the sum has exactly 2*m* terms, it follows by lemma 3.2 that:(4.59)(4.60)(4.61)Thus, by Markov's inequality and the Borel–Cantelli lemma, *W*_{m}→0 a.s. as *m*→∞, which concludes the proof of theorem 1.1. ▪

## 5. Conclusion

For a sequence =*X*_{1}, *X*_{2}, … of independent and identically distributed random vectors in , we have proved a strong law of large numbers for functions of a point and its nearest neighbours in the sample . We have also used the result to show that certain non-parametric difference-based estimators of residual moments are strongly consistent as the number of points *n*→∞ (provided the distribution of the explanatory vectors satisfies smooth and positive density conditions). Perhaps the most significant advance put forward in this paper is that the random variables need not be bounded, the only requirement being that their first four moments are bounded.

In practical applications, one area of concern is that the bound on of theorem 1.2 increases exponentially with the dimension of the underlying space , which negatively affects the rate of convergence when *d* is large. The exponential factor *β* is introduced in the proof of lemma 4.5, where we have used the relatively crude bound of lemma 4.2. By contrast, the proof of lemma 4.3 uses the bound of lemma 4.1, which avoids the exponential factor. If we attempt to prove lemma 4.5 along the same lines as those followed in the proof of lemma 4.3, we arrive at the expression(5.1)where *J*_{ij}() is the indicator variable of the event , and *B*_{i,n}() and *B*_{j,n}() are, respectively, the *k*-nearest neighbour balls of the points *X*_{i} and *X*_{j} in the sample . At this point, we would like to show that is at most equal to *c*(*d*)/*n*, where *c*(*d*) does not increase exponentially with *d*. However, Dwyer (1995) showed that if *k*=1 and *d*≥7, then for *n*≫*d* the expected number of edges *n*_{e} in the random sphere of influence graph, or equivalently the expected number of pairs (*X*_{i}, *X*_{j}) for which , satisfies(5.2)This suggests that the curse of dimensionality will not be easily lifted.

## Acknowledgments

The author would like to thank the Royal Society for supporting his research through the University Research Fellowship scheme.

## Footnotes

- Received June 5, 2008.
- Accepted July 11, 2008.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

- Copyright © 2008 The Royal Society