## Abstract

The aim of non-parametric regression is to model the behaviour of a response vector *Y* in terms of an explanatory vector *X*, based only on a finite set of empirical observations. This is usually performed under the additive hypothesis *Y*=*f*(*X*)+*R*, where *f*(*X*)=(*Y*|*X*) is the true regression function and *R* is the true residual variable. Subject to a Lipschitz condition on *f*, we propose new estimators for the moments (scalar response) and covariance (vector response) of the residual distribution, derive their asymptotic properties and discuss their application in practical data analysis.

## 1. Introduction

In the analysis of complex systems, it is often the case that we have no prior information regarding the system under investigation, and the analysis must proceed based only on empirical observations of the system behaviour. Non-parametric regression methods attempt to model the behaviour of an observable random vector *Y*∈^{n} in terms of another observable random vector *X*∈^{m}, based only on a finite set of observations.

The relationship between the response vector *Y* and the explanatory vector *X* is often assumed to satisfy the additive hypothesis(1.1)where *f*(*X*)=(*Y*|*X*) is the true regression function and *R* is the true residual vector, which we call *noise*.

For a scalar response (*n*=1) we present new non-parametric estimators for the moments (*R*^{q}) of the noise distribution (*q*=2, 3, 4, …), and for a vector response (*n*>1) we describe a new estimator for the covariance matrix (*RR*^{T}). Our estimators are computed from a finite sample of independent and identically distributed observations of the joint variable (*X*, *Y*) by exploiting the nearest neighbour structure of the explanatory observations. Using techniques first described in Bentley (1975), our estimates can be computed with asymptotic time complexity of order (*N* log *N*) as the number of points *N*→∞.

Our main assumptions are that the true regression function (*Y*|*X*) is a smooth function of the explanatory vector *X*, and that the zero-mean noise vector *R* is independent of the explanatory vector (homoscedastic). The noise vector represents all variations in the response vector that are not accounted for by any smooth transformation of the explanatory vector, and includes what might be called deterministic variation due to non-random factors that influence the response vector but which are not included in the explanatory vector, as well as stochastic variation due to purely random factors such as measurement error.

Our estimators belong to the class of difference-based methods. These date back to von Neumann (1941), who, for the case *m*=1 and under the assumption of homoscedastic noise, proposed that the residual variance can be estimated by the average of squared successive differences of the observations,(1.2)This estimator was investigated by Rice (1984) and subsequently extended by Gasser *et al*. (1986) and Hall *et al*. (1990) by using interpolation to reduce the bias of the so-called ‘pseudo-residual’ *Y*_{i}−*Y*_{i−1}. For the case *m*=1, Levine (2006) and Brown & Levine (2007) considered heteroscedastic noise and proposed difference-based estimators for the residual variance function (*R*^{2}|*X*) using weighted local averages of the squared pseudo-residuals. A related paper by Wang *et al*. (2008) investigated the effect that the mean function *f*(*X*)=(*Y*|*X*) has on estimators of the variance function. For the case *m*≥1, Munk *et al*. (2005) showed that the bias of simple difference-based estimators such as (1.2) have asymptotic order (*N*^{−1/m}) as *N*→∞ and are therefore not -consistent for *m*≥4. They also describe estimators that allow better control of the bias and remain consistent when *m*≥4. The case *m*≥1 is also discussed in Evans & Jones (2002) and Bock *et al*. (2007), where the assumption that the explanatory vectors lie on a rectangular grid is relaxed by considering their nearest neighbour structure. For the case *m*≥1 and under the assumption of homoscedastic noise, Durrant (2001) presented difference-based estimators for the higher moments (*R*^{q}) of symmetric noise distributions. To the best of our knowledge, this is the only previous attempt at estimating the higher moments of residual distributions.

### (a) Notation and conditions

Let (*X*_{1}, *Y*_{1}), …, (*X*_{N}, *Y*_{N}) be a sample of independent and identically distributed observations of the joint variable (*X*, *Y*). To establish asymptotic bounds on the bias and variance of our estimators as the number of points *N*→∞, we impose some fairly strong conditions on the distribution of the explanatory observations *X*_{1}, …, *X*_{N}. To this end, let *Φ* : ^{m}→[0, 1] be a probability distribution whose density *ϕ*(*x*)=*Φ*′(*x*) exists at every point *x*∈^{m}.

A probability distribution *Φ* satisfies a smooth density condition if its density has bounded partial derivatives at every point *x*∈^{m}.

For *x*∈^{m} let *B*_{x}(*s*) denote the ball of radius *s* centred at *x*, and *ω*_{x}(*s*) denote its probability measure,(1.3)

A probability distribution *Φ* satisfies a positive density condition if there exist constants *a*>1 and *δ*>0 such that(1.4)

The positive density condition (Gruber 2004) ensures that the probability measure of an arbitrary ball can be bounded in terms of its radius. If a distribution satisfies a smooth density condition, Evans *et al*. (2002) showed that the distribution also satisfies a positive density condition provided the support *S*_{ϕ} of the density is a compact convex body in ^{m}, where(1.5)

That a distribution satisfies the positive density condition essentially means that its density is well behaved in small neighbourhoods. If the distribution also satisfies a smooth density condition, then its density is approximately uniform in small neighbourhoods. The inequality of (1.4) in the l.h.s. ensures that the density is zero outside some bounded region. The positive density condition also ensures that the probability measure of small neighbourhoods scale with their Lebesgue measure, and thus excludes the case where the support of the density is a fractal set.

We state the following conditions.

The true regression function

*f*(*X*)=(*Y*|*X*) is Lipschitz continuous. Let*b*denote the associated Lipschitz constant, so that for all*x*,*x*′∈^{m}.The explanatory observations

*X*_{1}, …,*X*_{N}are independent and identically distributed random vectors in^{m}, and their common distribution*Φ*satisfies smooth and positive density conditions. Furthermore, the support*S*_{ϕ}of the associated density is a compact subset of^{m}such that for all*x*,*x*′∈*S*_{ϕ}.The (implicit) noise observations

*R*_{1}, …,*R*_{N}are independent and identically distributed random vectors, and to estimate (*R*^{q}) their common distribution*Ψ*must have bounded moments up to order 2*q*∈.For every observation pair (

*X*_{i},*Y*_{i}), the explanatory vector*X*_{i}and the (implicit) noise vector*R*_{i}are independent (homoscedastic).

We call *Φ* the sampling distribution, and *Ψ* the noise distribution or the true residual distribution. The condition that for all *x*, *x*′∈*S*_{ϕ} is imposed to ensure that for all *t*≥1. In practice, this condition can be enforced by an appropriate scaling of the explanatory observations *X*_{1}, …, *X*_{N}, which might lead to a modification of the Lipschitz constant *b* defined in condition (i).

### (b) Optimal estimators

A number of algorithms for non-parametric regression exist in the literature, most of which require an estimate of the noise variance (*R*^{2}) to control the amount of smoothing applied to the data. Too little smoothing means that noise is incorporated into the estimate, while too much smoothing leads to certain characteristics of the underlying regression function being lost. An estimate of (*R*^{2}) can help to identify suitable bandwidths for kernel regression (Silverman 1986), second-derivative penalties for spline smoothing (Eubank 1999), stopping criteria for neural network training (Jones 2004) and thresholds for wavelet smoothing (Donoho & Johnstone 1995).

To evaluate an estimator of the true regression function *f*(*X*), we consider the associated error function,(1.6)

In practice, *e*(*X*) is unknown and instead we must consider the *empirical* error function,(1.7)

Since the explanatory vector *X* and the noise vector *R* are assumed to be independent, it follows by (1.1) that the empirical mean-squared error satisfies(1.8)

Hence if is close to the noise variance (*R*^{2}), the error function must be close to zero at *X* and we might therefore define an optimal estimator to be one for which for all *X*∈*S*_{ϕ}. In general, the *q*th moment of the empirical error function satisfies(1.9)and we might instead define an optimal estimator to be one for which for all *q*=1, 2, … and *X*∈*S*_{ϕ}. Thus, we are motivated to estimate the moments of the noise distribution.

## 2. Products of differences

Let =(*X*_{1}, …, *X*_{N}) and =(*Y*_{1}, …, *Y*_{N}) denote the marginal samples. Under the additive hypothesis (1.1), the noise variance (*R*^{2}) can be estimated by the mean-squared difference between pairs of response vectors *Y* and *Y*′ for which the corresponding explanatory vectors *X* and *X*′ are the nearest neighbours in the marginal sample . By (1.1), since *R* and *R*′ are independent and identically distributed zero-mean random variables, it follows that(2.1)where the second term on the r.h.s. represents the bias of (1/2)(*Y*−*Y*′)^{2} as an estimator for (*R*^{2}). Rather than using squared differences, we propose an alternative estimator for (*R*^{2}) based on *products* of differences (*Y*−*Y*′)(*Y*−*Y*″), where *Y*′ and *Y*″ are the response vectors for which the corresponding explanatory vectors *X*′ and *X*″ are, respectively, the first and second nearest neighbours of *X*. By (1.1), since *R*, *R*′ and *R*″ are independent and identically distributed zero-mean random variables, we have that(2.2)where the second term on the r.h.s. represents the bias of ((*Y*−*Y*′)(*Y*−*Y*″)) as an estimator for (*R*^{2}). By the Lipschitz condition on *f*,(2.3)Estimators for the higher order moments (*R*^{q}) can be similarly constructed using *q*-fold products of differences over the first *q* nearest neighbours of *X*. Let *i*(*k*) denote the index of the *k*th nearest neighbour of *X*_{i} among the points *X*_{1}, …, *X*_{N}. We propose the following sample mean estimator for (*R*^{q}):(2.4)

To compute the *q*th moment, we need not restrict ourselves to the first *q* nearest neighbours of *X*_{i}, since any set of *q* points in the neighbourhood of *X* can be used to compute a *q*-fold product of differences and hence an estimate for (*R*^{q}). Although using more distant neighbours is likely to increase the bias of the estimate, these products might be useful to reduce statistical error when the number of observations is small. We can also form products of differences over a set of neighbours that are not necessarily distinct. In this case, some differences are repeated in the product, and the method returns an estimate for a known polynomial function of the moments up to order *q*. The precise form of the polynomial depends on how many differences are repeated, and how many times.

### (a) Covariance

For the case where the response is a vector *Y*∈^{n}, since the noise vector has mean zero, it follows by (1.1) that(2.5)

To estimate (*RR*^{T}), let (*X*′, *Y*′) and (*X*″, *Y*″) denote the sample points for which *X*′ and *X*″ are, respectively, the first and second nearest neighbours of the point *X* in the marginal sample . As above, by the Lipschitz condition on *f* the absolute differences and are small provided the corresponding distances and are also small. Thus, because the noise components *R*, *R*′ and *R*″ are independent and identically distributed zero-mean random vectors, we have(2.6)and we propose the following sample mean estimator for (*RR*^{T}):(2.7)

## 3. Asymptotic results

We now proceed to establish an asymptotic upper bound on the bias of our moment estimator *Γ*_{N}, and verify its probabilistic convergence as the number of observations *N*→∞. To this end, we first consider the mean distance from a point to its *q*th nearest neighbour in the marginal sample =(*X*_{1}, …, *X*_{N}), which we denote by(3.1)An almost identical argument to one found in Evans *et al*. (2002) leads to the following result, which we state without proof.

*Subject to condition* (ii),(3.2)*where a is the constant implied by the positive density condition on the sampling distribution and Γ denotes the Euler Gamma function*.

Theorem 3.2 shows that *Γ*_{N} is an asymptotically unbiased estimator for (*R*^{q}) as the number of points *N*→∞.

*Subject to conditions* (i)–(iv),(3.3)*where*(3.4)*where a is the constant implied by the positive density condition on the sampling distribution; b is the constant implied by the Lipschitz condition on the regression function; and Γ denotes the Euler Gamma function*.

Let *A*={1, …, *q*} and *A*(*t*) denote the set of all subsets *B*⊂*A* containing exactly *t* elements. Applying the identity(3.5)we obtain(3.6)Hence the conditional expectation of *γ*_{i}(,) given a fixed realization of the explanatory sample satisfies(3.7)Expanding the product , since *A*\*B* contains *q*−*t* distinct integers, the first term of the expansion is while the remaining terms are products in which at least one of the *R*_{i(k)} occurs exactly once. Since the noise variables *R*_{j} are independent and identically distributed with (*R*_{j})=0, and because they are also independent of the sample , the expected value of the first term is equal to and the expected value of each remaining term is zero. Thus it follows that(3.8)and(3.9)Hence the bias of *Γ*_{N} is given by(3.10)where(3.11)For *t*≥1,(3.12)where the first inequality follows by the Lipschitz condition on *f*(*x*) and the last by the assumption that . Hence for *B*∈*A*(*t*) with *t*≥1, it follows that(3.13)Thus, because *A*(*t*) contains elements, it follows by (3.10) and (3.13) that(3.14)Hence by lemma 3.1, we conclude that(3.15)where(3.16)as required. ▪

In another paper (Evans 2008), we proved the following result.

*Subject to conditions* (i)–(iv),(3.17)

If the first 2*q* moments of the noise distribution are absolutely bounded, var(*γ*_{i}) is also bounded. Chebyshev's inequality then yields the following corollary of theorems 3.2 and 3.3, which shows that *Γ*_{N} is a weakly consistent estimator for (*R*^{q}) as the number of observations *N*→∞.

*Subject to conditions* (i)–(iv) *and the additional requirement that* |(*R*^{t})|<∞ *for all t*=1, …, 2*q*,(3.18)

The first and second error terms in (3.18) correspond to the bias and variance of *Γ*_{N}, respectively. Theorem 3.2 shows that the bias of *Γ*_{N} is dominated by the expected mean nearest neighbour distance , which by lemma 3.1 is of asymptotic order (*N*^{−1/m}) as *N*→∞. Hence the rate at which the bias of *Γ*_{N} converges to zero decreases as the dimension of the explanatory vectors increases. By contrast, the rate at which the second error terms in (3.18) converges (in probability) to zero does not depend on the dimension, but only on the fact that var(*γ*_{i}) is bounded. An improved rate of convergence is obtained if there exists some *α*>0 such that var(*γ*_{i})=(*N*^{−α}) as *N*→∞, in which case the variance of *Γ*_{N} will be of asymptotic order _{P}(*N*^{−1/2−α/2}) as *N*→∞. However, subject to conditions (i–iv), it is easy to show that(3.19)so we cannot improve on the rate at which the variance of *Γ*_{N} converges to zero. Instead of using (2.4), it is interesting to speculate whether *γ*_{i} can be defined to ensure that both (*γ*_{i})→(*R*^{q}) and var(*γ*_{i})→0 as *N*→∞. In particular, if var(*γ*_{i}) can be bounded by some constant multiple of the expected nearest neighbour distance (*δ*_{i}), it follows by lemma 3.1 that the variance of *Γ*_{N} will be of asymptotic order _{P}(*N*^{−1/2−α/2}) as *N*→∞, which would ensure the -consistency of *Γ*_{N} for all *m*≥1 (Munk *et al*. 2005).

## 4. Practical considerations

In practical non-parametric data analysis, the implied constants in the asymptotic expression (3.18) are unknown, and thus we are not able to establish analytic confidence intervals on an estimate computed by *Γ*_{N}.

Let be a realization of the random sample (, ), with and . For the estimate , we refer to the errors incurred due to the bias and variance of *Γ*_{M} as the *systematic error* and the *statistical error*, respectively.

The systematic error associated with is due to the variability of the regression function *f*(*x*)=(*Y*|*X*=*x*) over the *q*-neighbourhood ball of each sample point (defined to be the ball centred at *x*_{i} and having the *q*th nearest neighbour *x*_{i(q)} of *x*_{i} on its boundary). By the Lipschitz condition on *f*, the extent to which *f*(*x*) varies over the *q*-neighbourhood ball of *x*_{i} can be bounded in terms of its radius . Hence for a particular realization of the random sample (, ), to reduce the systematic error of our estimate it makes sense to consider only those for which the associated value of is small.

Thus, we reorder the sequence to be increasing with respect to , and define a sequence of estimates(4.1)

The question now becomes whether or not *M* can be chosen so that the total error is as small as possible. In the spirit of exploratory statistics, we plot the sequence *Γ*_{M} against *M* and visually assess the behaviour of *Γ*_{M} as *M*→*N*. If *Γ*_{M} is seen to approach a stable value as *M* increases, we conclude that there are sufficient data to ensure that the total error is small, and proceed to estimate the moment (*R*^{q}) by inspection.

To reduce systematic error, it may be that the (*q*+1)-neighbourhood ball of a point *x*_{i} is significantly smaller than the *q*-neighbourhood ball of another point *x*_{j}, in which case we might expect that the *q*-fold product will incur less systematic error than . However, replacing *γ*_{j} by does not necessarily lead to a better estimate, because the empirical noise realizations corresponding to *γ*_{i} and are estimated by almost identical products of noise differences. Consequently, the estimator suffers a loss of information regarding the noise distribution, and statistical error increases accordingly. Systematic error might also be reduced by attempting to estimate the limit of *Γ*_{N} as *Δ*_{N}→0 (Evans & Jones 2002; Tong & Wang 2005). Thus for *k*=1, …, *p* we might define(4.2)along with the associated sample means *Γ*_{N}(*k*) and *Δ*_{N}(*k*). The limit of the *Γ*_{N}(*k*) as *Δ*_{N}(*k*)→0 can then be estimated using simple linear regression on the pairs (*Δ*_{N}(*k*), *Γ*_{N}(*k*)), which yields the estimator(4.3)

The success of this approach depends on the nearest neighbour distances *δ*_{i,k} being sufficiently small to ensure that the systematic error scales approximately linearly with *Δ*_{N}(*k*). Empirical results indicate that does not offer any significant advantages over *Γ*_{N}.

### (a) Covariance

Turning our attention to the covariance estimator , for a particular sample realization we reorder the sequence to be increasing with respect to , and define a sequence of estimates(4.4)

For each pair of coordinates 1≤*α*,*β*≤*n*, we plot the corresponding component sequence against *M*, then visually assess its behaviour as *M*→*N*. If the curve appears to approach a stable value as *M*→*N*, we proceed to estimate the component by inspection.

## 5. Experimental results

### (a) Moment estimation

To investigate our moment estimators, we generate samples of *N*=1000 observation pairs (*x*_{i}, *y*_{i})∈×, where

the regression function is defined to be

*f*(*x*)=20(*x*−1/2)^{2};the explanatory samples

*x*_{i}are selected independently according to the uniform distribution over the unit interval [0,1];the residual samples

*r*_{i}are selected independently according tothe standard Gaussian distribution

*N*(0, 1) andthe Gamma distribution, normalized to have mean zero and variance 0.25; and

the response samples

*y*_{i}are set to equal*f*(*x*_{i})+*r*_{i}.

Figure 1*a* shows a single sample realization for the Gaussian noise experiment and figure 1*b* shows a single realization for the Gamma noise experiment while table 1 shows the estimated moments and the empirical moments for the sample realizations shown in figure 1.

In figure 2, we plot the estimate sequences for the sample realization of the Gaussian noise experiment illustrated in figure 1, with the empirical moments shown by green line (*q*=1, 2, 3, 4). It is evident from the figures that the final values are likely to provide reasonable estimates for the corresponding moments of the noise distribution. Similarly, in figure 3 we plot the estimate sequences for the sample realization of the Gamma noise experiment illustrated in figure 1, with the empirical moments shown by green line (*q*=1, 2, 3, 4). In this case, the plots are more ‘jagged’ than those for the Gaussian noise experiment shown in figure 2. This is due to the heavy tail of the Gamma distribution, which produces outliers that can be seen in figure 1. These jagged features become increasingly apparent as *q* increases, illustrating the well-known fact that sample estimates of higher order moments are negatively affected by outliers. In spite of this, it appears that the curves stabilize as *M*→*N*, so that the final values can be adopted as reasonable estimates for the corresponding moments of the noise distribution.

To investigate the variance of our moment estimators, we perform 100 repetitions of both the Gaussian noise and Gamma noise experiments. In figure 4, we plot the mean absolute estimation error of the sequences for the Gaussian noise experiment, along with error bars representing one standard error of the mean over these repetitions. Figure 5 shows similar plots for the Gamma noise experiment.

### (b) Covariance estimation

To investigate our covariance estimators, we generate samples of *N*=1000 observation pairs (*x*_{i}, *y*_{i})∈^{2}×^{2}, where

the regression function is defined to be

*f*(*x*_{1},*x*_{2})=20(*x*_{1}−1/2)^{2}+20(*x*_{2}−1/2)^{2};the explanatory samples

*x*_{i}are selected independently according to the uniform distribution over the unit square [0,1]^{2};the residual samples

*r*_{i}are selected independently according to the multivariate Gaussian distribution*N*(0, Σ), wherethe response samples

*y*_{i}are set to equal*f*(*x*_{i})+*r*_{i}.

For a single run of this experiment, figure 6 shows the estimate sequences , along with the associated empirical covariances shown by green line (1≤*α*, *β*≤*n*). It is evident from the plots that the final values are likely to provide reasonable estimates for the corresponding residual covariances.

The empirical residual covariance matrix and the estimated residual covariance matrix are shown in (5.1). Because covariance matrices are symmetric by definition, we should replace and by their average value, which in this case would lead to an improved estimator for ,(5.1)

### (c) Smoothing windows

For scalar data (*x*_{i}, *y*_{i})∈×, one of the simplest non-parametric modelling techniques is to apply a *smoothing window* to the data, where the regression function value *f*(*x*_{i}) at *x*_{i} is estimated simply by the average of the associated point *y*_{i} and a certain number of its nearest neighbours,(5.2)where the *window size W* controls the amount of smoothing applied to the data. Consider the empirical error function associated with a particular value of *W*,(5.3)In view of (1.9), we define the optimal window size to be that for which the moments of the empirical error function best match the estimated moments of the noise distribution. In fact, for *q*=2, 3 and 4 we propose the following notions of what constitutes an optimal window size:(5.4)

In table 2, we show the average mean-squared estimation error of the models , computed over 100 repetitions of both the Gaussian noise and Gamma noise experiments. For Gaussian noise, the table shows that improved models can be obtained using instead of . The results also show that offers no advantage over , which is reasonable in view of the fact that the Gaussian distribution is symmetric and therefore has third moment equal to zero. By contrast, the table shows that does offer an advantage over for the Gamma noise distribution. This is also reasonable since the Gamma distribution is non-symmetric, and therefore has non-zero third moment. While the advantage over of for Gaussian noise and for Gamma noise appears to be very slight, it must be remembered that smoothing windows are rather crude non-parametric regression methods, and it may be that more significant advantages can be achieved with more sophisticated modelling techniques.

## 6. Conclusion

We have proposed new estimators of residual moments and covariance for non-parametric data analysis. Standardized moments (skewness, kurtosis) and correlation can be estimated by normalizing with respect to the appropriate standard deviation estimates. The estimators are computationally efficient, with running time of order (*N* log *N*) as the number of data points *N*→∞. We have shown that our moment estimators are asymptotically unbiased and weakly consistent as *N*→∞.

In practical applications, the sample mean sequence *Γ*_{M} is plotted to see whether it approaches a stable value as *M* increases. If this occurs, we conclude that there are sufficient data to ensure that the estimation error is small and proceed to estimate the moment or covariance component by inspection. If *Γ*_{M} does not appear to approach a stable value as *M*→*N*, we must conclude that we have insufficient data to allow accurate noise estimates using the methods described in this paper.

## Acknowledgments

D.E. would like to thank the Royal Society for supporting his research through its University Research Fellowship scheme.

## Footnotes

- Received August 1, 2007.
- Accepted May 6, 2008.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

- Copyright © 2008 The Royal Society