Non-parametric estimation of residual moments and covariance

The aim of non-parametric regression is to model the behaviour of a response vector Y in terms of an explanatory vector X, based only on a finite set of empirical observations. This is usually performed under the additive hypothesis Y=f(X)+R, where f(X)=(Y|X) is the true regression function and R is the true residual variable. Subject to a Lipschitz condition on f, we propose new estimators for the moments (scalar response) and covariance (vector response) of the residual distribution, derive their asymptotic properties and discuss their application in practical data analysis.


Introduction
In the analysis of complex systems, it is often the case that we have no prior information regarding the system under investigation, and the analysis must proceed based only on empirical observations of the system behaviour. Nonparametric regression methods attempt to model the behaviour of an observable random vector Y2R n in terms of another observable random vector X2R m , based only on a finite set of observations.
The relationship between the response vector Y and the explanatory vector X is often assumed to satisfy the additive hypothesis Y Z f ðXÞ C R; ð1:1Þ where f(X)ZE(Y jX) is the true regression function and R is the true residual vector, which we call noise. For a scalar response (nZ1) we present new non-parametric estimators for the moments E(R q ) of the noise distribution (qZ2, 3, 4, .), and for a vector response (nO1) we describe a new estimator for the covariance matrix E(RR T ). Our estimators are computed from a finite sample of independent and identically distributed observations of the joint variable (X, Y ) by exploiting the nearest neighbour structure of the explanatory observations. Using techniques first described in Bentley (1975), our estimates can be computed with asymptotic time complexity of order O(N log N ) as the number of points N/N.
Our main assumptions are that the true regression function E(Y jX ) is a smooth function of the explanatory vector X, and that the zero-mean noise vector R is independent of the explanatory vector (homoscedastic). The noise vector represents all variations in the response vector that are not accounted for by any smooth transformation of the explanatory vector, and includes what might be called deterministic variation due to non-random factors that influence the response vector but which are not included in the explanatory vector, as well as stochastic variation due to purely random factors such as measurement error.
Our estimators belong to the class of difference-based methods. These date back to von Neumann (1941), who, for the case mZ1 and under the assumption of homoscedastic noise, proposed that the residual variance can be estimated by the average of squared successive differences of the observations, This estimator was investigated by Rice (1984) and subsequently extended by Gasser et al. (1986) and Hall et al. (1990) by using interpolation to reduce the bias of the so-called 'pseudo-residual' Y i KY iK1 . For the case mZ1, Levine (2006) and Brown & Levine (2007) considered heteroscedastic noise and proposed differencebased estimators for the residual variance function E(R 2 jX ) using weighted local averages of the squared pseudo-residuals. A related paper by Wang et al. (2008) investigated the effect that the mean function f(X )ZE(YjX ) has on estimators of the variance function. For the case mR1, Munk et al. (2005) showed that the bias of simple difference-based estimators such as (1.2) have asymptotic order O(N K1/m ) as N/N and are therefore not ffiffiffiffi ffi N p -consistent for mR4. They also describe estimators that allow better control of the bias and remain consistent when mR4. The case mR1 is also discussed in  and Bock et al. (2007), where the assumption that the explanatory vectors lie on a rectangular grid is relaxed by considering their nearest neighbour structure. For the case mR1 and under the assumption of homoscedastic noise, Durrant (2001) presented difference-based estimators for the higher moments E(R q ) of symmetric noise distributions. To the best of our knowledge, this is the only previous attempt at estimating the higher moments of residual distributions.

(a ) Notation and conditions
Let (X 1 , Y 1 ), ., (X N , Y N ) be a sample of independent and identically distributed observations of the joint variable (X, Y ). To establish asymptotic bounds on the bias and variance of our estimators as the number of points N/N, we impose some fairly strong conditions on the distribution of the explanatory observations X 1 , ., X N . To this end, let F : R m /[0, 1] be a probability distribution whose density f(x)ZF 0 (x) exists at every point x2R m . Definition 1.1. A probability distribution F satisfies a smooth density condition if its density has bounded partial derivatives at every point x2R m .
For x2R m let B x (s) denote the ball of radius s centred at x, and u x (s) denote its probability measure, The positive density condition (Gruber 2004) ensures that the probability measure of an arbitrary ball can be bounded in terms of its radius. If a distribution satisfies a smooth density condition,  showed that the distribution also satisfies a positive density condition provided the support S f of the density is a compact convex body in R m , where That a distribution satisfies the positive density condition essentially means that its density is well behaved in small neighbourhoods. If the distribution also satisfies a smooth density condition, then its density is approximately uniform in small neighbourhoods. The inequality of (1.4) in the l.h.s. ensures that the density is zero outside some bounded region. The positive density condition also ensures that the probability measure of small neighbourhoods scale with their Lebesgue measure, and thus excludes the case where the support of the density is a fractal set.
We state the following conditions.
(i) The true regression function f (X )ZE(YjX ) is Lipschitz continuous.
Let b denote the associated Lipschitz constant, so that kf ðxÞKf ðx 0 Þk% bkx Kx 0 k for all x, x 0 2R m . (ii) The explanatory observations X 1 , ., X N are independent and identically distributed random vectors in R m , and their common distribution F satisfies smooth and positive density conditions. Furthermore, the support S f of the associated density is a compact subset of R m such that kx Kx 0 k% 1 for all x, x 0 2S f . (iii) The (implicit) noise observations R 1 , ., R N are independent and identically distributed random vectors, and to estimate E(R q ) their common distribution J must have bounded moments up to order 2q2N. (iv) For every observation pair (X i , Y i ), the explanatory vector X i and the (implicit) noise vector R i are independent (homoscedastic).
We call F the sampling distribution, and J the noise distribution or the true residual distribution. The condition that kx Kx 0 k% 1 for all x, x 0 2S f is imposed to ensure that kx Kx 0 k t % kx Kx 0 k for all tR1. In practice, this condition can be enforced by an appropriate scaling of the explanatory observations X 1 , ., X N , which might lead to a modification of the Lipschitz constant b defined in condition (i).

(b ) Optimal estimators
A number of algorithms for non-parametric regression exist in the literature, most of which require an estimate of the noise variance E(R 2 ) to control the amount of smoothing applied to the data. Too little smoothing means that noise is incorporated into the estimate, while too much smoothing leads to certain characteristics of the underlying regression function being lost. An estimate of E(R 2 ) can help to identify suitable bandwidths for kernel regression (Silverman 1986), second-derivative penalties for spline smoothing (Eubank 1999), stopping criteria for neural network training (Jones 2004) and thresholds for wavelet smoothing (Donoho & Johnstone 1995).
To evaluate an estimatorf ðX Þ of the true regression function f(X ), we consider the associated error function, eðX Þ Z f ðX ÞKf ðX Þ: ð1:6Þ In practice, e(X ) is unknown and instead we must consider the empirical error function,ê ðX; Y Þ Z Y Kf ðX Þ: ð1:7Þ Since the explanatory vector X and the noise vector R are assumed to be independent, it follows by (1.1) that the empirical mean-squared error satisfies Eðê 2 jX Þ Z EðR 2 Þ C eðX Þ 2 : ð1:8Þ Hence if Eðê 2 jX Þ is close to the noise variance E(R 2 ), the error function must be close to zero at X and we might therefore define an optimal estimator to be one for which Eðê 2 jX ÞZ EðR 2 Þ for all X2S f . In general, the q th moment of the empirical error function satisfies and we might instead define an optimal estimator to be one for which Eðê q jX ÞZ EðR q Þ for all qZ1, 2, . and X2S f . Thus, we are motivated to estimate the moments of the noise distribution.

Products of differences
Let X Z(X 1 , ., X N ) and YZ(Y 1 , ., Y N ) denote the marginal samples. Under the additive hypothesis (1.1), the noise variance E(R 2 ) can be estimated by the mean-squared difference between pairs of response vectors Y and Y 0 for which the corresponding explanatory vectors X and X 0 are the nearest neighbours in the marginal sample X . By (1.1), since R and R 0 are independent and identically distributed zero-mean random variables, it follows that where the second term on the r.h.s. represents the bias of (1/2)E(YKY 0 ) 2 as an estimator for E(R 2 ). Rather than using squared differences, we propose an alternative estimator for E(R 2 ) based on products of differences (YKY 0 ) (YKY 00 ), where Y 0 and Y 00 are the response vectors for which the corresponding explanatory vectors X 0 and X 00 are, respectively, the first and second nearest neighbours of X. By (1.1), since R, R 0 and R 00 are independent and identically distributed zero-mean random variables, we have that where the second term on the r.h.s. represents the bias of E((YKY 0 )(YKY 00 )) as an estimator for E(R 2 ). By the Lipschitz condition on f, EððY KY 0 ÞðY KY 00 ÞÞ Z EðR 2 Þ C OðEðkX KX 00 kÞÞ as kX KX 00 k/ 0: ð2:3Þ Estimators for the higher order moments E(R q ) can be similarly constructed using q -fold products of differences over the first q nearest neighbours of X. Let i(k) denote the index of the kth nearest neighbour of X i among the points X 1 , ., X N . We propose the following sample mean estimator for E(R q ): To compute the q th moment, we need not restrict ourselves to the first q nearest neighbours of X i , since any set of q points in the neighbourhood of X can be used to compute a q -fold product of differences and hence an estimate for E(R q ). Although using more distant neighbours is likely to increase the bias of the estimate, these products might be useful to reduce statistical error when the number of observations is small. We can also form products of differences over a set of neighbours that are not necessarily distinct. In this case, some differences are repeated in the product, and the method returns an estimate for a known polynomial function of the moments up to order q. The precise form of the polynomial depends on how many differences are repeated, and how many times.

(a ) Covariance
For the case where the response is a vector Y2R n , since the noise vector has mean zero, it follows by (1.1) that To estimate E(RR T ), let (X 0 , Y 0 ) and (X 00 , Y 00 ) denote the sample points for which X 0 and X 00 are, respectively, the first and second nearest neighbours of the point X in the marginal sample X . As above, by the Lipschitz condition on f the absolute differences kf ðXÞKf ðX 0 Þk and kf ðXÞKf ðX 00 Þk are small provided the corresponding distances kX KX 0 k and kX KX 00 k are also small. Thus, because the noise components R, R 0 and R 00 are independent and identically distributed zero-mean random vectors, we have EððY KY 0 ÞðY KY 00 Þ T Þ Z EðRR T Þ C OðEðkX KX 00 kÞÞ as kX KX 00 k/ 0;

ð2:6Þ
and we propose the following sample mean estimator for E(RR T ):

Asymptotic results
We now proceed to establish an asymptotic upper bound on the bias of our moment estimator G N , and verify its probabilistic convergence as the number of observations N/N. To this end, we first consider the mean distance from a point to its q th nearest neighbour in the marginal sample XZ(X 1 , ., X N ), which we denote by An almost identical argument to one found in  leads to the following result, which we state without proof.
Lemma 3.1. Subject to condition (ii), where a is the constant implied by the positive density condition on the sampling distribution and G denotes the Euler Gamma function.
Theorem 3.2 shows that G N is an asymptotically unbiased estimator for E(R q ) as the number of points N/N.
where c Z cðm; q; a; bÞ Z aGðq C 1=mÞ GðqÞ where a is the constant implied by the positive density condition on the sampling distribution; b is the constant implied by the Lipschitz condition on the regression function; and G denotes the Euler Gamma function.
Proof. Let AZ{1, ., q} and A(t) denote the set of all subsets B3A containing exactly t elements. Applying the identity we obtain Hence the conditional expectation of g i (X, Y) given a fixed realization of the explanatory sample X satisfies Expanding the product Q k2AnB ðRK R iðkÞ Þ, since A\B contains qKt distinct integers, the first term of the expansion is R qKt i while the remaining terms are products in which at least one of the R i(k) occurs exactly once. Since the noise variables R j are independent and identically distributed with E(R j )Z0, and because they are also independent of the sample X , the expected value of the first term is equal to EðR qKt i Þ and the expected value of each remaining term is zero. Thus it follows that and ðf ðX i ÞKf ðX iðkÞ ÞÞ: ð3:9Þ Hence the bias of G N is given by where the first inequality follows by the Lipschitz condition on f(x) and the last by the assumption that kX i K X iðqÞ k% 1. Hence for B2A(t) with tR1, it follows that If the first 2q moments of the noise distribution are absolutely bounded, var(g i ) is also bounded. Chebyshev's inequality then yields the following corollary of theorems 3.2 and 3.3, which shows that G N is a weakly consistent estimator for E(R q ) as the number of observations N/N. Corollary 3.4. Subject to conditions (i)-(iv) and the additional requirement that jE(R t )j!N for all tZ1, ., 2q, ð3:18Þ The first and second error terms in (3.18) correspond to the bias and variance of G N , respectively. Theorem 3.2 shows that the bias of G N is dominated by the expected mean nearest neighbour distance EðD N Þ, which by lemma 3.1 is of asymptotic order O(N K1/m ) as N/N. Hence the rate at which the bias of G N converges to zero decreases as the dimension of the explanatory vectors increases. By contrast, the rate at which the second error terms in (3.18) converges (in probability) to zero does not depend on the dimension, but only on the fact that var(g i ) is bounded. An improved rate of convergence is obtained if there exists some aO0 such that var(g i )ZO(N Ka ) as N/N, in which case the variance of G N will be of asymptotic order O P (N K1/2Ka/2 ) as N/N. However, subject to conditions (i-iv), it is easy to show that varðg i Þ Z EðR 2q Þ C qEðR 2 ÞKEðR q Þ 2 C Oðd i Þ as N /N; ð3:19Þ so we cannot improve on the rate at which the variance of G N converges to zero. Instead of using (2.4), it is interesting to speculate whether g i can be defined to ensure that both E(g i )/E(R q ) and var(g i )/0 as N/N. In particular, if var(g i ) can be bounded by some constant multiple of the expected nearest neighbour distance E(d i ), it follows by lemma 3.1 that the variance of G N will be of asymptotic order O P (N K1/2Ka/2 ) as N/N, which would ensure the ffiffiffiffi ffi N p -consistency of G N for all mR1 (Munk et al. 2005).

Practical considerations
In practical non-parametric data analysis, the implied constants in the asymptotic expression (3.18) are unknown, and thus we are not able to establish analytic confidence intervals on an estimate computed by G N . Let ðx;ỹÞ be a realization of the random sample (X , Y), withx Z ðx 1 ; .; x N Þ andỹZ ðy 1 ; .; y N Þ. For the estimate G N ðx;ỹÞ, we refer to the errors incurred due to the bias and variance of G M as the systematic error and the statistical error, respectively.
The systematic error associated with G N ðx;ỹÞ is due to the variability of the regression function f(x)ZE(Y jXZx) over the q-neighbourhood ball of each sample point x i 2x (defined to be the ball centred at x i and having the q th nearest neighbour x i(q) of x i on its boundary). By the Lipschitz condition on f, the extent to which f(x) varies over the q-neighbourhood ball of x i can be bounded in terms of its radius d i ðxÞ. Hence for a particular realization ðx;ỹÞ of the random sample (X, Y), to reduce the systematic error of our estimate it makes sense to consider only those g i ðx;ỹÞ for which the associated value of d i ðxÞ is small.
Thus, we reorder the sequence ðd 1 ðxÞ; g 1 ðx;ỹÞÞ; .; ðd N ðxÞ; g N ðx;ỹÞÞ to be increasing with respect to d i ðxÞ, and define a sequence of estimates The question now becomes whether or not M can be chosen so that the total error is as small as possible. In the spirit of exploratory statistics, we plot the sequence G M against M and visually assess the behaviour of G M as M/N. If G M is seen to approach a stable value as M increases, we conclude that there are sufficient data to ensure that the total error is small, and proceed to estimate the moment E(R q ) by inspection.
To reduce systematic error, it may be that the (qC1)-neighbourhood ball of a point x i is significantly smaller than the q-neighbourhood ball of another point x j , in which case we might expect that the q-fold product g 0 i Z Q qC1 kZ2 ðy i K y iðkÞ Þ will incur less systematic error than g j Z Q q kZ1 ðy j K y jðkÞ Þ. However, replacing g j by g 0 i does not necessarily lead to a better estimate, because the empirical noise realizations corresponding to g i and g 0 i are estimated by almost identical products of noise differences. Consequently, the estimator suffers a loss of information regarding the noise distribution, and statistical error increases accordingly. Systematic error might also be reduced by attempting to estimate the limit of G N as D N /0 Tong & Wang 2005). Thus for kZ1, ., p we might define along with the associated sample means G N (k) and D N (k). The limit of the G N (k) as D N (k)/0 can then be estimated using simple linear regression on the pairs (D N (k), G N (k)), which yields the estimator The success of this approach depends on the nearest neighbour distances d i,k being sufficiently small to ensure that the systematic error scales approximately linearly with D N (k). Empirical results indicate thatG N does not offer any significant advantages over G N .

(a ) Covariance
Turning our attention to the covariance estimator G cov N , for a particular sample realization ðx;ỹÞ we reorder the sequence ðd 1 ðxÞ; g cov 1 ðx;ỹÞÞ; .; ðd N ðxÞ; g cov N ðx;ỹÞÞ to be increasing with respect to d i ðxÞ, and define a sequence of estimates For each pair of coordinates 1%a,b%n, we plot the corresponding component sequence G cov M ða; bÞ against M, then visually assess its behaviour as M/N. If the curve appears to approach a stable value as M/N, we proceed to estimate the component by inspection.

(a ) Moment estimation
To investigate our moment estimators, we generate samples ðx;ỹÞ of NZ1000 observation pairs (x i , y i )2R!R, where -the regression function is defined to be f(x)Z20(xK1/2) 2 ; -the explanatory samples   -the residual samples r i are selected independently according to (i) the standard Gaussian distribution N(0, 1) and (ii) the Gamma distribution, normalized to have mean zero and variance 0.25; and -the response samples y i are set to equal f(x i )Cr i . Figure 1a shows a single sample realization for the Gaussian noise experiment and figure 1b shows a single realization for the Gamma noise experiment while table 1 shows the estimated moments G ðqÞ N and the empirical moments hr q i i for the sample realizations shown in figure 1.
In figure 2, we plot the estimate sequences G  (qZ1, 2, 3, 4). In this case, the plots are more 'jagged' than those for the Gaussian noise experiment shown in figure 2. This is due to the heavy tail of the Gamma distribution, which produces outliers that can be seen in figure 1. These jagged features become increasingly apparent as q increases, illustrating the well-known fact that sample estimates of higher order moments are negatively affected by outliers. In spite of this, it appears that the curves stabilize as M/N, so that the final values G ðqÞ N can be adopted as reasonable estimates for the corresponding moments of the noise distribution.
To investigate the variance of our moment estimators, we perform 100 repetitions of both the Gaussian noise and Gamma noise experiments. In figure 4, we plot the mean absolute estimation error of the sequences G ðqÞ M for the Gaussian noise experiment, along with error bars representing one standard error of the mean over these repetitions. Figure 5 shows similar plots for the Gamma noise experiment.

(b ) Covariance estimation
To investigate our covariance estimators, we generate samples ðx;ỹÞ of NZ1000 observation pairs (x i , y i )2R 2 !R 2 , where -the regression function is defined to be f(x 1 , x 2 )Z20(x 1 K1/2) 2 C20(x 2 K1/2) 2 ; -the explanatory samples For a single run of this experiment, figure 6 shows the estimate sequences G cov M ða; bÞ, along with the associated empirical covariances hr ia r ib i shown by green line (1%a, b%n). It is evident from the plots that the final values G cov N ða; bÞ are likely to provide reasonable estimates for the corresponding residual covariances.
The empirical residual covariance matrix and the estimated residual covariance matrix are shown in (5.1). Because covariance matrices are symmetric by definition, we should replace G cov N ð1; 2Þ and G cov N ð2; 1Þ by their average value, which in this case would lead to an improved estimator for hr i1 r i2 i, For scalar data (x i , y i )2R!R, one of the simplest non-parametric modelling techniques is to apply a smoothing window to the data, where the regression function value f(x i ) at x i is estimated simply by the average of the associated point y i and a certain number of its nearest neighbours, where the window size W controls the amount of smoothing applied to the data. Consider the empirical error function associated with a particular value of W,  In view of (1.9), we define the optimal window size to be that for which the moments of the empirical error functionê W best match the estimated moments G ðqÞ N of the noise distribution. In fact, for qZ2, 3 and 4 we propose the following notions of what constitutes an optimal window size: opt , which is reasonable in view of the fact that the Gaussian distribution is symmetric and therefore has third moment equal to zero. By contrast, the table shows thatŴ ð3Þ opt does offer an advantage overŴ ð2Þ opt for the Gamma noise distribution. This is also reasonable since the Gamma distribution is non-symmetric, and therefore has non-zero third moment. While the advantage overŴ opt for Gamma noise appears to be very slight, it must be remembered that smoothing windows are rather crude non-parametric regression methods, and it may be that more significant advantages can be achieved with more sophisticated modelling techniques.

Conclusion
We have proposed new estimators of residual moments and covariance for nonparametric data analysis. Standardized moments (skewness, kurtosis) and correlation can be estimated by normalizing with respect to the appropriate standard deviation estimates. The estimators are computationally efficient, with running time of order O(N log N ) as the number of data points N/N. We have shown that our moment estimators are asymptotically unbiased and weakly consistent as N/N.
In practical applications, the sample mean sequence G M is plotted to see whether it approaches a stable value as M increases. If this occurs, we conclude that there are sufficient data to ensure that the estimation error is small and  proceed to estimate the moment or covariance component by inspection. If G M does not appear to approach a stable value as M/N, we must conclude that we have insufficient data to allow accurate noise estimates using the methods described in this paper.
D.E. would like to thank the Royal Society for supporting his research through its University Research Fellowship scheme.