## Abstract

Mutual information quantifies the determinism that exists in a relationship between random variables, and thus plays an important role in exploratory data analysis. We investigate a class of non-parametric estimators for mutual information, based on the nearest neighbour structure of observations in both the joint and marginal spaces. Unless both marginal spaces are one-dimensional, we demonstrate that a well-known estimator of this type can be computationally expensive under certain conditions, and propose a computationally efficient alternative that has a time complexity of order (*N* log *N*) as the number of observations *N*→∞.

## 1. Introduction

Mutual information quantifies the dependence between two or more random vectors, and has many attractive properties (Cover & Thomas 1991). In particular, it provides a natural criterion for variable selection in regression problems, and has been used to identify appropriate time lags in nonlinear time-series analysis (Kantz & Schreiber 1997). The mutual information *I*(*X*, *Y*) between two random vectors and is defined by(1.1)where *ϕ*_{z} is the joint density of the pair *Z*=(*X*, *Y*), and *ϕ*_{x} and *ϕ*_{y} are the marginal densities of *X* and *Y*, respectively. Let *Z*_{1}, …, *Z*_{N} be a sample of independent and identically distributed observations *Z*_{i}=(*X*_{i}, *Y*_{i}) of the joint variable *Z*=(*X*, *Y*). In this paper, we investigate a class of estimators for mutual information based on the nearest neighbour structure of the observations in both the joint and marginal spaces. Unless both the marginal spaces are one-dimensional (*m*,*n*=1), we demonstrate that, under certain conditions, an existing estimator of this type due to Kraskov *et al*. (2004) has computational complexity of order (*N*^{1+α}) for some *α*>0 as the number of observations *N*→∞. We also propose an alternative estimator that, at the expense of increased estimation error, has computational complexity of order (*N* log *N*) as *N*→∞, regardless of the dimension of the marginal spaces.

To model the behaviour of some response vector *Y*, the first step is to choose an appropriate explanatory vector *X*, consisting of other observable parameters that, as far as possible, combine to determine the behaviour of *Y*. In the literature, this is known as feature selection or model identification (Liu & Motoda 1998), and is the central problem in data mining (Witten & Frank 2005). For time-series analysis, a related problem is to determine optimal embeddings for time-delay reconstruction (Kantz & Schreiber 1997).

Suppose we have identified a set of *d* variables that we think might influence a given response vector. To construct an accurate model, we must choose a subset of these candidate variables whose elements combine to determine the behaviour of the response vector as far as possible. A candidate subset of explanatory variables can be evaluated by estimating the mutual information between it and the response vector. There are 2^{d}−1 non-trivial candidate subsets, so if *d* is large it is not feasible to evaluate them all. Instead, an optimization routine must be used to search through the set of candidate subsets, seeking those whose mutual information with respect to the observed response is a maximum. To simplify the model construction process, after choosing a suitable subset of explanatory variables, we should also remove any redundant variables in the subset, i.e. those that depend almost entirely on one or more of the other chosen variables. Mutual information can again be used to quantify this dependence, and this elimination step may require further optimization if the number of candidate variables is large. For successful optimization, it is essential that the evaluation metric used to assess candidate solutions is rapidly computable, in order that a comprehensive search of the solution space can be carried out. Thus, we are motivated to investigate computationally efficient estimators of mutual information.

As we shall see, if both the marginal spaces are one-dimensional, the estimator of Kraskov *et al*. (2004) has the same computational complexity as our proposed alternative, namely of order (*N* log *N*) as *N*→∞. Kraskov *et al*. (2004) also demonstrate that their estimator performs particularly well if the explanatory and response vectors are independent. Prior to the optimization, the estimator of Kraskov *et al*. (2004) could therefore be used to eliminate those candidate variables that appear to be (almost) independent of the response vector, and thus reduce the dimension of the search space.

## 2. Estimation

The estimators that we consider are based on the metric properties of nearest neighbour balls in both the joint and marginal spaces. First, we impose a condition on the probability distributions to ensure that the probability measure of a nearest neighbour ball can be bounded in terms of its radius.

### (a) The positive density condition

Let be a distribution function whose density function is smooth (i.e. has bounded partial derivatives) every point . For , let *B*_{x}(*r*) denote the ball of radius *r* centred at *x*, and let *ω*_{x}(*r*) and *v*_{x}(*r*) denote its probability measure and volume (Lebesgue measure), respectively,(2.1)

We restrict our attention to distributions that satisfy a *positive density* condition (Gruber 2004), which ensures that the probability measure of an arbitrary ball can be bounded in terms of its radius.

A probability distribution satisfies a positive density condition if constants *β*>1 and *δ*>0 exist, such that(2.2)

For smooth densities, Evans *et al*. (2002) show that the positive density condition is satisfied if the support of the density is a compact convex body in . The positive density condition implies that the samples are drawn from a bounded region in , and thus excludes the possibility that a sample point can take arbitrarily large values. This is acceptable in practical non-parametric data analysis (where the analysis must be performed using only a finite number of empirical observations) because there is nothing to suggest that the variable in question can assume values significantly beyond, for example, the convex hull of the observed samples.

### (b) Nearest neighbours

Let *X*_{1}, …, *X*_{N} be a sample of independent and identically distributed observations in . For , let denote the *k*th nearest neighbour of *X*_{i} among the sample points *X*_{1}, …, *X*_{N}, where proximity relations are defined relative to the *ℓ*^{∞} norm

Let *B*_{x}(*i*, *k*) be the *k*th nearest neighbour ball of *X*_{i}, defined to be the ball (hypercube) centred at *X*_{i} and having the *k*th nearest neighbour of *X*_{i} on its boundary. Let and denote the probability measure and volume (Lebesgue measure) of , respectively,(2.3)

For a sampling distribution having smooth density, it is well known that the probability measure *ω*_{x}(*i*, *k*) of the *k*th nearest neighbour ball of any sample point has a beta distribution with parameters *k* and *N*−*k*. In particular,(2.4)where *ψ* is the digamma function .

### (c) Entropy estimation

The (differential) entropy *H*(*X*) of a random variable *X* is defined by(2.5)where *ϕ*_{x} is the density of *X*. Non-parametric methods for entropy estimation are surveyed by Beirlant *et al*. (1997). One approach is to partition the space into a finite set of ‘boxes’, and approximate the integral of (2.5) by the sum(2.6)where *p*_{j} is the probability measure of the *j*th box and the index runs over all boxes. The simplest method estimates *p*_{j} by the fraction of sample points falling in the *j*th box, which yields a negatively biased estimator. Grassberger (1988, submitted) presents a number of improved estimators for , and, in each case, provides an explicit analytic formula for the bias.

Another class of methods estimate log *ϕ*(*x*_{i}) at every sample point *x*_{i}, and proceed to approximate the integral of (2.5) by the sample mean(2.7)

The simplest approach is to estimate *ϕ*(*x*_{i}) by the ratio for some fixed *r*>0, where and are defined as in (2.1) and where is estimated by the proportion of sample points contained in the ball . If the radius *r* is small, the ball is likely to contain relatively few sample points and the estimate is therefore prone to significant sampling error. As *r* increases the sampling error will decrease, but because the density now becomes increasingly non-uniform over the ball, the estimate becomes increasingly affected by systematic error (bias).

Alternatively, we can estimate *ϕ*(*x*_{i}) by the ratio , where and are, respectively, the probability measure and volume (Lebesgue measure) of the *k*th nearest neighbour ball of *x*_{i}, which, by (2.4), yields the estimator

For *k*=1,2, …, we thus obtain the entropy estimator of Kozachenko & Leonenko (1987),(2.8)

For this estimator, which is also discussed by Wolsztynski *et al*. (2005), the trade-off between sampling error and bias is determined by the size of the *k*th nearest neighbour ball, and *k* must therefore be chosen appropriately to ensure that the total error is as small as possible. To determine the bias of , we need an analytic expression for the expectation , where *d*_{x}(*i*, *k*) is the distance from an arbitrary sample point to its *k*th nearest neighbour in the sample, and the expectation is taken over all realizations of the sample. For smooth sampling densities satisfying (2.2), Evans *et al*. (2002) obtain an asymptotic expression (with an explicit first-order term) for the expectation as the number of points *N*→∞. It is a matter for conjecture whether a similar approach can yield a similar expression for . Furthermore, the way in which finite sample corrections, similar to those presented by Grassberger (submitted) for partition-based estimators, might be developed for sample mean estimators, such as (2.8), is also an open question.

### (d) Mutual information estimation

Mutual information can be expressed as *I*(*X*, *Y*)=*H*(*X*)+*H*(*Y*)−*H*(*X*, *Y*), where *H*(*X*, *Y*) is the entropy of the joint variable *Z*=(*X*, *Y*). Thus, for , it follows by (1.1) and (2.8) that:(2.9)

Let be fixed and be the *k*th nearest neighbour of in the joint sample (*Z*_{i}, …, *Z*_{N}). Following Kraskov *et al*. (2004), we take *k*_{x}=*k*_{x}(*i*, *k*) to be the number such that *X*_{i} is closer to exactly *k*_{x}−1 points of the marginal sample (*X*_{1}, …, *X*_{N}) than it is to *X*_{i(k)}, and to be the number such that *Y*_{i} is closer to exactly points of the marginal sample (*Y*_{1}, …, *Y*_{N}) than it is to *Y*_{i(k)}. Then is the probability measure of the *k*_{x}th nearest neighbour ball of *X*_{i} in , and is the probability measure of the *k*_{y}th nearest neighbour ball of *Y*_{i} in . Because the nearest neighbours are computed with respect to the *ℓ*^{∞} norm, we have thatand hence(2.10)

Thus, by (2.4), for each *k*=1,2, … we obtain the following sample mean estimator for mutual information due to Kraskov *et al*. (2004), based on the ratios of probability measures(2.11)

As we shall see, under certain conditions the numbers *k*_{x}(*i*, *k*) and *k*_{y}(*i*, *k*) will rapidly increase as the number of data points increases, which can have a significant impact on the expected computation time required for this estimator.

Instead of using ratios of probability measures, we can estimate *I*(*X*, *Y*) using volume ratios. In this case, rather than using balls of the same radius but different probability measure, we use balls of the same probability measure but different radius. Thus, we return to (2.9), setting *k*_{x}=*k*_{y}=*k*_{z}=*k*. By (2.4), each of , and has an expected value , so (2.9) becomes

Thus, for each *k*=1, 2, …, we propose the following sample mean estimator for mutual information, based on volume ratios:(2.12)

Estimator ensures that the number of nearest neighbours to be computed remains fixed, and therefore avoids the possible computational overheads associated with .

## 3. Estimating probability measures can be expensive

For a set of *N* points, the naive approach to finding the *k* nearest neighbours of every point in the set is to first compute the distance between every pair of points in the set, which has a computation time of asymptotic order (*N*^{2}) as *N*→∞. However, for one-dimensional data we can simply sort the points, following which only *N*−*k*+1 comparisons are needed to find the *k* nearest neighbours of every point. The computation time of this method is thus dominated by the sort algorithm, the most efficient of which (e.g. *quicksort*) scales as (*N* log *N*) as *N*→∞.

For multi-dimensional data, more sophisticated methods are required. For example, methods based on quadtrees (Bentley 1975) first recursively partition the space until each cell contains a small number of sample points, a procedure that has time complexity of order (*N* log *N*) as *N*→∞. The cell containing an arbitrary point can then be found in time of order (log *N*), and, having done this, the first *k* nearest neighbours of the point can be located in time of order (*k*). To find the *k* nearest neighbours of every point therefore requires a computation time of order (*Nk*+*N* log *N*) as *N*→∞. If *k* is constant with respect to *N*, this is simply of order (*N* log *N*) as *N*→∞, but if *k* increases with *N*, say *k*=*N*^{α} for some *α*>0, the computational cost will be of order (*N*^{1+α}) as *N*→∞.

In theorem 3.1, we show that if *X* and *Y* are independent and satisfy positive density conditions on their distributions, the expected value of *k*_{x}, and hence the expected number of near neighbours that must be found for each point in the marginal sample *X*_{1}, …, *X*_{N}, grows at least as quickly as *N*^{α} for some *α*>0 as *N*→∞. Hence if either *m*>1 or *n*>1, the time required to compute estimator will be of asymptotic order as *N*→∞. In the following, we use the asymptotic notation as *N*→∞ to denote that *g*(*N*) is an asymptotic lower bound for *f*(*N*), in the sense that a constant *C*>0 and a number *N*_{0} exist, such that for all *N*>*N*_{0}.

*If* *and* *are independent random vectors whose distributions satisfy positive density conditions, then*(3.1)

Let be a fixed point, and consider all samples (*Z*_{1}, …, *Z*_{N}) for which *Z*_{i}=*z*. Let *i*(*k*) be the index of the *k*th nearest neighbour of *Z*_{i} among the sample points *Z*_{1}, …, *Z*_{N}, be the distance from *Z*_{i}=*z* to its *k*th nearest neighbour in and be the probability measure of the *k*th nearest neighbour ball in . Similarly, let be the distance from *X*_{i}=*x* to the point *X*_{i(k)} in , *k*_{x} be the number of points *Z*_{j}=(*X*_{j}, *Y*_{j}) for which and be the probability measure of the ball *B*_{x}(*r*_{x}) in . Then, *X*_{i(k)} is the *k*_{x}th nearest neighbour of *X*_{i}=*x* among the points *X*_{1}, …, *X*_{N} and *B*_{x}(*r*_{x}) is the *k*_{x}th nearest neighbour ball of *X*_{i}.

The expected value depends on the probability measure *ω*_{x} of the ball , computed with respect to the marginal density *ϕ*_{x}. In addition to the fixed point *Z*_{i}=*z*, the ball contains exactly *k* points (the *k* nearest neighbours of *Z*_{i}), while the remaining *N*−*k*−1 points are independently and identically distributed in its complement . Thus, because the points *Z*_{j} are independent and identically distributed, the expected number of points *k*_{x} lying in the region satisfies

By (2.4) we have , which means that and hence(3.2)

Let be the distribution of the *k*th nearest neighbour distance *r*_{z}, so that(3.3)where is the expected value of *ω*_{x} taken over all samples for which *Z*_{i}=*z* and . To compute a lower bound on , letand consider(3.4)

Given that a point lies in the ball *B*_{z}(*r*), the conditional probability *G*(*s*) is equal to the conditional probability that *z*′ lies in the region *B*_{x}(*s*)×*B*_{y}(*r*), given that . Furthermore, because *X* and *Y* are independent, and the proximity relations are determined with respect to the *ℓ*^{∞} norm, the probability measure of *B*_{x}(*s*)×*B*_{y}(*r*) is equal to *ω*_{x}(*s*)*ω*_{y}(*r*), and the probability measure of *B*_{z}(*r*) is equal to *ω*_{x}(*r*)*ω*_{y}(*r*). Hence for 0≤*s*≤*r*, we have(3.5)so, by (3.4),(3.6)

Because is an increasing function of *r*, this means that(3.7)so, by (3.3),(3.8)

The distribution function is the probability that at least *k* points *Z*_{j} (*j*≠*i*) lie inside the ball *B*_{z}(*r*). Using elementary probabilistic and combinatorial arguments (Evans *et al*. 2002), it can be shown that(3.9)whereand *Γ* denotes the Euler gamma function. Hence, by (3.8),(3.10)

To evaluate the integral, we note that for any fixed *δ*>0, the error incurred by neglecting all *k*th nearest neighbour balls *B*_{z}(*r*) of radius *r*>*δ* (or equivalently, of probability measure *ω*_{z}(*r*)>*ω*_{z}(*δ*)) becomes exponentially small as *N*→∞. Loosely speaking, this is because the probability that a region of fixed positive measure contains at most some fixed number of points must rapidly approach zero as the number of points increases without bound. To make this precise, let(3.11)

Because *ω*_{z}(*r*) is increasing with respect to *r*, we have for all . Hence, because and , and since , it follows that:

Let so that . Because *ω*(*δ*) is fixed, some constant *c*>0 exists, such that and hence . This means that and therefore

Thus, it follows that:(3.12)

The distributions of *X* and *Y* are assumed to satisfy positive density conditions, so by (2.2) constants *β*_{x}, *β*_{y}>1 and *δ*_{x}, *δ*_{y}>0 exist, such thatand hence(3.13)

From these relations, and using the fact that , we obtain

Hence, by (3.12),(3.14)where(3.15)

Changing the variable of integration and using a similar argument to that leading up to (3.12), we obtain

We recognize this as the beta integralwith parameters and *b*=*N*−*k*. Thus, by (3.10), we obtain

Using the asymptotic expansion (Artin 1964),it follows that:where *c*_{3}(*k*) is the finite constant (independent of *N*)

Finally, because this bound is independent of the fixed point *Z*_{i}=*z*, it follows by (3.2) that:which completes the proof. ▪

(3.16)

## 4. Experimental results

For multi-variate Gaussian distributions, exact expressions for entropy and mutual information are known, which allow us to evaluate the estimators and described above. If *Z* has an (*m*+1)-dimensional Gaussian distribution with covariance matrix *Σ*_{z}, it is known (Cover & Thomas 1991) that(4.1)

Let and . Then their mutual information satisfies(4.2)where *Σ*_{x} is the covariance matrix of .

We consider three cases: *independent* (A); *weakly dependent* (B); and *strongly dependent* (C) random vectors, each distributed according to the (*m*+1)-dimensional Gaussian distribution with mean zero, unit variance and respective covariance matrices *Σ*_{A}, *Σ*_{B} and *Σ*_{C} defined by(4.3)

### (a) The estimators and

For *m*=1, 2, 3, 4 and for each of the covariance matrices *Σ*_{A}, *Σ*_{B} and *Σ*_{C}, we independently select *N*=1000 points from , distributed according to the (*m*+1)-dimensional Gaussian distribution having zero mean, unit variance and covariance matrix *Σ*. In each case, we record the values computed by the estimators and for every *k*=1, …, 20, repeating the process over 25 independent realizations of the sample. The results are shown in figure 1*a–l*.

In each plot, we show the mean estimate sequence computed by and for 1≤*k*≤20, taken over all 25 sample realizations. The error bars indicate the standard error of the mean taken over these repetitions. In addition, the solid line represents the true value as determined by (4.2), and the dotted line denotes the mean empirical value of *I*(*X*, *Y*) that is also computed according to (4.2), but using the empirical covariance matrix of each sample realization.

As expected, the figures show that both the bias and the variance (and hence the mean squared error) of estimator are greater than those of estimator .

### (b) Scaling exponents

Table 1 shows the average values 〈*k*_{x}〉 and 〈*k*_{y}〉 for *k*_{z}=10 and 20, computed over 25 instances of the joint sample. As one might expect, 〈*k*_{x}〉 and 〈*k*_{y}〉 decrease as the random variables *X* and *Y* become increasingly dependent. It is also apparent that 〈*k*_{x}〉 decreases as *m* increases, but that this is balanced by a corresponding increase in 〈*k*_{y}〉.

Theorem 3.1 states that for some *α*>0. To investigate the scaling exponent, we record the average value 〈*k*_{x}〉 for *N*=100, 200, … up to 2500, then use least-squares regression to estimate the gradient of the (log *N*, log *k*_{x}) curve. The empirical scaling exponents, shown in table 2, are close to the theoretical values *α*_{x}=*n*/(*m*+*n*) for *k*_{x} and for *k*_{y} derived in theorem 3.1 and corollary 3.2, respectively.

## 5. Conclusion

We have investigated two non-parametric estimators for mutual information, demonstrated that the method of Kraskov *et al*. (2004) can be computationally expensive under certain circumstances and proposed a more computationally efficient alternative. Experimental results confirm that our estimator is more computationally efficient than the estimator of Kraskov *et al*. (2004), albeit at the expense of increased estimation error. Other estimators for mutual information, for example (Darbellay & Vajda 1999), have appeared in the literature, and there is considerable scope for an extensive numerical investigation into the estimation accuracy and computational cost associated with these techniques. There is also scope to extend the theoretical analysis of such estimators, for example, along the lines developed by Grassberger (submitted).

As a closing remark, in Evans (2007) we prove a weak law of large numbers for functions of a point and its nearest neighbours, where the number of nearest neighbours remains fixed relative to the sample size. This result extends previous research to include unbounded functions of a point and its nearest neighbours, and thus asserts that our sample mean estimator is weakly consistent as the number of observations *N*→∞.

## Acknowledgments

The author would like to thank the Royal Society for supporting his research through its University Research Fellowship scheme.

## Footnotes

- Received August 31, 2007.
- Accepted January 14, 2008.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

- Copyright © 2008 The Royal Society