## Abstract

This paper demonstrates the application of correlated Gaussian process based approximations to optimization where multiple levels of analysis are available, using an extension to the geostatistical method of *co-kriging*. An exchange algorithm is used to choose which points of the search space to sample within each level of analysis. The derivation of the co-kriging equations is presented in an intuitive manner, along with a new variance estimator to account for varying degrees of computational ‘noise’ in the multiple levels of analysis. A multi-fidelity wing optimization is used to demonstrate the methodology.

## 1. Introduction

Let *f*(** x**) denote an

*objective function*that maps a vector

**of length**

*x**k*to some scalar measure of merit.

*x*^{*}is sought, such that

*f*(

*x*^{*}) is the maximum or minimum of

*f*(

**). Such**

*x**optimization*problems (of which this is the simplest form) are encountered in most branches of science. Of interest to us in this paper is the class of optimization problems where

*f*(

**) is expensive to obtain or we know its value only at a limited number of**

*x***values. Thus, efficiency, in terms of the number of**

*x***values that must be mapped to their objective function values before we get to within a reasonable distance of**

*x*

*x*^{*}, is the key feature of the optimization algorithms considered.

While there is often no getting around the fact that a thorough search of a highly nonlinear, multi-dimensional *f* landscape will require sampling at a large number of sites, a useful shortcut is often employed. Based on a relatively small number of measurements, we can build a statistical approximation of the objective landscape, which, provided *f* is smooth and continuous and the measurements are reasonably uniformly spread, will be accurate enough to guide the search towards promising areas of the landscape.

If, at the same time, this *surrogate* of the expensive function is very cheap to evaluate, we can be as thorough as we like in searching it. Thus, most of the computational effort will be concentrated on what looks likely to be the neighbourhood of the global optimum. An approximation technique that has received much attention in recent years is *kriging*1—we shall delve into the details of this later on, as our chosen multi-fidelity optimization technique is an extension of kriging. But what exactly do we mean by multi-fidelity optimization?

The objective function evaluation system may feature lower-fidelity, cheaper models (in addition to the main, high-fidelity calculation of *f*). In most cases, the fast (but less trustworthy) and the slow (but more accurate) objective values can be obtained independently. Thus, we can learn more about the objective function by additionally measuring the cheap function(s) on a large number of ** x** values. This process is usually referred to as multi-fidelity optimization.

The use of secondary, correlated quantities to improve the accuracy of the model of the primary objective is not a new concept. For example, Hevesi *et al*. (1992) predict average annual precipitation values near a potential nuclear waste disposal site using a sparse set of precipitation measurements from the region along with the correlated and more easily obtainable elevation map of the area. Kennedy & O'Hagan (2000) approach the subject from the perspective of model building using objectives resulting from computational simulations of varying fidelities and costs. This is also the context of the work presented here; additionally, we introduce a method of optimization using correlated models of different costs and fidelities. We focus on combining co-kriging (the multi-response extension to kriging alluded to earlier) with a Bayesian model update criterion designed to balance the exploration of the search space and the quick exploitation of promising basins of attraction on the *f* landscape. The criterion is based on a newly derived error estimate that reflects the presence of noise in the observed data.

After a brief overview of kriging in §2, we discuss co-kriging in §3, followed by considerations on how to create sampling plans for multiple levels of analyses (§4). We put these pieces together in §5, where we describe an optimization strategy designed for multi-fidelity inputs. An aircraft wing design problem is used to demonstrate and compare the co-kriging approach and other multi-level strategies in §6. Section 7 discusses the remaining issues related to the use of correlated surrogates and summarizes our conclusions.

## 2. Kriging

We will consider co-kriging as a natural extension to the popular method of kriging and hence begin with a very brief introduction to the kriging method in order to introduce concepts and notation that follow through to co-kriging. Equations are shown without derivations, which are very similar to those presented later for co-kriging; the reader may wish to consult Jones (2001) or Forrester *et al*. (2006*b*) for more information on kriging.

As with all surrogate-based methods, to approximate a function *f* we start with a set of sample data—usually computed at a set of points in the domain of interest determined by a sampling plan (which will be discussed further in §4). The kriging prediction of *f* is built from a mean base term, (the circumflex denotes a maximum likelihood estimate, MLE) plus a stationary Gaussian process, *Z*(** x**), representing the local features of

*f*around the

*n*sample points, , .

*Z*(

**) has zero mean and covariance(2.1)where are correlations between a random variable**

*x**Y*(

**) at the point to be predicted (**

*x*

*x*^{(n+1)}) and at the sample data points(2.2)The

*hyper-parameter, p*

_{j}, can be thought of as determining the smoothness of the function approximation. In many situations, we can assume that there will not be any discontinuities and use

*p*

_{j}=2 rather than using an MLE. This means that the basis function is infinitely differentiable through a sample point (when ) and that the function is in the same form as a Gaussian distribution with variance . therefore can be thought of as determining how quickly the function changes as

*x*^{(n+1)}moves away from

*x*^{(i)}, with high and low values indicating an active or inactive function along dimension

*j*, respectively. The kriging prediction is found as the value at

*x*^{(n+1)}that, maximizes the likelihood, given the sample data and MLEs of the hyper-parameters, and is given by(2.3)The constants

*b*

_{i}are given by the column vector , where

**is an**

*Ψ**n*×

*n*symmetric matrix of correlations between the sample data;

**is a column vector of responses, ;**

*y***1**is an

*n*×1 column vector of ones; and the MLE .

## 3. Co-kriging

We now consider how to build an approximation of a function that is expensive to evaluate which is enhanced by data from cheaper analyses of the function. Combining multiple sets of data naturally leads to a complex notation and we will try to simplify this by limiting ourselves to two datasets:2 the most accurate expensive data with values *y*_{e} at points *X*_{e} and the less accurate cheap data *y*_{c} at points *X*_{c} (*X*_{e}⊂*X*_{c}).3 These data are concatenated to give the combined set of pointsAs with kriging, the value at a point in ** X** is treated as if it were the realization of a Gaussian random variable. For co-kriging, we therefore have the random fieldWe follow the auto-regressive model of Kennedy & O'Hagan (2000), which assumes that , . This means that no more can be learnt about from the cheaper code if the value of the expensive function at

*x*^{(i)}is known—a Markov property (i.e. we assume that the expensive simulation is correct and any inaccuracies lie wholly in the cheaper simulation). Gaussian processes

*Z*

_{c}(.) and

*Z*

_{e}(.) represent the local features of the cheap and expensive codes. Using the auto-regressive model, we are essentially approximating the expensive code as the cheap code multiplied by a scaling factor

*ρ*plus a Gaussian process

*Z*

_{d}(.) that represents the difference between

*ρZ*

_{c}(.) and

*Z*

_{e}(.)(3.1)Whereas in kriging we have a covariance matrix , we now have a more complex covariance matrix constructed as follows:The notation , for example, denotes a matrix of correlations of the form

*ψ*

_{c}between the data

*X*_{e}and

*X*_{c}. Our complete covariance matrix is thus(3.2)The correlations are of the same form as equation (2.2), but there are two correlations,

*ψ*

_{c}and

*ψ*

_{d}, and we therefore have more hyper-parameters to estimate:

*θ*_{c};

*θ*_{d};

*p*_{c};

*p*_{d}; and the scaling parameter

*ρ*. Our cheap data are considered to be independent of the expensive data and we can find MLEs for

*μ*

_{c}, ,

*θ*_{c}and

*p*_{c}by maximizing the ln-likelihood (ignoring constant terms)(3.3)By setting the derivatives of equation (3.3) w.r.t.

*μ*

_{c}and to 0 and solving, we find MLEs of(3.4)and(3.5)Substituting equations (3.4) and (3.5) into (3.3) yields the concentrated ln-likelihood(3.6)and and (if not set at

**2**) are found by maximizing this equation.

To estimate the mean, variance, hyper-parameters and scaling parameter of the difference model (*μ*_{d}, , *θ*_{d}, *p*_{d} and *ρ*), we first define(3.7)where *y*_{c}(*X*_{e}) are the values of *y*_{c} at locations common to those of *X*_{e} (the Markov property implies that we need to consider only this data). If *y*_{c} is not available at *X*_{e}, we may estimate *ρ* at little additional cost by using kriging estimates , found from equation (2.3), using the already determined hyper-parameters and . The ln-likelihood of ** d**, given

**y**_{c}, is now(3.8)yielding MLEs ofand(3.9)with , (again, if not set at

**2**) and found by maximizing(3.10)

Equations (3.6) and (3.10) must be maximized numerically using a suitable global search routine such as a genetic algorithm (GA). Depending upon the cost of evaluating the cheap and expensive functions *f*_{c} and *f*_{e}, for very high dimensional problems, the multiple matrix inversions involved in the likelihood maximization may render the use of the co-kriging model impractical (the size of the matrices depends directly on the quantities of data available, and the number of search steps needed in the MLE process is linked to the number of hyper-parameters being tuned). Typically, a statistical model used as a surrogate will be tuned many fewer times than the number of evaluations of *f*_{e} required by a direct search. The cost of tuning the model can therefore be allowed to exceed the cost of computing *f*_{e} and still provide significant speed-up. For large *k* and *n*, the time required to find MLEs can be reduced by using a constant *θ*_{c,j} and *θ*_{d,j} for all elements of **θ**_{c} and **θ**_{d} to simplify the maximization, though this may affect the accuracy of the approximation.

With the hyper-parameters estimated, the co-kriging prediction of the expensive function is given by(3.11)whereand . The notation , for example, denotes a column vector of correlations of the form *ψ*_{c} between the data *X*_{e} and the new point *x*^{(n+1)}. The derivation of equation (3.11) is given in appendix A. If we make a prediction at one of our expensive points, and ** c** is the

*n*

_{c}+

*i*th column of

**, then**

*C*

*c*^{T}

*C*^{−1}is the

*n*

_{c}+

*i*th unit vector and . We see, therefore, that equation (3.11) is an interpolator of the expensive data. If we make a prediction at one of the cheap points, ,

**is not a column of**

*c***and the predictor will in some sense regress**

*C*

*y*_{c}unless it coincides with

*y*_{e}.

The estimated mean-squared error in the predictor is given as follows:(3.12)Again, a derivation is given in appendix A. Since the co-kriging model is an interpolator of *y*_{e}, we expect the error to be zero at the expensive sample points. For , *c*^{T}*C*^{−1} is the *n*_{c}+*i*th unit vector, and hence *s*^{2}(** x**) is indeed 0. For

*X*_{c}\

*X*_{e},

*s*

^{2}(

**)≠0 unless**

*x*

*y*_{e}=

*y*_{c}(

*X*_{e}). The error at these points is determined by the character of

**Y**_{d}. If the difference between

*ρ*

**Y**_{c}(

*X*_{e}) and

**Y**_{e}(

*X*_{e}) is simple (characterized by low

*θ*

_{d,j}values), the error will be low, whereas a more complex difference (high

*θ*

_{d,j}values) will lead to high error estimates.

### (a) One-variable demonstration

We will now look at how co-kriging behaves using an example of a simple one-variable function. Imagine that our expensive to compute data are calculated by the function , , and a cheaper estimate of this data is given by . We sample the design space extensively using the cheap function at , but run the expensive function only at four of these points, .

Figure 1 shows the functions *f*_{e} and *f*_{c} with *A*=0.5, *B*=10 and *C*=−5. A kriging prediction through *y*_{e} gives a poor approximation to the deliberately deceptive function, but the co-kriging prediction lies very close to *f*_{e}, being better than both the standard kriging model and the cheap data. Despite the considerable differences between *f*_{e} and *f*_{c}, a simple relationship has been found between the expensive and the cheap data and the estimated error reduces almost to 0 at *X*_{c} (figure 2).

We have chosen the relationship between our low- and high-fidelity test functions in order to show how the hyper-parameters of the co-kriging model behave. The hyper-parameters pertaining to the cheap data, *θ*_{c} and *p*_{c}, are affected only by this data and behave as described in §2. Moving on to the scaling parameter *ρ*, if our cheap model parameter *A* (the multiplying term linking the cheap and expensive functions) is varied such that , we obtain the values for shown in figure 3 and see that . Similar trials show that the parameters *B* and *C* have no effect on *ρ*; thus we see that *ρ* is purely a scaling parameter. Note that is only an *indicator* of the scaling, since this value is estimated based on the data available. For the data in figure 1, (close to the true value of 2), but for small samples of *y*_{e} the MLE may be misleading. The slight deviations of the data in figure 3 from (shown as a dashed line), the most significant of which is at 1/*A*=6, are where our GA search has not found the true MLE.

Recall that (equation (3.7)) and hence, with , ** d** represents the difference in trends between the cheap and the expensive data. Thus, for our one-variable example, when

*B*,

*C*=0, for all values of

*A*, if is estimated accurately. Figure 4 shows how varies for and we see that as

*B*→0 and therefore

**→**

*d***0**, also approaches 0. Note that and will not be affected since the correlation in equation (2.2) is unaffected by the scaling of the objective data (it is, however, affected by the scaling of

**).**

*X*Our choice of cheap function for the above example is somewhat contrived, but this has allowed us to show that the co-kriging model and its parameters are behaving as we would expect. For our test function, the correction process *Z*_{d}(.) is linear. Co-kriging will work effectively for more complex correction processes with the proviso that *Z*_{d}(.) must be simpler to model than *Z*_{e}(.). In §6 we demonstrate the benefits of co-kriging using an engineering design problem.

## 4. Sampling plans

In §3 we have shown how to build the co-kriging model based on a set of sample data. We now consider how to choose the points *X*_{c} and *X*_{e} to give us the best prediction .

The results of experiments, whether physical or computational, are corrupted by experimental error, that is, they deviate from the ‘true’ result. This can be due to human error, systematic deviations (due to a flaw or an inevitable physical or numerical limitation of the experiment) and, *only in physical experiments*, random error (usually linked to the limited accuracy of the instruments). When deciding on the sampling plan, that is, when choosing ** X**, there is nothing we can do about the first two error components and, while physical sampling plans often include replicates to reduce the random error, computer experiment plans are simply designed to cover the parameter space reasonably uniformly. The goal is, of course, to enable the model fitted to these points to give an accurate global approximation of the unknown objective function landscape.

The definition of ‘uniform coverage’ is by no means obvious and a substantial body of literature exists on the subject—here we work with the optimality criterion of Morris & Mitchell (1995). According to this, the plan with the best space-filling properties is that which maximizes the smallest distance between any pair of points within the sample (several additional ‘tie-breaker’ criteria are given in case of multiple optima). Additionally, we restrict our search to a class of plans known as Latin Hypercubes (LH; McKay *et al*. 1979), which are built to ensure uniformly spread projections of all points on all axes (figure 5*a* shows a random LH). We thus generate an initial Morris–Mitchell optimal LH plan *X*_{c}.

A more interesting question is how do we select an *n*_{e}-element subset *X*_{e} of *X*_{c}, where the expensive simulations are to be run? Again, we wish to cover the parameter space evenly and hence turn to the Morris–Mitchell criterion, but this time we are dealing with a limited, discrete parameter space and thus the problem becomes a combinatorial one. Since selecting the subset that satisfies this is an *NP-complete* problem and an exhaustive search would have to examine subsets (clearly infeasible for all but very moderate cardinalities), here we use an exchange algorithm to select *X*_{e} (Cook & Nachtsheim 1980).

We start from a randomly selected subset *X*_{e} and calculate the Morris–Mitchell criterion. We then exchange the first point with each of the remaining points in *X*_{c}\*X*_{e} and retain the exchange that gives the best Morris–Mitchell criterion. This process is repeated for each remaining point . A number of restarts from different initial subsets can be employed to avoid local optima. Figure 5*b* shows a Morris–Mitchell optimal LH with a subset chosen using this exchange algorithm.

A rule of thumb for the number of points that should be used in the sampling plan is *n*=10*k*. When using a particularly cheap analysis, *n*_{c} may be rather greater than this, allowing us to build a more accurate model, and if the relationship between *f*_{c} and *f*_{e} is simple, *n*_{e} may be somewhat fewer—the advantage of the co-kriging method. We will return to this subject in §7.

## 5. Co-kriging based optimization

In order to confirm and enhance the predictions of a surrogate model, it is usual to update the model with new evaluations in promising areas. Co-kriging (and kriging) treats the value of the function at ** x** as if it were the realization of a Gaussian random variable

*Y*(

**), with a probability density functionwith mean given by the predictor, (equation (3.11)) and variance,**

*x**s*

^{2}(

**) (equation (3.12)). This allows us to model our uncertainty about the predictions we make. The most plausible value at**

*x***is , with the probability decreasing as**

*x**Y*

_{e}(

**) moves away from . Since there is an uncertainty in the value of we can calculate the expectation of it being an improvement, , on the best value calculated so far aswhere**

*x**Φ*(.) and

*ϕ*(.) are the normal cumulative distribution function and probability density function, respectively. By maximizing

*E*[

*I*(

**)] we can find the best new point at which to sample the design space. Note that**

*x**E*[

*I*(

**)]=0 when**

*x**s*(

**)=0 so that there is no expectation of improvement at a point that has already been sampled and therefore no possibility of re-sampling, which is a necessary characteristic of an updating criterion when using deterministic computer experiments, and guarantees global convergence; given that there is no possibility of re-sampling, as the number of updates based on the maximum**

*x**E*[

*I*(

**)] increases, the design space will become densely populated and so the global optimum will be found (Locatelli 1997).**

*x*### (a) Co-kriging regression

When using multi-fidelity analyses, we can modify the interpolating co-kriging formulation in §3 such that each analysis can be regressed appropriately to filter any noise present in the data. Different levels of filtering may be required for each analysis. For example, data from an empirical low-fidelity computer code may be smooth and require no regression, but could be coupled to a discretized high-fidelity analysis that displays noise that must be filtered using regression. Conversely, a low-fidelity model may be discretized on a much coarser mesh, requiring a higher degree of regression than the high-fidelity code. In our regressing co-kriging formulation, we therefore employ two regression constants, *λ*_{c} and *λ*_{e}. These are added to the leading diagonal of the correlation matrices to give the covariance matrix ** C**+

**,(5.1)where**

*λ***is an identity matrix and**

*I***0**a zero matrix. The values of

*λ*

_{c}and

*λ*

_{e}are found along with the other hyper-parameters by maximizing equations (3.3) and (3.8). The regressing predictor is now(5.2)(subscript r denotes regression) with an estimated mean-squared error(5.3)Unlike equation (3.12), the regression error does not fall to 0 at sample points, and therefore Locatelli's proof of convergence is invalid because max{

*E*[

*I*(

**)]} can occur at a previously sampled point. This is satisfactory when**

*x**f*

_{c}and/or

*f*

_{e}are evaluated through physical experiments, since repeated experiments may be required to reduce measurement error. If

*f*

_{c}and

*f*

_{e}are found using deterministic computer experiments, re-sampling should be avoided and the convergence properties of max{

*E*[

*I*(

**)]} (or other error-based sampling strategies) can be restored by modifying the estimated error to eliminate the error due to the noise filtering, since this is the error associated with the data rather than the quality of the model. This is achieved by fitting a kriging interpolation through regressed data, re-interpolation.**

*x*### (b) Co-kriging re-interpolation

We now show the derivation of a co-kriging error estimate that reflects the uncertainty in predicting the underlying trend of data, rather than the overall uncertainty, which includes any error in noisy data. The derivation, although naturally complicated by multiple sets of data and regression constants, follows the same theme as for kriging re-interpolation (Forrester *et al*. 2006*b*). As mentioned above, to eliminate the error in the noisy data, we need to calculate the error of an interpolation through a regressed set of data. The column vectors of regressed cheap and expensive data are found from equations (2.3) (with the addition of the regression constant) and (5.2), and can be expressed asandwhere the MLEs for the hyper-parameters are those found for the co-kriging model.

Note that only the *σ*^{2} terms in equation (3.12) depend on ** y**. We therefore need to substitute into equation (3.5) and into equation (3.9). Beginning with ,(5.4)(subscript ri denotes re-interpolation). Note that if

*λ*

_{c}=0, then . To find , we substitute and into equation (3.7) to giveThis is substituted into equation (3.9) in place of

**(again, note that if**

*d**λ*

_{e}=0 then ). The expressions for and can now be substituted into equation (3.12) to find the re-interpolation error estimate that reduces to 0 at

*X*_{e}. With the errors due to noisy data removed, the estimated error of the predicted smooth trend is typically very small when compared with an interpolating model. This can lead to problems with floating-point underflow when calculating

*E*[

*I*(

**)]. A scheme for avoiding this problem is presented in appendix B.**

*x*## 6. Example problem

We will demonstrate the benefits of co-kriging, and in particular regressing co-kriging, through the optimization of a generic transonic civil aircraft wing. The wing is defined by 11 variables: area; aspect ratio; kink position; sweep; inboard and outboard taper ratios; root, kink and tip thickness to chord ratio; tip washout; and kink washout. Our cheap analysis is in the form of an empirical drag estimation code, *Tadpole* (Cousin & Metcalf 1990). This returns a drag value based on curve fits to previously analysed wings in approximately 0.6 s. Our ‘expensive’ code is the linearized potential method *VSaero* (Maskew 1982) with viscous coupling (approx. 2 min per drag evaluation). We aim to minimize drag/dynamic pressure (*D*/*q*) for a fixed lift (determined by a wing weight calculated by Tadpole).

To allow visualization of the design landscape, we limit the search to the four variables that have the most impact on drag: area, *S*; aspect ratio, ; sweep, *Λ*; and inboard taper ratio, *T*_{in} (these variables were shown to be the most dominant by Keane (2003)). The Tadpole and VSaero design landscapes are shown using hierarchical axis plots in figures 6 and 7. With the fast empirical drag estimation provided by Tadpole, we have been able to build a high-resolution plot from 11^{4}=14 641 runs of the code in 2.3 hours, while using the slower physics-based VSaero we have produced a low-resolution plot using 3^{2}×11^{2}=1089 evaluations in 34.5 hours.4 Each tile of the plots shows *D*/*q* for *S*∈{150, 250}*m*^{2} and ∈{6, 12} for a fixed *Λ* and *T*_{in}. *Λ* and *T*_{in} vary from tile to tile, with the value at the lower left corner of the tile representing the value for the entire tile. The blank portions of figure 7 are regions where VSaero has failed to return a result for unusual geometries that lead to extreme flow regimes.5 It is seen that the two codes produce results that follow the same general trend of lower *D*/*q* for higher and *T*_{in}, but the global optimum of the VSaero landscape goes against this general trend, with the lowest *D*/*q* at *T*_{in}=0.55. *Z*_{d} will not, therefore, be a trivial function as in our one-variable example but, since the cheap and expensive landscapes have the same general trend, should be simpler than *Z*_{e}.

For this problem, we start with an initial Morris–Mitchell sampling plan for *X*_{c} of 100 points (note *n*_{c}>10*k*), to which a kriging model is fitted. A comparison of 100 further Tadpole calculations with predictions from the kriging model yields a high correlation coefficient of 0.96, indicating that the cheap code is approximated well. A subset of *n*_{e}=20 points (note *n*_{e}<10*k*) is selected (using the exchange algorithm) from which to build the initial co-kriging model. *X*_{c} and *X*_{e} are shown in figure 6. Despite the apparent sparseness of the *X*_{e} data, a good initial co-kriging prediction of the VSaero landscape is obtained, with a correlation coefficient of 0.96 when compared with the 11^{2}×3^{2} VSaero dataset. We follow an iterative process of updating the co-kriging model with new VSaero and Tadpole data at max{*E*[*I*(** x**)]} and retuning hyper-parameters until the optimum is found. Since

*n*

_{c}is large and the correlation with independent data is high, we do not expect the Tadpole landscape to be altered unduly by updates. The GA-based tuning of

*θ*_{c}and

*λ*

_{c}requires a large number of computationally intensive inversions of the large

*n*

_{c}×

*n*

_{c}covariance matrix. We therefore save time by limiting our re-tuning of the cheap hyper-parameters

*θ*_{c}and

*λ*

_{c}to every 10 updates, while we re-tune

*θ*_{e}and

*λ*

_{e}at each step.

The co-kriging method is compared with a max{*E*[*I*(** x**)]} search of a kriging model built using only the VSaero data. A

*D*/

*q*that improves on the minimum from the previously computed 3

^{2}×11

^{2}set of data plus one standard deviation of the noise exhibited by the VSaero results (found from 10 VSaero evaluations of the same design with small perturbations),

*D*/

*q*=2.57+0.027, is used as a stopping criterion in both cases, i.e. similar quality answers are achieved from both the searches, allowing a direct comparison of the effort involved. We also limit the size of the kriging and co-kriging models to a maximum of 200 VSaero calculations. Beyond this, the time required for tuning the model becomes equivalent to the time required to run VSaero. Results, averaged over five searches from sampling plans produced using different random number seeds, are shown in table 1.

The co-kriging based search consistently outperforms the kriging-based search; finding better optima for reduced numbers of VSaero evaluations and with fewer failed simulations. The initial prediction of the kriging model is almost as accurate as the co-kriging model (see correlation and RMSE in table 1), but the greater coverage of data in the co-kriging model leads to a better selection of successful update points in promising regions, as seen in figure 7, which shows the distribution of update evaluations for a typical search. Moreover, the co-kriging updates are concentrated in regions of good designs, while the kriging updates are more widespread because, as there is a sparsity of data, there is high error and therefore high expectations of improvement in many areas (i.e. there is an emphasis on exploration over exploitation).

Only discrete values of *Λ* and *T*_{in} can be displayed in figure 7 and these design variables are rounded to the nearest 2° and 0.03 respectively, for the purpose of visualization. It should be borne in mind that there is, in fact, a distribution of designs between tiles. The reader should also appreciate that, although the major axis is much larger than the axes of the individual tiles, they both represent the same variation across a variable. Thus, although there seems to be a wide distribution of update points across *Λ* and *T*_{in}, for the co-kriging updates these points represent tight clusters in regions of optimal feasible designs.

In §3 we discussed the use of two regression constants *λ*_{c} and *λ*_{d} in the co-kriging formulation. The final MLEs of these hyper-parameters for the search in figure 7 were and . These values agree with our intuition that data from an empirical code (Tadpole) will be smooth and require little regression, i.e. a very small , and data from a dicretized physics-based code (VSaero) will be noisy and require smoothing, i.e. a larger .

## 7. Discussion

Our results demonstrate that correlating analyses at multiple levels of fidelity can enhance the accuracy of a surrogate model of the highest level of analysis. This correlated model can be used to find optimal solutions more quickly. We have presented a global optimization strategy using a co-kriging based method that allows for varying levels of noise filtering across multi-fidelity analyses and which converges towards the global optimum using expected improvement maximization.

We have shown a co-kriging based optimization using two levels of CFD code fidelity. The methods are extendable to multiple levels and there are many other examples of analyses that may be coupled. Here, both codes are predicting the same quantity (drag) but, as long as there is a useful correlation, the analyses may be related to completely different quantities as is the case with the work of Hevesi *et al*. (1992), which correlated precipitation with elevation: two different but obviously correlated quantities. Often only one form of analysis is available, although this may come in varying levels of fidelity. For example, Kennedy & O' Hagan (2000) used different finite-element mesh resolutions to produce cheap and expensive results. Forrester *et al*. (2006*a*) showed that partially converged CFD simulations correlate well with their converged counterparts. The convergence could be stopped at any number of points, thus providing many levels of fidelity.

In §4 we discussed the choice of *n*_{c} and *n*_{e} based on using more or fewer points than an *n*=10*k* rule of thumb. In the example problem, we checked that the model built using our choices of *n*_{c} and *n*_{e} was accurate by assessing the correlation with a second set of data. Naturally the number of points required is problem dependent and additional validation data may be too expensive to compute. A possible way to reduce computational effort in situations when the cost of *f*_{c} is considerable is to start with a very low *n*_{c} and add data at points where the estimated error is maximized until an accurate cheap model is obtained. Model accuracy may be assessed using a leave-one-out cross-validation procedure. This method will produce a space-filling design with *n*_{c} controlled by the desired model accuracy. Maximum expected improvement updates *could* then proceed starting from an initial sample as small as *n*_{e}=2, though it may be advantageous to follow a maximum error update strategy until an accurate model is produced.

In our example problem we limited ourselves to four variables. Increasing the number of variables naturally results in a more difficult optimization problem, requiring larger initial sampling plans and more update evaluations, and so the benefits of the co-kriging method are likely to increase. However, since we have noted that co-kriging relies on the relationship between analyses being simpler to model than the analyses themselves, we can assume that for additional variables which have little impact on the design objective the co-kriging model will not offer a significant speed-up. Thus, while we expect co-kriging to perform well in high dimensions, we should still try to isolate the most important variables, either through prior knowledge or through the analysis of elementary effects (Morris 1991).

We limited the total sample size (including updates) to *n*_{e}=200 in the example problem due to the expense of tuning the hyper-parameters for large quantities of data overtaking the expense of the CFD simulations. This effectively puts an upper limit on the number of variables that can be considered, since more variables require more points. However, in such cases where *f*_{e} is quick to evaluate when compared with the hyper-parameter tuning we can choose to simplify the correlation, and hence the tuning time, by using the same *θ*_{c,j} and *θ*_{e,j} across all dimensions. The extension of surrogate-based techniques to higher dimensions is a particularly active area of research and further developments in this area would be welcome.

## Acknowledgments

This work has been supported by EPSRC grant code GR/T19209/01.

## Footnotes

↵Originally called

*krigeage*by Matheron (1963) after D. G. Krige—a South African mining engineer who pioneered the method in the 1950s for determining ore-grade distributions based on core samples (Krige 1951).↵Our methods can be extended to multiple code levels following the notation used by Kennedy & O'Hagan (2000).

↵We require partially collocated points in our estimation of a scaling parameter between the data. It is possible to construct the presented co-kriging model with completely non-collocated points by using kriging estimates .

↵Generating such plots would, of course, not be possible in most cases due to the high computational cost. Here we have computed this large quantity of data for illustrative purposes.

↵In our co-kriging and kriging-based searches, we have directed the search away from regions of failure by inputing penalized values of when VSaero fails to return a result (Forrester

*et al*. 2006*c*).- Received March 20, 2007.
- Accepted September 6, 2007.

- © 2007 The Royal Society