## Abstract

In this paper, Isomap and kernel Isomap are used to dramatically reduce the dimensionality of the output space to efficiently construct a Gaussian process emulator of parametrized partial differential equations. The output space consists of spatial or spatio-temporal fields that are functions of multiple input variables. For such problems, standard multi-output Gaussian process emulation strategies are computationally impractical and/or make restrictive assumptions regarding the correlation structure. The method we develop can be applied without modification to any problem involving vector-valued targets and vector-valued inputs. It also extends a method based on linear dimensionality reduction to response surfaces that cannot be described accurately by a linear subspace of the high dimensional output space. Comparisons to the linear method are made through examples that clearly demonstrate the advantages of nonlinear dimensionality reduction.

## 1. Introduction

Emulators (*surrogate models* or *metamodels*) [1,2] are approximations of high-fidelity computational models that are designed for rapid simulation (ideally in real time) in applications such as uncertainty analysis, design optimization and inverse parameter estimation, i.e. when repeated calls to the computational model (*simulator*) at numerous points in a parameter space are computationally impractical. In many applications, the simulator solves a system of parametrized partial differential equations (PDEs) and the outputs of interest are spatial or spatio-temporal fields, e.g. temperature or flow velocity, as functions of multiple input parameters. *Physics-based* emulation approaches to such problems typically involve a reduced-basis (e.g. via proper orthogonal decomposition) approximation of the original numerical formulation [3–11]. Such approaches are attractive since they establish a direct link to the underlying physical principles and provide rigorous estimates of the error. They can, on the other hand, be very difficult to implement, particularly when dealing with models involving highly complex nonlinearities. Another class of emulators can be termed *black-box*. They are based on function approximation strategies or machine learning algorithms applied to input–output data generated by the simulator [2,12]. The most common approach is to place a Gaussian process (GP) prior over the simulator outputs. Such GP emulators are very versatile and are readily applicable to highly nonlinear systems. They have been applied successfully to a broad range of problems, typically involving the approximation of a scalar simulator output as a function of multiple inputs [13,14]. There are, however, few approaches to GP emulation of vector-valued outputs in high dimensions (in the case of a spatial field, the simulator outputs would be vectorized values of the field variable at *d*≫1 points on a numerical grid).

The simplest approach was developed by Kennedy & O’Hagan [15], who treated the output index as an additional input. Since it involves *d* separate scalar GP emulations, this method is impractical for emulating whole fields. Conti & O’Hagan [16] extended the scalar GP method by placing a multi-dimensional GP prior over the simulator. The authors assumed a separable covariance structure, which is equivalent to assuming that the coordinates of the outputs are independent and identically distributed (i.i.d.) univariate GPs. This simplification leads to a potentially efficient emulation strategy but it is clearly restrictive [17]. Rougier [18] made the same assumption, in addition to several assumptions regarding the forms of the regression functions, and took advantage of factorizations of the covariance matrix to improve computational efficiency. Conti and O’Hagan's approach has recently been extended to non-stationary GPs [19] and to non-separable covariance structures for categorical outputs [17]. Fricker makes use of the linear model of coregionalization [20], which permits a richer covariance structure but is restricted to low dimensions.

Higdon *et al.* [21] developed a method for treating multiple outputs in the GP emulation framework by applying linear principal component analysis (PCA) to the output space. This is a natural method for reducing the problem size since the coordinates in a PCA basis are uncorrelated. They can, therefore, be treated as independent GPs, but crucially they are not treated as i.i.d. In many cases, however, it will not be possible to use a linear subspace of the high-dimensional output space for faithful representations of the data, which motivates the introduction of nonlinear dimensionality reduction techniques. The focus of the present work is the development of a new approach for constructing GP emulators of spatial and spatio-temporal data sets using nonlinear dimensionality reduction, namely Isomap [22,23]. Isomap is a powerful approach for locating low-dimensional manifolds from data points in high-dimensional spaces, with a range of applications in areas such as image processing and pattern recognition.

In this paper, Isomap and kernel Isomap are used to reduce the dimensionality of the *output* space consisting of spatial or spatio-temporal fields as functions of multiple input variables. Via the connection between Isomap and kernel PCA (kPCA), it can be demonstrated that the method is equivalent to a spectral discretization of an embedding of the output space in a feature space (kernel Isomap guarantees the existence of the feature space). GP emulation can then be applied individually to the coefficients extracted from Isomap, which are naturally ordered in terms of decreasing variance. The method developed (termed *Iso-GPE*) is a computationally efficient approach to emulating spatial or spatio-temporal fields as functions of multiple inputs and/or time. It is applied to three test cases involving data from computer simulations, demonstrating superior performance compared with the linear method of Higdon *et al.* [21] (referred to as HH).

## 2. Gaussian process emulation of spatial and spatio-temporal datasets via nonlinear dimensionality reduction

In this section, we first provide details of GP emulation in the univariate case, introducing the basic terminology. We then define the types of problems under consideration and the construction of the datasets. The extension of univariate GP emulation (GPE) to such problems using Isomap is then described. With a general audience in mind, details of Isomap and kernel Isomap are provided. Example applications are presented in §3.

### (a) Univariate GP emulation

A stochastic process is a family of random variables, *x*∈*X*, which can be discrete (*finite* subset of *X* is multivariate Gaussian. A GP is specified by its mean and covariance functions. Below, *f*(⋅) denotes a probability density function over the argument.

An *emulator* provides a probability distribution for the computer model outputs. The computer model is called a *simulator* and can be represented as a function *m* simulator outputs *y*^{(i)}=*ψ*(*x*^{(i)}), referred to as *training points*, from *design points* *ψ*(⋅) is regarded as a random function that has a GP distribution
*p*) is a vector of regression coefficients. *ϵ*(** x**) is assumed to be a zero mean GP with constant variance

*σ*

^{2}. It is further assumed to be independent of

*ψ*(

**) is therefore**

*x*

*β*^{T}

**(**

*h***). The regression functions are usually taken to be linear (in the input parameters) or constant values, which has been found to be adequate in most applications. A zero constant value is frequently assumed after centring the data [12], as will be the case in this work. If prior knowledge of the input–output relationship is available, more complex forms of**

*x***could be used.**

*h*The covariance function must generate a symmetric, positive semi-definite covariance matrix when evaluated at any finite collection of design points. A common choice of covariance function is the squared exponential (used in this work)
** x**=(

*x*

_{1},…,

*x*

_{L})

^{T}. The ‘hyperparameters’

*θ*

_{2},…,

*θ*

_{L+1}introduce different degrees of decay in each component of the input. The unknown variance of the noise (

*σ*

^{2}) is also treated as a hyperparameter. The notation

**=(**

*θ**θ*

_{1},…,

*θ*

_{L+1},

*σ*

^{2})

^{T}is used below.

The joint density of *x*^{(i)}, *i*=1,…,*m*, corresponding to the targets *y*^{(i)} is Gaussian. Assuming that the data have been centred (** β**=

**0**in equation (2.1)), and given that

*ϵ*(

**) and**

*x*

*x*^{(i)},

*i*=1,…,

*m*, i.e.

*i*,

*j*=1,…,

*m*.

Consider a test input ** x** for which the output

*y*is sought. Defining

**′=(**

*y***,**

*y**y*), the conditional distribution of

**′ is**

*y**C*′ is given by

**=(**

*k**k*(

*x*^{(1)},

**),…**

*x**k*(

*x*^{(m)},

**))**

*x*^{T}. The conditional distribution over a subset of the coordinates of

**′ is Gaussian and it can be shown that the mean and variance of**

*y**y*|

**,**

*y***are [12]**

*θ**θ*

_{i}, which must be deduced from the given data and the assumed model. The most rigorous method places a prior distribution (hyper-prior)

*f*(

**) over the hyperparameters and marginalizes the predictive distribution to eliminate the dependence on**

*θ***. Given the computational expense of such an approach, a point estimate of**

*θ***is typically used in (2.4) to make predictions [15]. If a uniform distribution is placed on**

*θ***, using Bayes’ rule**

*θ**f*(

**|**

*θ***)∝**

*y**f*(

**|**

*y***), so that**

*θ***, which is used in this paper.**

*θ*### (b) Construction of spatio-temporal datasets from simulators

The goal of this work is the development of a new approach based on Isomap for the efficient application of GPE to computer simulated parametrized spatial and spatio-temporal datasets representing scalar fields (e.g. temperature, velocity or magnetic). For vector fields or multiple scalar fields, a simple extension is to apply the method developed below individually to the field components or scalar fields. The types of problems under consideration are quite general. Consider a parametrized nonlinear, time-dependent system of *Q* order-*n* PDEs for dependent variables *u*_{i}(** q**,

*t*;

**), defined over a physical domain**

*x***=(**

*u**u*

_{1},…,

*u*

_{Q})

^{T},

*t*∈(0,

*t*

_{f}] denotes time and

**=(**

*q**ξ*,

*η*)∈

*Ω*is the spatial variable. The equations are parametrized by inputs

*i*=1,…,

*Q*, are nonlinear operators (parametrized nonlinear functions of their arguments). For a given multi-index

*α*=(

*α*

_{ξ},

*α*

_{η}),

*α*|=

*α*

_{ξ}+

*α*

_{η}. For a non-negative integer

*k*,

*D*

^{k}

*u*

_{i}={

*D*

^{α}

*u*

_{i}:|

*α*|=

*k*} denotes the set of all partial derivatives of

*u*

_{i}of order

*k*.

*D*

^{k}

**={**

*u**D*

^{α}

**:|**

*u**α*|=

*k*}, where

*D*

^{α}

**={**

*u**D*

^{α}

*u*

_{1},…,

*D*

^{α}

*u*

_{Q}}, denotes the set of all partial derivatives of order

*k*of all the

*u*

_{i}. The boundary and initial conditions are also allowed to be nonlinear and parametrized. It is assumed that the PDE model is well posed for any value of

**and that solutions can be computed numerically.**

*x*We seek to construct an emulator for numerical solutions (using, e.g. the finite-element or finite-volume method) to (2.6) with consistent initial/boundary conditions. The computational model is the simulator, the outputs of which are the fields *u*_{i}. The outputs of interest can include one or more of the *u*_{i}, or functions derived from the *u*_{i}. In the steady case, consider one such quantity, *u*(** q**;

**), which is simulated or derived from the simulations at specified locations on a spatial grid**

*x***=(**

*q**ξ*

_{i},

*η*

_{j}),

*i*=1,…

*N*

_{ξ},

*j*=1,…

*N*

_{η}. Provided that the same grid is used for each simulation (for different inputs

**), it is not a requirement that the grid points are uniformly spaced.**

*x*For different design points *u*^{(k)}(*ξ*_{i},*η*_{j};*x*^{(k)}), *i*=1,…*N*_{ξ}, *j*=1,…*N*_{η}. To simplify the notation, we define *d*=*N*_{ξ}×*N*_{η}. In the case of time-dependent simulations with *N*_{t} time steps in the temporal discretization, *t* is treated as an additional input, ** x**↦(

**,**

*x**t*). The values of

*u*(

**,**

*q**t*;

**) at defined points**

*x***=(**

*q**ξ*

_{i},

*η*

_{j}),

*d*=

*N*

_{ξ}×

*N*

_{η}and

*k*=1,…,

*m*×

*N*

_{t}. For spatially uniform, time-dependent quantities

*u*(

*t*;

**),**

*x**d*=

*N*

_{t}.

In analogy with the univariate case, the simulator is regarded as a function *u*^{(k)} corresponding to design points *x*^{(k)}, *k*=1,…,*m*. Even for relatively coarse discretizations, e.g. *N*_{ξ}=*N*_{η}=10, the value of *d* can be large. In problems involving complex geometries or fine-scale phenomena, much finer grids are required to fully resolve all features. There are few computationally feasible methods for GP emulation of such high-dimensional problems. Methods based on multivariate GP prior distributions over ** ψ** make restrictive assumptions regarding the forms of the regression functions and/or the correlation structure, notably separability of the between-output covariance and between-index correlation matrices [16,18].

An alternative approach was developed by Higdon *et al.* [21], who performed PCA on the training data matrix *X*=[*u*^{(1)} *u*^{(2)}⋯*u*^{(m)}] to extract a set of *d* orthogonal principal directions *p*_{i} that are ordered in terms of variance in the data. The coefficients of the training points in the PCA basis *p*_{i} are uncorrelated and since they are assumed to be Gaussian distributed, they are independent. The data can be projected onto a lower dimensional subspace of *r* most ‘dominant’ basis vectors. The simulator outputs can be taken as the coefficients *i*=1,…*m*, *j*=1,…*d*, of the training points *u*^{(i)} in the PCA basis, where *j*≤*r* signifies the principal direction and *i* refers to the training point number. HH then proceeds as below:

Note that each of the coefficients is approximated without making an i.i.d. assumption, i.e. they come from independent GPs with *different* covariance parameters. For complex response surfaces, PCA (and other linear methods such as multi-dimensional scaling (MDS)) may lead to poor results, and in many cases it could fail altogether. A simple example is illustrated in figure 1, which shows low-dimensional representations of data lying on a ‘swiss roll’ in *a* shows 2000 points randomly sampled from a swiss roll manifold, while figure 1*b* shows the approximation of the dataset using two principal components in PCA. Figure 1*c* shows the reconstruction using only one component (defined later) in Isomap. Clearly, this simple dataset cannot be approximated by a linear subspace of

### (c) Dimensionality reduction via Isomap

#### (i) Multi-dimensional scaling

Classical MDS, which was motivated by the work of Torgerson [28], provides a low-dimensional Euclidean space representation of a dataset that lies in a high-dimensional ambient space. It does so by relating the distances *δ*_{ij} between data points in the low-dimensional Euclidean space to dissimilarities *d*_{ij} between the data points in the ambient space. In ‘classical scaling’, the dissimilarities are Euclidean distances and the Euclidean distances in the low-dimensional space satisfy *δ*_{ij}=*d*_{ij} (in a least-squares sense). This can be viewed as an approximate isometric embedding of the data in the low-dimensional space. Let *D*=[*d*_{ij}] (dissimilarity matrix) denote the matrix of dissimilarities between data points *i*=1…,*m*. In classical scaling, *z*^{(i)}, *i*=1,…,*m*, of *u*^{(i)} in the low dimensional Euclidean space. Applying the centering matrix *H*=*I*−(1/*m*)**1****1**^{T} to the matrix *Z* with rows corresponding to *z*^{(i)} yields
*K* is *K*=*V* *ΛV* ^{T}, where
*λ*_{i}, *i*=1,…,*d*, are arranged in a non-increasing order and the corresponding eigenvectors *r*-dimensional manifold of *r* eigenvectors corresponding to the largest *r* eigenvalues, i.e. *V* _{r}=[*v*_{1},…*v*_{r}] and *Λ*_{r}=diag(*λ*_{1},…*λ*_{r}). The rows *P*_{r} is formed from the first *r* eigenvectors *p*_{i}, *i*=1,…,*d*, of the sample covariance matrix (as columns) corresponding to the largest *r* eigenvalues. These eigenvalues are identical to those of the kernel matrix. The singular value decomposition of *P*=[*p*_{1}…*p*_{d}], which gives *i*=1,…,*m*, computed by classical scaling are identical to the first *r* coordinates using the principal directions *p*_{i} as a basis for

#### (ii) Isomap and kernel Isomap

Classical metric MDS is a linear method and in common with PCA (see the example in §2*b*) it will fail to accurately represent many surfaces. To overcome this weakness, Tenenbaum *et al.* [22] developed the Isomap method, which uses geodesic distances rather than straight line (e.g. Euclidean) distances as the dissimilarities. For neighbouring points, the Euclidean distance offers a decent approximation to the geodesic distance. For faraway points, the geodesic distance can be approximated by finding a shortest path distance. Applying classical MDS to a dissimilarity matrix based on such geodesic distances leads to a low dimensional (approximately isometric) embedding that captures the intrinsic geometry of the data. The main steps are given below [22]:

The connection between this method and kPCA is worth elucidating [23,29,30] to explain the significance of the Isomap coordinates. The dissimilarity matrix *D* can be considered to represent distances between points in a feature space *k*′(*i*,*j*)=** ϕ**(

*u*^{(i)})

^{T}

**(**

*ϕ*

*u*^{(j)}). For an isotropic kernel scaled so that

*k*′(

*i*,

*i*)=1,

*K*′ is a kernel matrix. The centered kernel matrix

*K*is obtained from

*in a feature space*, i.e. exactly as in kPCA.

In kPCA, the eigenvectors of the centred covariance matrix in feature-space are generally not known since the feature map is not specified. This covariance matrix eigenproblem, in a possibly infinite dimensional feature space, can be recast as the eigenproblem for the positive definite (centred) kernel matrix. Assuming for simplicity that both matrices have a full set of eigenvectors, using only the eigenvectors of the centred kernel matrix, *v*_{j}, *j*=1,…,*m*, and the specified kernel values, the projections of a data point *f*_{j}, *v*_{lj} is the *l*th component of the eigenvector *v*_{j} and *K*_{li} is the dot product (kernel value) between two points *u*^{(l)} and *u*^{(i)} centred in feature space.

There is no theoretical guarantee, however, that the kernel matrix is positive definite (p.s.d.), which would guarantee that the low-dimensional manifold onto which the data are mapped is a Euclidean space (with the distances between mapped points equal to the dissimilarities) and that a feature space exists. Dissimilarity matrices that lead to p.s.d. kernel matrices are said to have an ‘exact Euclidean representation’. To overcome the problem of non-Euclidean dissimilarity matrices, Choi & Choi [23] developed a variant of Isomap termed kernel Isomap, which exhibits greater robustness. The algorithm is described below [30]:

Step 4 is related to the additive constant problem, which can be defined in the following way: find a value of the constant *c* such that the modified dissimilarities matrices [*c*≥*c** (*κ*_{ij} is the Kronecker-delta function). An exact solution to this problem was given by Cailliez [31] as the largest eigenvalue of the matrix (2.12).

### (d) The Iso-GPE algorithm

For design points *x*^{(i)}, *i*=1,…,*m*, the simulator ** ψ**(⋅) (computer model of §2

*b*) yields outputs

*u*^{(i)}=

**(**

*ψ*

*x*^{(i)}). Analogously to the HH method, we can view the simulator as instead giving realizations

*i*=1,…,

*m*, of a random vector. These realizations lie on or near an

*r*-dimensional space (embedding) and are extracted from the simulator outputs

*u*^{(i)}as described in §2

*c*(ii). For fixed

*j*, univariate GPE is performed on the set of coefficients

*x*^{(i)},

*i*=1,…,

*m*, to learn the mapping

**. In the notation of §2**

*x**a*, the training points are

*i*=1,…,

*m*, and the test output is

*j*=1,…,

*r*, with

*r*≪

*d*, to yield

**=**

*u***(**

*ψ***). The predicted coefficients**

*x**j*=1,…,

*r*, are precisely the coefficients in an expansion using the PCA basis in a feature space, as demonstrated in the preceding subsection (equation (2.11)). They are therefore uncorrelated and can be treated as independent given the GP-prior assumption. The final step is to find the inverse mapping from the Isomap representation

*Constructing solutions from the predicted Isomap coordinates.* Given the predicted coordinates of a point ** x** (as described above), the distances between such a point and the other points

*i*=1,…,

*m*, can be computed. Let

*d*

_{i,*}denote the Euclidean between

*d*

_{i,*}approximates the geodesic distance between

**(the inverse image of**

*u*

*u*^{(i)}by the nature of Isomap. For neighbourhood points of

**, these geodesic distances are approximately equal to Euclidean distances. Local linear interpolation can be used to approximate the coordinates of**

*u***by using the geodesic distances as weights [32,33]. Let**

*u**w*

_{i}=

*d*

_{i,*}; then

**is approximately given by**

*u**N*

_{n}is the number of nearest neighbours selected for the reconstruction (mapping of

**). The natural choice for**

*u**N*

_{n}is the neighbourhood number

*N*used in Isomap (recall that for neighbourhood points the geodesic distances are set equal to the Euclidean distances in step 1 of Algorithm 1). The accuracy of this procedure relies on (a) the design-of-experiment (DOE) for selecting the inputs (see §3) to provide ‘representative’ outputs in the window of input space that is of interest; and (b) the number of neighbours selected for the reconstruction. In some cases, e.g. precipitous changes in the outputs with small changes in the inputs and/or small sample sizes, a more informed (based on prior information) DOE than uniform sampling may be required, although this problem was not encountered with the examples considered in §3.

## 3. Results and discussion

The methodology developed in the previous section (Iso-GPE) was applied to datasets from three problems: (i) a steady-state parametrized PDE model, (ii) an unsteady parametrized PDE model, and (iii) a parametrized ODE model in time. The results were compared in each case to HH.

*Training and testing.* In each case, the dataset consisted of 500 data points (vectors), with inputs selected using a Sobol sequence DOE [34], which is specifically designed to generate samples as uniformly as possible over the unit hypercube [35]; 400 points were reserved for testing and different numbers of the remaining 100 points were used for training. The relative square errors (total square error divided by the number of grid points and the magnitudes of the average values of the test points) were used to measure the generalization error. Selection of the design points using DOE is vital for ensuring that an accurate response surface can be constructed from the data [2,36,37]. In the examples below, inputs were sampled uniformly. If prior knowledge of the input–output relationship is available, a more sophisticated approach could be used. Since our focus is on the output space dimensionality, and uniform sampling of the inputs suffices for the examples considered, we refer the reader to [2,35–37] for detailed discussions on DOE issues, which are common to all emulation methods.

Results are shown for different numbers of components in the two dimensionality reduction methods. In the case of PCA, the first *r* components are the *r* principal components corresponding to the *r* largest eigenvalues of the covariance matrix. In the case of Isomap (or kernel Isomap), the first *r* components are the *r* Isomap coordinates corresponding to the *r* largest eigenvalues of the kernel matrix (corresponding to the *r* in the previous section and in the Iso-GPE algorithm).

### (a) Free convection in porous media

This example concerns subsurface flow in a porous medium driven by density variations that result from temperature changes [38]. The temperature varies from a high value *T*_{h} to a low value *T*_{c} along the outer edges in a two-dimensional domain (*ξ*,*η*)∈*Ω*=[0,10]×[0,10] (in cm), which is filled with water (figure 2). Temperature gradients alter the fluid density and buoyant flow is generated. A Boussinesq buoyancy term in Brinkmann’s equation accounts for the lifting force due to thermal expansion. The model equations are
** w**=0.

*T*represents temperature,

*p*is pressure,

**g**denotes gravitational acceleration,

*ρ*is the fluid density at the reference temperature

*T*

_{c},

*ϵ*and

*κ*are the porosity and permeability of the medium,

*β*is the coefficient of volumetric thermal expansion of the fluid,

*μ*is the dynamic viscosity,

*λ*is the volume averaged thermal conductivity of the fluid–solid mixture and

*C*

_{p}is the specific heat capacity of the fluid at constant pressure. The Brinkman equations are subject to no-slip conditions on all boundaries.

The model was solved in COMSOL Multiphysics 4.3b (‘Free convection in porous media’ under the Subsurface Flow (Heat Transfer) module) without modification. The input parameters were ** x**=(

*β*,

*T*

_{h})

^{T}∈[10

^{−11},10

^{−8}]×[40,60] in units of K

^{−1}and °C, respectively. A total of 500 numerical experiments were performed using a Sobol sequence for sampling the input space; 400 of the data points were reserved for testing. For each input

*i*=1,…,500, the magnitude |

**| of the velocity was recorded on a regular 100×100 square spatial grid. The 10 000 points in the two-dimensional spatial domain**

*w**Ω*were reordered into vector form (as described in §2

*b*) to give data points

*d*=10 000. In the notation of §2

*b*, |

**|=**

*w**u*(

**;**

*q***). To give some indication of the range of results, figure 3 shows the |**

*x***| fields in six different cases. The general trend was an increase in the average value of |**

*w***| with increases in**

*w**β*and

*T*

_{h}.

*Results.* For 40 training points, (Tukey) box plots of the relative errors are shown in figure 4*a*,*b*, up to five components for both HH and Iso-GPE (the horizontal axis is the number of components for the respective method, as previously defined). On each box, the central line is the median, the lower and upper edges signify the first (*Q*_{1}) and third (*Q*_{3}) quartiles, and the lower and upper lines (whiskers) define the errors within 1.5×(*Q*_{3}−*Q*_{1}) of the first and third quartiles. All other points (considered outliers) are plotted individually using a ‘+’ symbol. For this number of training points, the two methods exhibit similar accuracy, with little variation in the errors beyond the use of one component (linear PCA shows that the first component contains over 95% of the modal energy). For higher numbers of training points, the Iso-GPE method is superior, as shown in figure 4*c*,*d*, which correspond to 80 training points. For the HH, the MLE for the hyperparameters was very sensitive to the algorithm used. The steepest descent algorithm (used for figure 4) gave the most stable results, while the other methods tested (conjugate gradient, L-BFGS, Hessian free Newton, Barzilai and Borwein, all implemented using the Matlab library minFunc (available from http://www.di.ens.fr/mschmidt/Software/minFunc.html) typically led to complete failure, irrespective of the initial guess for the hyperparameters. The MLE problem for Iso-GPE was found to be much more robust, in terms of both the method used and the initial guess. Representative examples in the case of 80 training samples are shown in figure 5. In both cases, five components were used. Iso-GPE performs well in both cases, while HH exhibits notable quantitative and qualitative differences from the test output in both cases.

### (b) Continuously stirred tank reactor

We consider the temperature profile of the start-up phase of a continuously stirred tank reactor (CSTR) used to produce propylene glycol (PrOH) from the reaction of propylene oxide (PrO) with water in the presence of an acid catalyst (H_{2}SO_{4}): *c*_{i} is the concentration, *ν*_{i} is the stoichiometric coefficient and *c*_{f,i} is the feed stream concentration of species *i*, respectively; *v*_{f} is the volumetric flow rate and *E* and pre-exponential factor *A* (*R* is the universal gas constant and *T* is the temperature of the reactor). The energy balance is expressed as follows:
*C*_{p,i}) and *H* is the enthalpy of reaction. The first term on the rhs represents the heat of reaction. The second term on the rhs represents the heat loss to the heat exchanger, where *F*_{x} is the flow rate, *C*_{p,x} is the specific heat capacity and *T*_{x} is the inlet temperature of the heat exchanger medium. *U* (J s^{−1} K^{−1}) is the heat exchange parameter. The third term on the rhs is the enthalpy change owing to the flow of species through the reactor. The molar enthalpy of species *i* is given by *h*_{i}=*C*_{p,i}(*T*−*T*_{ref})+*h*_{i,ref}, in which *h*_{i,ref} is the standard heat of formation of *i* at the reference temperature *T*_{ref}. The feed stream molar enthalpies *h*_{f,i} are calculated similarly. The model is completed by initial values for *T* and *c*_{i}.

Solutions were obtained using the ‘cstr startup’ model in the ‘Chemical Reaction Engineering Module’ of COMSOL Multiphysics 4.3b without modification. A total of 500 numerical experiments were performed using a Sobol sequence for sampling the input space; 400 of the data points were reserved for testing. The inputs were defined as ** x**=(

*T*

_{0},

*c*

_{0},

*U*)

^{T}∈[297,360]×[100,1500]×[1000,10 000], where

*T*

_{0}(K) is the initial temperature and

*c*

_{0}(mol m

^{−3}) is the initial concentration of PrO. For each input

*i*=1,…,500, the temperature was recorded at intervals of 14 s up to 7000 s, yielding 501 values for each of the 500 cases. The 501 values were reordered into vector form (as described in §2

*b*) to give data points

*d*=501. In the notation of §2

*b*,

*T*=

*u*(

*t*;

**).**

*x**Results.* Figure 6 shows the boxplots of the relative errors for HH and Iso-GPE up to five components for both 40 and 80 training points. The best performance is seen with Iso-GPE. In fact, HH failed to provide satisfaction for any number of training points. Despite appearing to give similar results to Iso-GPE for 40 training points, based on figure 6*a*,*b*, HH exhibited spurious oscillations. A clear difference in the errors is seen in figure 6*a*,*b*, which correspond to 80 training points. Examples of the predictions for each method are shown in figure 7 (for 80 training points using five components) demonstrating the failure of HH. Iso-GPE provided a good fit to the trends and although it exhibited only a reasonably good quantitative fit, the method did not suffer from spurious oscillations. For a higher number of training points up to 140, Iso-GPE exhibited slight improvements in the accuracy of the predictions, while HH continued to fail.

### (c) Metal melting front

A square cavity, (*ξ*,*η*)∈*Ω*=[0,10]×[0,10] (in units of cm), containing both solid and liquid is submitted to a temperature difference between the left and right boundaries. The fluid and solid phases are treated as separate domains sharing a moving melting front. The position of the melting front is calculated according to a Stefan condition [39]. The liquid domain is governed by the Navier–Stokes equations using a Boussinesq approximation
** w** is the liquid velocity,

*T*

_{l}(

*T*

_{s}) is the liquid (solid) temperature,

*μ*is the dynamic viscosity,

*λ*is the (common) thermal conductivity of the liquid and solid,

*C*

_{p}is the heat capacity of the liquid at constant pressure,

**g**denotes gravitational acceleration,

*ρ*

_{0}is the reference density of the fluid,

*ρ*=

*ρ*

_{0}

*β*(

*T*

_{l}−

*T*

_{f}) is the linearized density, where

*β*is the metal coefficient of thermal expansion, and

*T*

_{f}denotes the fusion temperature of the solid. The liquid–solid front is defined by

*ξ*=

*a*(

*η*,

*t*), along which

*T*=

*T*

_{f}. An energy balance at the melting front is given by [39]

*h*

_{f}is the latent heat of fusion. A no slip condition is applied at the other boundaries.

The model was solved in COMSOL Multiphysics 4.3b (‘Tin Melting Front’ under the Heat Transfer (Phase change) module) without modification. A total of 100 numerical experiments were performed by selecting parameters *i*=1,…,100, using a Sobol sequence (the units are kJ kg^{−1} and W m^{−1} K^{−1}, respectively). For each value of *t*_{i}=50+50(*i*−1), *i*=1,…,5 (in seconds). For each of the 500 snapshots, the magnitude |** w**| of the velocity was recorded on a regular 100×100 square spatial grid. The 10 000 points in the two-dimensional snapshots were reordered into vector form (as described in §2

*b*) to give data points

*d*=10 000. In the notation of §2

*b*, |

**|=**

*w**u*(

**,**

*q**t*;

**). This gave a total of 500 data points (400 reserved for testing), with corresponding inputs**

*x**i*=1,…,100,

*k*=1,…,5.

*Results.* For 40 training points, box plots of the relative square errors are shown in figure 8*a*,*b*, up to eight components for both Iso-GPE and HH. Iso-GPE clearly exhibits the lowest errors of the two methods. The differences in the errors are significant, as demonstrated by the two representative examples shown in figure 9, corresponding to cases 145 and 301 of the 400 reserved for testing. In both cases, five components were used. The performance of Iso-GPE in the first example is very good, while it performs poorly in the second. HH is notably worse than Iso-GPE, particularly in the second test, in which the fit is both qualitatively poor and negative values are predicted. For 80 training examples, the boxplots of the relative square errors are shown in figure 8*c*,*d*, again up to eight components for both Iso-GPE and HH. Both methods show improvement. Iso-GPE exhibited very good performance with 80 training points in almost all cases, while the fit with HH was still poor in many of the test cases. In fact, significant improvements were seen with Iso-GPE using just 60 training points.

### (d) Computational details

The results of HH were highly dependent on the MLE algorithm and the initial guess for the hyperparameters. Iso-GPE also showed sensitivity to the MLE algorithm and the initial guess, but to a much lesser degree; reasonable results were obtained with all algorithms tested. At present, we can provide no clear reasons for these observations, which will require further investigations. The computational times for training the algorithms using 80 training points and nine components in the examples considered (on a 2.7 GHz i7core, 16 GB RAM laptop) were all below 100 s when using the interior point method for the MLE (the most computationally expensive). These times are determined primarily by the solution to the MLE problem, and thus vary from method to method. The main limitation with the MLE (or fully Bayesian approach) is the inversion of the covariance matrix *C* and the calculation of |*C*| in equation (2.5), both of which are *O*(*m*^{3}) calculations, where *m* is the number of samples. As the input space dimensionality grows, *m* will generally increase. The examples we considered have relatively low numbers of inputs *L*, but much higher numbers can be used provided the number of training points does not grow too large. A main consideration will be the DOE [2,35–37] as discussed earlier. Since the limitations on the input space dimensionality and sample number are common to all GP emulation methods (including univariate), a more detailed discussion is beyond the scope of this work, which is focused on reducing the dimensionality of the output space.

The method used to determine the neighbouring points for Isomap (step 1 in Algorithms 1 and 2 of §2*c*(ii)) was found not to influence the results significantly. The results discussed above were generated using the neighbourhood number method (*N*=10 in all cases). Increasing the neighbourhood number above 10 did not notably alter the results. Several parameters and choices in the Iso-GPE algorithm will need to be optimized, on a case by case basis. These parameters include the covariance function, the MLE algorithm and the neighbourhood number (or ball radius) for Isomap.

## 4. Concluding remarks

An efficient approach for constructing GP emulators of parameter-dependent spatial and spatio-temporal fields was developed by using nonlinear dimensionality reduction to reduce the dimensionality of the output space. The full, naïve GPE [15] would be computationally prohibitive for such datasets, while other methods rely on restrictive assumptions regarding the correlation structure [16,18]. In the present approach, correlation information in the output data is exploited to reduce the dimensionality of the permissible output space (after implicitly mapping to a feature space). The method also extends that of Higdon *et al.* [21] (HH), in which the GP emulator is based on linear PCA. The generalization to nonlinear dimensionality reduction is significant because many response surfaces will not admit to a low-dimensional approximation using a linear subspace of the ambient space. The examples demonstrate that the Iso-GPE algorithm is generally more accurate than HH, with similar time costs. Although the motivation was parametrized nonlinear PDE models, the method developed can be applied to any problem involving vector-valued (steady-state or time-dependent) targets and vector-valued inputs. It can be used to dramatically reduce the time costs of numerical algorithms for uncertainty quantification, optimal design, control and inverse parameter estimation (any application that involves repeated solutions of high-fidelity computational models).

## Funding statement

The authors would like to acknowledge the Chinese Scholarship Council for a scholarship provided to W.X. to fund his doctoral studies.

- Received September 15, 2014.
- Accepted December 1, 2014.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.