Geostatistics of extremes

We describe a prototype approach to flexible modelling for maxima observed at sites in a spatial domain, based on fitting of max-stable processes derived from underlying Gaussian random fields. The models we propose have generalized extreme-value marginal distributions throughout the spatial domain, consistent with statistical theory for maxima in simpler cases, and can incorporate both geostatistical correlation functions and random set components. Parameter estimation and fitting are performed through composite likelihood inference applied to observations from pairs of sites, with occurrence times of maxima taken into account if desired, and competing models are compared using appropriate information criteria. Diagnostics for lack of model fit are based on maxima from groups of sites. The approach is illustrated using annual maximum temperatures in Switzerland, with risk analysis proposed using simulations from the fitted max-stable model. Drawbacks and possible developments of the approach are discussed.


(a) Spatial extremes
The prospect of climatic change and its impacts have brought spatial statistics of extreme events into sharper focus. For example, the European summer heatwave of 2003 cost the lives of many vulnerable individuals, and affected crops and water supply. Heatwaves and heavy rainfall are predicted to become more frequent in some regions of a warming Europe (Beniston et al. 2007), and it is important to have a mathematically sound and statistically efficient basis for modelling them and assessing their possible consequences. This paper represents a step towards this goal, based on data from a number of sites. The key idea is to use a so-called pairwise likelihood to fit models for spatial extremal processes, for which standard likelihood or Bayesian inference seem to be out of reach. By incorporating elements of classical Gaussian geostatistics, we are able to find flexible formulations for spatial extremes that extend univariate extremal models in a natural way. *Author for correspondence (anthony.davison@epfl.ch). Statistics of extremes have over the past two decades developed remarkably, with books now available at various levels of sophistication and a journal, Extremes, devoted to recent research. The underlying mathematical basis is now thoroughly established (Leadbetter et al. 1983;Resnick 1987;Embrechts et al. 1997;de Haan & Ferreira 2006) and statistical tools and methods for use with a single time series of data, or with a few series, are well developed (Coles 2001;Beirlant et al. 2004) and widely used. The main approaches to univariate extremal analysis are the fitting of the generalized extreme-value distribution to maxima, and of the generalized Pareto distribution or a related point process model to exceedances over high thresholds, though many elaborations appear in applications.
Multi-variate extreme-value models, the springboard for modelling of spatial aspects of rare events, are surveyed by Beirlant et al. (2004). Key mathematical elements in this classical theory, which is based on the limiting behaviour of tails of distributions, are regular variation and point processes, and these also play a central role in the theory of spatial extremes, as summarized in ch. 9 of de Haan & Ferreira (2006). There are many published applications of multi-variate extremes, but the literature on applications of the spatial theory is limited owing to a lack of flexible statistical models and corresponding inferential tools.
Four main approaches to modelling spatial extremes have been proposed. The first of these employs underlying latent processes, conditional on which standard extremal models are applied (Coles & Casson 1998;Casson & Coles 1999;Cooley et al. 2006Cooley et al. , 2007Fawcett & Walshaw 2006;Sang & Gelfand 2009a). This approach has the advantages of fitting naturally into Bayesian computational set-ups using Markov chain Monte Carlo simulation, and being applicable to large datasets, but because most such models postulate independence of extremes conditional on the latent process, the spatial modelling may yield unrealistic estimates of certain risk measures because the spatial clustering of rare events is not properly accounted for. An exception to this is Sang & Gelfand (2009b), who use a Gaussian copula model to mitigate the independence assumption, but this does not admit a max-stable model for extremes (see the next paragraph). A further difficulty with such models is that the marginal distributions thus generated are mixtures of standard extremal distributions and so do not fall naturally into the usual extremal paradigm.  0 2000 altitude (m) 4000 Oeschberg-Koppigen (483) Engelberg (1035) Bern-Liebefeld (565) Chateau d'Oex (985) Montreux-Clarens (405) Jungfraujoch (3580) Locarno-Monti (366) Lugano (273) Montana (1508) Gd-St-Bernard (2472) Neuchatel (485) Zurich-MeteoSchweiz (556) Santis (2490) Bad Ragaz (496) Davos-Dorf (1590) Arosa ( applied to the modelling of rainfall data by Smith & Stephenson (2009), Padoan et al. (2010) and Davison et al. (in press); rainfall modelling with max-stable processes was also described by Buishand et al. (2008), and modelling of maxima of annual snow heights is described by Blanchet & Davison (in press). The approach we describe has the advantages of a full integration of classical extremal results, so that the marginal distributions for maxima are appropriately modelled, and a treatment of spatial dependence suitable for rare events. Before describing the methodological content of the paper, we give an example of the type of data and application that our work is intended to address.

(b) Data
Below, we shall consider annual maximum temperatures from the 17 sites in Switzerland shown in figure 1. The data are abstracted from daily maximum temperature time series for the years 1961-2005, part of which are shown in figure 2 as temperature anomalies, obtained as residuals after subtraction of the site-wise average summer temperatures for the five years shown. There are evidently very strong correlations between the different series: many of the maxima occur simultaneously or as part of the same sequence of a few successive hot days. If we let Y (x) denote the temperature at site x of the set X , then the quantity of ultimate interest might be an integral such as where r(x) might represent a population or a crop at risk at site x if Y (x) exceeds some level y danger , and {u} + = max(u, 0). Of course, many other functions might be of interest; the key point is that joint spatial modelling of Y (x) is required for useful inference on R X . This entails extrapolation from the points at which data are available to the whole of X , and estimation of the joint distribution of {Y (x)} x∈X . At each individual site x ∈ X , standard arguments imply that if a limiting law for the maximum exists, then it must be a generalized extreme-value distribution where h and t are, respectively, real location and positive scale parameters and the shape parameter x determines the weight of the upper tail of the density. These parameters may depend on the site x or on covariates indexed by x. For example, temperature depends strongly on altitude a(x). In many applications, it would be necessary to take semi-parametric formulations for dependence on x, but as we have data at relatively few sites, we use parametric forms. Davison & Ramesh (2000), Hall & Tajvidi (2000), Pauli & Coles (2001), Chavez-Demoulin & Davison (2005 and Padoan & Wand (2008) describe approaches to more flexible modelling, applied by Ramesh & Davison (2002) and Butler et al. (2007).
Whatever the chosen regression structure, we shall suppose for simplicity of exposition that it has been used to transform the maxima to have the unit Fréchet distribution, exp(−1/z), for z > 0. IfĜ d (y) denotes the distribution at site x d , estimated from the annual maxima, then the variables Z dj = −1/ logĜ d (Y dj ), for the sites d = 1, . . . , 17 and the years j = 1, . . . , n, should possess approximately the unit Fréchet distribution. The quality of the approximation may be verified by quantile-quantile plots.
(c) Layout The following section brings together elements from the statistics of extremes and from Gaussian geostatistics, and uses them to construct flexible models for spatial extremes for which pairwise joint density functions are available. Section 3 outlines how these may be used to perform exploratory analysis, and how information on the times of the maxima may be incorporated. It appears impossible to perform full likelihood inference for such models, and in §4, we discuss how composite marginal likelihood inference may be used for parameter estimation and model comparison. Section 5 describes its application to the Swiss temperature data, and some discussion is given in §6.

(a) Max-stable processes
The generalization of the classical multi-variate extreme-value distributions to the spatial case is a max-stable process, discussion of which is simplified under the assumption that all of its univariate marginal distributions are unit Fréchet. There is no loss of generality in making this assumption, which may be achieved by the marginal transformation described in §1b.
A max-stable process {Z (x)} for x ∈ X ⊂ R 2 with unit Fréchet margins, also called a simple max-stable process, is a random field, all of whose marginal distributions satisfy the property of max-stability. That is, pr{Z (x) ≤ z} = exp(−1/z) for z > 0 and every x ∈ X , and if D = {x 1 , . . . , x D } is a finite disjoint subset of X , then for k = 1, 2, . . . (de Haan 1984), implying that the exponent measure function (Resnick 1987, ch. 5) say, is homogeneous of order −1 and that for each d ∈ {1, . . . , D}, Max-stable processes provide the natural generalization of models for the limiting distributions of multi-variate extremes to the process setting, in that the key property of max-stability is satisfied for all finite-dimensional marginal distributions, and the appropriate univariate marginal distributions, in this case unit Fréchet, are obtained. There are several canonical representations of such processes (Schlather 2002;de Haan & Ferreira 2006, ch. 9). For our purposes, the most useful has two elements, a Poisson process P of intensity ds/s 2 on R + and a set of independent random processes {W s (x) : x ∈ X , s ∈ R + }, the latter being replicates of a non-negative process {W (x) : x ∈ X } with measure n on W = R X + that satisfies E{W (x)} = 1 for all x. We then define (2.2) Following Schlather (2002), it is straightforward to verify that Z (x) is maxstable, with unit Fréchet marginal distributions, and that for suitable functions z(x) on X , The very general expression (2.2) suggests several specific types of model, and in particular 'random storm' and 'random process' formulations. For the first, suppose that X = R 2 and that W s (x) = f (x − X s ), where f is a density function on X , and X s is a point of a Poisson process of unit rate in X . Then, if we regard sW s (x) as the impact at x of a storm of magnitude s, centred at X s , and of shape f , we interpret Z (x) as the impact of the largest storm experienced at x. This formulation has been exploited by R. L. Smith (1990, unpublished data), Coles (1993), Coles & Tawn (1996), Coles & Walshaw (1994), Smith & Stephenson (2009) and Padoan et al. (2010) for the modelling of rainfall and maximum wind speeds. R. L. Smith (1990, unpublished data) obtained the marginal distribution for pairs Z (x 1 ), Z (x 2 ) with f taken to be bivariate normal, and de Haan & Pereira (2006) extended this to Student t and Laplace densities.
A second type of max-stable model arises on taking W (x) to be a stationary random process on X . When modelling heatwaves, for example, we might take W s (x) to represent the temperature over X on the sth independent occasion of severity s, so that Z (x) represents the overall maximum at x. We discuss this in §2c.
Difficulties with parametric inference on max-stable processes are immediately apparent on contemplating equation (2.3). In practice, data will be available at D = {x 1 , . . . , x D } ⊂ X , and the joint distribution function for maxima observed on this set will be given by and  (Cameron 1994). Direct computation thus seems infeasible. Second, even if a suitable algorithm for computation of the necessary derivatives was available, the complexity of analytical computation of equation (2.5) rapidly mounts with D: it is feasible for a variety of simple random processes W (x) when D = 2, but is possible only for special processes for larger D (Genton et al. 2011). Thus, standard likelihood-based inference appears to be out of reach.

(b) Extremal coefficients
A natural way to measure dependence among spatial maxima stems from considering the distribution of the largest value that might be observed on X . If this distribution is close to unit Fréchet, then the maxima observed on X will be close to fully dependent, whereas lower degrees of dependence would yield stochastically larger distributions. To see the implications of this, we set z(x) ≡ z in equation (2.3) and thus obtain is called the extremal coefficient corresponding to the set X . It is more useful to consider the extremal coefficient q D for a finite subset D = {x 1 , . . . , x D } of X , obtained by taking z(x) to equal z for x ∈ D and infinity elsewhere; thus, If q D = 1, then the maxima at D are perfectly dependent, whereas if q D = D, they are independent. Thus, q D may be interpreted as the effective number of independent Frechét variables in the set indexed by D. Schlather & Tawn (2003) consider the constraints that must be satisfied by the extremal coefficients corresponding to different subsets of D, and show how the coefficients may be estimated both with and without these constraints. Extremal coefficients are useful summaries of the joint tail behaviour of multi-variate extreme-value distributions, though they do not fully characterize dependence. We let q(h) denote the extremal coefficient of Z (x), Z (x + h); if the field is isotropic, then this depends only on the length of h. If replicate data are available, then the fact that max{Z (x), Z (x + h)} has distribution exp{−q(h)/z} enables maximum-likelihood estimation of q(h) for each distinct pair of spatial sites (Schlather & Tawn 2003, §4.2). We use this below to check the fit of our models.
There is a close relation between the extremal coefficient and the madogram proposed by Poncet et al. (2006) and further developed by Naveau et al. (2009), an extreme-value analogue of the variogram (Cressie 1993, p. 58). The construction of a madogram requires a relatively large number of sites, and we do not pursue it here.
(c) Gaussian process models A special case of equation (2.2) is to suppose, following Schlather (2002), that W s (x) = max{0, (2p) 1/2 3 s (x)} is proportional to the non-negative part of a stationary Gaussian process {3 s (x)} with zero mean, unit variance and correlation function r(x). In this case, the process {Z (x)} is called an extremal Gaussian process, and its marginal bivariate distributions are determined by the exponent measure Moreover, as a positive definite isotropic correlation function on R 2 can given correlations no smaller than −0.403, it follows that q(h) ≤ 1.838 for any h in the case of a spatial process. Thus, it is impossible to produce independent extremes using such a process, irrespective of the distance of the sites. This drawback may be overcome by taking where I B is the indicator function of a compact random set B ⊂ X , of which the {B s } are independent replicates, and the {X s } are the points of a homogeneous Poisson process of unit rate on X , independent of the 3 s . Then modulo minor edge effects, if c −1 = E[max{W (x), 0}]E(|B|), then the process (2.2) is again max-stable on X . If W s (x) is an extremal Gaussian process, as above, then equation (2.8) becomes can take any value in the interval [1, 2]. Independent extremes can be generated if the distribution of B is chosen so that B ∩ (h + B) = ∅ with probability one for large enough h, as in that case a(h) = 0. Thus, equation (2.9) has the appealing interpretation that the effects of the event corresponding to sW s (x) are felt only over the set B + X s . In §2d,e, we discuss some possible choices of r(h) and a(h). Below, we merely sketch some generic possibilities for modelling: more plausible forms will be prompted by consideration of particular applications.

(d) Geostatistics
Geostatistics is based largely on the theory of Gaussian random processes (see Cressie 1993;Stein 1999;Wackernagel 2003;Banerjee et al. 2004;Schabenberger & Gotway 2005;Diggle & Ribeiro 2007;Cressie & Wikle 2011). For a stationary Gaussian random field {S (x)} defined on x ∈ R 2 and having standard normal marginal distributions, the joint distribution of the field at any subset of points D = {x 1 , . . . , x D } must be multi-variate normal. The key element is the correlation function r(x 2 − x 1 ) = cov{S (x 1 ), S (x 2 )}, which is constrained to be positive definite. If the field is also isotropic, then the correlation depends only on the distance h = x 2 − x 1 , and for the rest of this subsection we assume this and write r( Standard families of correlation functions include the powered exponential family the Whittle-Matérn family in which K k denotes the modified Bessel function of order k, and the Cauchy family (2.14) In all three families, l represents a scale parameter with the dimensions of distance, and in the first two families, k is a shape parameter that controls the properties of the random field and in particular the roughness of its realizations. Visually smooth realizations arise only when r(h) is sufficiently well behaved at the origin. Rather than trying to estimate the shape parameter based on limited data, it may be more practicable to choose among a small set of values of k that represent qualitatively different behaviour of the Gaussian process. The Matérn correlation family is the more flexible and is strongly advocated by Stein (1999), but in practice equations (2.12)-(2.14) may be hard to distinguish. The correlation functions above impose r(h) → 1 as h → 0, but this is unrealistic in many applications. One way to relax this condition is to include a so-called nugget effect by adding to S (x) an independent Gaussian white noise random field. The resulting correlation function is where d(h) is the Kronecker delta function, r(h) is a correlation function and the nugget effect 1 − n is the proportion of the variance stemming from the local variation represented by {e(x)}.
As these correlation functions have r(h) > 0, the extremal coefficient based on equation (2.8) satisfies 1 ≤ q(h) < 1.71; as mentioned previously, independence of the extremal Gaussian processes resulting from using equations (2.12) and (2.13), either as such or in equation (2.15), cannot be achieved.
The simplest departure from isotropy, geometric anisotropy, presumes that there is an invertible matrix V such that the process Z (Vx) is isotropic, and corresponds to an affine transformation of the plane containing the sites.

(e) The set B
Detailed modelling of the random set B that appears in equation (2.9) will require data at many sites and appears to be difficult. In practice, it may be preferable to use flexible parametric shapes that generate simple forms for the parameter a(h) appearing in equation (2.10). An obvious possibility is to take B to be a disc of fixed radius r. Then, the area of intersection of B and h + B is where |h| is the length of the vector h. This function is close to linear in |h| over much of its range, and a crude approximation is to replace it by = {1 − |h|/(2r)} + , a model with one unknown parameter r that ensures the independence of extremes at sites for which |h| > 2r.
It may be thought to be more realistic to take sets of random size. A simple possibility is to take the above model, but to suppose that the radius of the disc is random with a generalized Pareto density function Thus, g is positive only for 0 < r < r max , where r max = ∞ if g ≥ 0 and r max = −d/g otherwise. For g > 0, the density g is heavy-tailed, for g = 0, it is exponential, and for g < 0, it has a finite upper support point r max , with g = −1 corresponding to the uniform distribution, and discs of constant radius r max appearing as g → −∞.
With the linear approximation to the overlap area mentioned above and with this choice of g, after some further integration, we find that provided g < 1/2, (2.17) When g ≥ 1/2, the expectation is infinite, corresponding to a(h) ≡ 1. A natural generalization is to suppose that the set B is the interior of an ellipse having area pr 2 and eccentricity e, and whose major axis subtends an angle u with the horizontal axis. If |h| and 4 are the polar coordinates of h, then the elliptical sets B are accounted for by replacing |h| in equation (2.17) by This provides a four-parameter family of sets, whose shapes are determined by e and u, and the distribution of whose sizes is determined by g and d.

Pairwise analysis (a) Direct analysis
Before discussing the fitting of the models described above to data from a number of sites x 1 , . . . , x D , we consider exploratory data analysis based on data from pairs of sites.
We suppose that the data have been transformed to have unit Fréchet marginal distributions. We suppose that data pairs (z 1j , z 2j ) (j = 1, . . . , n) are available, and take them to be independent realizations of variables (Z (x 1 ), Z (x 2 )) from a maxstable random field. In an environmental context, such data might be annual maxima recorded at the sites x 1 and x 2 in n successive years, and for ease of exposition, we shall use corresponding language.
The marginal pairwise density of variables Z 1 , Z 2 is obtained from equation (2.4) as (3.1) Given a parametric form for V such as equations (2.8) or (2.10), the log likelihood is then readily computed, and standard methods may be used to obtain estimates and confidence intervals (CIs). Figure 3a,b shows the results of such fits for the annual maximum temperature data described in §2b. The parameter estimates when model (2.10) is fitted to the standardized Fréchet observations from all 136 distinct pairs of sites are plotted against the inter-site distances, with pairs involving the two southern sites Locarno-Monti and Lugano in canton Ticino, and the highest, the Jungfraujoch, indicated differently. The approximate 95% CIs are obtained using profile likelihood. There is a clear general decline of r with distance, but the values of a mostly equal unity; the latter is not surprising, as hot days tend to result from large-scale climatic events. Aberrant pairs involving the Jungfraujoch and the sites in Ticino are visible in both graphs, though the uncertainty is large.

(b) Inclusion of event times
The approach to estimation of the parameters for each pair sketched in §3a and based on forming a log likelihood by adding contribution (3.1) may be improved if the times of maxima are available. In this case, arguments of Stephenson & Tawn (2005) show that if the maxima at the two sites are known to occur simultaneously, then the likelihood contribution should be whereas if they are known to occur at different times, then equation (3.2) should be replaced by f sep (z 1 , z 2 ) = vV (z 1 , z 2 ) vz 1 vV (z 1 , z 2 ) vz 2 exp{−V (z 1 , z 2 )}, z 1 , z 2 > 0. When the times are used, there is a general shrinking of the CIs, and it becomes clearer that the Ticino sites and Jungfraujoch are somewhat disconnected from the others, though the CIs remain too wide to draw firm conclusions. However, some degree of disconnection is entirely plausible on general grounds. At an altitude of 3580 m, the Jungfraujoch is above the snow line all year round, relatively little influenced by human activities, and more than a kilometre higher than the next highest site, Santis. The Ticino is more strongly influenced by meterological conditions in northern Italy and the Mediterranean basin than by those within and north of the Alps.

(c) Model checking
A check on the quality of a parametric model such as equation (2.10) is to compare its fitted extremal coefficients, obtained by replacing a and r in equation (2.11) by their maximum-likelihood estimates, with the naive estimator of Schlather & Tawn (2003) or the madogram-based estimator of Poncet et al.  (2006). The latter two are not based on a specific parametric model. Figure 4, which shows these comparisons for the maximum temperature data, reveals nothing to cast doubt on the model (2.10).
One check on the fit of the pairwise model stems from noting from equation (2.6) that the maximum max(Z 1 , Z 2 ) of two related variables, each with the unit Fréchet marginal distribution, has a scaled Fréchet distribution exp(−q 1,2 /z), for z > 0, where q 1,2 = V (1, 1). This suggests inspecting an exponential quantile plot of the sampleq 1,2 / max(z 1j , z 2j ) (j = 1, . . . , n), wherê q 1,2 =V (1, 1) is an estimate of q 1,2 . We have found that the maximum-likelihood estimate based on the data pairs (Z 1 , Z 2 ) does not work well; better results are obtained with the Schlather-Tawn or madogram-based estimators mentioned in the previous paragraph. In the present case, such plots reveal little untoward.

Composite marginal likelihood (a) Introduction
Statistical inference for parametric models is ideally performed using the likelihood function, but for multi-variate extremal models, this is generally difficult for the reasons mentioned at the end of §2a. When the pairwise marginal distributions are available and identify the model parameters, it therefore seems natural to use a composite likelihood function (Lindsay 1988;Cox & Reid 2004;Varin et al. 2011).
Suppose that the distribution of the responses depends on a parameter vector w, let z Y = {z j : j ∈ Y} denote a subset of the overall data, and suppose that the responses may be split into mutually independent subsets z Y 1 , . . . , z Y n . Also, let Y j,1 , . . . , Y j,m j denote distinct subsets of Y j . In the temperature data of §1b, the Y j might denote maxima for different summers j = 1, . . . , n, which may be supposed independent, and the Y j,1 , . . . , Y j,m j might denote all distinct pairs of sites for which data are available in summer j. The corresponding composite marginal log likelihood is (4.1) in a natural notation. Under regularity conditions like those needed for the limiting normality of the standard maximum-likelihood estimator, and if w is identifiable from the marginal densities contributing to Y , the maximum pairwise likelihood estimatorw has a limiting normal distribution as n → ∞, with mean w and covariance matrix of sandwich form estimable by J (w) −1 K (w)J (w) −1 , where are the observed information and squared score statistic corresponding to Y .
Corresponding results for comparison of nested models fitted by maximization of equation (4.1) involve sums of independent c 2 variables weighted by eigenvalues of matrices derived from K (w) and J (w) (Kent 1982), and so are awkward to apply in practice. In a similar situation, Chandler & Bate (2007) suggest modifications which ensure that likelihood ratio statistics retain their usual large sample distributions. The marginal composite likelihood (4.1) is useful only for inference on model parameters that are estimable from the terms appearing therein. Thus, in the case of a pairwise likelihood, in which each of the subsets Y j,i corresponds to pairs of observations, parameters that appear only in the joint densities for triples, quartets, etc., cannot be estimated. This is not a difficulty with the models discussed below or in other spatial contexts based on Gaussian distributions, for which the distribution is specified in terms of its first two moments, but it could be a drawback in other contexts; in that case, the use of contributions from, e.g. triplets of observations, might be indicated. There is also a potential gain of efficiency in using higher order terms (Genton et al. 2011). Varin & Vidoni (2005) and Varin (2008) discuss model selection based on composite likelihoods. It is appropriate to select models that minimize the composite likelihood information criterion, CLIC = −2[ Y (w) + tr{K (w)J (w) −1 }], an adaptation of the Takeuchi information criterion (Takeuchi 1976) to the present setting. One minor difficulty with this, and indeed in comparing models based on Y , is that even for moderate datasets, the values of Y tend to be very large because of the large numbers of terms summed in equation (4.1). It is therefore helpful to replace Y by * Y = c Y , where the scaling constant c is chosen to give the correct log likelihood, at least approximately, for independent observations, and to use the corresponding scaled information criterion, CLIC * .
Composite likelihood may be seen as the basis for estimating the equation v Y (w)/vw = 0, but has the advantage over arbitrarily defined estimating equations of stemming from an objective function. As Varin et al. (2011) describe, composite likelihood is increasingly being used in settings where the likelihood itself is computationally intractable. In cases where the composite maximumlikelihood estimatorw can be compared with the usual maximum-likelihood estimator, the former is often surprisingly efficient, though if n is fixed, care must be taken not to allow m j to become too large; if so, the efficiency can drop, and in extreme cases,w may become inconsistent (Cox & Reid 2004). In our applications, the number n of independent blocks is typically fairly large, so even if the m j are large, inconsistency should not usually be an issue.

(b) Efficiency comparison
In the present context, it is natural to use the bivariate marginal densities corresponding to equations (2.8), (2.10), (3.2) and (3.3), thus forming the pairwise log likelihood in which the Y j,i comprise all distinct pairs of sites. One natural concern is the efficiency of the maximum pairwise likelihood estimatorw, but it is impossible to assess this directly, because the full log likelihood is unavailable for such models. Instead, we simulated data from a D-dimensional logistic model for multi-variate extremes (Tawn 1988 Table 1 shows the efficiencies for pairwise likelihood relative to full likelihood estimation for the four parameters, as the dimension D varies from 3 to 40. When there is appreciable dependence, the loss of information seems to be slight, but even for weak dependence, it is not catastrophic. We conclude that pairwise likelihood estimation is a reasonable option in the present context; in any case, there seems to be no other.

(c) Computation
Composite marginal likelihood fitting of certain of the geostatistical models for extremes described above and of the Gaussian random storm model of R. L. Smith (1990, unpublished data) may be performed using Mathieu Ribatet's R package Table 1. Relative efficiency (%) of the pairwise likelihood estimator based on 500 datasets of size 200 simulated from a symmetric logistic distribution of dimension D with dependence parameter a = 0.1, 0.4, 0.7, 0.9. The parameters h, t and x = 1 are the location, scale and shape parameters of the common marginal distribution. SPATIALEXTREMES, which allows flexible formulations of the location, scale and shape parameters of the marginal generalized extreme-value distributions, and provides appropriate plots and diagnostics.

Temperature data (a) Model fitting
We now discuss the fitting of model (2.8) to the temperature data described in §1b. The pairwise analysis in §3a suggests that the sites at Locarno-Monti, Lugano and the Jungfraujoch have different meteorological behaviour than the others, and consequently we exclude them from the fitting, thus giving D = 14 sites (see below). Exploratory analysis for the marginal parameters shows relations between the location parameter of the generalized extreme-value distribution (1.2) and latitude, altitude and time. We first fitted various different models with the powered exponential covariance (2.12) using the scaled pairwise likelihood * {g(y d 1 ,j ; w), g(y d 2 ,j ; w); w}, (5.1) where y d,j represents the maximum at site d in year j, g(y; w) = {1 + x(y − h)/t} 1/x and the marginal parameters h, t and x are functions of the generic parameter vector w. Some of the many models that we fitted and the values of their information criteria are shown in table 2. The dependence models incorporate the dependence among the maxima by fitting the log likelihood based on equation (3.1), whereas the independence model incorrectly treats the spatial data as independent series. There is a very large reduction in the CLIC * owing to inclusion of quadratic terms in altitude and a smaller one owing to time. The effect of latitude appears stronger than that of longitude, and this seems natural in view of the geography of the Alps. The overall best model uses quadratic terms in longitude, latitude and altitude of the sites, plus a linear term in time, to explain variation in all three marginal parameters of equation (1.2), but this seems very complex for a fit to 14 sites, and we prefer a simpler model with just eight marginal parameters. The other models vary the shape parameter k of the exponential covariance function used for these fits. Our preferred model, denoted by superscript 'b', has constant scale and shape parameters, and where alt(x) and lat(x) are altitude above mean sea level (in kilometres), and latitude, measured in a coordinate system centred on the former national observatory in Bern. Time t is measured in centuries since the year 2000. The corresponding independence model has an appreciably higher information criterion value, suggesting that, as one might anticipate, the spatial dependence model better accounts for the data. Table 3 shows the estimates and standard errors for the model with margins (5.2), when the data from the 14 sites are treated as independent, using the max-stable model with likelihood contributions (3.1) not allowing for Table 3. Parameter estimates (s.e.) for annual maximum temperature data from 14 sites, treated both as independent and as dependent. In the latter case, we fit a geostatistical max-stable model using the powered exponential covariance with k = 1.5, both ignoring event times and taking them into account, using pairwise likelihood estimation based on likelihood contributions ( the timing of maxima, and allowing for their timing. In the last case, we used likelihood contributions (3.2) and (3.3), with maxima considered to form part of the same episode, only if they occurred on the same day; the effect of varying this by considering that a hot episode could take place over several days was slight. Most of the estimates are similar for the independence and dependence models, with the latter giving larger standard errors for the marginal intercept parameters h 0 , t, x and smaller standard errors for most of the marginal regression effects. The exception is the parameter h 5 representing dependence on time, the estimate of which increases and becomes more significant when spatial dependence is taken into account, though none of the estimates differs significantly from zero. There is some reduction in uncertainty when information on the synchronicity of extreme events is included, particularly for the scale parameter l of the correlation function.
The table provides plausible physical values for the effect of altitude, which is known to reduce temperatures by around 7 • C for each kilometre of climb; the gradients implied by the quadratic curve fitted here range from −6 • C to −9 • C at altitudes of 1-2 km; both the linear and quadratic terms in altitude are highly significant, but the addition of a cubic term is not. The effect of climatic change is also consistent with estimates from other sources, suggesting an increase of around 2.5 • C over the next century. The large uncertainty of this estimate is not surprising, as the data encompass a small geographical region and are strongly correlated. The effect of latitude is most likely a surrogate for complex effects of topography beyond that of altitude.
We also fitted equation (2.8) with correlation function (2.15) and the Whittle-Matérn correlation family, with results very similar to those described above. Figure 5 shows the fitted extremal coefficient curves for various models; there is little to choose between them, though the model with individual marginal parameters at every site is somewhat distinct from the others. This is unsurprising, but as this model does not allow extrapolation to other sites, it is of little practical use in risk analysis.

(b) Model checking
The max-stable models are fitted using data from pairs of sites, and it is natural to assess the quality of the fit by using data from larger subsets. To do so, we perform a simulation-based test, as follows. We first note that the annual maxima Z A = max d∈A Z d are available for any subset A ⊂ D of the sites at which data are available, and that the distribution of Z A under the max-stable model is Fréchet with parameter q A . The empirical values of Z A are available for n independent years of data; call these z A,1 , . . . , z A,n . They may be compared either with the Fréchet distribution with parameter q A or with maxima for datasets simulated from the fitted model. In the second case, we simulate R independent sets of data from the fitted max-stable model, and thus obtain R replicates z * A,1,r , . . . , z * A,n,r of the original data. Simple graphical tests of fit may be based on the concordance of the ordered values of z A,1 , . . . , z A,n with the ordered values of z * A,1,r , . . . , z * A,n,r . Figure 6 shows the use of R = 5000 simulated max-stable processes to form simulation envelopes (Davison & Hinkley 1997, section 4.2.4)  (d) Southern Plateau; (e) Northern Plateau; and (f) Eastern Alps. Since they were not used to fit the model, the poor fit for the Jungfraujoch and for the Ticino is not surprising. The fit is broadly satisfactory for the other groups, except perhaps group (f). Arosa, Davos-Dorf and Santis have altitudes of 1840, 1590 and 2490 m, respectively, but Bad Ragaz is much lower, at 496 m, and is unusual in this group because of its low altitude. This suggests that a more complex model accounting better for dependence on altitude would be preferable, but since the plotted data do not lie outside the overall 95% confidence band, this does not appear to be a critical issue. Figure 7 shows simulation envelopes with five other groups of sites not in close proximity. The sites found to be badly fitting have been put together into group (a), for which the fit is very poor, but for the other groups, it is quite satisfactory. Similar computations for other groupings showed no other systematic model failures, even when Bad Ragaz was treated as an ordinary station, and we conclude that the model appears satisfactory in this respect.

(c) More complex models
We attempted to fit the random set model (2.10) by maximizing the pairwise likelihood with the best model above and a(h) = {1 − |h|/(2r)} + , corresponding to taking B to be a disc of constant radius r. Fitting this model is delicate, but a profile likelihood for r is maximized as r → ∞. This is not surprising in view of the relatively small area that contains the sites. Quite different results would be expected with other types of data, such as annual maximum rainfall.
We also allowed for geometric anisotropy of the covariance function by fitting models in which the plane (x, y) is affinely transformed to (z −1 (x cos a + y sin a), z(y cos a − x sin a)), but found no evidence that a = 0 or that z = 1. Probably, a much larger number of sites would be needed to detect such effects.

(d) Risk analysis
A traditional approach to risk analysis is based on the pointwise estimation of high quantiles of a distribution, the so-called return levels or return values. In the univariate case and when the generalized extreme-value distribution (1.2) has been fitted to annual maxima, such quantiles are obtained as solutions to the equationĜ(y p ) = p, with p = 0. 95, 0.98, 0.99 and 0.999 corresponding, respectively, to 20, 50, 100 and 1000 year return levels. Similar arguments apply when exceedances over high thresholds are modelled, rather than maxima (Davison & Smith 1990;Coles 2001). When the extremal data are spatially indexed, it is now common to make the assumption of an underlying surface, and thus to obtain a smooth map of the necessary quantiles over the region of interest (see Cooley et al. (2007) or Sang & Gelfand (2009a)). A cruder and less satisfactory approach would be kriging of the relevant quantiles at different sites.
One important benefit of spatial modelling of rare events is the possibility of simulating unusual episodes, here years with high maximum temperatures, rather than merely producing pointwise maps of high quantiles, and thus allowing a more powerful risk analysis. To illustrate this, figure 8 shows four max-stable processes simulated from our fitted model, ordered by sorting 1000 simulated    processes according to their overall averages on the data scale. Figure 8a-d shows the processes on the unit Fréchet scale, and clearly indicates the extent of the spatial dependence. Figure 8e-h shows the corresponding processes on the scale of the original summer maximum temperature data, for the year 2020. The effect of altitude is very strong, but close examination shows clear annual differences in the possible maximum temperatures, from which a risk analysis could be extracted using a functional such as equation (1.1). In obtaining these simulations, we made some allowance for parameter uncertainty for the parameters in table 3 by generating parameter values randomly from a multivariate normal distribution centred on the parameter estimates and with the same covariance matrix. Simulations can also be used to estimate the extremal coefficient (2.7) for the region X , by averaging values of max x∈X ∩G W (x), where G is a grid of points on which W (x) is simulated. To estimate q X for Switzerland, we used the fitted model to generate 10 000 random fields W (x) on a grid of side d = 0.25 of the pixels used in figure 1 and then subsampled on grids at 0.25, 0.5, . . . , 2.5 of the original pixels, giving estimates of q X that were extrapolated to d = 0. The resulting q X ≈ 9.5 shows the expected high degree of dependence, corresponding to around 9.5 independent temperature maxima across the entire country. The same computations gave q D ≈ 5.5 for the 14 sites used for the analysis.
We also used simulations to estimate the frequencies with which the temperatures observed for the exceptionally hot

Discussion
We regard the models and methods described above as a step towards realistic and flexible modelling of spatial maxima, consistent with the classical paradigm of max-stable processes and distributions. Our approach involves the following steps: (a) construction of a marginal generalized extreme-value model for sitewise maxima that allows extrapolation to maxima across the entire spatial domain, and enables the maxima to be transformed from their original scale to a common unit Fréchet distribution; (b) construction of a max-stable model for the transformed maxima, taking into account their spatial dependence through an appropriate covariance function and, perhaps, inclusion of a random set B and corresponding overlap probability a(h); (c) pairwise likelihood fitting of models given by (a) and (b), with model selection using information criteria; (d) assessment of model fit using data from pairs and from larger groups of sites; and (e) use of the chosen model for risk analysis through simulation, entailing both extrapolation into the tails of the infinite-dimensional distribution of the random process, and perhaps extrapolation outside the period or domain for which data are available.
Below, we discuss some aspects of the approach and outline some further elements that seem to us particularly important.
Although it arises naturally from the probabilistic construction of maxstable processes, the transformation in step (a) may appear awkward because interpretation on the unit Fréchet scale is only indirectly linked to the underlying physical process. The same might be said of step (b), though the random set element may have a clear interpretation in cases such as extreme convective rainfall, where it might correspond to an exceptionally active storm cell.
Our application involved the use of a polynomial response surface for the marginal parameters in (a), but in principle, we see no difficulty in using broader classes of functions such as splines, provided the data are rich enough.
For (b), we have used a relatively simple Gaussian process model suggested by Schlather (2002), but there is wide scope for developing further classes of model for which the pairwise exponent measures can be computed and whose parameters are identifiable from the pairwise densities. In separate work, we have found that such measures for other models both stationary and non-stationary, and the work of R. L. Smith (1990, unpublished data), de Haan & Pereira (2006, Smith & Stephenson (2009) and Padoan et al. (2010) on the random storm formulation are also relevant. Although not an issue for our example, it is a nuisance that the pairwise extremal coefficient for the simple Gaussian process model cannot attain its maximal value of 2, and in other applications, more flexible models are needed (see Davison et al. in press; Wadsworth & Tawn in press).
The fitting at step (c) is sometimes delicate, and it would be useful to have fast robust algorithms. We have fitted the marginal and joint models simultaneously, but in some cases alternating algorithms seem to more be stable (Blanchet & Davison in press). Perhaps, the fitting would be improved if higher order densities were available, though there is clearly a trade-off between computational and statistical efficiency.
Steps (d) and (e) rely on simulation from the fitted model, and therefore rule out the use of exponent measures that do not correspond to processes that can be constructed.
The use of max-stable models stems from the stability postulate underlying the derivation of the basic distributions used for univariate extremes, here generalized to equation (2.1), which enables mathematically consistent extrapolation beyond the range of the data. However, such extrapolation is fraught in several dimensions and great care should be exercised in the present infinite-dimensional situation. Even for multi-variate extremes, the classical theory has limits in modelling how dependence changes with events of increasing rarity. Important further developments to deal with these near-independence cases are owing to Ledford & Tawn (1996, 2003, who have, in an important series of papers, described diagnostics for more subtle dependencies in the joint tails for several variables than are allowed by the classical theory, and models that accommodate such dependencies. More recent work on these lines is by Heffernan & Resnick (2005, 2007. It seems important to extend our approach to cater for these more flexible and realistic forms of asymptotic behaviour (see Wadsworth & Tawn in press). Under asymptotic independence, extremely rare events would have smaller spatial extent than less rare ones; thus, our asymptotically dependent approach would tend to overstate risk if applied to data from an asymptotically independent spatial process. In order to detect such cases, it is helpful to plot measures of extremal dependence as in Blanchet & Davison (in press). It would also be helpful to find an analogue of the spatial madogram (Naveau et al. 2009) suitable for near-independence cases.