## Abstract

Epidemic models have become a routinely used tool to inform policy on infectious disease. A particular interest at the moment is the use of computationally intensive inference to parametrize these models. In this context, numerical efficiency is critically important. We consider methods for evaluating the probability mass function of the total number of infections over the course of a stochastic epidemic, with a focus on homogeneous finite populations, but also considering heterogeneous and large populations. Relevant methods are reviewed critically, with existing and novel extensions also presented. We provide code in Matlab and a systematic comparison of numerical efficiency.

## 1. Introduction

### (a) Motivation

Epidemic models are now widely used to inform policy on a range of issues, from childhood diseases [1] to bioterrorist smallpox [2], SARS [3], foot-and-mouth disease [4] and pandemic influenza [5]. These models have typically involved either numerical integration of ordinary differential equations that do not explicitly account for the underlying chance events in transmission, or Monte Carlo simulation of models that aspire to a high level of realism [1,6].

There has been growing interest, however, in consideration of the parsimonious stochastic epidemic models that were described during the earliest phase of mathematical epidemiology [7]. This is partly because of the impressive array of formal results that have accumulated over the years in this field [8], but also because of the role that these models can play in modern, computationally intensive inference of epidemiological parameters [9–11] and optimization problems.

The final size of an epidemic can be defined informally as the total number of people experiencing infection during the outbreak. This quantity is typically called the *attack rate* by applied epidemiologists, and expressed as a percentage of the population in question. The probability distribution of final sizes is of particular interest to statistical epidemiologists, owing to its use *inter alia* in the analysis of household data [12,13].

It turns out that there is a particularly large number of approaches to calculation of the probability mass function (PMF) for the final size of an epidemic (often called the *final-size distribution*). This paper aims to summarize these approaches, paying particular attention to: (i) numerical implementation, including a novel application for iterative methods; (ii) critical comparison of different methods; and (iii) consideration of how these methods can be applied to calculation of other epidemiological quantities of interest.

### (b) Model definition

While we consider various generalizations, our starting point is the susceptible-infectious-recovered (SIR) epidemic model in a closed finite population. We consider a population of integer size *N*, and the state of the system is given by non-negative integer-valued stochastic variables *S*(*t*) and *I*(*t*), obeying *S*+*I*≤*N*, which represent the number of individuals who are susceptible or infectious at time *t*, respectively. The number of recovered individuals *R* is given by *R*=*N*−*S*−*I* due to our assumption that the population is closed (i.e. there are no births, deaths or migrations).

We assume that any pair of individuals makes contact at the points of a homogeneous Poisson process of rate *β* and that contacts between different pairs of individuals are mutually independent. Contact between an infectious and a susceptible individual results in the immediate infection of the susceptible individual, who experiences a random duration of infectiousness drawn from the *infectious period distribution* (also often called the recovery time distribution) and then recovers. Note that in these conventions, the overall rate of infection is *βSI*, so when comparing results across different values of *N*, it is often most instructive to hold the quantity *β*/(*N*−1) constant.

The epidemic process starts in state (*S*_{0},*I*_{0}) and must end when there are no more infectious individuals. Writing the final state in the form (*N*−*Z*,0) allows us to define the integer-valued stochastic variable *Z* as the *final size* of the epidemic. We are particularly interested in the PMF of this quantity; since this function has finite integer support, it is often conveniently represented as a vector of probabilities
1.1In the rest of this paper, we will be interested in either the calculation of the components of the final-size distribution (often in vector representation) at machine precision or sampling from this distribution using Monte Carlo methods.

## 2. Material and methods

We now consider different ways to calculate the final-size distribution, starting with Monte Carlo methods that are simply described in §2*a*, before moving on to methods that run at machine precision in §2*b*,*c*. We have tried to use notation that is consistent within this manuscript, but in doing so, we may deviate from the conventions originally used for each technique.

### (a) Monte Carlo methods

#### (i) Direct continuous-time simulation

In the event that the infectious period distribution is exponential with rate parameter *γ* (i.e. mean *γ*^{−1}) the system dynamics take the form of a continuous-time Markov chain. The events and their rates of occurrence for this Markov chain are
2.1This system can then be simulated directly using Monte Carlo methods, once initial conditions are specified. Stigler's Law [14] states that scientific discoveries are not named after their originators (and was, of course, discovered earlier by Robert K. Merton). It is therefore as would be expected that while the most common method for Monte Carlo simulation of a continuous-time Markov chain is typically called Gillespie's algorithm [15], this method was actually developed by probabilists some decades earlier [16–18].

The results of simulating the Markov chain defined by (2.1) are shown in figure 1*a*,*b*, for a relatively large and small outbreak, respectively. We provide the function `gil_mc.m` in the electronic supplementary material, (S1.1), as an implementation of this, based on the code from Keeling & Rohani [6]. Obviously, if one is interested in temporal dynamics of the epidemic, all of the information created by this method is useful. But for the purposes of sampling from the final-size distribution, there is a lot of unnecessary computational effort expended by this method.

*Extension: waning immunity*. Some of the methods presented in this paper rely on long-lasting immunity following recovery from natural infection—the SIR paradigm. Many diseases, for example respiratory synctial virus (RSV) [19] and rotavirus [20], exhibit significant waning immunity, which is readily incorporated in approaches based on direct simulation by adding the transition
2.2to (2.1), giving SIRS epidemic dynamics. The system as defined always ends in state (*N*,0) and so what to calculate depends on the epidemiological problem in hand. In the context of SIRS epidemics with household structure, the total number of infection events is of interest due to its role in the calculation of epidemic thresholds [21,22], and this can be readily extracted from a realization of the full SIRS epidemic.

*Extension: phase-type infectious periods*. An assumption behind (2.1) is that the infectious period durations are exponentially distributed—this is justifiable for some diseases, but certainly not for others. Suppose we now introduce an integer index *a* between 1 and *k* for infectious individuals, and modify (2.1) to
2.3so that the distribution of infectious periods is a phase-type distribution. Since phase-type distributions are dense in positive-valued distributions [23], any reasonable infectious period distribution can be approximated by a model of similar form to (2.3); however, this comes at potentially large computational cost. The most commonly used phase-type distribution is the hypo-exponential, where individuals pass from linearly, which has the effect of reducing variability compared with an exponential distribution, and leads to a *k*-Erlang infectious period distribution when the transition rates are all equal. Another frequent use of phase-type distributions is the hyper-exponential distribution, where all *q*_{a,b} in (2.3) are zero meaning that each infective passes through a unique *I*_{a} state, which is often interpreted as individuals experiencing different disease severities, and which has the effect of increasing variability compared to an exponential distribution.

#### (ii) Sellke's method

Sellke's method provides a way to simulate final sizes of a stochastic epidemic with arbitrary infectious period distribution [24]. We suppose that each individual *i* has a stochastic variable *T*_{i} for its infectious period, picked from the infectious period distribution, and that susceptible individuals have a random *threshold* *Q*_{i} picked from an exponential distribution of mean 1 (in this work, we take the initial infectives to have zero threshold parameter). We arrange the labelling of individuals, without loss of generality, so that *Q*_{i}≤*Q*_{j} when *i*<*j*. Then
2.4The intuition behind this equation is that *Q*_{i} is essentially how ‘resistant’ individual *i* is to infection, and *βT*_{i} is the force of infection that individual *i* will contribute *if* ultimately infected. In this non-rigorous picture, the epidemic ‘stops’ when the total infectious pressure drops below the resistance/threshold of all remaining susceptibles.

To see more mathematically why this construction is equivalent to the model defined, we consider the argument of Andersson & Britton [8], §2.2. If we let *I*(*t*) be the number of infective individuals at time *t*, then the rate at which a given susceptible becomes infective is *βI*(*t*) in the original model. We define the cumulative force of infection as
2.5Then considering a small unit of time *δt*, and a unit-mean exponentially distributed threshold *Q* we have that
2.6Therefore, the rate at which *Λ* exceeds *Q* is *βI*(*t*). Putting this together with (2.5) gives (2.4), showing that Sellke's method gives the same final size as direct simulation.

Realizations of Sellke's method for a relatively large and small outbreak are shown in figure 1*c*,*d*. Numerically, this is much more efficient and general than Gillespie's algorithm for sampling from the final-size distribution, but the function `sel_mc.m` given in electronic supplementary material, S1.2.1, does not provide the temporal dynamics of the relevant epidemic process, although such dynamics can be readily calculated. The fundamental approach of the Sellke construction is to keep track of total infection pressure and thresholds separately, which is quite generally applicable.

*Extension: temporal dynamics*. Once a realization of the Sellke construction has been made as described above, it is possible to construct the temporal dynamics of that realization without any further picking of (pseudo-)random numbers. This follows from (2.6) and is practically implemented by maintaining a list of recovery times for currently infectious individuals, infecting each individual in the order specified by the *Q* thresholds and interspersing recoveries at the appropriate times. We provide an implementation of this method in electronic supplementary material, S1.2.2, which has the major advantage that an arbitrary distribution of infectious period can be simulated efficiently without any approximation.

#### (iii) Ludwig's method

Another possibility for simulation comes from the construction of Ludwig [25]. This procedure does not provide any temporal information, and involves a set of discrete stages called ‘ranks’. Ludwig's method starts with *S*_{0} susceptibles, and places the *I*_{0} initial infectives in rank 0. At a given rank *g*, we label the *I*_{g} infectives with a set of integers . Next, each infective is cycled through in turn, picking an infectious period *T*_{i} from the relevant distribution, which then leads to an independent probability
2.7of infecting each of the remaining susceptibles. The number of susceptibles *S* is therefore reduced due to infective *i* by an integer
2.8Once each of the susceptibles in the current rank has been considered, all infectives in the current rank are removed, and the number of infectives in the new rank *g*+1 is . This process is continued until an empty rank is generated. Figure 1*e*,*f* shows a realization of this process for large and small outbreaks, respectively. We provide code `lud_mc.m` in electronic supplementary material, S1.3 to implement this algorithm.

Pellis *et al.* [26] have argued that while Ludwig's argument is intuitive, it is not always obvious for which generalizations of simple epidemic models it will still hold, leading these authors to provide additional detail about the approach. Indeed, the insights built up from Ludwig's argument can be applied to very complex populations incorporating several layers of structure—including households and networks—and for intrinsic varying severity, while the inclusion of infector-dependent varying severity invalidates it [27–29].

Perhaps the clearest way to confirm the validity of Ludwig's method is to make use of network theory. Using standard modern terminology [30], consider a random directed network where a link starting on individual *i* and ending on individual *j* is present with probability *π*_{i,j} corresponding to the probability of infectious contact being made from *i* to *j* if *i* becomes infective, before *i* recovers. The key property required for use of Ludwig's method is that the probabilities of infectious contacts emanating from an individual must depend only on quantities that can be determined in advance, but cannot depend on (for example) the temporal behaviour of the epidemic before that individual is infected. Equation (2.7) is a simple example, but all that is needed is the calculability of probabilities of infectious contact in advance of the epidemic. A node *j* will be infected eventually if there is a path through the network from an individual *i* in the set of initially infective individuals to *j*. Letting **D**=(*D*_{i,j}) be the adjacency matrix for the random directed network (i.e. *D*_{i,j} is 1 if *i* makes contact with *j* and is 0 otherwise) we can see that
2.9Ludwig's method therefore involves finding the smallest *n* satisfying the right-hand condition of (2.9), which becomes *j*'s rank. The assumption of independence of the links means that this can be simulated iteratively as described.

### (b) Machine-precision, Markov chain methods

A continuous-time Markov chain such as that defined by the events and rates given in (2.1) can have its dynamics fully specified by a solution **p**(*t*) to the Kolmogorov forward equations
2.10for appropriate initial probability vector **p**(0) and rate matrix **Q**=(*Q*_{i,j}). One way to define this matrix explicitly is by defining *σ*(*i*) and *ι*(*i*) as the number of susceptibles and infectives associated with state *i*, respectively. Then
2.11where *δ*_{i,j} is the Kronecker delta. We write the jump matrix for this Markov chain
2.12We use *p*_{S,I}(*t*) to stand for the element of **p**(*t*) corresponding to the probability of *S* susceptibles and *I* infectives at time *t*. We also use the notation for the state space of the Markov chain, which is composed of a set of absorbing states and an irreducible transient class , and which has dimension .

#### (iv) ‘Brute force’ methods

In this framework, there are two ‘brute force’ methods for matrix-based calculation. Firstly, for a Markov chain with finite state space, equation (2.10) has a matrix exponential solution
2.13which could be evaluated using, e.g. EXPOKIT [31] at a sufficiently large *t*. Alternatively, the probability vector after *n* events is, by definition of the jump matrix, **p**(0)**P**^{n}. For SIR dynamics, a maximum of *I*_{0}+2*S*_{0} events can take place, and so the relation
2.14can be used to calculate the final-size distribution. Code for both of these methods is provided in electronic supplementary material, S2.1.

#### (v) Bailey's method, with Neuts & Li's implementation

Bailey [7] notes that an integral representation is available for any final size probability:
2.15Then by evaluating the Laplace transform of (2.10) at non-negative *s*, we see that
2.16It is possible to solve for **q**(*s*) algebraically, and thus obtain
2.17as a closed form solution. While Bailey's original text suggests many different forms for the algebraic solution of (2.16), Neuts & Li [32] considered the form most suitable for numerical implementation. We provide `bailey_fs.m` in electronic supplementary material, S2.2.1, as an implementation of their algorithm.

*Extension: generalized transmission rates*. The method as outlined is efficient due to the sparse, triangular structure of **Q** and does not depend sensitively on the actual transition rates. Neuts & Li [32] in fact considered models that had rates that have general functional dependence on *S* and *I* (but not, for example, time)
2.18which could be useful in several contexts, and does not significantly alter the algorithmic procedure or the computational time.

*Extension: maximum size*. Neuts & Li's paper also considered calculation of the distribution for the maximum size of a stochastic epidemic, defined as the maximum value of *I*(*t*) over the course of the epidemic and denoted *I**. This quantity would typically be called *peak prevalence* by epidemiologists. Neuts & Li consider implementation of the method of Daniels [33] and provide an iterative procedure for calculation of *Pr*(*I**≥*y*). We implement this scheme as `NL_Imax.m` of electronic supplementary material, S2.2.2, which can be readily compared with the output of Gillespie or transient Sellke results.

#### (vi) Path integral/sum method

This approach has recently been used to evaluate the PMF of the number of secondary infections caused by an initially infected individual [34]. The method works by appending to the state of the Markov chain an indicator variable that takes the value one when the most recent transition corresponds to an infection event, and takes the value zero otherwise. In the SIR case, calling the indicator variable *J* gives the augmented Markov chain
2.19We then consider the jump chain of this modified Markov chain, a discrete-time Markov chain specifying the probabilities of jumping between states at the time of a jump; we label the transition probability matrix of the jump chain **P**. The total number of infections over the course of an epidemic, *Z*, is then equal to the number of jumps which result in being in a state where *J*=1 up until there are no infectives remaining in the population.

We can evaluate the distribution of *Z* using the following result [34]. Let be a discrete-time Markov chain taking values in with transition probability matrix **P**=(*P*_{i,j}). Define
2.20where is called the cost per visit, obeying and is an irreducible transient class. Let . Then *ϕ*(*z*) is the (maximal) solution to
2.21with *ϕ*_{k}(*z*)=1 for . This result is directly derived by conditioning on the state first jumped to.

The result above allows us to evaluate the probability generating function (PGF) *ϕ*(*z*) of the distribution of *Z*. To evaluate the PMF, we can numerically invert the PGF considered as a general Laplace transform [35]. However, a more efficient procedure can be obtained by differentiating the equation (2.21) *k* times with respect to *z* and evaluating at *z*=0, resulting in a system of linear equations for . Solving this system and using the relation
2.22allows computation of the *k*th element of **p**. This method was employed in Ross [34], where code was also made available.

In addition to being applicable to the SIR model with exponential infectious period distribution, this method can be applied to any Markovian model, including SIRS models and with any phase-type infectious period distribution; of course, more complicated infectious period distributions and model structures will result in an increase in the size of the matrix **P** required, and hence the methodology will eventually become computationally unfeasible.

*Extension: hitting times*. For a discrete random variable such as the final size, the path sum is most appropriate. For continuous random variables, the path integral method can be used [36]. Using notation as above, we consider the Laplace transform of the total cost (rather than the PGF):
2.23This is the maximal solution to the equations
2.24where *Φ*_{k}(*s*)=1 for . For the hitting times of a stochastic epidemic (i.e. the first time at which there are zero infective individuals), the costs are simply *c*_{i}=1 for .

While moments of the hitting-time distribution can be simply obtained by differentiation of (2.24), calculation of the probability density at a given point in time requires numerical Laplace transform inversion, which is inherently numerically unstable. We recommend the Euler method of Abate & Whitt [35], with the roundoff error control proposed by Sakurai [37]. Our code for calculation of the Laplace transform is given in electronic supplementary material, S2.3.2.

#### (vii) Null space method

The null space method is based upon the spectral theory of matrices, in particular as appropriate for finite-state Markov chains and was used by Keeling & Ross [38]. We start with the Kolmogorov forward equations (2.10). Being a square matrix, **Q** is always diagonalizable in the sense of Jordan canonical form. The SIR model has *N*+1 absorbing states, and hence *N*+1 repeated eigenvalues of 0. We may write
2.25for some **T**, where **J** is block diagonal with each *Jordan block* **J**_{i} of size *n*_{i} equal to the algebraic multiplicity (i.e. number of repetitions) of its eigenvalue *λ*^{(i)}. As the geometric multiplicity (i.e. the dimension of the null space) is equal to the algebraic multiplicity, each Jordan block **J**_{i} is simply diagonal with *λ*^{(i)} on the diagonal. Perron–Frobenius theory applied to the stochastic (uniformized) matrix , where , ensures that the eigenvalues lie within a disc in the negative half plane. Throughout this paper, we let **I**_{n} be the identity matrix of dimension *n*. For the SIR model, ). Now,
2.26since for the *i*th Jordan block, we have the form
2.27Hence as , we have that each Jordan block tends to a zero matrix with the exception of *e*^{J1}=**I**_{N+1}. Hence, we have
2.28where **T**=[**T**_{1};**T**_{2};…;**T**_{d}] and **T**^{−1}=[**U**_{1},**U**_{2},…,**U**_{d}] with *d* equal to the number of distinct eigenvalues. We use the semi-colon (;) for vertical concatenation and comma (,) for horizontal concatenation of matrices, following the conventions of Matlab. Let us write **T**_{1}=[**v**_{1};**v**_{2};…;**v**_{N+1}], of dimensions , with rows the left eigenvectors of **Q** (associated with the eigenvalue zero); and also write **U**_{1}=[**u**_{1},**u**_{2},…,**u**_{N+1}], of dimensions , with columns the right eigenvectors of **Q** associated with the eigenvalue zero. Now, considering the left eigenvectors of **Q** (associated with the eigenvalue zero), we may set **v**_{i}=**e**_{i}, the vector with a one in the *i*th position and zeros elsewhere. Hence, we have
2.29where **0**_{d1×d2} is the matrix of size *d*_{1}×*d*_{2} consisting entirely of zeros. Now, by noting that if we start in an absorbing state (with probability one) then we must remain in that state, the columns of the matrix **U**_{1} span the null space of **Q** so that **v**_{i}**u**_{j}=*δ*_{ij}. This is essentially a computationally efficient version of the jump-matrix-based equation (2.14), provided we know the null space. We use the `spspaces.m` code of Kowal [39] to compute the null space of **Q**, and provide code in electronic supplementary material, S2.4, for application of this code to the SIR model.

*Extension: maximum size*. Markov chain-based methods, such as the null space and path integral can also be used to calculate maximum epidemic sizes by augmentation of the state space:
2.30We do not provide code for this, since the Neuts & Li algorithm is certainly much more efficient, but note that Markov chain-based methods are extremely versatile, at the cost of state space expansion.

### (c) Machine-precision, arbitrary infectious period methods

Analytic traction can be gained on the Sellke construction by the derivation of a Wald-type identity. This was done by Ball [40], who obtained a triangular system of linear equations for the probability *p*_{k} of observing *k* additional cases in a population of *S*_{0} initial susceptibles and *I*_{0} initial infectives:
2.31where *Φ* is the Laplace transform of the infectious period distribution, so for Markovian dynamics *Φ*(*x*)=*γ*/(*x*+*γ*). Note that *p*_{k}=Pr(*Z*=(*k*+*I*_{0})) in our conventions. Clearly, (2.31) can be written in the form
2.32where **1** is a column vector with all entries equal to one, and we call **B**=(*B*_{kl}) the Ball matrix. We now consider three methods for numerical solution of (2.32).

#### (viii) Direct substitution

The equation (2.32) can be solved directly, since the Ball matrix **B** is left triangular
2.33Unfortunately, this procedure becomes numerically divergent for large population sizes. The essential reason for this is that the binomial coefficients in (2.31) become extremely large for large *l* and *k*≈*l*/2. Figure 2*a* shows the large mass in the middle of the Ball matrix, which drives this numerical instability.

*Extension: multiple-precision arithmetic*. Demiris & O’Neill [41] showed how the problem of numerical divergence could be overcome through the use of multiple-precision arithmetic. We provide code in the electronic supplementary material, S3.1, to implement this using Matlab's `vpa()` function, which can be easily modified to work at standard machine precision. A limitation of this method, however, is the large computational cost involved, since multiple-precision algorithms are typically extremely costly.

#### (ix) Iterative methods

A large number of numerical methods are available for the solution of equations of type (2.32) [42]. We find that the use of the conjugate gradient method on the equation
2.34is stable for population sizes up to around 10^{2}. This problem is illustrated in figure 2*b*, which shows a concentration of the mass of **A** in one area of the matrix that is similar to that of **B** noted above. In contrast, iterative schemes that can be applied directly to non-symmetric equations as (2.32) such as biconjugate methods (e.g. Matlab's `bicstab()`) or minimum-residual methods (e.g. Matlab's `gmres()`) do not out-perform direct substitution.

*Extension: preconditioners*. A major benefit of iterative methods is the potential to use preconditioners [42]. We find that the use of a Jacobi preconditioner together with an initial probability vector based on asymptotic results [8,41] can give accurate results even for *N*=10^{3}. This preconditioner is formed through the matrix **D**=(*D*_{kl}) for
2.35Then the preconditioned conjugate gradients method (PCG) effectively solves
2.36This problem is visualized in figure 2*c*, which shows how this preconditioner evens out the density of mass in the matrix involved. Code is provided in electronic supplementary material, S3.2, to implement this method. There remains the possibility of further preconditioning based on the properties of **E** to enhance convergence further without significant computational cost, but we were unable to find an appropriate second preconditioner.

While we could not find a preconditioner that significantly improved biconjugate methods, the minimal residual method given by Matlab's `gmres()` can be improved through the use of the preconditioner **F**=(*F*_{kl}) where
2.37This improved GMRES method effectively solves
2.38and has comparable performance to direct substitution. This problem is visualized in figure 2*d*. As for PCG, the possibility of finding a second preconditioner is still open and could yield a significant improvement over direct substitution if an appropriate one is found.

#### (x) Gontcharoff polynomials

Another way of computing the mass function of the final-size distribution is via the expression for the PGF of this distribution given slightly indirectly in theorem 2.6 of Ball [40] (Ball gives the PGF for *N*−*Z*, the number of susceptibles remaining at the end of the epidemic). We can use the fact that a mass function can be recovered from its generating function (*z*∈[0,1]) using the relationship (2.22) to find that, for *k*=0,1,…,*n*,
2.39Here *q*_{i}=*Φ*(*iβ*) (*i*=0,1,…) (the probability that an infective fails to infect any of a given set of *i* susceptibles), *U*=(*u*_{i}=*q*_{i}, *i*=0,1,…), *E*^{j}*U*=(*u*_{j+i}, *i*=0,1,…) and the Gontcharoff polynomials *G*_{k}(*x*|*U*) (*k*=0,1,…) are defined by
and
See eqn (3.11) of Picard & Lefèvre [43]. Ball does not use Gontcharoff polynomials in his results, but rather a collection of polynomials (*α*_{k}(*s*), *k*=0,1,…) defined as the solution to a triangular system of linear equations. Modulo some scaling, this system of equations is equivalent to the definition of the Gontcharoff polynomials given above: it is not hard to show that *α*_{k}(*s*)=*k*!*G*_{k}(*s*). Gontcharoff polynomials are useful in formal proofs, and do have something of a probabilistic interpretation when evaluated at 1 [44], §3.1. It turns out that the scaling *α*_{k}(*s*)=*k*!*G*_{k}(*s*) can be better for computational purposes and is used in our implementation in electronic supplementary material, S3.3.

*Extension: heterogeneous populations and generalized transmission*. The approach of Picard & Lefèvre [43] generalizes the work of Ludwig [25] and Bahr & Martin-Löf [45], considering the very general case where ‘each infective during [their infectious period] fails to transmit the infection within any given set of susceptibles with a probability depending only on the size of that set’ [43: p. 269]. Polynomial-based equations of similar form to (2.39) can therefore be derived for more general models of transmission, and also for heterogeneous populations. These equation sets typically have similar numerical behaviour to each other.

### (d) Asymptotic results

There is an extensive literature on asymptotic results for epidemic final sizes in large populations. This goes back to the earliest mathematical representations of the mean behaviour of epidemics [46], and also limiting distributions of simple epidemics [47], but more recently involves formal convergence proofs [45,48] and results obtained for quite complex population structures, including multiple types [49], households [50] and networks [28].

At the heart of most asymptotic results is a transcendental equation for the probability *π* that an individual avoids ‘global’ infection (defined in a model-specific way) of the form
2.40Here, *π*=1 (i.e. asymptotically vanishing levels of global infection) will always be a solution. If *F*′(1)≤1, then *π*=1 is the desired result, but if *F*′(1)>1 then we wish to find the largest *π*<1 that satisfies the equation (2.40). For complex models, there may be many non-maximal solutions, rendering general root-finding algorithms, such as bisection, unreliable. Considering
2.41should, however, provide an accurate estimate for *π* for sufficiently small *ε* and a sufficiently large number *m* of iterations of *F*. In the case of household-structured models, there is then the question of derivation of the distribution of final-size proportions, which can often be done using the methods outlined elsewhere in this paper. Sample code for reproduction of fig. 2 of Ball *et al.* [51] using a Jump chain method and (2.41) is given in electronic supplementary material, S4.

### (e) Epidemics on networks

There has been much recent interest in epidemic models where the population is connected on a network [52,53]. In these models, the population is made up of *N* individuals indexed by integers *i*,*j*,… and the contacts from *i* to *j* happen at the points of a Poisson process of rate *β*_{i,j}. If *i* is infectious and *j* susceptible at the point of contact, then *j* becomes infectious. Recovery happens, as before, after a time drawn from the infectious period distribution. This very general formulation can be somewhat simplified when there are several individuals with the same epidemiological characteristics (i.e. if there is a large discrete symmetry group for ** β**=(

*β*

_{i,j})). Alternatively, one may wish to preserve individual identity but make the simplifying assumption that

**=**

*β**τ*

**G**, where

**G**=(

*G*

_{i,j}) is the adjacency matrix of an undirected, symmetrical, topological network without self-edges. Both simplifications are more commonly considered than the most general case [52,53].

#### (xi) Monte Carlo methods

All Monte Carlo methods discussed so far are relatively simply adapted to populations with network structure. For the case of Gillespie's method, this involves direct simulation of the Markov Chain 2.42Code to perform direct network epidemic simulation is already available in a form that is readily adapted [6], program 7.7.

For the Sellke method on networks, thresholds can be generated at the start of the process, however, the ordering of thresholds to obtain an expression similar to (2.4) is not straightforward. Instead, we provide `sel_net.m` in electronic supplementary material, S5.1.1, in which new generations of infectives are created iteratively. This is somewhat similar to Ludwig's approach, but with the difference that the random numbers are generated for the susceptible nodes rather than the infectious contacts produced by infectives.

Ludwig's method is, given its links with random directed networks, particularly natural in the context of epidemics on networks. In this case, the ranks are constructed iteratively as before, but the picking with probability given by (2.7) is applied only to remaining susceptible neighbours of the individual concerned. We provide the function `lud_net.m` in electronic supplementary material, S5.1.2, that implements this algorithm.

#### (xii) Machine precision methods

There are two distinct meanings to the final-size distribution for network models. The first, as considered by Neal [54], is the final-size PMF averaged over realizations of a given random graph model. This has, to our knowledge, only been done so far for the Bernoulli/Erdös-Rényi random graph where each link is present with probability *π*, independently of the presence or the absence of all other links. This work generates a set of equations very similar to (2.31):
2.43which can then be solved using similar methods to those discussed for the Ball equations above. Code that creates the matrix **N** is provided in electronic supplementary material, S5.2.1.

The second broad class of distributions concerns the probabilities of different outcomes for a given ** β**—typically the marginal probability for each node that it is ultimately infected. Here there is a ‘multitype’ formula also derived by Ball [40] that can be adapted. If we have a vector

**n**=(

*n*

_{i}) whose elements are the initial number of susceptibles of type

*i*, and also

**m**=(

*m*

_{i}) with elements corresponding to the initial number of infectives of each type, then the relevant equations are 2.44where operations on vectors are defined in the natural way [8,40]. Code to generate the relevant matrix is given in electronic supplementary material, S5.2.2. It is also possible to consider a Markovian model as defined by (2.42). We provide code in electronic supplementary material, S5.2.3, to return an appropriate generator

**Q**for this Markov chain when

**=**

*β**τ*

**G**, which can then be analysed using methods for Markov chains already discussed.

An interesting recent development is the approach of Sharkey [55]. This method makes a ‘closure’ assumption that has been commonly used in pairwise epidemic network models [53] to derive a set of *O*(*N*^{2}) differential equations describing the temporal behaviour of a network epidemic that is provably exact for loop-less networks (I. Z. Kiss 2012, personal communication). Since the non-network homogeneous models considered are essentially of fully connected cliques this makes the Sharkey model inappropriate for these applications, however, it can be compared with other methods when the network structure is loop-less, and code to evaluate the model is already available [55], appendix D.

## 3. Results and discussion

We compare and benchmark each method systematically in table 1. These were carried out on a Mac Pro with a dual quad-core 3.2 GHz chipset and 16 GB of RAM running Matlab V. 7.9, 64-bit version. The machine precision *ϵ*≈2.2×10^{−16}, meaning that numbers close to 1.0 differing by this amount may be evaluated as equal by the machine. For variable-precision algorithms, the minimum number of digits (considering only powers of 2) needed to obtain accurate results is reported. For iterative matrix methods solving **A****x**=**b**, and using the vector norm , the standard diagnostic ‘residual’ is recorded, and is equal to |**b**−**A****x**|/|**b**|. For machine precision methods, we quote times for evaluation of the entire PMF, while for Monte Carlo methods we quote times per sample. We also display the behaviour of each algorithm as a function of population size graphically in figure 3.

The exact times given in seconds will of course vary significantly between machines, and may be subject to significant modification with more optimized code—in addition, the exact part of the code that should be timed to make a fair comparison is not clear, for example, where sparse matrices need to be generated, this is not included in our time measurement because the sparse structure of these matrices can be stored in a parameter-independent manner. Despite these caveats, however, some strong signatures show up in the benchmarking that can be interpreted and generalized.

### (a) Numerical efficiency of Monte Carlo methods

If a Monte Carlo calculation is necessary, then Gillespie's method is typically the least efficient method. The Sellke construction is much faster for both final-size and temporal dynamics (in fact, it is the only approach that can generate temporal dynamics for an arbitrary infectious period distribution). Gillespie's method is, however, the only Monte Carlo approach that can deal straightforwardly with waning immunity. There is also the question of efficiency at different parameter values. For subcritical epidemics where the expected final size is much smaller than the population size, Ludwig's method can involve generation of many fewer pseudo-random numbers than Sellke's, for example.

When considering the usefulness of any Monte Carlo method, it is useful to know how many samples are required. Figure 4 shows three measures of convergence. Suppose the probability of final size *z*, *p*_{z}, and the associated cumulative probability are known (in our examples through Bailey's method). Then if a proportion *q*_{z} of simulations have final size *z* and we define the empirical cumulative probability , the measures used are: (i) the *Kullback–Leibler (KL) divergence*
3.1using the convention that 0*ln*(0)=0; (ii) the *Kolmogorov–Smirnov (KS) D-statistic*
3.2(iii) the *summed absolute error*
3.3Which of these measures is most relevant depends on the precise application. For example, the KL divergence is often used in estimation, while the KS D-statistic is used to assess model adequacy. Figure 4 indicates that the convergence with number of samples *n* is, respectively: (i) *O*(*n*^{−1}); (ii) *O*(*n*^{−1/2}); (iii) *O*(*n*^{−1/2}).

While we have considered ‘exact’ Monte Carlo methods, approximate simulation algorithms also exist, such as the *τ*-leaping method introduced by Gillespie [56]. This was motivated by the desire to improve the numerical efficiency of the ‘Gillespie algorithm’, and assumes that the transition rates of the model are held fixed for a period of time *τ*, implying that the number of events of each type that occur during that time period have independent Poisson distributions. The algorithm is appropriate to use and provides most benefit, when (i) the changes to the state of the system have no or minimal impact on the transition rates of *all* event types, and (ii) it is unlikely for the number of events to lead to a state which is inconsistent with the state space of the model. The *τ*-leaping methods therefore seem unlikely to provide substantial benefits for epidemic models: for example, requirement (i) is unlikely to be satisfied for homogeneously mixing models and (ii) will not be satisfied for network models. Nevertheless, a thorough analysis of this question with more sophisticated approximations [57] could be of significant interest.

### (b) Numerical efficiency of Markov chain-based methods

For models based on Markov chains, multiple methods are available to calculate various quantities at machine precision. Of these, Bailey's method is the fastest, and is robust even for system sizes over 10^{4} with the limitation on system size related to the resources available to store and process a dense *N*×*N* matrix, but does rely on special properties of the SIR model. Of the methods that apply to more general Markov chains, the exact calculation considered will determine the appropriate method: matrix exponentials can capture temporal dynamics; which of the null space and path sum/integral methods is faster is likely to be machine- and implementation-dependent, but both are efficient for calculation of a wide range of quantities. The system size limitations for these approaches are related to the resources available to store and process large sparse matrices, but as for Bailey's method there is no inherent numerical instability involved.

### (c) Numerical efficiency of arbitrary infectious period methods

If a complex infectious period distribution is required then it may be impractical to use phase-type infectious distributions and Markovian approaches are unsuitable. Of the machine precision methods this leaves the Ball matrix equations and the use of Goncharoff polynomials. While these approaches are often much lower-dimensional than Markovian models, they are both inherently numerically unstable: the former because some elements of the Ball matrix are much larger than others; and the latter because solution involves summing many different positive and negative terms, leading to ‘catastrophic cancellation’. Where possible, however, direct substitution of the Ball equations is numerically efficient; also Gontcharoff polynomials are comparable in numerical efficiency and stability to direct substitution.

When the system size becomes large enough to generate numerical instability, Jacobi-preconditioned conjugate gradients can be used to reach system sizes of 10^{3} while retaining numerical efficiency. An important point about all iterative methods, however, is that small negative probabilities can be returned for any value of the PMF that is close to zero. While these do not lead to practical problems or major numerical instability, variable-precision should still be viewed as a gold-standard method, albeit one that can be prohibitively computationally costly. On the other hand, in the context of inference schemes such as those based on random-walk Metropolis–Hastings sampling over parameters *θ*, if a probability vector **p**(*θ*) has been evaluated at one step and a change in parameters *δθ* is proposed, then **p**(*θ*) can be used as the starting vector for iterative calculation of **p**(*θ*+*δθ*), and would be expected to converge more quickly than an initial vector based on asymptotic results if *δθ* is small.

### (d) Network methods

We also performed benchmarking of network methods, as shown in table 2. Here the conclusions are very similar to those for homogeneous models, but we note that for special topologies (averages over Bernoulli graphs or loop-less fixed networks) there are particularly efficient methods available. This is likely to be a general feature: for certain special networks, there will be techniques available that exploit the restricted topology; but in the most general case less efficient but more versatile approaches will be necessary. In this context, we note the efficiency of the Sellke construction, coupled with the possibility of reconstructing temporal dynamics as for the homogeneous case.

### (e) Implications for inference

Since one of the main motivations for our study is the relevance of final-size calculations for statistical work, we now consider the implications of our results for estimation of epidemiological parameters. A theoretically attractive approach to inference is via exact evaluation of the likelihood. We have considered here machine-precision methods which facilitate achieving this goal. The inference itself is typically performed within a Bayesian framework, using Markov chain Monte Carlo (MCMC) methods [41,58]. However, all of the machine-precision methods we have considered become practically infeasible to implement as the population size, or the complexity of the epidemic model, increases, owing to growth in the number of equations to be solved and numerical instabilities. We have demonstrated in this work that deployment of appropriate numerical methods such as preconditioning or use of path sums can significantly extend the reach of machine-precision methods.

The MCMC inference framework is very flexible and so difficulties such as infeasible evaluation of the exact likelihood can often be overcome by some form of imputation or data augmentation [58,59]. In the context of final-size data for epidemics, an example is the random graph approach developed by Demiris & O’Neill [60] for stochastic multi-type epidemics in structured populations. Often such approaches are heavily reliant on being able to restrict the search space, or at least to be able to sample over it efficiently, by good choice of prior. This is typically not a trivial task. Another approach which is gaining in popularity is to use simulation-based methods, such as Approximate Bayesian Computation or pseudo-marginal methods [58,61–63]. This approach relies on the ability to simulate data efficiently, and hence our comparison of three Monte Carlo methods will facilitate the use of such simulation-based inference methods.

For small population sizes, there is a variety of methods which may be used to perform inference. The advantage of machine precision methods is that the PMF is computed without Monte Carlo errors, and hence inference based upon such calculations is reliable. However, these methods may be time-consuming, and it may be more efficient to evaluate a Monte Carlo estimate. The difficulty is in determining how many simulations are required in order to estimate the mass function to *sufficient* accuracy for the problem in hand. To assist in determining this we have considered three measures of ‘distance’ between the machine-precision evaluation and the Monte Carlo estimate (figure 4). This informs us that to achieve very close to the same precision as the machine-precision methods, based upon the Benchmark cases running times in tables 1 and 2, that often the machine-precision methods are more computationally favourable. However, if one is happy to forgive some inaccuracy—possibly in bias and confidence—in parameter estimates, then some speed-up is possible using Monte Carlo methods. This trade-off will often be also dependent upon the application, in terms of the total running time required and available. In any case, as the population size increases Monte Carlo estimation methods become more appealing, and eventually become necessary. Additionally, whether the infectious period distribution can be well approximated by a sufficiently low-dimensional phase-type distribution can lead to additional questions about speed and accuracy. Nevertheless, certain general statements as discussed earlier will generally hold: for SIR epidemics, Sellke's method typically outperforms the Gillespie algorithm, while only Markovian models can deal with waning immunity. Temporal data will also restrict the use of Sellke's method to Monte Carlo simulation, while Markovian models can have temporal quantities evaluated at machine precision through the use of, for example, matrix exponentials.

## 4. Conclusions

Since the work of Bailey [7], much effort on stochastic epidemic models has focused on analytic results to enhance understanding [8]. Modern computational resources, however, mean that there are three particularly strong reasons to consider numerical algorithms. First, there is the possibility of making a fast sweep over a large region of parameter space to aid intuitive understanding. Secondly, there is improving the performance of computationally intensive inference. Thirdly, there is enhancement of the performance of other ‘inverse problems’ such as optimization of public health intervention strategies.

In this work, we have reviewed a fairly comprehensive selection of the existing methods for generation of the epidemic final-size distribution, and compared their numerical performance. We have shown how Jacobi-preconditioned conjugate gradients can be used to help alleviate the reported limitations of the Ball method; however, we consider it likely that problem-specific preconditioners and other numerical techniques can be developed with the aim of addressing the epidemiological questions above. We would furthermore encourage anyone reading this paper to contribute to such developments.

## Acknowledgements

T.H. is supported by the UK Engineering and Physical Science Research Council. J.V.R. was supported under Australian Research Council's Discovery Projects funding scheme (project no. DP110102893). We are grateful to Lorenzo Pellis and three anonymous referees for helpful comments on this manuscript.

- Received July 17, 2012.
- Accepted November 8, 2012.

© 2012 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original author and source are credited.