## Abstract

In this paper, we consider a simulation technique for stochastic trees. One of the most important areas in computational genetics is the calculation and subsequent maximization of the likelihood function associated with such models. This typically consists of using importance sampling and sequential Monte Carlo techniques. The approach proceeds by simulating the tree, backward in time from observed data, to a most recent common ancestor. However, in many cases, the computational time and variance of estimators are often too high to make standard approaches useful. In this paper, we propose to stop the simulation, subsequently yielding biased estimates of the likelihood surface. The bias is investigated from a theoretical point of view. Results from simulation studies are also given to investigate the balance between loss of accuracy, saving in computing time and variance reduction.

## 1. Introduction

There is currently much interest in performing ancestral inference from molecular population genetic data. To facilitate this inference, there has been an explosion of research in developing computationally efficient methods. These techniques are designed either to compute the likelihood, for maximum-likelihood estimation, of a sample of genes or for deriving the posterior distribution on parameters in coalescent models, which describe the ancestry of the genes. Broadly speaking, there are three main approaches to inference in molecular population genetics: (i) importance sampling (IS) for likelihood evaluation, whose application in population genetics was pioneered by Griffiths & Tavaré (1994*a*,*b*,*c*), (ii) Markov chain Monte Carlo methods (e.g. Kuhner *et al.* 1995; Wilson & Balding 1998), and (iii) Approximate Bayesian Computation (ABC); Marjoram *et al.* (2003), Del Moral *et al.* (2009). See Stephens (2004) for a review.

In this paper, we concentrate on likelihood-based methods. Molecular data have a sampling distribution which is a mixture over possible ancestries. The state space of the ancestries is huge and closed-form expressions are available only in the simplest cases. The objective is to estimate a parameter () using the criterion
1.1
for some observed genetic data *y*_{n}∈*E*, *n*≥2, parameter *θ*∈*Θ*, probability density *π*_{θ} on *F* and an integrable function. Note that *y*_{n} is typically the genetic types of a random sample of genes. In addition, *z*=(*z*_{0},…,*z*_{k}) denotes the coalescent history, i.e. the set of ancestral configurations at the embedded events in a Markov process where coalescence, mutations or other events take place. *z*_{k} denotes the current state, while *z*_{0} is the state when a singleton ancestor is reached.

Statistical inference associated with *l*(*y*_{n};*θ*) can be regarded as a missing data problem and could, in principle, be tackled by the expectation maximization (EM) algorithm (Dempster *et al.* 1977) and its Monte Carlo extensions (e.g. Fort & Moulines 2003). However, *z*, the stochastic tree, can be computationally expensive to simulate and such techniques are typically avoided. For example, for the coalescent Kingman (1982) and ancestral recombination graphs (e.g. Fearnhead & Donelly 2001), the standard approach is to use IS (Griffiths & Tavaré 1994*a*; Stephens & Donelly 2000; De Iorio & Griffiths 2004*a*) and sequential Monte Carlo (SMC) methods (Chen *et al.* 2005) to approximate equation (1.1). These approximations are usually computed on a discrete grid Δ_{θ}⊆*Θ* and the estimate of *θ** corresponds to the largest approximated likelihood on Δ_{θ}. See also Olsson *et al.* (2008) for an alternative procedure for state-space models.

Techniques such as ABC and composite likelihood (Wiuf 2006) do not give solutions which are exact with respect to the original model while, when possible, exact inference is of interest. This is because, given a reasonable stochastic model, the approach allows investigators to exactly (up to a numerical error) average over the uncertainty in the tree structure when estimating genetic parameters of interest. One of the main drawbacks of existing exact IS/SMC schemes is the simulation of the tree backward in time, from observed data, until the tree coalesces. In many scenarios, especially for large data sets, when getting close to the top/root of the tree, it often takes a long time to coalesce. This is owing to genetic parameters (e.g. mutation rates) that can be very large relative to the size of the data. Consequently, it can take a very long time to simulate the tree back to the most recent common ancestor (MRCA). As a result, the variance of the estimate of the likelihood can be higher than is desirable, along with long CPU times. It should be noted that the calculation of the likelihood at these points, *θ*∈*Θ*, can be inferentially important. In addition, it is seldom possible to speed up the simulation via IS as the variance of the weights can become too large. That is, by adapting the parameter of the proposal to lead to a fast coalescence, the discrepancy between the true process and the proposal leads to a very inefficient algorithm w.r.t. variance.

### (a) The time machine

The approach proposed in this paper is based on IS. Stephens & Donelly (2000) proposed a way to use IS efficiently to simulate ancestral trees by characterizing an optimal proposal distribution and similar methods have since been developed for a variety of genetic scenarios (e.g. De Iorio & Griffiths 2004*a*,*b*). The basic idea is to define an efficient proposal distribution on ancestral histories which allows us to reconstruct Markov histories backwards in time from the sample *y* to an MRCA.

We introduce a stopping time in the IS proposal, backward in time, to stop the simulation before the MRCA is reached. Then using a simple stopped identity, forward in time we are able to characterize the bias introduced in the evaluation of the likelihood owing to stopping the simulation of the stochastic tree. The bias can be understood by considering two aspects:

The underlying mixing of the evolutionary process.

A specific distribution on the process (the last exit time).

In the context of (i), the idea is that for many models, close to the root node of the tree, the process is able to forget its initial condition. As a result, stopping the simulation is reasonable, because the place where it is stopped is forgotten by the process forward in time; we formalize these ideas later on. In reference to (ii), the more information there is on the true marginal distributions of the process, the more it is possible to reduce the bias. Ideas from the theory of population genetics models (Ewens 1972; Ethier & Griffiths 1987) will be used to achieve the latter.

In reference to a comment of Edwards (2000), our method is termed the ‘time machine’. This is because, estimation is performed saving the simulation time of going all the way back in time to the MRCA. A similar idea, in the context of filtering, can be found in the work of Olsson *et al.* (2008) and also in option pricing Avramidis & L’Ecuyer (2006). In our context, we have a simpler underlying process than in filtering, but the ergodicity conditions considered there do not apply here. The mixing conditions that they require only apply locally and thus the proofs have to be modified. Recall that approximate tools for inference from stochastic trees (e.g. Tavaré *et al.* 2000; Meligkotsidou & Fearnhead 2007; Del Moral *et al.* 2009) are available. However, our approach is ‘less approximate’, in that our point-wise estimate of the likelihood is significantly less-biased, but costing more in computational time.

This paper is structured as follows: In §2, we introduce a motivating example, the coalescent model, which will help to illustrate our ideas. In §3, our approximation is described; §4 features an analysis of the bias of the approach; §5 presents a simulation study to demonstrate the performance of our algorithm and we conclude the paper in §6. The appendix contains some proofs. A description of the algorithm and an extension to the infinite sites model can be found in the technical report of Jasra *et al.* (2011).

## 2. Motivating example

The coalescent model is used as a motivating example for our work. Some notations are first introduced. In particular, we consider the case in which the type space *E*={1,…,*d*}^{n} for the collection of the genes/genes is finite and the only genetic process of interest is mutation.

### (a) Notation

Denote by a measurable space. For two *σ*-finite measures *λ*_{1} and *λ*_{2} mutual absolute continuity is written *λ*_{1}∼*λ*_{2} and the Radon–Nikodym derivative as d*λ*_{1}/d*λ*_{2}. Given a Markov kernel , let *P*^{0}(*x*,⋅)=*δ*_{x}(⋅), (the Dirac measure) and write the composition for *j*≥1 as , with a corresponding composition of inhomogeneous kernels as *P*_{1:j}. Write as the indicator of a set. The collection of bounded and measurable functions is denoted . The supremum norm is written . The total variation distance between two probability measures *λ*_{1} and *λ*_{2} on is . Given a probability measure *λ*, and a , the product measure is written *λ*^{⊗j}:=*λ*^{⊗j−1}⊗*λ*, *λ*^{⊗−1}:=1. The vector notation *x*=(*x*_{1},…,*x*_{j})=*x*_{1:j} is adopted. In addition, let the *d*-dimensional vector *e*_{i}=(0,…,0,1,0,…,0) where the 1 is in the *i*th position. The norm of a vector is written |*x*_{1:j}|_{1}:=|*x*_{1}|+⋯+|*x*_{j}|. For , .

### (b) Identity of interest

Define the tree model on the measurable space . Let *n*≥2. The basic idea is to maximize, w.r.t. *θ*∈*Θ*, the quantity
2.1
where the observed data is *y*_{1:n}∈*E*, , normally the identity, for
and for some *E*⊂*F*_{n}, some () to be defined specific to a model and depending upon the model under study. In all of our examples, *π*_{θ}(*k*,*z*_{1:k}) corresponds to the density of a birth-type (in some sense) Markov process in discrete time, stopped at a random time ; that is
Throughout the article it is assumed that , i.e. that the stopping time is a.s. finite w.r.t. *π*_{θ}. The stopping time will be determined by the first time that the tree is of ‘size’ *n*+1.

Introduce an absolutely continuous distribution *Q*_{θ} on *F* and sample according to *Q*_{θ}, then the IS estimator of *l*(*y*_{n};*θ*) is
where
and
the empirical measure of the simulated samples.

### (c) The coalescent model

The coalescent model plays a central role in population genetics (Kingman 1982; Hudson 1990). It describes the ancestry of a sample of genes/chromosomes back in time until the time to the MRCA. The genealogical history of a site on the DNA sequence can be represented by a tree (figure 1), with time going back in the past. For a sample of *n* chromosomes, there are initially *n* distinct branches, which represent the ancestry of one of the sampled chromosomes. As we go back in time, branches of the tree coalesce, i.e. join together when two chromosomes in the sample share a common ancestor. Note that coalescences occur only between pairs of individuals. The genealogy stops when all sampled chromosomes are traced back to the MRCA. The time *T*_{k} during which the sample has *k* distinct ancestors has an exponential distribution with parameter *k*(*k*−1)/2. Moreover, *T*_{n},…,*T*_{2} are mutually independent random variables. The tree has *k* edges of length *T*_{k}, *k*=*n*,…,2. Measuring time backwards, the number of ancestors of the sample is a death process with a quadratic death rate. Conditionally on the genealogical tree and the type of the MRCA, mutations occur along the branches of the tree as a Poisson process with rate *μ*/2 (*μ* is the mutation rate per gene per generation) and transition matrix *P*. Mutations in different edges are conditionally independent given the length of the edges. For example, when there are *d* possible allele types 1,…,*d* and when a mutation occurs, a transition is made from type *i* to *j* according to the elements *p*_{ij} of *P* (irreducible and positive recurrent Markov transition). Hence, the configuration of types in the sample is determined by the mutations occurring on the edges of the tree from the root to the leaves. Denote the number of genes of type *i* at event *j* of the process as , with . The objective is to find the genetic parameters *θ*=(*μ*,*P*).

The various components of the identity (2.1) for the coalescent model are defined as:
with *t* the identity function,
and finally,
where
and, with *ψ*_{θ} the stationary distribution of *P*,
Write (here d*z* is counting measure).

Note that for any fixed *n*≥2, *θ*∈*Θ*, . Here, we write the data as to denote the *d*-types and the fact that there are *n*-data. We remark, when there is only mutation, the stopped-process can be integrated out (see Gorur & Teh 2009). However, this does not hold in general, for example when there is migration; our results can be easily extended to the case of migration as well (e.g. De Iorio & Griffiths 2004*b*).

### (d) Likelihood computation

To compute the likelihood, for a given *θ*∈*Θ*, IS is adopted. An importance distribution, *Q*_{θ}, is introduced to simulate the tree backward in time to the MRCA; this ensures that the data is hit.

In detail, let *x* denote the reverse chain backward in time and write *x*∈*F*_{n} instead of *z* (this convention is used throughout the article, see also figure 1). Let:
for some Markov transition *q*_{θ}; see Stephens & Donelly (2000) for the optimal *Q*_{θ}. Then the likelihood is
The simulation proceeds by sampling from and computing the weight
Simulations backward in time are carried out until we reach the MRCA, i.e. when there is only one individual in the sample. This procedure is repeated *N* times to provide a Monte Carlo estimator for the likelihood
where are the simulated samples, for every and
This can be repeated for many *θ* using a driving value (Griffiths & Tavaré 1994*a*) or bridge sampling ideas (e.g. Fearnhead & Donelly 2001). In addition, to deal with the problem of weight degeneracy (e.g. Doucet *et al.* 2001) resampling steps can be added (e.g. Chen *et al.* 2005).

## 3. Stopping the simulation

It is now detailed how we stop the simulation of the stochastic tree back in time before the MRCA is reached. In the next section, we provide theoretical results and connections to the theory of SMC are established.

### (a) Notations and result

For the purpose of stopping the simulation, introduce two stopping times (forwards in time): the first hitting time of the set *B*_{n+1}
and some stopping time *α* associated with the hitting of a set
such that
where is the *π*_{θ}-probability. Let denote the expectations w.r.t. the process {*Z*_{k}}.

Consider the coalescent model, define, for *n*>*m*≥3 the stopping time *α*
which is the first time the forward process has *m*+1 individuals.

### Lemma 3.1

*The likelihood (2.1) can be written
*
*that is,
*
3.1
*where
*
*Furthermore, in the case of the coalescent
*
*where
*

### Remark 3.2

For the coalescent, one can write
That is, given *α*, the distribution of the gene counts at the first entrance time of *B*_{m+1} can be written as the composition of:

— the distribution of the counts at the last exit time from

*B*_{m}— and the Markov transition.

This helps to provide some intuition for the subsequent analysis.

### (b) Biased approximation for the coalescent

Set, for the coalescent
In other words the simulation is stopped the first time there are *m* genes. On the basis of lemma 3.1, our approximation of the likelihood is then
It is then clear that if
3.2
then the approximation of the likelihood is exact. That is, to minimize the bias an approximation of the true distribution of the counts at the last time there are *m* genes should be used. The ideas and notation are clarified in figure 1.

## 4. Results on the bias

In our biased simulation, using the decomposition (3.1), the procedure will approximate where our notation is such that:

— (

*X*_{k})_{k≥1}is the time reversed process—

*ρ*is a first hitting time associated with (*X*_{k})_{k≥1}—

*h*_{θ}an approximation of equation (3.2).

### (a) Error bounds

We begin by giving a simple result on the error bounds for two type of algorithms. The result applies to the standard IS algorithms, for example, in Stephens & Donelly (2000) and De Iorio & Griffiths (2004*b*), and for the SMC algorithms as in Chen *et al.* (2005). The simulation is to be performed backward in time, as in §2*d*. The ideas here are adapted from the theory of Del Moral (2004).

The biased estimates are denoted as , *N*≥1, where depends upon whether IS or SMC is implemented. For example, in the IS case:
where
is such that *A* is the set associated with *ρ*≥3 (*Q*_{θ}-a.s.), *x*_{1},*x*_{2}∉*A* and . Below expectations w.r.t. the stochastic process that is simulated by the algorithm are written as and it is assumed

### Proposition 4.1

*For any n*≥2, *p*≥1, *θ*∈*Θ*, *y*_{1:n}, *there exists a* , such that:

### Remark 4.2

The result shows the standard variance-bias type decomposition. That is, can be thought of as a bound on the variance and |*l*_{b}(*y*_{n};*θ*)−*l*(*y*_{n};*θ*)| is the bias. Our estimate converges to *l*_{b}(*y*_{1:n};*θ*), and it is sought to control the bias term. In the case of the coalescent and with abuse of notation, it is
4.1
where *γ* is any probability on {*m*−1,…} and . Hence, it is of interest to control terms of the form
for *λ*_{1},*λ*_{2} two probability measures and {*P*_{n}} a sequence of non-homogenous Markov kernels and *f* a bounded measurable function (*θ* is suppressed on the right-hand side).

### (b) Controlling the bias

A simple technical result is now given which shows how to control the bias term (4.1).

Some assumptions are now made, that can be satisfied by many stochastic tree models. Introduce a sequence of time inhomogeneous Markov kernels {*P*_{n}}, on space and a sequence of sets .

### Assumption 1

*Stability of* {*P*_{n}}.

.**Initial probability measures***λ*_{1},*λ*_{2}*are concentrated on**C*_{0}.{**Absorption of***P*_{n}}.*For every**n*≥1,*x*∈*C*_{n−1}*we have*4.2{**Local mixing of***P*_{n}}.*For every**n*≥1,*there exist**ϵ*_{n}∈(0,1),*ν*_{n}*concentrated on**C*_{n},*such that for all**x*∈*C*_{n−1}4.3

*The assumption* A1 (*which is comprised of* (i)–(iii)) *will refer to the fast mixing of the process close to the root node of the tree. The absorption type assumption refers to the birth process associated with coalescent type chains.*

### Proposition 4.3

*Assume* (*A*1). *Then, for any k*≥1, *define*:
*and we have*

### Remark 4.4

The result helps to bound the bias as .

### (c) Verifying the assumptions

Assumption A1 is now discussed in the context of the coalescent. Note that the results follow, with some extra work, for coalescent processes with migration. Readers interested in how the method may be applied can skip to §5, with no loss in continuity.

Suppose that the transition matrix *P* satisfies, for any *i*,*j*∈{1,…,*d*}, *ϵ*_{φ}∈(0,1) and probability *φ*, *φ*_{j}>0 . This condition implies that *P* mixes extremely quickly. Let *C*_{0}={*z*_{1:d}:|*z*_{1:d}|_{1}=3}; this corresponds to the space of . Also let *C*_{1}={*z*_{1:d}:3≤|*z*_{1:d}|_{1}≤6}. It is clear that : since we start with at most 3 genes and the most possible after three steps is 6. Now it can be seen that, for any *z*∈*C*_{0}
with

Here, the minorizing probability *ν*=*φ*^{⊗3} puts all its probability on having 3 genes. Then it can be subsequently seen that satisfies condition (4.3), with *C*_{2}={*z*_{1:d}:3≤|*z*_{1:d}|_{1}≤12,*n*_{i}≥0} and so forth. In effect the condition (4.3) holds with ; that is, the closer to the root node of the tree we stop, the faster the process will mix forward in time.

As a result and on inspection of equation (4.1), to bound the bias we can write it, approximately, in the form, for *r*>*n*−*m*+1
with , *l*∈{1,…,*k*} is associated with the fact that we need to iterate the kernels to satisfy equation (4.3) and *ξ*∈(0,1). *r* is an integer large enough so that *R*(*r*) is negligible. If one chooses the arbitrary *γ* as , then the bound can be upper-bounded by
Thus, approximately, the bound shows that the bias falls geometrically as we stop closer to the root of the tree. Note, however, it cannot go to zero unless *h*_{θ} is exactly (3.2) (one need not consider *γ* here). To an extent, finding good approximations is more difficult than being able to stop the tree, which is why we focus on this.

## 5. Simulations

### (a) Experimental setup

To illustrate our approach, we consider three simulation scenarios: two parent independent mutation (PIM) models and one parent-dependent mutation model (PDM). The two PIM models, denoted PIM 0.5–0.5 and PIM 0.1–0.9 are based on the following per-locus transition matrices:
while the per-locus mutation probability matrix underlying the PDM model is
In all three scenarios, the initial population was set to 100 sequences and we considered a single-locus case (i.e. with two possible types). For the PDM model only, we also considered the case of 10 loci (i.e. 2^{10}=1024 different types). Irrespective of the number of loci considered, the distribution of the 100 initial sequences among the different types was sampled from a multinomial distribution with a probability vector *P* defined as the invariant point, solution of equation, *P*_{(2n×1)}=*P*_{(2n×1)}*T*_{(2n×2n)}, where *n* denotes the number of loci considered and *T* is the full mutation probability matrix.

The algorithm description is in the original technical report of Jasra *et al.* (2011). For the function *h*_{θ}, we use the distribution of an unordered sample from a PIM model (which, even for the PIM cases, is not the correct distribution in the bias term).

### (b) Simulation results

Simulations were carried out until there were *TM*=1,2,5,10,25,50 sequences left in the population. *TM*=1 exactly corresponds the approach in Stephens & Donelly (2000) and is subsequently referred to as *SD*. For each simulation, we examined 60 values for *μ* ranging from 0.1 to 30.1. We report in figure 2 the estimated log-likelihood distribution, based on 100 000 samples for all four simulations scenarios (presented in rows) and three values of *μ* (presented in columns).

In figure 2 it is clear that as expected, uniformly across the values of *μ*, the closer to the MRCA the algorithm is stopped, the more accurate the distribution of the likelihood is estimated. However, up to *TM* 10 per cent (and even *TM* 25% for *μ*>20) our results suggest that the time machine approximation and correction provides an accurate estimate of the distribution of the likelihood. Conversely, when the algorithm is stopped too early (*TM*≥25%) the biased estimator underlying the time machine approach leads to very inaccurate estimates of the likelihood. For even more extreme cases (*TM* 50% for *μ*=0.1), this results in a highly shifted estimated distribution of the likelihood.

The above observations are also reflected in the mean likelihood (figure 3). For every model considered here, the simulations of the time machine up to *TM* 25 per cent seem to provide estimates of the mean likelihood that are similar to the SD approach, although for larger values of *μ*, *TM* 25 per cent seems to overestimate the mean likelihood. Furthermore, the time machine approach seems to accurately locate the value of *μ* maximizing the likelihood for *TM*≤10 per cent, and to provide acceptable approximations for this when *TM*>10 per cent, regardless of the simulation scenario.

In figure 4, the average computation time per iteration is plotted as a function of *μ* for the PDM-10 loci simulations. Results for all other models led to the same conclusions and are therefore not shown. From this figure, the computation time appears to be a linearly increasing function of *μ*: increasing the mutation rate naturally decreases the probability of simulating a coalescent event, and therefore, tends to increase the time to reach the MRCA (or any population size). However, it seems that stopping the simulation when there are only more than five sequences left in the population drastically reduces the computation time: for *TM* 5 per cent the simulation run is on average more than twice as fast as the SD simulation, and for *TM* 25 per cent, the time machine is three times more time-efficient than the SD algorithm. It should also be noted that ‘large’ values of *μ* (around 10), for which the time savings are most significant, also seem to be inferentially important (figure 3*d*).

In figure 5, the relative standard deviation across our 100 repeats of the algorithm, of the time machine to SD are plotted for all the scenarios considered. It can be seen, as expected, that there is some variance reduction and, for example, for the *TM* 5 per cent PDM, the variance reduction is of the order 1.5.

On the basis of our experiments, combining both computational efficiency and the numerical accuracy, the use of the time machine with TM 5 per cent is an efficient alternative to the SD algorithm. The C++ code is available upon request from M.C.-H.

## 6. Summary

In this paper, we have considered a new approach for simulation of stochastic trees and likelihood calculation of sample probabilities in population genetics models. The approach consists in stopping the backward simulations before the root of the tree is reached. We have provided theoretical results on the bias introduced in the estimation of the likelihood. Some extensions to our work are described below.

Firstly, to extend our analysis to different models. The paper has been written to facilitate such analysis and we believe it is rather simple to deal with other stochastic tree models. Also, some further empirical investigations would help support the simulations and theoretical analyses presented here. Our methodology would be further enhanced with graphical processing unit technology (e.g. Lee *et al.* 2010), and this is one area that we are currently investigating.

Secondly, to look at the consistency (in a likelihood sense) of our biased Monte Carlo estimator. As we observed in §5, it appears that the time machine seems to recover the maximum-likelihood estimator. Therefore, consistency or potential asymptotic bias is of genuine interest. There are very few results in the context of consistency, owing to the dependency in the data, after integrating out the tree. That is, it is difficult to apply uniform laws of large numbers to complex dependency structures. None-the-less, we suggest the work of Fearnhead (2003), Douc *et al.* (2004), Olsson *et al.* (2008) and Olsson & Rydén (2008) as possible starting points for a proof.

## Acknowledgements

We thank Arnaud Doucet for some valuable conversations. We also thank three reviewers for some very useful comments which greatly improved the article.

## Appendix: proofs

We give the proof of lemma 3.1. The proofs of propositions 4.1 and 4.3 are in the technical report of Jasra *et al.* (2011).

### Proof of lemma 3.1.

Then the likelihood (2.1) can be written as and applying tower-property, we have which establishes the first part of the lemma.

For the second, using equation (3.1) (recalling the data are ), we have
A1
In words this means that to have *m*+1 genes, we need a minimum of *m* steps in the process and *τ* has to be at least *n*−*m*+*α* steps.

In this case, write
where *ψ*_{θ} is as in §2*c*. Note this is well-defined owing to the fact that the size of the population is non-decreasing, and then, for any

Returning to the likelihood (A1) and making the substitutions, *α*′=*α*−1, *η*=*τ*−*α*′+1, it thus follows that
Here, *η* is the time from the last time there are *m* genes to the first time there are *n*+1 genes. ■

- Received September 27, 2010.
- Accepted January 31, 2011.

- This journal is © 2011 The Royal Society