## Abstract

In this paper, we discuss statistical families with the property that if the distribution of a random variable *X* is in , then so is the distribution of *Z*∼Bi(*X*, *p*) for 0≤*p*≤1. (Here we take *Z*∼Bi(*X*, *p*) to mean that given *X*=*x*, *Z* is a draw from the binomial distribution Bi(*x*, *p*).) It is said that the family is closed under binomial subsampling. We characterize such families in terms of probability generating functions and for families with finite moments of all orders we give a necessary and sufficient condition for the family to be closed under binomial subsampling. The results are illustrated with power series and other examples, and related to examples from mathematical biology. Finally, some issues concerning inference are discussed.

## 1. Introduction

Many examples in statistics relate to the problem of ‘thinning’ a random variable *X* in the sense that *X* itself is not observed, rather a ‘distorted’—or sampled—version of *X*, *Z* is observed. *Z* is influenced by the nature of *X*, the means of observation and stochastic errors. *Z* might have a functional relationship to *X*, *Z*=*g*(*X*), or a stochastical relationship, *Z*∼*F*(*X*), where *F* is a distribution function depending on the value of *X*. Statistically, *X* can be thought of as missing information. In our case, we assume *X* and *Z* are stochastically related through *Z*∼Bi(*X*, *p*). Here, and elsewhere in the article, we take *Z*∼Bi(*X*, *p*) to mean that given *X*=*x*, *Z* is a draw from the binomial distribution Bi(*x*, *p*).

We have noticed several examples for this in our own field of mathematical biology and below we list a number of these. We start out with an example that was the motivation for this paper. In the biological world and elsewhere it is often the case that data is well represented by a stochastic graph. For example, interactions between genes, proteins, or species in a foodweb can all be represented by a graph, where each gene, protein or species is a node in the network or graph, and interactions are connections between nodes. Maybe, the best-known and understood stochastic graph is an Erdös–Renyi random graph (e.g. Bollobás 2001). Each node, *i*, in the network has *X*_{i}∼Bi(*M*−1, *q*) connections to other nodes, where *M* is the total number of nodes in the network. Other examples which have been studied extensively, are small-world networks and scale-free graphs.

However, it is often the case that our experimental devices, or observational strategies, only allow us to observe each connection in a network with a certain probability. Perhaps, the most parsimonious sampling scheme is one where each node is sampled with the same probability *p*, independently of its connections to other nodes or other features of the network. For an Erdös–Renyi random graph, where node *i* has *X*_{i}∼Bi(*M*−1, *q*) true connections, only *Z*_{i}∼Bi(*M*−1, *pq*) are observed. Interestingly, the sampled distribution has the same form as the true distribution. In other examples, such as small-world networks and scale-free graphs, the sampled distribution does not take the same form as the true distribution (Stumpf *et al*. 2005; Stumpf & Wiuf 2005). Nonetheless, data are analysed under the implicit—though generally unacknowledged—assumption that the distribution of *Z*_{i} has the same form as that of *X*_{i} (see Stumpf *et al*. 2005).

Other examples include the following. (i) Assume *X*_{t} is drawn according to a counting process, where *t* denotes time. If each event (*X*_{t} in total) only can be observed with probability *p*, then the number of observable events, *Z*_{t}, is *Z*_{t}∼Bi(*X*_{t}, *p*). For example, let *X*_{t} be the number of mutations in a DNA sequence over time and *Z*_{t} the number of mutations that change the DNA sequence. If mutations arrive at rate *λ*, then *X*_{t}∼Po(*λt*) and *X*_{t}∼Po(*pλt*), where *p* is the probability of a DNA changing mutation and 1−*p* the probability of a mutation that does not alter the DNA sequence (e.g. Felsenstein 2004). (ii) Another example comes from cancer research (Koed *et al*. 2005). Cancer genomes are instable and undergo frequent chromosomal alterations. Here we focus on loss of chromosomal regions. A normal genome has two copies of all chromosomes, whereas a cancer genome might have lost one of these copies, or lost regions of a chromosome (theoretically, loss of both copies is possible, but this is very rare). Loss can be inferred experimentally from SNP (single nucleotide polymorphism) or bi-allelic DNA markers typed in a normal genome and a cancer genome from the same patient: if two different alleles are observed in the normal genome (i.e. the marker is heterozygous), but only one allele is observed in the cancer genome, then one allele must have been lost. If both are observed, then nothing has been lost. Conversely, if only one allele is observed in the normal cell (i.e. the marker is homozygous), and also one is observed in the cancer cell, it is impossible to tell whether an allele has been lost or not, because the cancer genome could still have one or two copies of that allele. If the number of losses in a region with *n* markers is *X*_{n} and the marker is heterozygous with probability *p*, then the number of observable losses *Z*_{n} is *Z*_{t}∼Bi(*X*_{n}, *p*). *X*_{n} might be modelled by a *k*-order Markov chain in *n*.

(iii) Lastly, in field ecology capture–release experiments can be used in order to measure species abundance (Southwood & Henderson 2000). Quite generally, there is considerable interest in ecological time series (e.g. Powell & Steele 1994). However, at each time-point, it is impossible to sample all individuals of all species present in a given area. When averaged over the duration of the observation each individual is observed with probability *p*. Note, however, that some individuals may be sampled more than once by chance. If the true species abundance is given by a process *X*_{t}, then the observed number of individuals at each time-point, *Z*_{t}, is given by *Z*_{t}∼Bi(*X*_{t}, *p*). There are many other related biological processes which can be modelled in this way. These include, for example, all processes where a source generates particles (e.g. hepatitis C virus (HCV) produced in the liver), which leave the source vicinity with probability *p*, and where the measurements of the number of particles occur at some distance from the source (e.g. measuring HCV abundance in the blood).

In some of these examples, the distribution of *Z* will have the same functional form as the distribution of *X*. This property, of course, relies on the distribution of *X*. Our own interest was mainly motivated by analysis of biological network data, where we noticed that many models other than Erdös–Renyi random graphs, e.g. scale-free graphs, did not have the property. In the analysis of real data, it is often assumed that data confirms to the distribution of the entire network, and that sampling does not change this distribution (maybe apart from a change in parameters), see Stumpf *et al*. (2005) for a discussion. This may be erroneous and consequently lead to inferential mistakes. In other examples, e.g. modelling the evolution of a DNA sequence, the process of DNA changing mutations has the same functional form as the process of all mutations, and the same model can be used for both (with a change in parameter to *pλt* instead of *λt*). However, this has the cost that *p* and *λ* cannot be separated easily.

In this paper, we discuss families with the property of being *closed under binomial subsampling*, i.e. families that possess the property that the distribution of *Z*∼Bi(*X*, *p*) is in the same family of probability distributions as *X*. We characterize such families in terms of probability generating functions (pgfs) and in terms of moments (if these exist). We also determine the power series families with this property. The results are illustrated by a number of examples. Although some of the results are relatively straightforward to derive, they and their implications have, to our knowledge, not been discussed together in the literature. Finally, we make a few comments in relation to statistical inference.

## 2. Closure under binomial subsampling

Let be a family of probability distributions on . It is convenient to think of as a family of distributions of random variables *X* given by *P*(*x*)=(*X*=*x*) for *P*∈, and we write *X*∼*P*, if *X* has distribution *P*. The degenerate distribution is denoted by *P*_{0}.

Further, let denote the pgf of *X*, and similarly let *G*_{P}(*s*) denote the pgf of *P*. If *X*∼*P*, then *G*_{X}(*s*)=*G*_{P}(*s*). Any open interval *I*, where *G*_{X}(*s*), *s*∈*I*, is finite, defines the distribution of *X* uniquely; i.e. if *G*_{X}(*s*)=*G*_{Y}(*s*) for *s*∈*I*, then *X* and *Y* have the same distribution. Therefore, for convenience, *G*_{X}(*s*), *s*∈*I*, is also called the pgf of *X*.

We use the following notation througout the paper. Let *x*_{[k]} denote the *k*th descending factorial of *x*, *x*_{[k]}=*x*(*x*−1)…(*x*−*k*+1), and *x*_{(k)} the *k*th ascending factorial of *x*, *x*_{(k)}=*x*(*x*+1)…(*x*+*k*−1).

Let be a family of distributions on , *X* an -valued random variable, and define *Z* by *Z*∼Bi(*X*, *p*) for *p*∈[0,1]. If *Z* has distribution in , whenever *X* has distribution in , then is said to be closed under binomial subsampling.

In definition 2.1, the dependence of *p* on *Z* is suppressed. The distribution of *Z* is given by(2.1)Definition 2.1 is equivalent to the requirement that(2.2)has distribution in , whenever *X* has distribution in and *Z*_{j} is a Bernoulli variable with parameter *p*.

Further, *p*=0 implies that *P*_{0} always belongs to a family closed under binomial subsampling. Definition 2.1 also implies that if _{i}, *i*∈*I*, are families closed under binomial subsampling, then so are ∪_{i∈I}_{i} and ∩_{i∈I}_{i}.

Let be closed under binomial subsampling, and let *Q*, *P*∈. Define a relation ◊ on in the following way: *Q*◊*P* if and only if there exists a random vector (*Z*, *X*) with *Z*∼*Q* and *X*∼*P*, such that either *Z*∼Bi(*X*, *p*) or *X*∼Bi(*Z*, *p*) for some *p*∈(0,1]. Note that *P*_{0} is only related to itself.

*Q*◊*P* is a relation between univariate distributions, but holds if a suitable bivariate distribution, which links *Q* and *P* together, exists.

*The relation* ◊ *defined in* *definition 2.2* *is an equivalence relation on* .

The equivalence class consisting of only *P*_{0} is called the *trivial* class.

*Let* *be closed under binomial subsampling and* *a non-trivial class of* . *If P*∈*, then there is an interval J*_{P} *with left endpoint zero but not containing it, such that Q*∈ *if and only if*(2.3)*for some r*∈*J*_{P}.

In theorem 2.4, *P* is called a *class representative*. Theorem 2.4 provides an alternative characterization of the relation ◊ in terms of generating functions. This characterization is perhaps more natural than the one given in definition 2.2; in that it does not involve bivariate distributions. However, definition 2.2 is closer to how we originally perceived the problem.

Let be closed under binomial subsampling and let *X*_{i}∼*P*∈ for *i*=1, 2, …. Further let *N* be a random variable with distribution on , and assume all *X*_{i} and *N* are mutually independent. Then the family of distributions defined by is closed under binomial subsampling. The pgf of *X* is given by(2.4)If *Z*∼Bi(*X*, *p*), then , with *Z*_{i}∼Bi(*X*_{i}, *p*); hence(2.5)and the results follow from theorem 2.4.

*Let X be an* *-valued random variable and define a family of functions by*(2.6)*for* *and s*≥max(0, 1−1/*r*). *Then there exists* 1≤*ρ*≤∞*, such that G*_{r}(*s*) *is the pgf of a random variable with values in* *if and only if r*≤*ρ, r*≠∞. *The family of distributions defined by G*_{r}(*s*)*, r*≤*ρ, r*≠∞*, is closed under binomial subsampling*.

Let *X*_{1} and *X*_{2} be independent random variables. Further, let _{i}, *i*=1, 2, be the family of distributions generated by *X*_{i}, *i*=1, 2. Let *Z*_{i} be defined by *G*_{Zi}(*s*)=*G*_{Xi}(1−*r*+*rs*) for *r*≤min(*ρ*_{1}, *ρ*_{2}) and *ρ*_{i} given as in theorem 2.6. Then also the family of distributions defined by *Z*_{1}+*Z*_{2} (assuming *Z*_{1} and *Z*_{2} are independent) is closed under binomial subsampling.

Let be a family closed under binomial subsampling and let *ψ*:*Ψ*→ be a parametrization of the classes that maps *ψ* to a unique member *P*_{ψ} of the class . For convenience, *ψ* is called a *class parameter* (which may be vector-valued). Then is parameterized by (*ψ*, *r*), where *ψ*∈*Ψ*, , and is an interval with boundary zero (theorem 2.4). According to theorem 2.6, for some *ρ*_{ψ}. If equality holds, then the class is said to be *full*; and if not, the class can be extended to a full class accordingly. A full class is said to be generated by the class representative *P*_{ψ}. Note that the class is generated by any member of the class.

*Let* *be the family generated by X (with r*≤*ρ as in* *theorem 2.6**). Assume* *, X*_{i} *iid* *-valued random variables and let* *be the family generated by X*_{1} *(with r*≤*ρ*_{1}*). Then* *, Z*_{i} *iid, has distribution in* *, whenever Z*_{1} *has distribution in* . *It follows that ρ*_{1}≤*ρ*.

## 3. Finite moments of all orders

We will now impose some further constraints on the family . Explicitly, we will assume that all moments of a random variable *X* with distribution *P* in exist, and that *P* is determined by these moments.

Abusing notation slightly we say that *X* is in whenever *X* has a distribution in . Further, if is closed under binomial subsampling, then we let *Z* denote an element in the class of *X*, and assume that (*Z*, *X*) is constructed, such that definition 2.2 is fulfilled, i.e. either *Z*∼Bi(*X*, *p*) or *X*∼Bi(*Z*, *p*).

Note that for *Z* in the class of *X*,(3.1)and is equivalently parameterized by (*ψ*, *τ*), where *τ* denotes expectation and *ψ* is a class parameter.

Henceforth, we assume a family of distributions is parameterized by *ω*=(*ϕ*, *τ*)∈*Ω*, where *τ* denotes expectation of the distribution and *ϕ* is an additional parameter. The parameterization is assumed to be one-to-one. No topological constraints are put on the space *Ω*.

*A parameterized family* *is closed under binomial subsampling with ϕ as a class parameter if and only if*(3.2)*where a*_{k}(*ϕ*) *is a constant depending on k and ϕ only, a*_{1}(*ϕ*)=1, *, and T*_{ϕ} *is an interval containing 0, either T*_{ϕ}=[0,*t*_{ϕ}]*, or T*_{ϕ}=[0,*t*_{ϕ}) *(here t*_{ϕ} *is potentially infinity)*.

*Any series of positive numbers a*_{k}*, k*=1, 2, …*, such that a*_{1}=1*, defines a family of distributions closed under binomial subsampling, by the requirement*(3.3)*for τ in some interval T containing 0*.

The family defined in corollary 3.2 is said to be generated by {*a*_{k}}_{k} and *T* is said to be the range of {*a*_{k}}_{k}. Lemma 3.3 establishes how can be found from {*a*_{k}}_{k} and for *X* in the family generated by {*a*_{k}}_{k}.

*Let X be an* *-valued random variable with finite moments. If*(3.4)*for all i, then*(3.5)

_{k}

*a*

_{k}

*<*+∞.

Let *a*_{k}=(*k*+1)2^{−k} for *k*=1, 2, …. Then(3.6)for *x*=0, 1, … and *τ*∈[0,2], defines a family closed under binomial subsampling with . For *τ*=2, *X*−1 is Poisson distributed with intensity 1.

If the state space is {0,1}, then the family generated by is the binomial family with distributions Bi(1, *τ*).

If the state space is {0,1,2}, then the family generated by and , *a*_{2}>0, has(3.7)(3.8)and(3.9)with *τ*≤1/*a*_{2} for 0.5≤*a*_{2} and for 0<*a*_{2}<0.5. If *a*_{2}≠0.5, then the family is not a binomial family and the family does not contain any binomial distributions, apart from the degenerated distribution *P*_{0}.

## 4. Power series families

We start with a number of well-known examples.

The members of have distribution given by(4.1)for *x*=0, 1, …, *n*, and *q*∈[0,1]. Here *τ*=*nq*, *ψ* is zero-dimensional, and *a*_{k}=*n*_{[k]}/*n*^{k} for *k*≥1.

The Poisson distribution, Po(*λ*). The members of have distribution given by(4.2)for and *λ*≥0. Here *τ*=*λ*, *ψ* is zero-dimensional and *a*_{k}=1 for all *k*.

The negative binomial distribution, NB(*q*, *ψ*). The members of have distribution given by(4.3)for , *q*∈(0,1) and *α*>0. Here *τ*=*αq*/(1−*q*), *ψ*=*α* and *a*_{k}(*α*)=*α*_{(k)}/α^{k} for *k*≥1.

A *k*-order power series family is a family of the form(4.4)for , *b*(*k*)>0, and(4.5)for 0≤*x*<*k*, . Here *g*(*λ*) is a normalizing constant, such that(4.6)In particular, a *k*-order power series is an *m*-order power series for all *m*≥*k*.

Examples of 0-order power series families are given in examples 4.1–4.3, whereas example 3.4 is not a power series family for any *k*. However, according to definition 4.4, the family in example 3.6 is a 2-order power series family.

*A k-order power series family closed under binomial subsampling fulfills, for some suitable parameterization λ*∈*Λ and choice of g*(*λ*)*, one of the following conditions: for x*≥*k, either (1),*(4.7)*for fixed* *; or (2),*(4.8)*for fixed α*>*0; or (3),*(4.9)

*The only 0-order power series families closed under binomial subsampling are the binomial family, the Poisson family and the negative binomial family for fixed α*.

The (modified) logarithmic distributions with *c*∈(0,1] and *ψ*∈(0,∞),(4.10)for \{0} and(4.11)form a 1-order power series family closed under binomial subsampling for fixed *c* and *ψ*. Here(4.12)thus the range of depends on *ψ*.

## 5. Mixing

The construction of *Z*∼Bi(*X*, *p*) can naturally be regarded as a mixture of binomial distributions Bi(*n*, *p*) over *n* with prior distribution . The resulting family of distributions for *p*∈[0,1] is closed under binomial subsampling, simply because the binomial families Bi(*n*, *p*), *p*∈[0,1], are closed under binomial subsampling.

We will give some further examples of mixing. For simplicity, in examples 5.1 and 5.3, assume all moments exist and that the parameter space of a family closed under binomial subsampling is a product space, *Ω*=*Ψ*×*T*. This assumption can easily be relaxed at the cost of a more complex notation.

Mixing over *Ψ*. Let *g*(*x*; *ϕ*) be a prior on *ψ*∈*Ψ*, depending on the parameter *ϕ*. Then(5.1)assuming the integrals exist. Let *υ*=*β*_{1}(*ϕ*)*τ*, then(5.2)defines a family _{mix} closed under binomial subsampling with class parameter *ϕ*. A subcase is finite mixtures(5.3)where *q*=(*q*_{1}, …, *q*_{m}) and .

Let be a *k*-order power series family closed under binomial subsampling, and let . Define *δ*=1−*c*(*λ*_{0}) and by for *x*≥*k* and otherwise. Let be the family generated by . It follows that ={*Q*_{λ}|*λ*≤*λ*_{0}} is a *k*-order power series family, such that *δQ*_{λ}(*x*)=*P*_{λ} for all *x*≥*k* and *λ*≤*λ*_{0}.

Define *R*_{λ}(*x*) by *R*_{λ}(*x*)=[*P*_{λ}(*x*)−*δQ*_{λ}(*x*)]/(1−*δ*), *λ*≤*λ*_{0}. Note that *R*_{λ}(*x*)=0 for *x*≥*k* and *R*_{λ}(*x*)≥0 for 0≤*x*<*k*. Also *R*_{λ} is a probability measure for all *λ*≤*λ*_{0}, and further the family ={*R*_{λ}|*λ*≤*λ*_{0}} is closed under binomial subsampling.

In consequence, any *P*_{λ}∈, with *λ*≤*λ*_{0} can be written as a mixture of two measures *Q*_{λ}∈ and *R*_{λ}∈,(5.4)such that and are closed under binomial subsampling, and such that is generated by a member of a 0-order power series family with *b*(*x*)=0 for *x*≤*k*, and has support in {0, 1, …, *k*}. The two families cannot be extended beyond *λ*_{0}, because for *x*≤*k*.

Example 4.7 provides one example: can here be given as the probability measure with *c*=1 and *λ*_{0}=(e^{ψ}−1)/*ψ*.

Mixing over *T*. Let a prior distribution, *f*(*x*; *ν*, *ϕ*), be given on *T*, depending on the parameter . If the moments of the prior fulfill(5.5)for *k*≥1, then the mixture of with *f*(*x*; *ν*, *ϕ*) is also closed under binomial subsampling, because(5.6)

*μ*=

*c*

_{1}(

*ϕ*)

*ν*, hence(5.7)for

*k*≥1, and

*α*

_{k}(

*ψ*,

*ϕ*)=

*a*

_{k}(

*ψ*)

*c*

_{k}(

*ϕ*)/

*c*

_{1}(

*ϕ*)

^{k}defines a family

_{mix}closed under binomial subsampling with class parameter (

*ψ*,

*ϕ*). Prior distributions fulfilling equation (5.5) include the Gamma distributions,

*Γ*(

*ν*,

*ϕ*), with moments(5.8)and the folded normal distribution (i.e. the absolute value of a normal) with moments(5.9)If is the family of Poisson distributions with

*a*

_{k}=1, then mixing over

*λ*with the folded normal yields(5.10)and from lemma 3.3,(5.11)The expectation of

*X*is .

## 6. Inference

Assuming *Z*∼Bi(*X*, *p*), then the conditional distribution of *Z* given *X*=*x*, *Z*|*X*=*x*, is *S*-ancillary about inference on *p*, and *X* is *S*-sufficient for inference on parameters, *ω*∈*Ω*, describing the distribution of *X*. The distribution of *Z* is generally not sufficient for inference on either of the parameters *p* and *ω*.

Let us now consider a one-dimensional family of distributions closed under binomial subsampling. Denote the compound parameter by *θ*=*pτ*. Then the distribution of *Z* does only depend on *θ* and the conditional distribution *X*|*Z*=*z* can be found from that of *X*, *Z* and *Z*|*X*=*x*. For some of the families discussed in this paper, the distribution of *X*|*Z*=*z* is in fact *L*-nonformative about *θ* (Barndorff-Nielsen 1999). Hence, it can be argued that *Z* contains all the information available about *θ* and *Z* is said to be *L*-sufficient for inference on *θ*. Families for which *Z* is *L*-sufficient include the 0-order and 1-order power series families. However, this is not a general feature of families closed under binomial subsampling. For instance, *X*|*Z*=*z* is not *L*-nonformative about *θ* in examples 3.4 and 3.6; despite the family in example 3.6 being a 2-order power series family.

To provide an example, consider the Poisson family. The conditional distribution of *X*|*Z*=*z* is given by(6.1)where *X*∼Po(*τ*)=Po(*θ*/*p*). Let be the profile likelihood estimate of *p* for fixed *θ*. Then(6.2)and the relative conditional profile likelihood becomes(6.3)for two values of *θ*, *θ*_{1} and *θ*_{2}. Since it only depends on *θ*_{1}, *θ*_{2} and *z*, *not* *x*, it is *L*-nonformative about *θ* (Barndorff-Nielsen 1999) and *Z* is *L*-sufficient for inference on *θ*, the compound parameter.

Let us return to the general setting. The likelihood function of *z* is(6.4)If the family is closed under binomial subsampling, simulation from the distribution of *X* (for arbitrary parameter *ω*) and simulation from the distribution of *Z* are computationally the same, as *Z* and *X* belong to the same family of distributions. Thus, the likelihood function for *Z*=*z* has the same computational tractability as the likelihood function for *X*=*x*. Conversely, if the family is not closed under binomial subsampling, then simulation from the distribution of *Z* might be a considerably harder task than simulation from the distribution of *X*, because the obvious, or straightforward, way to proceed is to simulate *X*, then sample *Z* from *X*. The upside here is, of course, that (given sufficient data) we can separate *p* and *ω* in the likelihood of *Z*=*z*. This is not possible if we consider *Z* with distribution in a family closed under binomial subsampling.

## Acknowledgments

C.W. is supported by a grant from the Danish Cancer Society. M.P.H.S. is a Wellcome Trust Research Fellow. Financial support from the Carlsberg Foundation and the Royal Society is gratefully acknowledged. Oskar Hagberg is thanked for several useful suggestions that improved the presentation.

## Footnotes

- Received May 18, 2005.
- Accepted November 23, 2005.

- © 2006 The Royal Society