## Abstract

In this paper, the *r*th-order univariate Hermite distributions are proposed to model the number of dicentrics in biological dosimetry. These families of distributions are introduced from compound Poisson process modelling. Regression models appropriate for analysing the number of dicentrics as a function of doses of radiation are presented, and an example of application is also given.

## 1. Introduction

Almost 100 years ago, A. K. Erlang published his famous paper ‘*The Theory of Probabilities and Telephone Conversations*’ (1909), which was the beginning of the branch of Science known as Queueing Theory. In this innovative paper, Erlang proved that the number of calls at a call-centre originated during a certain interval of time follows a Poisson distribution. His arguments are applicable to many other different events, such as the number of particles emitted from a radioactive mass during a fixed period of time or, replacing the time by area or volume, the number of pine trees per unit area of forest or the number of stars in a given volume of space. This stochastic process, that keeps count of the number of events that have occurred up to time *t*, is called the *Poisson process*.

Let *N*(*t*) denote the number of events (calls, particles, etc.) in the time interval (0,*t*]. We assume that there are zero events at the starting time, *N*(0) = 0. For a Poisson process, the probability distribution of the number of events in a time interval is,

The parameter *λ* is called the *intensity* of the process.

The Poisson process is used to model the appearance of dicentric chromosomes when samples of blood are irradiated. It can be a reasonable assumption for low linear energy transfer (low-LET) and small doses. In fact, it is well known that the Poisson process also describes the number of ionizing particles hitting a photodetector, cell, device, etc. However, after exposure to high-LET radiation types, the counts of dicentrics often exhibit overdispersion and consequently, the Poisson process is not a good model.

Since early radiobiology it was accepted that cellular damage induced by radiation is generally not a simple Poisson process. It was Neyman & Puri (1976) who combined the concept of primary radiation particles and secondary loss events to give a *compound Poisson process*. This compound Poisson process was later adapted by Nelson (1984) for a comprehensive model of radiation effects in mammalian cells *in vitro*. Independently, and for the damage on DNA and the induction of exchange type aberrations, as dicentrics are, a compound Poisson process, called batch-Poisson, was also proposed by Sachs *et al.* (1992).

A stochastic process *X*(*t*) is called a compound Poisson process if it can be represented by
1.1
where *N*(*t*) is a Poisson process and *Y*_{1}, *Y*_{2},… are independent, identically distributed random variables that are also independent of *N*(*t*) (see Daley & Vere-Jones 2003). Virsik & Harder (1981) consider that the particles traversing cell nucleus follow a Poisson process, and for each particle there is a probability *h*(*x*) to produce *x* aberrations. Then, the process for describing the formation of dicentrics is just a compound Poisson process, as is illustrated in figure 1.

Except for Virsik & Harder (1981), the other papers offer no methods for how to fit appropriate compound Poisson distributions to the observed number of dicentrics.

In §2, we will see some families of distributions generated by compound Poisson processes that can be useful for several practical applications. In §3, these distributions will be applied to biological dosimetry. A naive method to select the order of the model will be described, and some regression models that relate the mean number of dicentrics with the doses of radiation received will also be introduced.

## 2. The *r*th-order univariate Hermite distribution

Let us first recall an important tool used to describe any count random variable. Given a count random variable *X*, its *probability-generating function* (pgf) *Φ*_{X}(*s*) is defined as , where the coefficients of this power series are the probabilities *p*_{k} = *P*(*X* = *k*). The pgf characterizes the distribution of *X* and it also summarizes a large amount of information. The derivatives of *Φ*_{X}(*s*) at *s* = 0 provide the probability mass function of *X* and the derivatives at *s* = 1 provide the moments if they exist. For instance, *Φ*′_{X}(1) = *E*(*X*) = *μ* and *Φ*′′_{X}(1) = *E*(*X*(*X* − 1)) = *σ*^{2} + *μ*^{2} − *μ*, where *σ*^{2} is the variance of *X*. A Poisson random variable *X*, with *E*(*X*) = *λ*, has a very simple pgf: *Φ*_{X}(*s*) = e^{λ(s − 1)}.

If *N*(*t*) is a Poisson process with intensity *λ*, it can be shown that the pgf of *X*(*t*), the compound Poisson process given in equation (1.1), is
2.1
where *Φ*_{Y}(*s*) is the common pgf of the *Y*_{i}. Because the pgf is a power series, the distributions described in equation (2.1) can be called Poisson–Power Series (Kemp & Kemp 1965). It is interesting to point out that when the distribution of *Y*_{i} is Poisson, the distribution of the number of dicentrics *X*(*t*), for a fixed time *t*, is a Neyman type A or Poisson–Poisson distribution. This is the model introduced by Virsik & Harder (1981). When *Y*_{i} follows a logarithmic distribution, *X*(*t*) is negative binomial distributed, and this is another distribution widely used to describe the counts of dicentrics.

It is a reasonable assumption to consider that *Y*_{i} follows a distribution taking a finite range of values, 0,1,…,*r*, with probabilities *p*_{0},*p*_{1},…,*p*_{r}. In our context, it means that each particle produces a maximum number of dicentrics *r*. Then, , replacing this pgf in equation (2.1), taking into account that *p*_{0} = 1 − *p*_{1} − ⋯ − *p*_{r} and making *λ*_{i} = *λp*_{i}, *i* = 1,2,…,*r*, we obtain
2.2

For *r* = 2, the process described in equation (2.2) is known as the Gauss–Poisson process. It was introduced by Newman (1970) using a different approach (see also Daley & Vere-Jones 2003). Gauss–Poisson processes are widely used in statistical mechanics, queuing theory, ecology, etc. When the *Y*_{i} in equation (1.1) is distributed as a binomial(*n*,*π*), then *X*(*t*) follows a Poisson–Binomial distribution (see Douglas 1980). In the particular case when *n* = 2, we obtain again the Gauss–Poisson process (*r* = 2).

Note that the multiplicative effect of *t* acts directly over the parameters *λ*_{i}, *i* = 1,…,*r*. In future, without loss of generality, we will consider that *t* = 1. Notice that when *r* = 1, the pgf of the Poisson distribution is obviously obtained. When *r* = 2, the pgf in equation (2.2) corresponds to the Hermite distribution (Kemp & Kemp 1965, 1966). For *r* > 2, Milne & Westcott (1993) call these distributions *Generalized Hermite* or *r*th-order univariate Hermite. The latter name is preferable because Gupta & Jain (1974) used the name ‘generalized Hermite distribution’ previously, related to the specific case in equation (2.2), where only *λ*_{1} and *λ*_{r} are non-zero.

The parameters *λ*_{i}, *i* = 1,…,*r* are the intensities or mean arrival rates for *i*-simultaneous arrivals. Then, the domain of the parameters is *λ*_{i} ≥ 0, *i* = 1,…,*r*. Many arrival processes are modelled using a Poisson process but it does not allow *i*-simultaneous arrivals, with *i* > 1. However, in the real world, simultaneous arrivals can happen. For instance, this is the situation of the customers arriving to the queue or waiting list of restaurants or cinemas. Consequently, the *r*th-order univariate Hermite distribution can be a good alternative to describe the resulting counts of these arrival processes.

The probabilities for an *r*th-order univariate Hermite distribution can be calculated using the following lemma (Puig & Valero 2007):

### Lemma 2.1

*Let X be a count random variable with a pgf as in equation (2.2). Then its probabilities p*_{k} *= P(X = k) satisfy the recurrence relation,*
2.3
*with* *and p*_{−1} *= p*_{−2} *= ⋯ = p*_{1 − r} *= 0.*

Note that any count random variable *X* with a pgf like in equation (2.2) can be interpreted as a linear combination of independent Poisson random variables , with *E*(*X*_{i}) = *λ*_{i}. This is a direct consequence of the fact that the pgf of a sum of independent random variables is the product of the pgfs of its components. From this linear combination, direct calculations show that the cumulants are .

In particular, the population mean and variance have the simple expressions, 2.4

The dispersion coefficient *δ* = *σ*^{2}/*μ* can be calculated from this, and it satisfies 1 ≤ *δ* ≤ *r*. The extreme value *δ* = 1 corresponds to the case, where only *λ*_{1} is different from zero, that is, the Poisson distribution. On the other hand, *δ* = *r* corresponds to the situation where only *λ*_{r} is different from zero, that is, *r* times a Poisson distribution. Consequently, these models can be used in practice to analyse datasets presenting overdispersion. In practice, the calculation of the sample coefficient of dispersion can be helpful to choose the appropriate order *r* of the model. For instance, if the sample coefficient of dispersion is *d* = 2.5, the Hermite distribution would not be appropriate (*r* = 2) but the third, fourth or a higher order model could be.

An interesting property of the *r*th-order univariate Hermite distribution can be stated in terms of its factorial cumulants. Given any random variable *X* with finite moments of all the orders, the logarithm of its pgf can be expanded in a power series at *s* = 1, in the form,
2.5

The coefficient of (*s* − 1)^{i}/*i*! is the *i*th factorial cumulant *κ*_{(i)}. Moments, cumulants and factorial cumulants are closely related, and any *κ*_{(i)} can be written in terms of the moments or the cumulants and vice versa. Practical conversion tables can be found, for instance, in Douglas (1980). In particular, *κ*_{(1)} = *μ* and *κ*_{(2)} = *σ*^{2} − *μ*. Note that the log-pgf of an *r*th-order univariate Hermite random variable is an *r*th degree polynomial, which can be expressed as,
2.6

Comparing in equation (2.6) with (2.5), the *r*th-order univariate Hermite distribution can be understood as a kind of non-parametric approach in order to approximate any general count distribution.

Comparing equations (2.6) with (2.2), linear relations between the *λ*’s and the factorial cumulants can be established. For the Poisson distribution *κ*_{(i)} = 0 for *i* ≥ 2. In fact, this property characterizes the Poisson distribution. This vanishing property of the factorial cumulants is similar to that of the ordinary cumulants for the Gaussian distribution. In general, for the *r*th-order univariate Hermite distribution, *κ*_{(i)} = 0 for *i* ≥ *r* + 1 and this is also a characterizing property.

In order to apply the *r*th-order univariate Hermite distribution to regression models, it is more convenient to use an alternative parameterization including *μ* and *δ* because these two parameters are particularly meaningful. Moreover, in regression models, the parameter *μ* is important because it is usually related to the explanatory variables or covariates. Note that the coefficient of dispersion *δ* is essentially the mean-scaled second-order factorial cumulant. Similarly, the mean-scaled factorial cumulants , *i* > 2, can be defined as . For the regression models that will be analysed in the next section, the parameterization will be adopted, because it has been very useful in our applications. For the models with *r* = 2 (Hermite distribution), 3 and 4, the relations needed for this change of parameters are shown in table 1.

To calculate probabilities with this parameterization, it is only necessary to replace the *λ*’s in equation (2.3) by the expressions shown in table 1. In §3, model selection and parameter estimation will be studied in the context of specific applications to radioprotection.

## 3. Applications to biological dosimetry

An example in which the Poisson distribution is widely used as a model is in biological dosimetry. The main objective of biological dosimetry in radioprotection is to quantify the dose received in individuals who have been exposed to ionizing radiation (IAEA 2001). This is essential for predicting the derived health consequences; for both early effects, such as vomiting, skin injuries or haematopoietic depletion; and late effects like the development of cancer (Waselenko *et al.* 2004). Nowadays, the most widely used and accepted method is the analysis of the induced chromosome aberrations, in particular the analysis of the frequency of dicentrics observed in metaphases from peripheral blood lymphocytes. To quantify the dose of a possible exposure to radiation, it is necessary to establish previous dose–effect calibration curves irradiating peripheral blood lymphocytes to a certain number of doses.

Ionizing radiation is characterized by the production of discrete and independent energy deposition events in time and space that, when impacting with a cell nucleus, can induce a wide range of DNA lesions. To respond, a cell has complex signal transduction, cell-cycle checkpoint and repair pathways. Among the different lesions induced, double-stand breaks (DSBs) are critical lesions and their misrepair or lack of repair leads to the formation of chromosome aberrations. At molecular level, the two major DSB repair mechanisms are the non-homologous end-joining (NHEJ) and homologous recombination repair. When peripheral blood lymphocytes are irradiated in the G0 phase, where cells are in a quiescent non-dividing status, each DSB induced in a chromosome makes two free ends. In G0, NHEJ is the one which contributes mostly, and because NHEJ is an error-prone mechanism, the chromosome can be restituted, restoring the DNA sequence to its pre-DSB configuration, or occasionally erroneous repair results in the free ends of different DSB joining. It must be noted that the repair–misrepair process is another type of random process, with decreasing intensity in time proportional to the amount of remaining unrepaired DSB. The misrepair results in an exchange chromosome aberration, as dicentrics are. This repair mechanism corresponds to the classical breakage-and-reunion pathway of aberration formation (Savage 1998), in which aberrations such as dicentrics result from the misrepair of two independent DSBs. The role of the Poisson distribution in describing the dispersion of the number of dicentrics per cell has been discussed by several authors (Edwards *et al.* 1979; Merkle 1983). Deviation of the dispersion index from 1 is commonly used to check this assumption. This is more probable for high-LET radiation, such as α-particles or heavy nuclei, that deposit more rapidly the energy along their track, producing a multiplicity of aberrations per particle traversal following a compound Poisson process. Departures from Poisson distribution, after high-LET radiation exposures, have been observed in Barquinero *et al.* (2004).

Table 2 illustrates the results of one of these experiments. Peripheral blood samples have been exposed to 10 different doses of 1480 MeV oxygen ions (Di Giorgio *et al.* 2004). Irradiated lymphocytes were stimulated to enter mitosis with phytohaemagglutinin and were cultured for 48 h. To obtain metaphase chromosome spreads, colcemid was added 2 h before harvesting. Then, for each dose, the number of dicentric chromosomes (figure 2) were counted. For instance, for dose = 0.120, we have 1438 cells with 0 dicentrics, 55 with 1, five with 2 and two with 3 dicentrics. Note that the sample sizes used by Di Giorgio *et al.* (2004) are not the same for each dose because these experiments were carried out according the IAEA recommendations (IAEA 2001). The statistics *z*, and will be defined in §3*a*.

Different explanations can be proposed to interpret the overdispersion, dispersion index exceeding unity, observed after high-LET radiation exposures. We are going to assume that the particles traversing cell nucleus are Poisson events, and for each particle there is a probability distribution to produce a number of dicentrics between 0 and *r*. As it has been shown before, these assumptions lead to the *r*th-order univariate Hermite models.

### (a) Model selection

As explained above, the empirical dispersion coefficient , the moment-based estimator of *δ*, can be used for model selection. This coefficient has been widely used in the literature for detecting departures from the Poisson distribution and it is the basis of the classical Fisher overdispersion test. It is shown in Puig (2003) that this test is equivalent to the score test to contrast the Poisson distribution with a Hermite alternative.

Another measure of the departure from the Poisson distribution is the zero-inflation index (*θ*), which is very appropriate for datasets containing many zeros. According to Puig & Valero (2006), this index, for a count random variable *X*, is defined as , with *μ* = *E*(*X*) and *p*_{0} = *P*(*X* = 0). The sample version of this index is , where *f*_{0} is the observed relative frequency of zeros. If *X* is Poisson-distributed, *θ* = 0, and *θ* > 0 if *X* is ‘zero-inflated’, that is, its proportion of zeros is greater than the proportion of zeros of a Poisson random variable having the same mean. It immediately follows from equation (2.6) that,
and consequently,
3.1

This expression suggests a simple method for model selection by using the moment-based estimates of *θ*, *δ* and the mean-scaled factorial cumulants. Notice that the moment-based estimators are robust and can always be used independently of the real distribution of the data. Puig & Valero (2006) introduced the idea of plotting *z*_{i} against *d*_{i} to identify appropriate count distributions, when we have several samples coming from similar experiments and want to look for a common distribution for all of them, although possibly with different values for the corresponding parameters. The method presented here is a special arrangement adapted to identify *r*th-order univariate Hermite distributions:

— For each one of the

*m*samples, estimate the indexes*θ*,*δ*and ,*r*= 3,4,…, by means of the sample moments and the observed proportion of zeros. These values will be denoted as*z*_{i},*d*_{i}and ,*i*= 1,…,*m*.— Plot

*y*_{i}=*z*_{i}against*d*_{i}and check for a linear trend with a slope about 0.5 and intercept about − 0.5 (because, according to equation (3.1),*θ*∼(*δ*− 1)/2), and also calculate the determination coefficient*R*^{2}as a measure of goodness of fit. If there is no trend and all the values of*d*_{i}are around 1 and the values of*z*_{i}are around 0, then stop the process and choose a Poisson model (*r*= 1). If there is no trend, and the values of*d*_{i}and*z*_{i}are not around 1 and 0, then the*r*th-order Hermite model is doubtful.— Plot against

*d*_{i}and check again for the specific kind of linear trend mentioned above. Calculate the determination coefficient and compare it with the preceding one. Evaluate if an increment of*R*^{2}occurs and if this increment is sufficiently relevant. If not, stop the process and choose the Hermite distribution (*r*= 2).— Plot against

*d*_{i}, and continue the same procedure until an appropriate model is chosen.

We have applied this method to the 10 datasets of table 2 and the results are shown in figure 3. Note how the linear trend between *z*_{i} and *d*_{i} appears in the first plot (top) with a good determination coefficient (*R*^{2} = 0.9256). In the second plot (middle), the slope and intercept are closer to the theoretical values of 0.5 and − 0.5, respectively, and the *R*^{2} has been incremented with respect to the preceding plot. In the third plot, although *R*^{2} has also been incremented, the slope and intercept are not so close to the theoretical values. Consequently, a third-order univariate Hermite distribution could be appropriate.

This naive method of model selection presented here is very easy to implement using a worksheet like Excel. However, similar to other graphical methods, it also has limitations. The method graphs various cumulant-based estimators against each other. Consequently, it is essential that the estimators vary with the sample, or else the graphs become unhelpful. In §4, certain likelihood-based model selection methods will be explored.

### (b) Fitting the model

Historically, most of the quantitative dicentric studies have been based on the Linear-Quadratic model, in which it is assumed that the mean (*μ*) of the number of dicentrics is a function of the dose (*x*), following the relation *μ*(*x*) = *αx* + *βx*^{2}. The datasets that we want to analyse by using *r*th-order univariate Hermite models have a structure similar to those in table 2. The observed number of dicentrics will be denoted as *y*_{i}, its frequency as *f*_{i}, the mean as *μ*_{i} and the corresponding doses as *x*_{i}. What about the other possible parameters ? In principle, they could also depend on the doses of radiation. However, figure 4 shows the plots of the empirical values against the doses, and no trend becomes apparent. Consequently, we shall consider models where the parameters *δ*, and are constant along the different doses. The log-likelihood function for these models is the following:
where is calculated using lemma 2.1, changing the *λ*’s by the parameters according to table 1.

Moreover, . The maximization of the log-likelihood function can be done by using a numerical method.

To do this, we have used the package ‘nlm’ of R and the code of the program is available upon request. The parameter estimates of the regression models considered are shown in table 3. The Akaike information criterion (AIC) shows that the model with *r* = 3 is the best because it corresponds to the lowest value of the AIC. Note that it agrees with the results of the naive method explained in §3*a*.

However, note that for this model is close to zero and its s.e. 0.0374 is moderately high. Consequently, it is interesting to test the hypothesis because, if this null hypothesis was not rejected, we could choose the model with *r* = 2 (Hermite distribution) instead of that with *r* = 3. To do this, the likelihood-ratio test can be used but taking into account that, under the null hypothesis, the likelihood-ratio test statistic does not have the usual asymptotic distribution. This is because of the fact that belongs to the boundary of the domain of parameters and this is a non-standard condition. Then, according to Self & Liang (1987), the asymptotic distribution is a 50:50 mixture of the constant zero and the distribution. For our example, the likelihood-ratio test statistic has a value of 4.78. It is evident that the *p*-value calculated from this mixture is the same as the *p*-value calculated from a distribution divided by 2, that is, 0.014. Consequently, the null hypothesis is rejected and the model with *r* = 3 is assumed as good. Figure 5 shows the observed and the predicted means stating the goodness of fit.

More complicated models can be studied in a similar way. For instance, one might argue that the highest doses could follow a different pattern and *r* might increase with dose. Then, an appropriate model could have the parameters, *α*, *β*, *δ*, , as a model with *r* = 3, and in addition would be replaced by a linear relation of the form *cx*, where *c* is a new parameter and *x* is the dose. Another model could be considered replacing by *I*_{x0}(*x*)*c*, where *I*_{x0}(*x*) is an indicator function that is 1, if the dose *x* is greater than a threshold *x*_{0}, and 0 otherwise. For both models, the hypothesis that *r* increases with dose could be tested by means of *H*_{0}:*c* = 0, using the likelihood-ratio test.

## 4. Conclusions

We have shown that the *r*th-order univariate Hermite distributions are suitable alternatives to the Poisson distribution to describe the number of dicentrics in biological dosimetry. These distributions can be constructed from a specific compound Poisson process related with the dicentric formation.

These distributions have been reported by several authors, the case *r* = 2 by Kemp & Kemp (1965), the two-parameter generalization by Gupta & Jain (1974) and the generalized *r*th-order univariate used in this paper by Milne & Westcott (1993). However, in practice, these distributions have not often been used. In our opinion, this is the result of a lack of awareness of their good properties. Some properties and characterizations that justify the use of these distributions in more general contexts can be found in Puig & Valero (2007).

We have also presented a naive graphical method to choose the order of the univariate Hermite distribution when there are several independent samples. This method can be easily implemented by using a worksheet like Excel and can be considered as a preliminary quick analysis.

Regression models based on the *r*th-order univariate Hermite distribution have been introduced in order to relate the number of dicentrics observed and the dose of radiation received. The parameters are estimated by maximum likelihood and a hierarchical approach based on the likelihood-ratio test can be performed in order to choose the order of the distribution. For practical applications, further research must be done in inverse regression methods.

## Acknowledgements

This research has been partially supported by grant MTM2009-10893 from the Ministry of Education of Spain, and by grant CSN-11280 from the Spanish Nuclear Safety Council. The authors thank the three referees for their valuable comments and suggestions.

- Received July 21, 2010.
- Accepted August 31, 2010.

- This journal is © 2010 The Royal Society