## Abstract

Prime numbers seem to be distributed among the natural numbers with no law other than that of chance; however, their global distribution presents a quite remarkable smoothness. Such interplay between randomness and regularity has motivated scientists across the ages to search for local and global patterns in this distribution that could eventually shed light on the ultimate nature of primes. In this paper, we show that a generalization of the well-known first-digit Benford's law, which addresses the rate of appearance of a given leading digit *d* in datasets, describes with astonishing precision the statistical distribution of leading digits in the prime number sequence. Moreover, a reciprocal version of this pattern also takes place in the sequence of the non-trivial Riemann zeta zeros. We prove that the prime number theorem is, in the final analysis, responsible for these patterns.

## 1. Introduction

The individual location of prime numbers within the integers seems to be random; however, their global distribution exhibits a remarkable regularity (Zagier 1977). Certainly, this tension between local randomness and global order has led the distribution of primes to be, since antiquity, a fascinating problem for mathematicians (Dickson 2005) and, more recently, for physicists (Berry & Keating 1999; Kriecherbauer *et al*. 2001; Watkins, M. *Number theory & physics archive*, http://www.secamlocal.ex.ac.uk/people/staff/mrwatkin/zeta/physics.htm). The prime number theorem, which addresses the global smoothness of the counting function *π*(*n*) providing the number of primes less or equal to integer *n*, was the first hint of such regularity (Tenenbaum & France 2000). Some other prime patterns have been advanced so far, from the visual Ulam spiral (Stein *et al*. 1964) to the arithmetic progression of primes (Green & Tao in press), while some others remain conjectures, such as the global gap distribution between primes or the twin primes distribution (Tenenbaum & France 2000), enhancing the mysterious interplay between apparent randomness and hidden regularity. There are indeed many open problems still to be solved, and the prime number distribution is yet to be understood (Guy 2004; Ribenboim 2004; Caldwell, C. *The prime pages*, http://primes.utm.edu/). For instance, deep connections exist between the prime number sequence and the non-trivial zeros of the Riemann zeta function (Edwards 1974; Watkins, M. *Number theory & physics archive*, http://www.secamlocal.ex.ac.uk/people/staff/mrwatkin/zeta/physics.htm). The celebrated Riemann hypothesis, one of the most important open problems in mathematics, states that the non-trivial zeros of the complex-valued Riemann zeta function (as a matter of fact, the meromorphic continuation of the function to the entire complex plane) are all complex numbers with real part 1/2, the location of these being intimately connected with the prime number distribution (Edwards 1964; Chernoff 2000).

Here, we address the statistics of the first significant or leading digit of both the sequences of primes and the sequence of the non-trivial Riemann zeta zeros. We show that while the first-digit distribution is asymptotically uniform in both sequences (that is to say, integers 1, …, 9 tend to be equally likely as first digits in both sequences when we take into account the infinite amount of them), this asymptotic uniformity is reached in a very precise trend, namely by following a size-dependent generalized Benford's law (GBL), which constitutes an as yet unnoticed pattern in both sequences.

The rest of the paper is organized as follows. In §2, we introduce the most famous first-digit distribution: Benford's law. In §3, we introduce a generalization of Benford's law, and we show that both the sequences of the prime numbers and Riemann zeta zeros follow what we call a size-dependent GBL, introducing two unnoticed patterns of statistical regularity. In §4, we point out that the mean local density of both sequences is responsible for these latter patterns. We provide statistical arguments (statistical conformance between distributions) that support our claim. In §5, we provide some analytical arguments that confirm it. Specifically, making use of asymptotic expansion methods we prove that the prime number distribution is equivalent (within a margin of error) to a distribution that strictly follows a size-dependent GBL. At this point, we come up with new expressions for both the primes and the zeta zeros counting functions, precisely based on the pattern's structure previously found. In §6, we conclude and discuss possible applications.

## 2. Benford's law

The leading digit of a number represents its non-zero leftmost digit. For instance, the leading digits of the prime 7703 and the zeta zero 21.022… are 7 and 2, respectively. The most celebrated leading digit distribution is the so-called Benford's law (Hill 1996), after physicist Benford (1938), who empirically found that in many disparate natural datasets and mathematical sequences, the leading digit *d* was not uniformly distributed as might be expected, but instead had a biased probability as follows:(2.1)where *d*=1, 2, …, 9. While this empirical law was indeed first discovered by astronomer Newcomb (1881), it is popularly known as Benford's law or, alternatively, as the law of anomalous numbers. Several disparate datasets such as stock prices, freezing points of chemical compounds or physical constants exhibit this pattern at least empirically. While originally being only a curious pattern (Raimi 1976), practical implications began to emerge in the 1960s in the design of efficient computers (see, for instance, Knuth 1997). In recent years, goodness-of-fit tests against Benford's law have been used to detect possibly fraudulent data, by analysing the deviations of accounting data, corporation incomes, tax returns or scientific experimental data to theoretical Benford predictions (Nigrini 2000). Indeed, digit pattern analysis can produce valuable findings not revealed at first glance, as in the case of recent election results (Nigrini 2000; Mebane 2006).

Many mathematical sequences such as and (Benford 1938), binomial arrays (Diaconis 1977), geometric sequences or sequences generated by recurrence relations (Raimi 1976; Miller & Takloo-Bighash 2006), to cite a few, have been proved to conform to Benford. Therefore, one may wonder if this is the case for the primes. In figure 1, we have plotted the leading digit *d* rate of appearance for the prime numbers placed in the interval [1,*N*] (black bars), for different sizes *N*. Note that intervals [1,*N*] have been chosen such that in order to assure an unbiased sample where all possible first digits are equiprobable *a priori* (see appendix A*a* for a discussion on natural densities and prime numbers). Benford's law states that the first digit of a datum extracted at random is 1 with a frequency of 30.1 per cent, and is 9 only approximately 4.6 per cent. In figure 1, note that primes seem, however, to approximate to uniformity in their first digit. Indeed, the more we increase the interval under study, the more we approach uniformity (in the sense that all integers 1, …, 9 tend to be equally likely as a first digit). As a matter of fact, Diaconis (1977) proved that primes are not Benford distributed as long as their first significant digit is asymptotically uniformly distributed. A direct question arises: how does the prime sequence reach this uniform behaviour in the infinite limit? Is there any pattern on its trend towards uniformity, or, to the contrary, does the first-digit distribution lack any structure for finite sets?

## 3. Generalized Benford's law

Several mathematical insights regarding Benford's law have also been put forward so far (Pinkham 1961; Raimi 1976; Hill 1995*a*; Miller & Takloo-Bighash 2006), and Hill (1995*b*) proved a central limit-like theorem which states that random entries picked from random distributions form a sequence whose first-digit distribution tends towards Benford's law, explaining thereby its ubiquity. Practically, this law has for a long time been the only distribution that could explain the presence of skewed first-digit frequencies in generic datasets. Recently, Pietronero *et al*. (2001) proposed a generalization of Benford's law based on multiplicative processes (see also Nigrini & Miller 2007). It is well known that a stochastic process with probability density 1/*x* generates data that are Benford; therefore, series generated by power-law distributions , with *α*≠1, would have a first-digit distribution that follows a so-called GBL,(3.1)where the prefactor is fixed for normalization to hold and *α* is the exponent of the original power-law distribution (observe that for *α*=1 the GBL reduces to the Benford law, while, for *α*=0, it reduces to the uniform distribution).

### (a) The first-digit frequencies of prime numbers

Although Diaconis showed that the leading digit of primes distributes uniformly in the infinite limit, there exists a clear bias from uniformity for finite sets (figure 1). In this figure, we have also plotted (grey bars) the fitting to a GBL. Note that in each of the four intervals that we present, there is a particular value of exponent *α* for which an excellent agreement holds (see appendix A for fitting methods and statistical tests). More specifically, given an interval [1,*N*], there exists a particular value *α*(*N*) for which a GBL fits with extremely good accuracy the first-digit distribution of the primes appearing in that interval. Observe at this point that the functional dependency of *α* is only in the interval's upper bound; once this bound is fixed, *α* is constant in that interval. Interestingly, the value of the fitting parameter *α* decreases as the interval's upper bound *N*, hence the number of primes, increases. In figure 2*a*, we have plotted this size dependence, showing that a functional relation between *α* and *N* seems to take place,(3.2)where is the best fit. Note that , and this size-dependent GBL, reduces asymptotically to the uniform distribution, which is consistent with previous theory (Diaconis 1977). Despite the local randomness of the prime number sequence, it seems that its first-digit distribution converges smoothly to uniformity in a very precise trend: as a GBL with a size-dependent exponent *α*(*N*).

At this point and as in the case of Benford's law (Hill 1995*b*), an extension of the GBL to include not only the first significative digit but also the first *k* significative ones can be performed. Given a number *n*, we can consider its *k* first significative digits *d*_{1}, *d*_{2}, …, *d*_{k} through its decimal representation: , where and for *i*≥2. Hence, the *extended* GBL providing the probability of starting with number *D* is(3.3)Figure 3 represents the fit of the 4118054813 primes appearing in the interval [1,10^{11}] to an *extended* GBL for *k*=2,3,4 and 5: interestingly, the pattern still holds.

### (b) The ‘mirror’ pattern in the Riemann zeta zeros sequence

Since prime numbers are strongly related to the non-trivial Riemann zeta zeros, one may wonder if a similar pattern holds in this latter sequence (zeros sequence from now on). This sequence is composed of the imaginary part of the non-trivial zeros of (actually only those with a positive imaginary part are taken into account for reasons of symmetry, since the zeros are symmetrically distributed about the central point). This sequence is not Benford distributed according to a theorem by Rademacher and Hlawka (Hlawka 1984) which proves that it is *asymptotically* uniform. Nevertheless, will it follow a size-dependent GBL as in the case of the primes?

In figure 4, we have plotted, in the interval [1,*N*] and for different values of *N*, the relative frequencies of leading digit *d* in the zeros sequence (black bars), and in grey bars a fit to a GBL with density *x*^{α}, i.e.(3.4)(this reciprocity is clarified later in the text). Note that a very good agreement holds again for particular size-dependent values of *α*, and the same functional relation as equation (3.2) holds, with as the best fit. As in the case of the primes, this size-dependent GBL tends to uniformity for *N*→∞, as it should (Hlawka 1984). Moreover, the *extended* version of equation (3.4) for the *k* first significative digits is(3.5)As can be seen in figure 5, the pattern also holds in this case.

## 4. Statistical conformance of prime number distribution to GBL

Why do these two sequences exhibit this unexpected pattern in the leading digit distribution? What is causing it to take place? While the prime number distribution is deterministic in the sense that precise rules determine whether an integer is prime or not, its apparent local randomness has suggested several stochastic interpretations. In particular, Cramér (1935, see also Tenenbaum & France 2000) defined the following model: assume that we have a sequence of urns *U*(*n*), where *n*=1, 2, …, and put black and white balls in each urn such that the probability of drawing a white ball in the *k*^{th}-urn goes as 1/log *k*. Then, in order to generate a sequence of pseudo-random prime numbers, we need only to draw a ball from each urn: if the drawing from the *k*^{th}-urn is white, then *k* will be labelled as a pseudo-random prime. The prime number sequence can indeed be understood as a concrete realization of this stochastic process, where the chance of a given integer *x* to be prime is 1/log *x*.

We have repeated all the statistical tests within the stochastic Cramér model, and have found that a statistical sample of pseudo-random prime numbers in [1,10^{11}] is also GBL distributed and reproduces all the statistical analyses previously found in the actual primes (see appendix A for an in-depth analysis). This result strongly suggests that a density 1/log *x*, which is nothing but the mean local prime density by virtue of the prime number theorem, is likely to be responsible for the GBL pattern. In the following we provide further statistical and analytical arguments that support this fact.

Recently, it has been shown that disparate distributions such as the lognormal, the Weibull or the exponential distribution can generate standard Benford behaviour (Leemis *et al*. 2000) for particular values of their parameters. In this sense, a similar phenomenon could be taking place with GBL: can different distributions generate GBL behaviour? One should thus switch the emphasis from the examination of datasets that obey GBL to *probability distributions* that do so, other than power laws.

### (a) *Χ*^{2}-test for conformance between distributions

The prime counting function *π*(*N*) provides the number of primes in the interval [1,*N*] (Tenenbaum & France 2000) and, up to normalization, stands as the cumulative distribution function of primes. While *π*(*N*) is a stepped function, a nice asymptotic approximation is the offset logarithmic integral(4.1)(one of the formulations of the Riemann hypothesis actually states that , for some constant *c*; Edwards 1974). We can interpret 1/log *x* as an average prime density and the lower bound of the integral is set to be 2 for singularity reasons. Following Leemis *et al*. (2000), we can calculate a Χ^{2} goodness-of-fit test of the conformance between the first-digit distribution generated by Li(*N*) and a GBL with exponent *α*(*N*). The test statistic is in this case(4.2)where Pr(*X*) is the first-digit probability (equation (3.1)) for a GBL associated to a probability distribution with exponent *α*(*N*), and Pr(*Y*) is the tested probability. In table 1, we have computed, fixed the interval [1,*N*], the *Χ*^{2}-statistic *c* for two different scenarios, namely the normalized logarithmic integral Li(*n*)/Li(*n*) and the normalized prime counting function *π*(*n*)/*π*(*N*), with *n*∈[1,*N*]. In both cases, there is a remarkable good agreement and we cannot reject the hypothesis that primes are size-dependent GBL.

### (b) Conditions for conformance to GBL

Hill (1995*b*) wondered about which common distributions (or mixtures thereof) satisfy Benford's law. Leemis *et al*. (2000) tackled this problem and quantified the agreement to Benford's law of several standard distributions. They concluded that the ubiquity of Benford behaviour could be related to the fact that many distributions follow Benford's law for particular values of their parameters. Here, following the philosophy of that work (Leemis *et al*. 2000), we develop a mathematical framework that provides conditions for conformance to a GBL.

The probability density function of a discrete GB random variable *Y* is(4.3)The associated cumulative distribution function is therefore(4.4)How can we prove that a random variable *T* extracted from a probability density has an associated (discrete) random variable *Y* that follows equation (4.3)? We can readily find a relation between both random variables. Suppose, without loss of generality, that the random variable *T* is defined in the interval [1,10^{D+1}). Let the discrete random variable *D* fulfil(4.5)This definition allows us to express the first significative digit *Y* in terms of *D* and *T*,(4.6)where from now on the floor brackets stand for the integer part function. Now, let *U* be a random variable uniformly distributed in (0,1), *U*∼*U*(0,1). Then, inverting the cumulative distribution function (4.4), we obtain(4.7)This latter relation is useful to generate a discrete GB random variable *Y* from a uniformly distributed one *U*(0,1). Note also that for *α*=0, we have that is a first-digit distribution which is uniform , as expected. Hence, every discrete random variable *Y* that distributes as a GB should fulfil equation (4.7), and, consequently, if a random variable *T* has an associated random variable *Y*, the following identity should hold:(4.8)and then,(4.9)In other words, in order for the random variable *T* to generate a GB, the random variable *Z* defined in the preceding transformation should distribute as *U*(0,1). The cumulative distribution function of *Z* is thus given by(4.10)that in terms of the cumulative distribution function of *T* becomes(4.11)where .

We may consider now the power-law density *x*^{−α} proposed by Pietronero *et al*. (2001) in order to show that this distribution exactly generates generalized Benford behaviour,(4.12)Its cumulative distribution function will be(4.13)and thereby equation (4.11) reduces to(4.14)as expected.

### (c) GBL holds for prime number distribution

While the preceding development is in itself interesting in order to check for the conformance of several distributions to GBL, we will restrict our analysis to the cumulative distribution function of the prime number conveniently normalized in the interval [1,10^{D}],(4.15)Note that previous numerical analysis showed that(4.16)where . Since *π*(*t*) is a stepped function that does not possess a closed form, the relation (4.11) cannot be analytically checked. However, a numerical exploration can indicate into which extent primes are conformal with GBL. Relation (4.11) reduces in this case to check if(4.17)where and . First, this latter relation is trivially fulfilled for the extremal values *z*=0 and 1. For other values , we have numerically tested this equation for different values of *D*, and have found that it is satisfied with negligible error (we have performed a scatterplot of equation (4.17) and have found a correlation coefficient *r*=1.0).

The same numerical analysis has been performed for logarithmic integral Li. In this case, the relation that must be fulfilled is(4.18)and is indeed satisfied with similar remarkable results provided that we fix for singularity reasons.

## 5. Counting functions for prime numbers and zeta zeros

Hitherto, we have provided statistical arguments which indicate that other distributions than *x*^{−α} such as 1/log *x* can generate GBL behaviour. In the following we provide analytical arguments that support this fact.

### (a) The primes counting function *L*(*N*)

Suppose that a given sequence has a power-law-like density *x*^{−α} (and whose first significative digits are consequently GBL). One can derive from this latter density a counting function *L*(*N*) that provides the number of elements of that sequence appearing in the interval [1,*N*]. A first option is to assume a local density of the shape *x*^{−α(x)}, such that . Note that this option implicitly assumes that *α* varies smoothly in [1,*N*], which is not the case in the light of the numerical relation (3.2), which implies that the functional dependency of *α* is only with respect to the upper bound value of the interval. Indeed, *x*^{−α(x)} is not a good approximation to 1/ln *x* for any given interval. This drawback can be overcome defining *L*(*N*) as follows:(5.1)where the prefactor is fixed for *L*(*N*) to fulfil the prime number theorem and, consequently,(5.2)(see table 2 for a numerical exploration of this new approximation to *π*(*N*)). Observe that what we are claiming is that the fixed interval [1,*N*], *x*^{−α(N)} acts as a good approximation to the primes mean local density 1/ln *x* in that interval. In order to prove it, let us compare the counting functions derived from both densities. First, possesses the following asymptotic expansion:(5.3)On the other hand, we can asymptotically expand *L*(*N*) as it follows:(5.4)Comparing equations (5.3) and (5.4), we conclude that Li(*N*) and *L*(*N*) are compatible cumulative distributions within an error(5.5)which is indeed minimum for *a*=1, consistent with our previous numerical results obtained for the fitting value of *a* (equation (3.2)). Hence, within that error, we can conclude that primes obey a GBL with *α*(*N*) following equation (3.2): primes follow a size-dependent GBL.

### (b) The zeta zeros counting function *S*(*N*)

What about the Riemann zeros? Von Mangoldt proved (Edwards 1974) that, on average, the number of non-trivial zeros *R*(*N*) up to *N* (zeros counting function) is(5.6)*R*(*N*) is nothing but the cumulative distribution of the zeros (up to normalization), which satisfies(5.7)The non-trivial Riemann zeros average density is thus log(*x*/2*π*), which is essentially the reciprocal of the prime numbers mean local density (see equation (4.1)). One can thus straightforwardly deduce a power-law approximation *S*(*N*) to the cumulative distribution *R*(*N*) of the non-trivial zeros similar to equation (5.1),(5.8)We conclude that zeros are also GBL for *α*(*N*) satisfying the following change of scale:(5.9)Hence, since (equation (5.5)), one should expect the following value for the constant *a* associated to the zeros sequence: , which is in good agreement with our previous numerical analysis.

## 6. Discussion

To conclude, we have unveiled a statistical pattern in the sequences of the prime numbers and the non-trivial Riemann zeta zeros that has surprisingly gone unnoticed until now. According to several statistical and analytical arguments, we can conclude that, for a fixed interval [1,*N*], we can approximate the mean local density of both sequences to a power-law distribution with good accuracy, and this is indeed responsible for these patterns. Along with this finding, some relations concerning the statistical conformance of any given distribution to the generalized Benford law have also been derived.

Several applications and future work can be depicted: first, since the Riemann zeros seem to have the same statistical properties as the eigenvalues of a concrete type of random matrices called the Gaussian unitary ensemble (Berry & Keating 1999; Bogomolny 2007), the relation between GBL and random matrix theory should be investigated in depth (Miller & Kontorovich 2005). Second, this finding may also apply to several other sequences that, while not being strictly Benford distributed, can be GBL, and, in this sense, much work recently developed for Benford distributions (Hürlimann 2006) could be readily generalized. Finally, it has not escaped our notice that several applications recently advanced in the context of Benford's law, such as fraud detection or stock market analysis (Nigrini 2000), could eventually be generalized to the wider context of GBL formalism. This generalization also extends to stochastic sieve theory (Hawkins 1957), dynamical systems that follow Benford's law (Berger *et al*. 2005; Miller & Takloo-Bighash 2006) and their relation to stochastic multiplicative processes (Manrubia & Zanette 1999).

## Acknowledgments

We thank I. Parra for helpful suggestions and K. McCourt, O. Miramontes, J. Bascompte, D. H. Zanette and S. C. Manrubia for their comments. This work was supported by grant FIS2006-08607 from the Spanish Ministry of Science.

## Appendix A. Statistical methods and technical digressions

### (a) How to pick an integer at random?

#### (i) Visualizing the generalized Benford law pattern in prime numbers as a biased random walk

In order for the pattern already captured in figure 1 to become more evident, we have built the following two-dimensional random walk:(A1)where *x* and *y* are Cartesian variables with , and both *ξ*_{x} and *ξ*_{y} are discrete random variables that have values depending on the first digit *d* of the numbers randomly chosen at each time step, according to the rules depicted in figure 6. Thereby, in each iteration, we peak at random a positive integer (grey random walk) or a prime (black random walk) from the interval [1,10^{6}], and depending on its first significative digit *d*, the random walker moves accordingly (for instance, if we peak prime 13, we have *d*=1 and the random walker rules provide *ξ*_{x}=1 and *ξ*_{y}=1: the random walker moves up-right). We have plotted the results of this two-dimensional random walk in figure 6 for random picking of integers (grey random walk) and random picking of primes (black random walk). Note that while the grey random walk seems to be a typical uncorrelated Brownian motion (enhancing the fact that the first-digit distribution of the integers is uniformly distributed), the black random walk is clearly biased: this is indeed a visual characterization of the pattern. Observe that if the interval in which we randomly peak either the integers or the primes were not of the shape [1,10^{D}], there would be a systematic bias present in the pool and, consequently, both integer and prime random walks would be biased; it is therefore necessary to define the intervals under study in that way.

#### (ii) Natural density

If primes were, for instance, Benford distributed, one should expect that if we pick a prime at random, this one should start with the number 1 around 30 per cent of the time. But what does the sentence ‘pick a prime at random’ mean? Note that in the previous experiment (the two-dimensional biased random walk), we have drawn numbers, whether integers or primes, at random from the pool [1,10^{6}]. Throughout this paper, the intervals [1,*N*] have been chosen so that *N*=10^{D}, . This choice is not arbitrary, but very much to the contrary, it relies on the fact that whenever studying infinite integer sequences, the results strongly depend on the interval under study. For instance, everyone would agree that intuitively the set of positive integers is an infinite sequence whose first digit is uniformly distributed: there exist as many naturals starting with 1 as naturals starting with 9. However, there exist subtle difficulties at this point arising from the fact that the first-digit natural density is not well defined. Since there exist infinite integers in and, consequently, it is not straightforward to quantify the quote ‘pick an integer at random’ in a way which satisfies the laws of probability, in order to check if integers have a uniform distributed first significant digit, we have to consider finite intervals [1,*N*]. Hereafter, note that uniformity *a priori* is only respected when *N*=10^{D}. For instance, if we choose the interval to be [1,2000] and we randomly draw a number, this one will start with 1 with rather large probability, as there are obviously more numbers starting by one in that interval. If we increase the interval to say [1,3000], then the probability of drawing a number starting with 1 or 2 will be larger than the probability of any other. We can easily come to the conclusion that the first-digit density will oscillate repeatedly by decades as *N* increases without reaching convergence, and it is thereby said that the set of positive integers with leading digit *d* (*d*=1, 2, …, 9) does not possess a natural density among the integers. Note that the same phenomenon is likely to take place for the primes (see Chris Caldwell's *The prime pages* for an introductory discussion of natural density and Benford's law for prime numbers, http://primes.utm.edu/ and references therein).

In order to overcome this subtle point, one can (i) choose intervals of the shape [1,10^{D}], where every leading digit has equal probability *a priori* of being picked. According to this situation, positive integers have a uniform first-digit distribution, and, in this sense, Diaconis (1977) showed that primes do not obey Benford's law as their first-digit distribution is asymptotically uniform. Or (ii) use average and summability methods such as the Cesaro or the logarithm matrix method ℓ (Raimi 1976) in order to define a proper first-digit density that holds in the infinite limit. Some authors have shown that, in this case, both the primes and the integers are said to be *weak* Benford sequences (Flehinger 1966; Withney 1972; Raimi 1976).

As we are dealing with finite subsets and in order to check if a pattern *really* takes place for the primes, in this work, we have chosen intervals of the shape [1,10^{D}] to assure that samples are unbiased and that all first digits are equiprobable *a priori*.

### (b) Statistical methods

#### (i) Method of moments

In order to estimate the best fit between a GBL with parameter *α* and a dataset, we have employed the method of moments. If GBL fits the empirical data, then both distributions have the same first moments, and the following relation holds:(A2)where *P*(*d*) and *P*^{e}(*d*) are the observed normalized frequencies and GB expected probabilities for digit *d*, respectively. Using a Newton–Raphson method and iterating equation (A 2) until convergence, we have calculated *α* for each sample [1,*N*].

#### (ii) Statistical tests

Typically, the Χ^{2} goodness-of-fit test has been used in association with Benford's law (Nigrini 2000). Our null hypothesis here is that the sequence of primes follow a GBL. The test statistic is(A3)where *M* denotes the number of primes in [1,*N*]. Since we are computing parameter *α*(*N*) using the mean of the distribution, the test statistic follows a *Χ*^{2}-distribution with 9−2=7 degrees of freedom, so the null hypothesis is rejected if , where *a* is the level of significance. The critical values for the 10, 5 and 1 per cent are 12.02, 14.07 and 18.47, respectively. As we can see in table 3, despite the excellent visual agreement (figure 1), the *Χ*^{2}-statistic goes up with sample size and, consequently, the null hypothesis cannot be rejected only for relatively small sample sizes *N*<10^{9}. As a matter of fact, the *Χ*^{2}-statistic suffers from the excess power problem on the basis that it is size sensitive: for huge datasets, even quite small differences are statistically significant (Nigrini 2000). A second alternative is to use the standard *Z*-statistics to test significant differences. However, this test is also size dependent and hence registers the same problems as *Χ*^{2} for large samples. Owing to these facts, Nigrini (2000) recommends for Benford analysis a distance measure test called mean absolute deviation (MAD). This test computes the average of the nine absolute differences between the empirical proportions of a digit and the ones expected by the GBL. That is,(A4)This test overcomes the excess power problem of *Χ*^{2} as long as it is not influenced by the size of the dataset. While MAD lacks a cut-off level, Nigrini (2000) suggests that the following guidelines for measuring conformity of the first digits to Benford's law: MAD between 0 and 0.4×10^{−2} implies close conformity; from 0.4×10^{−2} to 0.8×10^{−2} acceptable conformity; from 0.8×10^{−2} to 0.12×10^{−1} marginally acceptable conformity; and, finally, greater than 0.12×10^{−1}, non-conformity. Under these cut-off levels, we cannot reject the hypothesis that the first-digit frequency of the prime number sequence follows a GBL. In addition, the maximum absolute deviation *m* defined as the largest term of MAD is also shown in each case.

As a final approach to testing for a similarity between the two histograms, we can check the correlation between the empirical and theoretical proportions by the simple regression correlation coefficient *r* in a scatterplot. As we can see in table 3, the empirical data are highly correlated with a generalized Benford distribution.

The same statistical tests have been performed for the case of the non-trivial Riemann zeta zeros sequence (table 4), with similar results.

### (c) Cramér's model

The prime number distribution is deterministic in the sense that primes are determined by precise arithmetic rules. However, its apparent local randomness has suggested several stochastic interpretations. Concretely, Cramér (1935, see also Tenenbaum & France 2000) defined the following model: assume that we have a sequence of urns *U*(*n*), where *n*=1, 2, …, and put black and white balls in each urn such that the probability of drawing a white ball in the *k*^{th}-urn goes as 1/log *k*. Then, in order to generate a sequence of pseudo-random prime numbers, we need only to draw a ball from each urn: if the drawing from the *k*^{th}-urn is white, then *k* will be labelled as a pseudo-random prime. The prime number sequence can indeed be understood as a concrete realization of this stochastic process. With such a model, Cramér studied, among others, the distribution of gaps between primes and the distribution of twin primes as far as statistically speaking, these distributions should be similar to the pseudo-random ones generated by his model. Quoting Cramér: ‘With respect to the ordinary prime numbers, it is well known that, roughly speaking, we may say that the chance that a given integer *n* should be a prime is approximately 1/log *n*. This suggests that by considering the following series of independent trials we should obtain sequences of integers presenting a certain analogy with the sequence of ordinary prime numbers .’

In this work, we have simulated a Cramér process, in order to obtain a sample of pseudo-random primes in [1,10^{11}]. Then, the same statistics performed for the prime number sequence have been realized in this sample. The results are summarized in table 5. We can observe that the Cramér's model reproduces the same behaviour, namely (i) the first-digit distribution of the pseudo-random prime sequence follows a GBL with a size-dependent exponent that follows equation (3.2). (ii) The number of pseudo-random primes found in each decade matches, statistically speaking, the actual number of primes. (iii) The *Χ*^{2}-test evidences the same problems of power for large datasets. Bearing in mind that the sample elements in this model are independent (which is not the case in the actual prime sequence), we can confirm that the rejection of the null hypothesis by the *Χ*^{2}-test for huge datasets is not related to a lack of data independence but more likely to the test's size sensitivity. (iv) The rest of the statistical analysis is similar to the one previously performed in the prime number sequence.

## Footnotes

- Received March 6, 2009.
- Accepted March 23, 2009.

- © 2009 The Royal Society