# The Shannon entropy of Sudoku matrices

Paul K. Newton, Stephen A. DeSalvo

## Abstract

We study properties of an ensemble of Sudoku matrices (a special type of doubly stochastic matrix when normalized) using their statistically averaged singular values. The determinants are very nearly Cauchy distributed about the origin. The largest singular value is , while the others decrease approximately linearly. The normalized singular values (obtained by dividing each singular value by the sum of all nine singular values) are then used to calculate the average Shannon entropy of the ensemble, a measure of the distribution of ‘energy’ among the singular modes and interpreted as a measure of the disorder of a typical matrix. We show the Shannon entropy of the ensemble to be 1.7331±0.0002, which is slightly lower than an ensemble of 9×9 Latin squares, but higher than a certain collection of 9×9 random matrices used for comparison. Using the notion of relative entropy or Kullback–Leibler divergence, which gives a measure of how one distribution differs from another, we show that the relative entropy between the ensemble of Sudoku matrices and Latin squares is of the order of 10−5. By contrast, the relative entropy between Sudoku matrices and the collection of random matrices has the much higher value, being of the order of 10−3, with the Shannon entropy of the Sudoku matrices having better distribution among the modes. We finish by ‘reconstituting’ the ‘average’ Sudoku matrix from its averaged singular components.

## 1. Introduction

A Sudoku matrix A is a 9×9 real valued matrix with an integer between 1 and 9 in each entry, so long as the following constraints are obeyed:

• — each integer can appear only once along any row,

• — each integer can appear only once down any column, and

• — each integer can appear only once in each of the nine 3×3 sub-blocks.

Without the last ‘regional’ constraint, the matrix is usually called a 9×9 Latin square, whereas, without any of the constraints, if the integer in each entry is chosen with equal probability of , we call it a random matrix. It was recently proven by Felgenhauer & Jarvis (2006) that there are exactly 6, 670, 903, 752, 021, 072, 936, 960≡9!×722×27×27,704,267,971∼6.67×1021 Sudoku matrices. Their enumeration strategy begins by analysing the permutations of the top 3×3 sub-block used in valid Sudoku grids. Once these sub-block symmetries and equivalence classes for the partial solutions are identified, the other sub-blocks are constructed and counted for each equivalence class. Russell & Jarvis (2006) showed that, if symmetries are taken into account (using Burnside’s lemma), there are significantly fewer—only 5,472,730,538 of them. Both of these numbers are small compared with 981, the total number of random matrices, or the total number of 9×9 Latin squares, which is known to be O(1027) (Bammel & Rothstein 1975). In fact, taking the ratio of Felgenhauer and Jarvis’ number over 981 gives the minuscule empirical probability of 3.3927×10−56 of producing a Sudoku matrix by using a random number generator to select each integer entry independently, without explicitly building in the constraints. An example of a Sudoku matrix is shown in figure 1 and can be found in many major daily newspapers as a popular number game, whose modern version was apparently invented by the American architect Howard Garns in 1979, but whose ancestry dates back at least to Euler. A ‘Sudoku game’ starts with a few squares filled in (for example, the darkened numbers in figure 1), and the reader must fill in the rest (the lighter numbers in figure 1), obeying the above constraints to create a Sudoku matrix, a classic problem in constrained combinatorial optimization theory (Simonis 2005). If all but one of the squares are filled in initially, it is easy to see that there is a unique value for the remaining square, and hence the final Sudoku matrix is unique and the problem is said to be well posed. However, if none of the squares are filled in initially, there are many ways to fill them in (ill posed) to arrive at a final Sudoku matrix, namely all of the ones counted by Felgenhauer & Jarvis (2006). The fewest number of filled in squares that renders the solution unique is unknown, although the lowest number yet found is 17. The general problem of solving Sudoku puzzles on n2×n2 boards of n×n blocks is known to be NP-complete (Yato & Seta 2005). A rapidly evolving discussion of the current state-of-the-art regarding the ‘mathematics of Sudoku’ can be found at the http://en.wikipedia.org/wiki/mathematics_of_sudoku.

Figure 1.

An arbitrary Sudoku matrix. Each of the integers 1–9 appears only once down each column, along any row, and in each of the 3×3 sub-blocks. With only the 27 darkened numbers showing, a Sudoku game consists of filling in all the lighter coloured numbers to produce a complete Sudoku matrix.

In this paper, we focus on the statistical properties of Sudoku matrices, not on strategies for solving Sudoku puzzles. Progress on computer algorithms to solve Sudoku puzzles (using ‘Sinkhorn balancing’) are discussed in Moon et al. (2009) and comprehensive strategies are described in Davis (2009). Here, we mostly discuss average properties of ensembles of Sudoku matrices as opposed to specific properties of individual matrices. Of course, we first highlight the main properties that all Sudoku matrices making up an ensemble must share (discussed in the next section). Given the constraints imposed on the entries, we address the question of how random is a Sudoku matrix? We answer this question by creating an ensemble of Sudoku matrices from which we study the statistically averaged singular value distributions and Shannon entropy of the ensemble, which gives us a scalar measure of how evenly distributed are the rank-1 matrices whose linear combination constitutes an ‘average’ matrix, as detailed in the text. In §3, we describe the algorithm we use to generate Sudoku matrices, one of which is shown in figure 1. Section 4 discusses the singular value decomposition of Sudoku matrices, which we then use to calculate their (average) Shannon entropy. For comparison purposes, we calculate the Kullback–Leibler divergence between the ensemble of Sudoku matrices and a corresponding ensemble of Latin squares and random matrices.

## 2. Basic properties

The following properties hold for any Sudoku matrix:

• — Because of constraint 1, every Sudoku matrix A has eigenvalue λ=45, with corresponding eigenvector η=(1,1,1,…,1)T.

• — Because of constraint 2, the transpose of every Sudoku matrix AT has eigenvalue λ=45, with corresponding eigenvector η=(1,1,1,…,1)T.

• — Because of constraint 3, we have ATA.

• — Because of constraint 3, we have ATAAAT, and λ=452 is an eigenvalue of the covariance matrices AAT and ATA.

• — Because of constraint 3, we have 18≤trace(A)≤72. This is because the smallest value for the trace in any 3×3 sub-block is 1+2+3=6, while the largest value is 7+8+9=24.

Sudoku matrices can be singular, as the following example shows.

### Example 2.1.

The following Sudoku matrix has one eigenvalue that is zero, with corresponding eigenvector : 2.1

Sudoku matrices can be highly structured, as in the following example.

### Example 2.2.

The following Sudoku matrix is produced with an initialized top row containing values 1→9. The second row is obtained from the top row by applying a left-shift of three entries (to accommodate the block constraint). The third row is obtained from the second row by applying a left-shift of three entries. The fourth row starts a new 3×3 block; hence, we obtain it from the top row by applying a left-shift of one entry. Rows 5 and 6 are produced from the row above by a left-shift of three entries. Row 7 starts a new block and is obtained from row 4 by applying a left-shift of one entry. The final two rows are obtained from the rows above them by applying the left-shift of three entries. The result is a highly structured yet valid Sudoku matrix, 2.2

### Remark 2.3.

Every Sudoku matrix can be made into a doubly stochastic matrix (row and column sums equal to 1) by dividing each entry by 45. This normalized matrix M can then be viewed as a finite Markov chain with nine states, each of which is recurrent and aperiodic (Grimmett & Stirzaker 2005). The stationary distribution, a vector π such that π*M=π, is a left eigenvector of M with eigenvalue 1, and it follows immediately from basic property 2 that the vector of all 1s is a left eigenvector of M; hence, normalizing to obtain a probability distribution, we have . Note that Latin squares have the same stationary distribution, as they satisfy basic condition 2 as well.

As an indication of the distribution of the eigenvalues (specifically, their products), a histogram of the determinants of an ensemble of 10 000 Sudoku matrices (see §3 for the details of construction) is shown in figure 2 along with the sample mean, standard deviation and smallest and largest values from the sample. Plotted together with the histogram is a Cauchy distribution 2.3 which appears to closely model the data (if the entries are not selected uniformly, this would not be the case). Recall that the Cauchy distribution has no defined mean, variance or higher order moments, while its mode and median are c (the peak of the distribution), with s being the scale parameter specifying the half-width at half-maximum. Interestingly, the central limit theorem predicting convergence to a Gaussian distribution does not apply because the variance is not finite. Generically, the Cauchy distribution arises when we take the ratio U/V of two independent Gaussian distributed random variables U, V , with expected values 0 and unit variances. We use the parameter estimation techniques described in Nagy (2006) to determine the parameter values s=2.155×107 and c=−2.902×105. The determinants from the sample range from ±108, with a sample median value −98 820. All of the members of the ensemble had either rank 8 (somewhat special) or rank 9 (more typical) and we believe it is not possible for a Sudoku matrix to have a zero eigenvalue with algebraic multiplicity more than 1, but we have not been able to prove this. The recent article by Dahl (2009) discusses other general properties of Sudoku matrices.

Figure 2.

Histogram of determinants of 10 000 Sudoku matrices which appears to closely follow a standard Cauchy distribution, plotted as a solid curve. The sample median is −98820, and the sample mean is −570352.2660, with standard deviation 0.56719630860983×108. The smallest determinant from the sample is −551 886 210 while the largest is 355 398 840.

## 3. The Sudoku algorithm

### (a) Algorithm description

An ideal (unbiased) random Sudoku generator would index each of the approximately 1021 Sudoku matrices by an integer, and a uniformly distributed random integer between 1 and 1021 would be chosen corresponding to one of the matrices. The task of pre-assembling and working with 1021 matrices, however, has obvious drawbacks.

Instead, our algorithm for producing Sudoku matrices proceeds one row at a time, starting with the first row and working down to the bottom. First, a selection array consisting of the admissible integers 1–9 is created and an index integer j between 1 and 9 is chosen at random, with uniform distribution. The jth number in the selection array is placed in row/column/block=r/c/b=1/1/1 of the Sudoku matrix. This entry is then deleted from the selection array, leaving eight remaining entries. To select the next entry of the Sudoku matrix, a new index integer is picked at random, this time between 1 and 8. This new index is then used to choose the next member of the selection array and placed in r/c/b=1/2/1 of the Sudoku matrix. The process continues until the top row of the Sudoku matrix is filled and all entries of the selection array have been used. The remaining rows are generated in a similar manner, one entry at a time, with the additional preliminary step of reducing the set of admissible values based on the constraints in the corresponding r/c/b. In the event of encountering an entry with no possible admissible values, that entire row and the row preceding it are deleted (called the back-stepping procedure) and the process continues as before, starting with the first empty row. As an alternative to the back-stepping procedure, we could simply empty the matrix and re-start the algorithm from scratch each time the algorithm runs into a dead end, but it is far faster and more efficient to implement the back-stepping procedure, clearing only two rows. This algorithm seems to be fast and efficient and can easily generate 10 000 Sudoku matrices in a matter of seconds (using C code on a laptop computer).

### (b) Ensemble bias

The question is whether the algorithm produces an ensemble that is unbiased. Namely, we need the smaller sample produced by the algorithm to have the same statistical properties that a full ensemble of all 1021 matrices would have. For this, we describe a necessary condition that any unbiased sample ensemble of Sudoku matrices must have. The following simple facts are used. First, note that, given any Sudoku matrix, a new Sudoku matrix can be produced by globally switching any two numbers. For instance, we can switch the numbers 1 and 2 everywhere in the Sudoku matrix (2.2), and the result of the switch will produce a new, less structured, but valid Sudoku matrix. This gives us the following.

### Lemma 3.1.

Given a valid Sudoku matrix, we may construct an additional 9!−1 distinct Sudoku matrices.

### Proof.

Consider the set of all permutations of the numbers 1–9, and create a 1–1 mapping from the set {1,…,9} to each of the permutations by matching the nth entry in {1,…,9} to the nth entry in the permutation. For each permutation, a distinct Sudoku matrix may be obtained by replacing the number in each corresponding entry of the given Sudoku matrix with the corresponding entry in the permutation. ■

An ensemble of Sudoku matrices produced this way is an equivalence class.

### Lemma 3.2.

If the values of all Sudoku matrices in an equivalence class are averaged together component-wise, the resulting matrix is the rank-1 matrix with 5 in each entry.

### Proof.

The proof is clear by symmetry. Every entry will have each of the numbers 1–9 in it an equal number of times in the ensemble of 9!−1 matrices. ■

Thus, as in equation (2.2), we can initialize a given Sudoku matrix by choosing the matrix in the equivalence class that has the first row {1,2,3,4,5,6,7,8,9}. Proceeding this way will give a standardized way of looking at the distribution of matrices generated. Finally, since the matrices in each equivalence class average to the rank-1 matrix with 5 in each entry, we can average these averages and obtain the same matrix. This gives us the following.

### Lemma 3.3.

If the values of all Sudoku matrices are averaged together component-wise, the resulting matrix is the rank-1 matrix with 5 in each entry.

Therefore, we would like any subset of the full set of Sudoku matrices to inherit this property.

### Theorem 3.4.

A necessary condition that a set of Sudoku matrices is unbiased is that the component-wise average of the ensemble produce the rank-1 matrix with 5 in each entry.

The extent to which the sample average deviates from this rank-1 average is a measure of possible bias of the ensemble. Equation (3.1) shows the component-wise average of 108 Sudoku matrices generated by the algorithm which is acceptable for our purposes. Of course, more detailed and stringent tests could be developed, as in those discussed in Jacobson & Matthews (1996) for Latin squares, but we regard our relatively weak test sufficient for our purposes. 3.1

## 4. Ensemble analysis

Our main tool in the analysis of Sudoku matrices is the singular value decomposition and the resulting calculation of Shannon (1948) entropy for the ensemble of matrices which we now describe.

### (a) Singular value decomposition and Shannon entropy

To obtain the singular value decomposition of A, we find the nine eigenvalues λi, (i=1,…,9) of the associated covariance matrix ATA 4.1 where λi≥0, .1 The singular values σi are defined as . Equivalently, one can define the singular values and singular vectors directly via the system of linear equations 4.2 where , and the unit vectors and are called the right and left singular vectors, respectively. We use them as columns to construct the 9×9 orthogonal matrices U, V defined as 4.3 which produces the singular value decomposition of A (Kirby 2001; Trefethen & Embree 2005) 4.4 Σ is the diagonal matrix with singular values down the diagonal, ordered from largest (top left) to smallest (bottom right), with corresponding right and left singular vectors filling in the columns of V and U from left to right, 4.5 From equation (4.4), one can see that the singular value decomposition expresses A as the sum of rank-1 matrices, and does so in an optimal way. Namely, if we define the rank-k approximation to A, where k<9, by forming the sum 4.6 then ∥AAk2σk+1, where ∥⋅∥2 represents the 2-norm. It is a standard theorem in linear algebra (Kirby 2001) that any matrix B that is not the rank-k approximation (4.6) has greater error 4.7 where B is any 9×9 matrix of rank k. Each of the matrices in equation (4.6) are rank 1 and should be thought of as the ‘normal modes’ whose weighted linear combination produces the matrix Ak.

It is useful to normalize the singular values by dividing each by the sum of all nine, so that the normalized values lie in the range between 0 and 1 and can be used to determine the distribution of ‘energy’ over all the singular modes. The normalized values are defined as 4.8 From these, we define the Shannon entropy to be 4.9 which provides us with a scalar measure of the distribution of ‘energy’ among the modes, which is typically interpreted as the level of disorder, or randomness, associated with the matrices as well as a measure of how rapidly the normalized singular values decay from the peak (i=1). The next two examples give further insight into the relation between the Shannon entropy of a matrix and its singular value distribution.

### Example 4.1.

All orthogonal matrices have the property AT=A−1, hence AAT= I. Therefore, each eigenvalue of the covariance matrix λi=1, which when normalized gives equal singular values , (i=1,…,9). Therefore, the Shannon entropy for any 9×9 orthogonal matrix is , which is the maximal value that H can achieve for the 9×9 case. Normalized singular value distributions that are evenly distributed among all the available modes correspond to maximum entropy states.

### Example 4.2.

The 9×9 matrix with 1s in every entry has the property that its covariance matrix has 9s in each entry. Therefore, the covariance matrix has rank 1 and nullspace dimension 8, which means there are eight normalized singular values that are zero, and one that is exactly 1. The Shannon entropy for this case is 0, which is the minimal value that H can achieve. Singular value distributions that have all the energy clustered into a single mode are minimum entropy states.

From these two examples, one sees that distributions of singular values that drop off rapidly from the peak value correspond to lower entropy matrices than those that are flat. The flattest distribution, where all normalized singular values are equal in height, corresponds to the matrix with maximum entropy.

It is also useful to define the notion of ‘percentage of compression’ a given matrix achieves with respect to the maximal entropy matrix. Hence, given a Sudoku matrix A, define 4.10 From the previous examples, orthogonal matrices of example 4.1 have % compression, while minimum entropy matrices of example 4.2 have % compression.

### Example 4.3.

The singular value decomposition of the Sudoku matrix shown in example 2.1 gives rise to matrices U, Σ and V ,

Note that the first term in the partial sum representation (4.6) gives rise to the rank-1 matrix (minimal Shannon entropy as in example 4.2) 4.11 where each entry is the arithmetic average of the allowable digits 1–9 in each column, row, or 3×3 sub-block.2 If we normalize the singular values down the diagonal of Σ, 4.12 we obtain a value of the Shannon entropy H=1.7137 and a compression factor of 22 per cent.

In the next section, we calculate the Shannon entropy and percentage of compression of an ensemble of Sudoku matrices generated by our algorithm.

### (b) Ensemble averages

To generate an ensemble of Sudoku matrices, we run the algorithm described above N times, denoting each realization of a Sudoku matrix A(j). The singular values for the jth realization are denoted and their corresponding left and right singular vectors are denoted and (i=1,…,9), respectively. We define the ensemble average of the collection of matrices 4.13 as well as the ensemble averages of the singular components 4.14 4.15and 4.16 The standard deviation of each quantity is denoted with double brackets 〈〈⋅〉〉N. If we first normalize the singular values from each member of the ensemble as in equation (4.8), i.e. 4.17 and then perform the ensemble averages, we denote the averaged normalized values 4.18 with standard deviations .

The averaged normalized singular values (4.18) are shown in figure 3 for N=1,…,10 000. Each is seen to converge (the law of large numbers) to a sample mean value denoted by . The distribution of the nine values is shown in figure 4 (data points), along with error bars (black) showing one standard deviation and the spread of all the data about the mean (grey). The singular values decrease very nearly linearly. Histograms of each of the averaged normalized values are shown in figure 5, along with their Gaussian fits. Table 1 shows the nine sample mean values , their standard deviations and variances for N=108.

Figure 3.

Ensemble average of each of the nine singular values for N Sudoku matrices, where N=1,…,10 000. The horizontal (dashed) line in each plot is , with value shown.

Figure 4.

Distribution of the normalized singular values of the ensemble of Sudoku matrices in descending order. The dots denote the sample mean, with error bars showing one standard deviation about the mean and the spread of all the data with a least squares line (dashed) to the first eight singular values (the scale of the x-axis is arbitrary). The computed correlation coefficient is r2=0.9993. The unfilled circles denote the singular values of a typical Sudoku matrix selected randomly, while the unfilled squares denote the singular values from equation (2.2).

Figure 5.

Histograms of the nine singular values for the ensemble of N= 10 000 Sudoku matrices. The spread around the mean appears to follow a Gaussian distribution reasonably well. The smallest singular value has an approximate half-Gaussian spread (centred at 0).

View this table:
Table 1.

Minimum value, maximum value, sample mean and standard deviation for the nine ensemble-averaged and normalized singular values based on a sample size of N=108.

We denote the Shannon entropy of the jth member of the ensemble to be 4.19 with ensemble average 4.20 and standard deviation 〈〈H〉〉N. A histogram of the averaged Shannon entropy is shown in figure 6, along with a Gaussian fit to the data. The ensemble-averaged Shannon entropy has sample mean value 〈H108=1.73312 and standard deviation 0.000173. The smallest entropy achieved in the sample was 1.512975, whereas the largest was 1.881064. A convergence plot of the Shannon entropy is shown in figure 7 for N=1,2,…,10 000. If we use this sample mean value in equation (4.10) to calculate the ensemble-averaged percentage of compression of a collection of Sudoku matrices, we arrive at 4.21 This value is slightly lower than the corresponding value for a collection of 108 9×9 Latin squares, which has Shannon entropy 1.73544±0.0001735. On the other hand, a collection of random matrices has the much lower entropy value of 1.65128±0.0001656, giving a 25 per cent compression factor.

Figure 6.

Ensemble-averaged Shannon entropy with N=10 000, with sample mean value 1.73221822064708 and standard deviation 0.03996073536463. A histogram is shown along with the Gaussian distribution.

Figure 7.

The Shannon entropy of the ensemble of N=1,…,10 000 Sudoku matrices. The horizontal line (dashed) shows the approximate converged sample average.

### Remark 4.4.

When comparing two distributions, the relative entropy or Kullback–Leibler divergence gives a measure of how likely the first distribution will resemble the second (Kullback & Leibler 1951). This value is defined as 4.22 where a=(a1,…,an) and p=(p1,…,pn) are the two distributions being compared. Note that h(a,p)≠h(p,a). It has already been established that the entropy for Sudoku matrices and Latin squares is similar, so naturally their relative entropy should be small. Indeed, this is the case, as this value (if ai corresponds to the Sudoku distributions and pi corresponds to Latin squares) turns out to be h(a,p)=9.74107032×10−6≈10−5. To put this in perspective, the relative entropy between Sudoku matrices and random matrices is 0.004044≈5×10−3, two orders of magnitude higher, and nearly the same as the relative entropy between Latin squares and random matrices, which is 0.0042812.

## 5. Matrix reconstitution

One might ask what the ‘typical’ Sudoku matrix looks like in terms of its decomposition (4.4). For this, we use the ensemble-averaged singular vectors , to construct the matrices along with the averaged singular values 〈σiN. The ‘reconstituted’ matrix is then defined to be 5.1 which yields an optimal rank-k ensemble-averaged Sudoku matrix. The matrices with N=108 for k=3,6 and 9 are as follows: 5.2 5.3 5.4

Given the discussion at the end of §3 with respect to the extent to which our sample is unbiased, the convergence to the rank-1 matrix (4.11) is expected. By subtracting the matrix (4.11) from Ak, k=3,6,9, the singular value spectrum should go to zero. We show this in figure 8. Each shows one dominant non-zero singular value with the eight others close to zero.

Figure 8.

Singular values for the rank-k=3,6,9 matrices of the ensemble average. Each shows a spectrum similar to the rank-1 matrix of all 5’s. (a) Singular values for the rank-3 matrix, (b) singular values for the rank-6 matrix and (c) singular values for the rank-9 matrix.

## 6. Summary

The normalized distribution of singular values and corresponding Shannon entropy for an ensemble of Sudoku matrices provides a quantitative measure of the level of ‘disorder’ inherent in the collection by characterizing the weighting of the nine rank-1 ‘modes’ used in reconstituting the ‘average’ Sudoku matrix and allows for comparisons with the related ensembles of Latin squares and random matrices obtained by dropping the regional constraint 3, and all three constraints, respectively. Our conclusions based on these comparisons are that the ensemble-averaged Shannon entropy of the collection of Sudoku matrices is slightly lower than a collection of Latin squares, but higher than a collection of appropriately chosen random matrices. Thus, the extra constraints imposed on the Latin squares and Sudoku matrices serve to increase their Shannon entropy as compared with what it would be without the constraints imposed (random matrices), forcing a more even distribution among the singular modes. This is consistent with the fact that the set of random matrices includes matrices of all possible rank, including matrices with rank 1. Low-rank matrices have lower Shannon entropy than high-rank ones, pulling the average of the ensemble down. Why the more constrained Sudoku matrices have a lower average Shannon entropy than the less constrained ensemble of Latin squares is not obvious, nor is it clear that the differences in these averages are statistically significant, or that our methods are sufficiently sensitive to distinguish between such subtle differences.

## Acknowledgements

P.K.N. would like to acknowledge financial support through NSF-DMS0504308. S.A.D. would like to thank Chris DeSalvo for help with the development of the Sudoku algorithm.

## Footnotes

• 1 We note that there are similarities here to the well-studied Wishart matrices, whose distribution arises from the sample covariance matrix obtained from a multivariate normal distribution. These arise frequently in the study of spectral theory of random matrices (Sengupta & Mitra 1999). Also note that sample covariance matrices are well known to be sensitive to statistical outliers.

• 2 In fact, this will always be the case since the constraints imply that 45 is an eigenvalue of A and AT, with corresponding eigenvectors (1,1,…,1) and (1,1,…,1)T.