Pictish symbols revealed as a written language through application of Shannon entropy

Many prehistoric societies have left a wealth of inscribed symbols for which the meanings are lost. For example, the Picts, a Scottish, Iron Age culture, left a few hundred stones expertly carved with highly stylized petroglyph symbols. Although the symbol scripts are assumed to convey information, owing to the short (one to three symbols), small (less than 1000 symbols) and often fragmented nature of many symbol sets, it has been impossible to conclude whether they represent forms of written language. This paper reports on a two-parameter decision-tree technique that distinguishes between the different character sets of human communication systems when sample sizes are small, thus enabling the type of communication expressed by these small symbol corpuses to be determined. Using the technique on the Pictish symbols established that it is unlikely that they are random or sematographic (heraldic) characters, but that they exhibit the characteristics of written languages.


Introduction
Among the durable artefacts left by prehistoric societies, there are many instances of enigmatic scripts. These scripts typically consist of very short sequences of regularly placed symbols (or single symbols) and range from the inscribed pottery of the Chinese Neolithic pottery (Li et al. 2002), through the inscribed clay tablets and seals of the Indus Valley culture (Rao et al. 2009) to the inscribed stones of Late Iron Age Scotland (Wainwright et al. 1955;Mack 1997). A longstanding conundrum has been to determine whether any of the symbol sets might be an example of a written language. A number of problems have impeded progress in this area: the non-availability of reliable corpuses describing the specific symbols, a lack of agreement on the definition of individual symbol types, small corpus sizes ranging from a couple of hundred to a few thousand symbols, the often short nature of individual inscriptions (one to three symbols in length) and the lack of *Author for correspondence (r.lee@exeter.ac.uk). information is communicated without reference to verbal language forms (such as heraldic characters that have no lexigraphic value in themselves but identify a person, position and place) or (b) lexigraphic scripts, where the characters embody the form of verbal language (e.g. logograms representing words and syllables (non-phonetically), syllabograms representing syllables (phonetically), alphabetic signs representing letters (parts of syllables) and code characters (e.g. Morse code) representing parts of letters (Powell 2009)? A fundamental characteristic of any communication system is that there is a degree of uncertainty (also known as entropy or information) over the particular character or message that may be transmitted (Shannon 1993a). A measure of the average uncertainty of character occurrence is the uni-gram (single character) entropy, F 1 (Shannon 1993b). In a set of N u different characters, the first-order  entropy (F 1 ) is given by where p i is the relative frequency of occurrence of a character calculated from the dataset. In a large dataset of random characters (i.e. sampled with equal probability from a finite lexicon), all uni-grams appear with the same frequency, so p i = 1/N u , thus F 1 = log 2 N u . However, small sample sets of random characters will deviate from this since the incompleteness of the sample available will lead to unequal relative frequencies being observed. Thus, in small sample sets of random characters, p i ∼ 1/N u when estimated from the sample. Figure 2 confirms that F 1 ∼ log 2 N u for 40 sets of random data of small sample size ranging from 15 to 1000 characters. Systems for which F 1 is different from log 2 N u (with respect to the confidence ellipse for prediction) can be identified as non-random and characteristic of writing. The simplest gauge of character-to-character information in written communications is the di-gram entropy, F 2 , the measurement of the average uncertainty of the next character when the preceding character is known. Shannon defined F 2 as (Shannon 1993b) 100 1000 10 000 100 000 T u F 2 Figure 3. Plot of F 2 (di-gram entropy) versus T u (text size based on the total number of uni-gram characters) for a wide range of texts and character types. The di-gram entropy is similar for different types of characters in datasets with small sample size owing to the incomplete nature of the di-gram lexicons. Dashes, sematograms-heraldry; filled diamonds, letters-prose, poetry and inscriptions; grey filled triangles, syllables-prose, poetry, inscriptions; open squares, words-genealogical lists; crosses, code characters; open diamonds, letters-genealogical lists; filled squares, words-prose, poetry and inscriptions.
in which b i is a uni-gram (single character), j is an arbitrary character following b i , p(b i , j) is the relative frequency of the di-gram (pair of characters) b i , j and F 1 is the uni-gram entropy where the summation is from 1 to N u for a set of N u uni-gram characters. F 2 is at a maximum when all the possible di-grams appear with the same frequency (Yaglom & Yaglom 1983). Thus, as the ability to predict the next character increases, the di-gram entropy decreases. Figure 3 shows the di-gram entropy for over 400 datasets of scripts containing small samples of characters. Each dataset contains between 30 and 10 000 word equivalents for a wide variety of character types and scripts. The scripts analysed cover sematograms (Heraldic characters), logograms (Chinese), syllabaries (Linear B and Egyptian hieroglyphs), alphabetic systems (analysed at letter, syllable and word level) of different modern languages (English, Irish, Welsh, Norse, Turkish, Basque, Finnish, Korean) and ancient languages (Latin, Anglo-Saxon, Old Norse, Ancient Irish, Old Irish, Old Welsh). The texts cover prose, poetry, monumental inscriptions and genealogical lists (King lists, marriage and birth lists). Full details are given in §5. Unfortunately, figure 3 shows that, for systems containing only small samples of characters, the Shannon di-gram entropy as a function of text size (as given by the total number of uni-grams, T u ) cannot be used to differentiate between the different character types or even between the types of writing (semasiography or lexigraphic). The reason for this failure to differentiate between character type is that, at these small sample sizes, there are insufficient data to properly characterize the character lexicon, which affects the observed N -gram distributions and hence entropy. In this paper, the term incomplete is used to describe text samples that are insufficiently representative to characterize the underlying character lexicons. For a text of a given size, the degree to which the N -gram lexicon is incomplete will be strongly affected by a number of linguistic phenomena, including -type of character used to code for the communication, -size of the character lexicon used (e.g. texts with constrained (or limited) vocabularies pull from a pool of available words that is limited to a fraction of a normal vocabulary), -grammar of the unknown language (e.g. the system of inflection within the language), -syntax of the unknown language (i.e. the word order), and -degree of standardized spelling (i.e. many inscriptions do not use standardized spelling).
At very large datasets (100+ million words), these phenomena reduce the prediction ability of N -gram-based statistical language models used in such areas as speech and optical character recognition, document classification and machine translation (Rosenfeld 2000). As a consequence, linguistic-derived functions are used to make up some of the predictive deficiency (Rosenfeld 2000). Unfortunately, these functions are not appropriate for unknown systems with small sample size datasets and very incomplete character lexicons. In order to be able to compare the di-gram entropies of such datasets, a measure of the degree of 'incompleteness' or 'completeness' of the di-gram lexicons is needed. This paper proposes that a measure of the completeness of the di-gram lexicon (or its lack of incompleteness) can be derived from the number of different uni-grams and di-grams in the dataset.
For a text with a given number of different uni-grams, N u , the number of different di-grams, N d , will depend upon the incompleteness of the di-gram lexicon (which in turn is dependent upon the linguistic phenomena outlined above). Depending upon the degree of di-gram lexicon completeness, N d will range between N u (very incomplete) and (N u ) 2 (complete, but note that this is the theoretical maximum: actual lexicons will, in practice, always be less than (N u ) 2 since the rules of syntax and spelling will only allow a value less than (N u ) 2 ). Thus, a measure of the completeness in the di-gram lexicon for small samples can be obtained by calculating (N d /N u ), where N d is the number of different di-grams and N u is the number of different uni-grams. Figure 4 shows that the di-gram entropy is dependent upon this measure of the degree of completeness in the di-gram lexicon and shows differentiation between three types of lexigraphic character types (words, syllables and letters). Thus, this paper proposes normalizing F 2 by log 2 (N d /N u ) in order to define, for any text, a second-order function (U r ) adjusted for the di-gram lexicon completeness in a small sample size, As figure 4 shows, systems written in sematograms (heraldry) and lexigraphic code characters can have similar di-gram entropies to those of standard lexigraphic characters. However, while words, syllables and letters generally correspond to a fixed unit of language, sematograms and code characters have no fixed lexigraphic value in themselves, but are combined to produce a lexigraphic character. Thus, heraldic and code characters are, by their nature, intrinsically more repetitive than standard lexigraphic characters. For example, the morse code uses two characters in combination (with a space) and thus the four digrams (dash-dash, dash-dot, dot-dash and dot-dot) are used repeatedly in order to build the 26 letters of the alphabet. By definition, repetitive di-grams are ones that appear more than once, thus a measure based on the degree of di-gram repetition in a text can be given by (S d /T d ), where T d is the total number of di-grams and S d is the number of di-grams that appear only once in the text. For texts with a high degree of di-gram repetition, S d approaches 0. The degree of di-gram repetition will have some dependency upon the degree of completeness in the di-gram lexicon-since texts with a relatively 'complete' di-gram lexicon will be more likely to have greater repetition of the di-grams and a lower value of S d . Figure 5 shows that the degree of di-gram repetition (S d /T d ) is dependent upon the degree of completeness in the di-gram lexicon (N d /N u ) for texts using standard lexigraphic characters. Figure 5 also shows that heraldic and code characters do not follow the same dependency as standard lexigraphic characters. Thus, this paper proposes a di-gram repetition factor, C r , defined as a linear combination of the two quantities  where a is a constant estimated using cross-validation techniques in order to maximize the performance of a decision tree. Thus, the structure variables, U r and C r , are combinations of underlying linguistic variables that elucidate the key characteristics and structure of the data. Both U r and C r can be calculated for any type of communication system without any prior knowledge of the meaning of a system and have been used in a two-parameter decision tree to classify the following character types: (i) words, (ii) syllables, (iii) letters, and (iv) other characters such as heraldic sematograms and lexigraphic code characters. Figure 2 shows the 99.9 per cent confidence ellipse for prediction around 40 sets of random data. The datasets plotted in figure 2 were generated as follows: characters were sampled from a uniform distribution (i.e. with equal relative frequencies) into small units of text similar to the small units of glyphs seen on the stones. The key properties (total number of unigrams, number of different unigrams and the subsequent fraction of unigrams appearing only once) bracketed the corresponding properties observed in the stones. Figure 2 therefore tests whether the stones correspond to similar-sized samples from a finite alphabet of equal relative frequency of unigram occurrence. Texts based on written  Figure 6. Two-parameter decision tree that separates repetitive text from non-repetitive text. This figure classifies the character types found in non-repetitive text into the three main lexigraphic character units (words, syllables, letters). Repetitive text consists of two main categories of characters: non-lexigraphic heraldic characters and lexigraphic code characters, as well as non-concordant letter, syllable and word character texts that are repetitive.

Results
communication have an uneven distribution of characters that generally results in a lower F 1 for any value of N u when compared with random sets. Figure 2 shows that the observed uni-gram entropy values for the Pictish symbols fall outside the 99.9 per cent confidence ellipse for prediction surrounding the random uni-gram dataset. Hence, it is extremely unlikely that the observed values for the Pictish stones would occur by chance were they indeed a random dataset. The structure variables U r and C r have been used in a decision-tree analysis to differentiate the majority of the character types found in written communication. Figure 6 shows a decision tree, cross validated using the Gini diversity index, with a successful allocation rate of 99.1 per cent. (The performance of the classifier remains constant provided two decimal places of the partition values of the variables are retained, thus the classifier estimates are optimal to two decimal places. An estimate of the value of the parameter a in equation (2.4) was obtained by cross validation. Full details of the validation are given in §5.) Texts with C r < 4.89, where C r = N d /N u + 7(S d /T d ), are repetitive in nature and are consistent with Heraldic character systems (sematograms) and code-character systems, both of which are characterized by highly repetitive character sequences. Unfortunately, some repetitive lexigraphic texts also fall in this group and so if a text has a C r < 4.89, we cannot determine what character type is present using this tree. If, however, the texts have a C r ≥ 4.89, then we classify the character types as lexigraphic and, depending upon the value of U r , determine whether the characters represent words, syllables or letters. It is generally easier to predict the next letter than the next word because of: (i) the spelling rules (e.g. q is usually . The effect on the empirical cumulative distributions of U r (F 2 / log 2 (N d /N u )) of increasing the character vocabulary constraint for letters. As the vocabulary becomes constrained, the distribution of U r becomes narrower and the mean value decreases. Short-dashed line, empirical cumulative distribution for letter characters for all prose, poetry and inscriptions; long-dashed line, empirical cumulative distribution for letter characters for constrained genealogical lists; solid line, empirical cumulative distribution for letter characters from very constrained lists. followed by u in English) and (ii) the constrained nature of the letter lexicon compared with word lexicon (26 letters in English versus word vocabulary of hundreds for even the most constrained texts). This means that for a given value of (N d /N u ), we should expect F 2 for words to be larger than letters and thus U r to be larger, and figures 3 and 6 show this to be the case. The separation of the lexigraphic character types is independent of language or sign type (i.e. alphabet, syllabogram and logogram scripts).
As a character vocabulary is constrained, it becomes easier to predict the next character, decreasing F 2 and U r . The effect of constraining the character vocabulary upon the distribution of U r is shown in figure 7. Within normal texts, there is a wide variety of vocabulary constraints. Constraining the character vocabulary (e.g. King lists and genealogical lists that are constrained to a vocabulary of names or genealogical lists using an even smaller vocabulary of familiar, diminutive names) gives a narrower distribution and a decreasing mean value of U r .
The tree classifier developed suggests that the Pictish symbols are lexigraphic in nature because they have values of C r in the interval [5.6, 6.2] (table 1). In particular, we infer that the Pictish symbols are not drawn from a distribution of heraldic characters. Table 1 shows that Mack's symbol categorization gives values of U r that fall in the syllable side of the syllable/word boundary. However, Mack's categorization of the symbol types is much narrower than that of other workers (Allen & Anderson 1903;Diack 1944;Forsyth 1997). If Mack's categorization Table 1. Values of C r and U r calculated for the Class I and Class II Pictish symbol stones using the symbol types given by Mack (1997) and by Allen & Anderson (1903

Discussion
Since there are many complete stones inscribed only with a single symbol, it seems unlikely (although not impossible) that the symbols are single syllables.
In order to answer the question of whether the symbols are words or syllables, and thus define a system from which a decipherment can be initiated, a complete visual catalogue of the stones and the symbols will need to be created and the effect of widening the symbol set investigated. However, demonstrating that the Pictish symbols are writing, with the symbols probably corresponding to words, opens a unique line of further research for historians and linguists investigating the Picts and how they viewed themselves. Having shown that it is possible to use an entropic technique to investigate the degree of communication in very small and incomplete written systems, it may be possible to extend this to other areas with similar problems. For example, animal language studies using Shannon entropies are often hampered by small sample datasets (McCowan et al. 1999). By building a similar set of data for spoken or verbal human communication, it should be possible to make similar comparisons of the level of information communicated by animal languages.

(a) Entropy calculations
For all texts, a 'start/end' character was inserted at commas or full stops, otherwise all punctuation was removed and all spaces ignored (since many old inscriptions have little or no punctuation). F 0 , F 1 and F 2 were calculated at the character levels appropriate for the text, e.g. alphabetic texts were mainly analysed at the letter and word level, syllabogram texts at the syllable and word level.

(b) English texts
Prose fiction texts were written under varying degrees of word constraint (normal texts have ca 4.3 letters/word, lightly constrained texts have between 3.6 and 4.0 letters/word and highly constrained texts have 2.5-3.0 letters/word). Graveyard texts from Kelsall Church of England graveyard were used. Text size varied from 35 to 10 000 words and were analysed at the letter, syllable and word level.

(c) Chinese texts
Prose and poetry texts from Yu Xuan Ji, Hong Lou Meng and Shijing texts were analysed (Lung 2009). Text size from 50 to 3000 word characters was analysed at the word level.

(d) Universal Declaration of Human Rights text
Languages analysed were English, Irish, Welsh, Norse, Turkish, Basque, Finnish and Korean at the word and letter level (UDHR 2008).

(e) Ancient inscriptions from the British Isles
Languages analysed were Latin, Anglo-Saxon, Old Norse, Ancient Irish, Old Irish and Old Welsh. Only whole words from translatable inscriptions were included. Each inscription was bracketed with a start/end character or a 'missing' character for incomplete inscriptions. All punctuation (if present) was removed. Ligated letters were separated into their constituent letters. Alphabet-specific characters were retained. Each corpus of specific inscription types was run as a single set. Irish inscriptions were split into an early tradition (ogam) and later tradition (uncial) with two different authors being used for the early tradition (Macalister 1945(Macalister , 1949McManus 1991). Welsh inscriptions were split into an early tradition (Class I) and later tradition (Class II and III) (Nash-Williams 1950). Roman memorials were split into two groups, those found at Hadrian's Wall and the rest (Collingwood & Wright 1965). Inscribed stones, slabs, crosses and personal items from the Anglo-Saxon period were used (Okasha 1971). Isle of Man Norse runic inscriptions (Page 1995) and Southern Scottish inscriptions (Thomas 1991(Thomas -1992 were used. The text sizes ranged from 50 to 2200 words and were analysed at the letter, syllable and word level.

(f ) Egyptian monumental texts
These were transcribed in two ways: using the standard modern spelling (which removes superfluous hieroglyphs and applies a standard spelling) and an 'as observed' reading of the hieroglyphs (Zauzich 2004). The Egyptian hieroglyphs in these texts are primarily syllabic in nature, being predominantly a mix of single and bilateral glyphs. Text size was 250 words analysed at the word and syllable level.
(g) Mycenaean lists (Linear B ) These were split into two groups: military lists and others (Palmer 1998). Text size was 450-600 words analysed at the word and syllable level.

(h) King lists
These contain only the names of child and parent(s) for the Pictish, Anglo-Saxon, Scottish, English, Cashel and Munster lineages (Anderson 1973;Byrne 1973;Montague-Smith 1992). Text size was 60-175 words analysed at the word and letter level.

(i) Genealogical lists
English baronial genealogies containing: (i) Christian names of the child and parent(s), (ii) Christian names of bride and groom, and (iii) surnames of bride and groom were used (Sanders 1960). A second set of lists was created using familiar, diminutive names instead (e.g. 'Al for Alan, Alfred and Albert'). Text size was 250-1500 words analysed at the word and letter level.

(j) Sematogram heraldic
A normal distribution of arms from the Heraldic Arms of British Extinct peerages (1086-1400) was used (Burke 1962). The charges (symbols) on the shield were used as characters for analysis. The colour of the charge was also used for analysis. A simplified set of characters was also generated using only the base symbols, e.g. (i) all the different lion charges such as rampant or passant are classified as a 'lion' character and (ii) all different cross charges such as bourdonny and fleuretty are classified as 'cross' in the base-symbol categorization. Each arms was read as observed symbols from bottom to top. Text size was 400-1200 symbols.
(k) Subletter coded systems A range of English texts was transposed using morse code and a three-character code for the letters. Text size was 400-75 000 characters.

(l) Random
Randomly generated characters texts, ranging from sets of two to 100 different characters, were used with texts sizes of 15-1000 characters. The texts bracketed the values observed in the stones for the total number of uni-grams (T u ), the number of different uni-grams (N u ) and the fraction of uni-grams appearing only once.

(m) Pictish symbols
These were split into Class I and Class II symbols. The symbols were read as observed from top to bottom, left to right, using Mack's symbol set and the symbol set given in Early Christian Monuments of Scotland (Allen & Anderson 1903;Mack 1997). The symbol data were taken only from complete stones, which form the majority of the stones.

(n) Statistical analysis
The 99.9 per cent confidence ellipse for prediction was calculated from the random character data assuming a bi-variate normal distribution for F 1 and log 2 N u . (Histograms and normal probability plots of the marginal distributions show no obvious departure from normality.) The confidence ellipse is centred on the mean of the random data (Mardia et al. 1979).
Classification trees, constructed using the classification-tree methodology of Breiman et al. (1984), are non-parametric models to describe the variation in a response variable (the categorical character-class variable here) as a function of a number of explanatory variables (the continuous-structure variables U r and C r here) for a sample of data in a two-step approach. Firstly, the sample is partitioned, by means of successive binary partitions, such that the subsets eventually formed are as homogeneous as possible with respect to the response variable, quantified using one of a number of criteria (the Gini diversity index here). To avoid over-fitting, subsets of the partition can then be recombined if the resulting loss of homogeneity is not large (assessed using a 10-fold cross-validation strategy) in the second stage, known as pruning.
Cross validation is a well-known data resampling method to estimate model predictive performance, and possibly thereby the optimal values of one or more tuning parameters (Stone 1974;Picard & Cook 1984); for example, the value of parameter a in structure variable C r , or the optimal extent of pruning. In cross validation, the sample set is partitioned into two or more subsets. One subset is typically withheld, while the remaining subsets are used to construct the model, in this case a classification tree. The withheld subset is then treated as an independent test set with which to estimate model performance, possibly as a function of one or more tuning parameters. The withheld subset is then reinstated, another subset withheld and the procedure repeated until each subset has been withheld exactly once. Overall model performance is then calculated by summing performance over all withheld subsets and the corresponding optimal value of tuning parameter selected. There are many possible refinements to the cross-validation procedure. In general, it is necessary to explore the sensitivity of cross-validation-based inferences as a function of the parameters of the cross-validation strategy. We found that, for the current application, inferences were generally insensitive to these choices-for instance, any value between 6 and 9 of the parameter a in structure variable C r gives a cross-validation performance of greater than 99 per cent. The performance of the classifier remains constant provided two decimal places of the partition values of the variables are retained, and thus the classifier estimates are optimal to two decimal places.

Glossary
T u : total number of characters (uni-grams) in a text. T u is the text size for that character type, thus a text of 200 words may have a letter text size of 900 letters and a syllable text size of 520 syllables.