Pictish symbols revealed as a written language through application of Shannon entropy

Rob Lee, Philip Jonathan, Pauline Ziman

Abstract

Many prehistoric societies have left a wealth of inscribed symbols for which the meanings are lost. For example, the Picts, a Scottish, Iron Age culture, left a few hundred stones expertly carved with highly stylized petroglyph symbols. Although the symbol scripts are assumed to convey information, owing to the short (one to three symbols), small (less than 1000 symbols) and often fragmented nature of many symbol sets, it has been impossible to conclude whether they represent forms of written language. This paper reports on a two-parameter decision-tree technique that distinguishes between the different character sets of human communication systems when sample sizes are small, thus enabling the type of communication expressed by these small symbol corpuses to be determined. Using the technique on the Pictish symbols established that it is unlikely that they are random or sematographic (heraldic) characters, but that they exhibit the characteristics of written languages.

1. Introduction

Among the durable artefacts left by prehistoric societies, there are many instances of enigmatic scripts. These scripts typically consist of very short sequences of regularly placed symbols (or single symbols) and range from the inscribed pottery of the Chinese Neolithic pottery (Li et al. 2002), through the inscribed clay tablets and seals of the Indus Valley culture (Rao et al. 2009) to the inscribed stones of Late Iron Age Scotland (Wainwright et al. 1955; Mack 1997). A longstanding conundrum has been to determine whether any of the symbol sets might be an example of a written language. A number of problems have impeded progress in this area: the non-availability of reliable corpuses describing the specific symbols, a lack of agreement on the definition of individual symbol types, small corpus sizes ranging from a couple of hundred to a few thousand symbols, the often short nature of individual inscriptions (one to three symbols in length) and the lack of a technique to establish the level of communication of the symbols when sample sizes are small (Bouissac 1997). For known languages, statistical techniques such as phylogenetic methods have been used to aid in the reconstruction of ancient language histories (Warnow 1997; Foster & Toth 2003; Dunn et al. 2005) and the rates of linguistic evolution (Pagel et al. 2007; Atkinson et al. 2008). Recently, conditional entropies have been used to investigate the Indus script, but the conclusions were not definitive owing to the use of small, smoothed datasets, the comparative nature of the technique and its inability to differentiate between different lexigraphic systems (Rao et al. 2009). This paper describes a technique that incorporates linguistic functions in order to quantify the level of communication in these small, ‘incomplete’ symbol datasets and thus differentiate between the different possible character types of writing (the term incomplete is used here to describe text samples that have insufficient data to properly characterize the character lexicons).

The Picts were an Iron Age society that existed in Scotland from ca AD 300–843 when the Dalriadic Scot, Kenneth, son of Alpin, took the Pictish Kingship. The Picts are recorded in the writings of their contemporaries—the Romans, the Anglo-Saxons and the Irish but, other than a copy of their King list, they left no written record of themselves (Wainwright et al. 1955; Anderson 1973). The Picts did, however, leave a range of finely carved stones inscribed with glyphs of unknown meaning, known as ‘Pictish Symbol Stones’. The Pictish Symbol Stones are categorized into two types as shown in figure 1: (i) Class I stones, numbering between 180 and 195, consist of undressed stones with the symbols inscribed onto the rock and (ii) Class II stones, numbering between 60 and 65 stones, contain the depiction of a cross, use dressed stones and relief carving for the symbols and may have other, often Christian, imagery. Class I stones are taken to be the earlier tradition of the two types of Symbol Stones. The stones contain between one and eight symbols, with the commonest syntax being one or two symbols. Over a century ago, Allen and Anderson visually catalogued the then known Pictish Symbol Stones and categorized their symbols (Allen & Anderson 1903). While no visual categorization catalogue of the possible different symbol types exists, the Pictish Symbol Stones have recently been completely categorized by Mack (1997), although he uses a smaller set of 43 symbol types than do earlier workers (Allen & Anderson 1903; Diack 1944; Forsyth 1997). Over the last century, a wide variety of ‘meanings’ for the symbols have been proposed, from pagan religious imagery to heraldic arms (Allen & Anderson 1903; Diack 1944; Wainwright et al. 1955; Mack 1997), but it is only recently that the question as to whether they might be a written language has been asked (Samson 1992; Forsyth 1997). However, in the absence of a suitable technique, the call for an analysis to establish whether the symbols were a script and that the stones might be memorial in character remains unanswered (Samson 1992; Forsyth 1997).

Figure 1.

Pictish Symbol Stones. (a) Class I stone, ‘Grantown’, with two symbols—stag and rectangle. (b) ‘Aberlemno 2’, a Class II stone with two symbols—divided rectangle with a Z rod and triple disc, as well as other imagery (a battle, the cross is on the other face).

2. Theory

The problem that the Pictish symbols pose can be broken into a couple of questions. (i) Are they random in nature (admittedly unlikely since they appear to have been carved for a purpose)? (ii) If it is unlikely that they are random, then what type of communication do they convey: (a) semasiography, where information is communicated without reference to verbal language forms (such as heraldic characters that have no lexigraphic value in themselves but identify a person, position and place) or (b) lexigraphic scripts, where the characters embody the form of verbal language (e.g. logograms representing words and syllables (non-phonetically), syllabograms representing syllables (phonetically), alphabetic signs representing letters (parts of syllables) and code characters (e.g. Morse code) representing parts of letters (Powell 2009)?

A fundamental characteristic of any communication system is that there is a degree of uncertainty (also known as entropy or information) over the particular character or message that may be transmitted (Shannon 1993a). A measure of the average uncertainty of character occurrence is the uni-gram (single character) entropy, F1 (Shannon 1993b). In a set of Nu different characters, the first-order entropy (F1) is given byEmbedded Image 2.1 where pi is the relative frequency of occurrence of a character calculated from the dataset. In a large dataset of random characters (i.e. sampled with equal probability from a finite lexicon), all uni-grams appear with the same frequency, so pi=1/Nu, thus Embedded Image. However, small sample sets of random characters will deviate from this since the incompleteness of the sample available will lead to unequal relative frequencies being observed. Thus, in small sample sets of random characters, pi∼1/Nu when estimated from the sample. Figure 2 confirms that Embedded Image for 40 sets of random data of small sample size ranging from 15 to 1000 characters. Systems for which F1 is different from Embedded Image (with respect to the confidence ellipse for prediction) can be identified as non-random and characteristic of writing.

Figure 2.

Plot of F1 (uni-gram entropy) versus Embedded Image (number of different uni-grams) showing the 99.9% confidence ellipse for prediction of the random data. This figure tests whether the stones correspond to similar-sized samples from a finite alphabet of equal relative frequency of unigram occurrence. It is extremely unlikely that the observed values for the Pictish Stones would occur by chance were they indeed a random dataset. Open squares, random data; filled triangles, Pictish Symbol Stones; dotted line, upper 99.9% confidence ellipse for prediction; solid line, lower 99.9% confidence ellipse for prediction.

The simplest gauge of character-to-character information in written communications is the di-gram entropy, F2, the measurement of the average uncertainty of the next character when the preceding character is known. Shannon defined F2 as (Shannon 1993b)Embedded Image 2.2 in which bi is a uni-gram (single character), j is an arbitrary character following bi,p(bi,j) is the relative frequency of the di-gram (pair of characters) bi, j and F1 is the uni-gram entropy where the summation is from 1 to Nu for a set of Nu uni-gram characters. F2 is at a maximum when all the possible di-grams appear with the same frequency (Yaglom & Yaglom 1983). Thus, as the ability to predict the next character increases, the di-gram entropy decreases.

 Figure 3 shows the di-gram entropy for over 400 datasets of scripts containing small samples of characters. Each dataset contains between 30 and 10 000 word equivalents for a wide variety of character types and scripts. The scripts analysed cover sematograms (Heraldic characters), logograms (Chinese), syllabaries (Linear B and Egyptian hieroglyphs), alphabetic systems (analysed at letter, syllable and word level) of different modern languages (English, Irish, Welsh, Norse, Turkish, Basque, Finnish, Korean) and ancient languages (Latin, Anglo-Saxon, Old Norse, Ancient Irish, Old Irish, Old Welsh). The texts cover prose, poetry, monumental inscriptions and genealogical lists (King lists, marriage and birth lists). Full details are given in §5. Unfortunately, figure 3 shows that, for systems containing only small samples of characters, the Shannon di-gram entropy as a function of text size (as given by the total number of uni-grams, Tu) cannot be used to differentiate between the different character types or even between the types of writing (semasiography or lexigraphic). The reason for this failure to differentiate between character type is that, at these small sample sizes, there are insufficient data to properly characterize the character lexicon, which affects the observed N-gram distributions and hence entropy. In this paper, the term incomplete is used to describe text samples that are insufficiently representative to characterize the underlying character lexicons. For a text of a given size, the degree to which the N-gram lexicon is incomplete will be strongly affected by a number of linguistic phenomena, including

  • — type of character used to code for the communication,

  • — size of the character lexicon used (e.g. texts with constrained (or limited) vocabularies pull from a pool of available words that is limited to a fraction of a normal vocabulary),

  • — grammar of the unknown language (e.g. the system of inflection within the language),

  • — syntax of the unknown language (i.e. the word order), and

  • — degree of standardized spelling (i.e. many inscriptions do not use standardized spelling).

Figure 3.

Plot of F2 (di-gram entropy) versus Tu (text size based on the total number of uni-gram characters) for a wide range of texts and character types. The di-gram entropy is similar for different types of characters in datasets with small sample size owing to the incomplete nature of the di-gram lexicons. Dashes, sematograms—heraldry; filled diamonds, letters—prose, poetry and inscriptions; grey filled triangles, syllables—prose, poetry, inscriptions; open squares, words—genealogical lists; crosses, code characters; open diamonds, letters—genealogical lists; filled squares, words—prose, poetry and inscriptions.

At very large datasets (100+ million words), these phenomena reduce the prediction ability of N-gram-based statistical language models used in such areas as speech and optical character recognition, document classification and machine translation (Rosenfeld 2000). As a consequence, linguistic-derived functions are used to make up some of the predictive deficiency (Rosenfeld 2000). Unfortunately, these functions are not appropriate for unknown systems with small sample size datasets and very incomplete character lexicons. In order to be able to compare the di-gram entropies of such datasets, a measure of the degree of ‘incompleteness’ or ‘completeness’ of the di-gram lexicons is needed. This paper proposes that a measure of the completeness of the di-gram lexicon (or its lack of incompleteness) can be derived from the number of different uni-grams and di-grams in the dataset.

For a text with a given number of different uni-grams, Nu, the number of different di-grams, Nd, will depend upon the incompleteness of the di-gram lexicon (which in turn is dependent upon the linguistic phenomena outlined above). Depending upon the degree of di-gram lexicon completeness, Nd will range between Nu (very incomplete) and (Nu)2 (complete, but note that this is the theoretical maximum: actual lexicons will, in practice, always be less than (Nu)2 since the rules of syntax and spelling will only allow a value less than (Nu)2). Thus, a measure of the completeness in the di-gram lexicon for small samples can be obtained by calculating (Nd/Nu), where Nd is the number of different di-grams and Nu is the number of different uni-grams.

 Figure 4 shows that the di-gram entropy is dependent upon this measure of the degree of completeness in the di-gram lexicon and shows differentiation between three types of lexigraphic character types (words, syllables and letters). Thus, this paper proposes normalizing F2 by Embedded Image in order to define, for any text, a second-order function (Ur) adjusted for the di-gram lexicon completeness in a small sample size,Embedded Image 2.3

As figure 4 shows, systems written in sematograms (heraldry) and lexigraphic code characters can have similar di-gram entropies to those of standard lexigraphic characters. However, while words, syllables and letters generally correspond to a fixed unit of language, sematograms and code characters have no fixed lexigraphic value in themselves, but are combined to produce a lexigraphic character. Thus, heraldic and code characters are, by their nature, intrinsically more repetitive than standard lexigraphic characters. For example, the morse code uses two characters in combination (with a space) and thus the four di-grams (dash–dash, dash–dot, dot–dash and dot–dot) are used repeatedly in order to build the 26 letters of the alphabet. By definition, repetitive di-grams are ones that appear more than once, thus a measure based on the degree of di-gram repetition in a text can be given by (Sd/Td), where Td is the total number of di-grams and Sd is the number of di-grams that appear only once in the text. For texts with a high degree of di-gram repetition, Sd approaches 0. The degree of di-gram repetition will have some dependency upon the degree of completeness in the di-gram lexicon—since texts with a relatively ‘complete’ di-gram lexicon will be more likely to have greater repetition of the di-grams and a lower value of Sd. Figure 5 shows that the degree of di-gram repetition (Sd/Td) is dependent upon the degree of completeness in the di-gram lexicon (Nd/Nu) for texts using standard lexigraphic characters. Figure 5 also shows that heraldic and code characters do not follow the same dependency as standard lexigraphic characters. Thus, this paper proposes a di-gram repetition factor, Cr, defined as a linear combination of the two quantitiesEmbedded Image 2.4 where a is a constant estimated using cross-validation techniques in order to maximize the performance of a decision tree. Thus, the structure variables, Ur and Cr, are combinations of underlying linguistic variables that elucidate the key characteristics and structure of the data. Both Ur and Cr can be calculated for any type of communication system without any prior knowledge of the meaning of a system and have been used in a two-parameter decision tree to classify the following character types: (i) words, (ii) syllables, (iii) letters, and (iv) other characters such as heraldic sematograms and lexigraphic code characters.

Figure 4.

Plot of F2 (di-gram entropy) versus Nd/Nu (degree of di-gram lexicon completeness) using a log-linear scale. The di-gram entropy for different types of characters is dependent upon the level of completeness of the di-gram lexicon. Dashes, sematograms—heraldry; filled diamonds, letters—prose, poetry and inscriptions; grey filled triangles, syllables—prose, poetry, inscriptions; open squares, words—genealogical lists; crosses, code characters; open diamonds, letters—genealogical lists; filled squares, words—prose, poetry and inscriptions.

Figure 5.

Plot of Sd/Td (degree of di-gram repetition) versus Nd/Nu (degree of di-gram lexicon completeness). The degree of di-gram repetition is also dependent upon the level of completeness of the di-gram lexicon and that this dependency is different for standard lexigraphic characters compared with heraldic sematogram characters. Dashes, sematograms—heraldry; filled diamonds, letters—prose, poetry and inscriptions; grey filled triangles, syllables—prose, poetry, inscriptions; open squares, words—genealogical lists; crosses, code characters; open diamonds, letters—genealogical lists; filled squares, words—prose, poetry and inscriptions.

3. Results

 Figure 2 shows the 99.9 per cent confidence ellipse for prediction around 40 sets of random data. The datasets plotted in figure 2 were generated as follows: characters were sampled from a uniform distribution (i.e. with equal relative frequencies) into small units of text similar to the small units of glyphs seen on the stones. The key properties (total number of unigrams, number of different unigrams and the subsequent fraction of unigrams appearing only once) bracketed the corresponding properties observed in the stones. Figure 2 therefore tests whether the stones correspond to similar-sized samples from a finite alphabet of equal relative frequency of unigram occurrence. Texts based on written communication have an uneven distribution of characters that generally results in a lower F1 for any value of Nu when compared with random sets. Figure 2 shows that the observed uni-gram entropy values for the Pictish symbols fall outside the 99.9 per cent confidence ellipse for prediction surrounding the random uni-gram dataset. Hence, it is extremely unlikely that the observed values for the Pictish stones would occur by chance were they indeed a random dataset.

The structure variables Ur and Cr have been used in a decision-tree analysis to differentiate the majority of the character types found in written communication. Figure 6 shows a decision tree, cross validated using the Gini diversity index, with a successful allocation rate of 99.1 per cent. (The performance of the classifier remains constant provided two decimal places of the partition values of the variables are retained, thus the classifier estimates are optimal to two decimal places. An estimate of the value of the parameter a in equation (2.4) was obtained by cross validation. Full details of the validation are given in §5.) Texts with Cr<4.89, where Cr=Nd/Nu+7(Sd/Td), are repetitive in nature and are consistent with Heraldic character systems (sematograms) and code-character systems, both of which are characterized by highly repetitive character sequences. Unfortunately, some repetitive lexigraphic texts also fall in this group and so if a text has a Cr<4.89, we cannot determine what character type is present using this tree. If, however, the texts have a Cr≥4.89, then we classify the character types as lexigraphic and, depending upon the value of Ur, determine whether the characters represent words, syllables or letters. It is generally easier to predict the next letter than the next word because of: (i) the spelling rules (e.g. q is usually followed by u in English) and (ii) the constrained nature of the letter lexicon compared with word lexicon (26 letters in English versus word vocabulary of hundreds for even the most constrained texts). This means that for a given value of (Nd/Nu), we should expect F2 for words to be larger than letters and thus Ur to be larger, and figures 3 and 6 show this to be the case. The separation of the lexigraphic character types is independent of language or sign type (i.e. alphabet, syllabogram and logogram scripts).

Figure 6.

Two-parameter decision tree that separates repetitive text from non-repetitive text. This figure classifies the character types found in non-repetitive text into the three main lexigraphic character units (words, syllables, letters). Repetitive text consists of two main categories of characters: non-lexigraphic heraldic characters and lexigraphic code characters, as well as non-concordant letter, syllable and word character texts that are repetitive.

As a character vocabulary is constrained, it becomes easier to predict the next character, decreasing F2 and Ur. The effect of constraining the character vocabulary upon the distribution of Ur is shown in figure 7. Within normal texts, there is a wide variety of vocabulary constraints. Constraining the character vocabulary (e.g. King lists and genealogical lists that are constrained to a vocabulary of names or genealogical lists using an even smaller vocabulary of familiar, diminutive names) gives a narrower distribution and a decreasing mean value of Ur.

Figure 7.

The effect on the empirical cumulative distributions of Ur (Embedded Image) of increasing the character vocabulary constraint for letters. As the vocabulary becomes constrained, the distribution of Ur becomes narrower and the mean value decreases. Short-dashed line, empirical cumulative distribution for letter characters for all prose, poetry and inscriptions; long-dashed line, empirical cumulative distribution for letter characters for constrained genealogical lists; solid line, empirical cumulative distribution for letter characters from very constrained lists.

View this table:
Table 1.

Values of Cr and Ur calculated for the Class I and Class II Pictish symbol stones using the symbol types given by Mack (1997) and by Allen & Anderson (1903).

The tree classifier developed suggests that the Pictish symbols are lexigraphic in nature because they have values of Cr in the interval [5.6, 6.2] (table 1). In particular, we infer that the Pictish symbols are not drawn from a distribution of heraldic characters. Table 1 shows that Mack’s symbol categorization gives values of Ur that fall in the syllable side of the syllable/word boundary. However, Mack’s categorization of the symbol types is much narrower than that of other workers (Allen & Anderson 1903; Diack 1944; Forsyth 1997). If Mack’s categorization is incorrect, then this will have the effect of artificially constraining the symbol lexicon, lowering F2 and Ur. The larger symbol categorization proposed by Allen and Anderson in Early Christian Monuments of Scotland implies that the Pictish symbols are very constrained words, similar in constraint to the genealogical name lists. Thus, it is likely that the symbols are actually words, but that Mack’s categorization has lowered the symbol di-gram entropy such that the data fall in the syllable band.

4. Discussion

Since there are many complete stones inscribed only with a single symbol, it seems unlikely (although not impossible) that the symbols are single syllables. In order to answer the question of whether the symbols are words or syllables, and thus define a system from which a decipherment can be initiated, a complete visual catalogue of the stones and the symbols will need to be created and the effect of widening the symbol set investigated. However, demonstrating that the Pictish symbols are writing, with the symbols probably corresponding to words, opens a unique line of further research for historians and linguists investigating the Picts and how they viewed themselves.

Having shown that it is possible to use an entropic technique to investigate the degree of communication in very small and incomplete written systems, it may be possible to extend this to other areas with similar problems. For example, animal language studies using Shannon entropies are often hampered by small sample datasets (McCowan et al. 1999). By building a similar set of data for spoken or verbal human communication, it should be possible to make similar comparisons of the level of information communicated by animal languages.

5. Material and methods

(a) Entropy calculations

For all texts, a ‘start/end’ character was inserted at commas or full stops, otherwise all punctuation was removed and all spaces ignored (since many old inscriptions have little or no punctuation). F0, F1 and F2 were calculated at the character levels appropriate for the text, e.g. alphabetic texts were mainly analysed at the letter and word level, syllabogram texts at the syllable and word level.

(b) English texts

Prose fiction texts were written under varying degrees of word constraint (normal texts have ca 4.3 letters/word, lightly constrained texts have between 3.6 and 4.0 letters/word and highly constrained texts have 2.5–3.0 letters/word). Graveyard texts from Kelsall Church of England graveyard were used. Text size varied from 35 to 10 000 words and were analysed at the letter, syllable and word level.

(c) Chinese texts

Prose and poetry texts from Yu Xuan Ji, Hong Lou Meng and Shijing texts were analysed (Lung 2009). Text size from 50 to 3000 word characters was analysed at the word level.

(d) Universal Declaration of Human Rights text

Languages analysed were English, Irish, Welsh, Norse, Turkish, Basque, Finnish and Korean at the word and letter level (UDHR 2008).

(e) Ancient inscriptions from the British Isles

Languages analysed were Latin, Anglo-Saxon, Old Norse, Ancient Irish, Old Irish and Old Welsh. Only whole words from translatable inscriptions were included. Each inscription was bracketed with a start/end character or a ‘missing’ character for incomplete inscriptions. All punctuation (if present) was removed. Ligated letters were separated into their constituent letters. Alphabet-specific characters were retained. Each corpus of specific inscription types was run as a single set. Irish inscriptions were split into an early tradition (ogam) and later tradition (uncial) with two different authors being used for the early tradition (Macalister 1945, 1949; McManus 1991). Welsh inscriptions were split into an early tradition (Class I) and later tradition (Class II and III) (Nash-Williams 1950). Roman memorials were split into two groups, those found at Hadrian’s Wall and the rest (Collingwood & Wright 1965). Inscribed stones, slabs, crosses and personal items from the Anglo-Saxon period were used (Okasha 1971). Isle of Man Norse runic inscriptions (Page 1995) and Southern Scottish inscriptions (Thomas 1991–1992) were used. The text sizes ranged from 50 to 2200 words and were analysed at the letter, syllable and word level.

(f) Egyptian monumental texts

These were transcribed in two ways: using the standard modern spelling (which removes superfluous hieroglyphs and applies a standard spelling) and an ‘as observed’ reading of the hieroglyphs (Zauzich 2004). The Egyptian hieroglyphs in these texts are primarily syllabic in nature, being predominantly a mix of single and bilateral glyphs. Text size was 250 words analysed at the word and syllable level.

(g) Mycenaean lists (Linear B )

These were split into two groups: military lists and others (Palmer 1998). Text size was 450–600 words analysed at the word and syllable level.

(h) King lists

These contain only the names of child and parent(s) for the Pictish, Anglo-Saxon, Scottish, English, Cashel and Munster lineages (Anderson 1973; Byrne 1973; Montague-Smith 1992). Text size was 60–175 words analysed at the word and letter level.

(i) Genealogical lists

English baronial genealogies containing: (i) Christian names of the child and parent(s), (ii) Christian names of bride and groom, and (iii) surnames of bride and groom were used (Sanders 1960). A second set of lists was created using familiar, diminutive names instead (e.g. ‘Al for Alan, Alfred and Albert’). Text size was 250–1500 words analysed at the word and letter level.

(j) Sematogram heraldic

A normal distribution of arms from the Heraldic Arms of British Extinct peerages (1086–1400) was used (Burke 1962). The charges (symbols) on the shield were used as characters for analysis. The colour of the charge was also used for analysis. A simplified set of characters was also generated using only the base symbols, e.g. (i) all the different lion charges such as rampant or passant are classified as a ‘lion’ character and (ii) all different cross charges such as bourdonny and fleuretty are classified as ‘cross’ in the base-symbol categorization. Each arms was read as observed symbols from bottom to top. Text size was 400–1200 symbols.

(k) Subletter coded systems

A range of English texts was transposed using morse code and a three-character code for the letters. Text size was 400–75 000 characters.

(l) Random

Randomly generated characters texts, ranging from sets of two to 100 different characters, were used with texts sizes of 15–1000 characters. The texts bracketed the values observed in the stones for the total number of uni-grams (Tu), the number of different uni-grams (Nu) and the fraction of uni-grams appearing only once.

(m) Pictish symbols

These were split into Class I and Class II symbols. The symbols were read as observed from top to bottom, left to right, using Mack’s symbol set and the symbol set given in Early Christian Monuments of Scotland (Allen & Anderson 1903; Mack 1997). The symbol data were taken only from complete stones, which form the majority of the stones.

(n) Statistical analysis

The 99.9 per cent confidence ellipse for prediction was calculated from the random character data assuming a bi-variate normal distribution for F1 and Embedded Image. (Histograms and normal probability plots of the marginal distributions show no obvious departure from normality.) The confidence ellipse is centred on the mean of the random data (Mardia et al. 1979).

Classification trees, constructed using the classification-tree methodology of Breiman et al. (1984), are non-parametric models to describe the variation in a response variable (the categorical character-class variable here) as a function of a number of explanatory variables (the continuous-structure variables Ur and Cr here) for a sample of data in a two-step approach. Firstly, the sample is partitioned, by means of successive binary partitions, such that the subsets eventually formed are as homogeneous as possible with respect to the response variable, quantified using one of a number of criteria (the Gini diversity index here). To avoid over-fitting, subsets of the partition can then be recombined if the resulting loss of homogeneity is not large (assessed using a 10-fold cross-validation strategy) in the second stage, known as pruning.

Cross validation is a well-known data resampling method to estimate model predictive performance, and possibly thereby the optimal values of one or more tuning parameters (Stone 1974; Picard & Cook 1984); for example, the value of parameter a in structure variable Cr, or the optimal extent of pruning. In cross validation, the sample set is partitioned into two or more subsets. One subset is typically withheld, while the remaining subsets are used to construct the model, in this case a classification tree. The withheld subset is then treated as an independent test set with which to estimate model performance, possibly as a function of one or more tuning parameters. The withheld subset is then reinstated, another subset withheld and the procedure repeated until each subset has been withheld exactly once. Overall model performance is then calculated by summing performance over all withheld subsets and the corresponding optimal value of tuning parameter selected. There are many possible refinements to the cross-validation procedure. In general, it is necessary to explore the sensitivity of cross-validation-based inferences as a function of the parameters of the cross-validation strategy. We found that, for the current application, inferences were generally insensitive to these choices—for instance, any value between 6 and 9 of the parameter a in structure variable Cr gives a cross-validation performance of greater than 99 per cent. The performance of the classifier remains constant provided two decimal places of the partition values of the variables are retained, and thus the classifier estimates are optimal to two decimal places.

Glossary

Tu: total number of characters (uni-grams) in a text. Tu is the text size for that character type, thus a text of 200 words may have a letter text size of 900 letters and a syllable text size of 520 syllables.

Nu: the number of different characters (uni-grams) in a text. Thus, a 200 word text might have 25 different letters, 100 different syllables and 130 different words.

Td: total number of character pairs (di-grams) in a text. Td is the character pair text size for that character type, thus a text of 200 words may have 199 word pairs and have a letter-pair text size of 899 letter pairs and a syllable-pair text size of 519 syllable pairs.

Nd: the number of different character pairs (di-grams) in a text. Thus, a 200 word text might have 270 different letter pairs, 390 different syllable pairs and 190 different word pairs.

Sd: the number of different character pairs that appear only once in a text.

Acknowledgements

We thank Nigel Tait, Clive McDonald, Richard Price and John Love for critical discussions and reading of the manuscript; Nigel Tait for technical help with the coding of the macros; and the referees for their help in improving the paper.

Footnotes

    • Received January 27, 2010.
    • Accepted February 26, 2010.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

References

View Abstract