## Abstract

Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein–protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.

## 1. Introduction

Mathematicians have conducted detailed examinations of a large number of objects, such as numbers, manifolds, equations, functions and operators, throughout the long history of mathematics, but they have not studied strings in detail. A string is an object that computer scientists have addressed in depth. Stringology, a field of computer science, has thoroughly investigated algorithms and data structures for string processing [1,2]. However, computer scientists have not studied strings using a mathematical approach; for example, functions, operators and probabilities on a set of strings provided with topological and algebraic structures have not been investigated.

Numbers and numerical vectors account for a large portion of data. However, in recent years, the amount of string data generated has increased dramatically. For example, large amounts of text data have been produced on the web. In the life sciences, large amounts of data regarding genes, RNAs and proteins have been generated. These data are nucleotide or amino acid sequences and can be represented as strings. Consequently, a random string that randomly generates strings based on a probability law is necessary for string data analysis, much as random variables that randomly generate numbers and stochastic processes that randomly generate functions are essential in various fields. Statistical methods for numerical data were rigorously constructed based on probability theory. Similarly, the development and systematization of methods on the basis of probability theory on a set of strings will be required for text mining techniques and methods for analysing biological sequences.

Let *A** be a set of strings on an alphabet *A*={*a*_{1},…,*a*_{z}}. From the viewpoint of mathematics, *A** is a monoid with respect to concatenation (the operation of appending one string to the end of another) and a metric space with the Levenshtein distance (the minimal number of deletions, insertions or substitutions required to transform one string into another), and it forms a non-commutative topological monoid. Koyano and co-workers [3,4] were the first to construct a theory for treating strings that are randomly generated according to a probability law by developing a probability theory on the space *A** with the above-mentioned mathematical structures and to construct statistical methods for analysing string data using this probability theory. They used the developed statistical methods to address problems that involved biological sequence data. However, those researchers developed specific methods for estimating the *α* and *β* diversities of biological communities using gene sequences rather than general-purpose methods for statistical analysis of string data. In this study, we first extend a limit theorem demonstrated in [3] on the asymptotic behaviour of a consensus sequence of strings. This theorem is an analogue of the strong law of large numbers in a *p*-dimensional real vector space *A** of a mean in

Classifying string data is a common problem in many fields, including computer science and the life sciences because of recent significant increases in the prevalence of string data. The most widely used approach to this problem is to convert strings into numerical vectors using a string kernel and subsequently apply a support vector machine (SVM) [5–9] to the vectors. The earliest string kernels were developed by Haussler [10], Watkins [11] and Lodhi *et al*. [12]. These papers proposed that the similarity between strings should be defined based on the number of subsequences common to them. Leslie and Paaß and co-workers [13,14] used the spectrum kernel, a string kernel that quantifies the similarity between strings based on the number of common substrings, without considering common subsequences for which gaps are allowed. The spectrum kernel was subsequently extended by Leslie and Vishwanathan and co-workers [15–17]. In addition to these kernels, a number of novel string kernels were developed and applied to problems in bioinformatics by recent studies [18–21]. The spectrum kernel has become the most widely used of these various string kernels, although this kernel discards considerable amounts of the information concerning the order of the letters that compose the strings.

Converting strings into numerical vectors using a string kernel is not bijective and involves information loss. A more serious problem is that this conversion makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. Consequently, the performance of a learning machine has been evaluated based on whether the machine yields better results compared with other machines in a certain simulation experiment or in the application to a certain real dataset, and the fundamental evaluation of a learning machine by theoretically evaluating its generalization error has been abandoned. In this study, we develop a learning machine that classifies strings without converting them into numerical vectors by constructing a direct sum decomposition of *A** under the principle of margin maximization, as an SVM does in

## 2. Specification of the problem

In the following, we refer to a classifier that decomposes a space into two disjoint subsets by choosing a hyperplane under the principle of margin maximization as an SVM, although an SVM also has other characteristics, such as (i) learning on the dual of a vector space, (ii) extracting features from input vectors, and (iii) using kernel functions. We consider a plane *A*={*a*_{1},…,*a*_{z}} by *A**. The intrinsic operation and distance on *A** are concatenation (hereafter denoted by ⋅) and the Levenshtein distance (hereafter denoted by *d*_{L}), respectively. Therefore, we provide *A** with algebraic and topological structures using ⋅ and *d*_{L}. *A** forms a non-commutative topological monoid, but it does not form a vector space or field. Therefore, ‘a line’ cannot be defined in *A** using the above two forms. However, this does not mean that a line cannot be defined in *A**. Thus, we consider the following two questions: (i) Can ‘a line’ be defined in *A** in some way? (ii) If so, can *A** be decomposed into two disjoint subsets by using ‘the line’? The answer to the first question is ‘Yes’, whereas the answer to the second question is ‘No’.

By considering a curve in a space to be a subset of the space that is obtained by repeating the operation of connecting a point in the space to one of its contiguous points, we can roughly define ‘a curve’ in *A**, for example, in the following manner: if *d*_{L}(*s*_{i},*s*_{i+1})=1, *i*=1,…,*n*−1 holds for *s*_{1},…,*s*_{n}∈*A**, we call {*s*_{1},…,*s*_{n}} ‘a curve’ in *A**. Furthermore, considering a segment between two points in a space to be the shortest curve that connects the two points, we can define ‘a segment’ in *A** as follows: we suppose that *d*_{L}(*s*,*s*′)=*n* for *s*,*s*′∈*A**. *s* can be transformed into *s*′ by performing one of three types of operation, insertion, deletion and substitution, *n* times. We denote a string obtained by performing the first *i* operations of the *n* operations on *s* by *s*_{(i)} for each *i*=1,…,*n*−1. For the uniqueness of ‘a segment’ that connects two given strings, we suppose that the order of priority is given among insertion, deletion and substitution and that a series of operations is performed on *s* in ascending order with respect to the letter number in *s* and according to the order of priority among the three operations. We call {*s*,*s*_{(1)},…,*s*_{(n−1)},*s*′} ‘a segment’ in *A** that connects *s* and *s*′.

Therefore, we consider the decomposition of a sufficiently large subset for applications that are composed of strings whose length is less than or equal to that of *s*, although not the entire space of *A**, by choosing a sufficiently long string *s* and drawing a segment between *s* and the empty string (a string composed of zero letters). The alphabet *A*={*a*_{1},…,*a*_{z}} forms a metric space with the Hamming distance *d*_{H}(*a*_{i},*a*_{j})=0 (if *i*=*j*) or 1 (if *i*≠*j*). By comparison, the set of real numbers *d*(*x*,*y*)=|*x*−*y*| as well as a totally ordered set with respect to the usual less-than-or-equal relation ≤. The distance *d* and the total order ≤ on *x*≤*y* and *y*≤*z*, then *d*(*x*,*y*)≤*d*(*x*,*z*) holds for any *A*. By defining a total order that is consistent with the Hamming distance *d*_{H} in the sense mentioned above, can we make *A* form a totally ordered set without destroying its structure as a metric space? This task is impossible owing to the definition of *d*_{H}. Consequently, we have the following problem: *H*^{+} and *H*^{−} are defined using the total order ≤ on *A** in the above manner using the Levenshtein distance. However, the concepts of upper and lower areas of a segment cannot make sense without destroying the structure of *A* as a metric space, because a total order that is consistent with the Hamming distance cannot be defined on *A*. Consequently, *A** cannot be divided by determining such a non-closed subset as a line, in contrast to

However, the above discussion does not indicate that *A** cannot be divided into two disjoint subsets in any manner. As the Jordan curve theorem [22] and the Jordan–Brouwer separation theorem [23] of topology state, *p*≥3) can be divided into two disjoint subsets by choosing a closed curve and hypersphere without determining a line or hyperplane, respectively. Can we decompose *A** into two disjoint subsets by using a method other than by drawing a line? We set *U*(*s*,*r*)={*t*∈*A**:*d*_{L}(*t*,*s*)≤*r*} for *s*∈*A** and *A** into *U*(*s*,*r*) and *U*(*s*,*r*)^{c}=*A**−*U*(*s*,*r*). In other words, we examine a method of drawing a sphere in *A** and subsequently decomposing *A** into its interior and exterior. In this manner, the decomposition does not require the concepts of upper and lower areas. In the following, we refer to ∂*U*(*s*,*r*)={*t*∈*A**:*d*_{L}(*t*,*s*)=*r*} as a discriminant sphere and the number of strings in *U*(*s*,*r*) as the size of ∂*U*(*s*,*r*). We set the convention that strings that lie on ∂*U*(*s*,*r*) are classified into a class of positive examples.

## 3. Learning machine working in *A**

To decompose *A** in the manner described in §2, it is necessary to specify the centre *s*∈*A** and the radius *U*(*s*,*r*) given positive and negative examples. We say that the positive examples *X*_{m}={*s*_{1},…,*s*_{m}} and negative examples *Y* _{n}={*t*_{1},…,*t*_{n}} are spherically separable if there exists *s*_{0}∈*A** such that *X*_{m} and *Y* _{n} are spherically inseparable if they are not spherically separable. We denote a set of *m*-tuples of strings for which a consensus sequence is uniquely determined by [(*A**)^{m}]. A formal definition of a consensus sequence is provided in §S2 of the electronic supplementary material for this paper. We suppose *s*_{1},…,*s*_{m}∈[(*A**)^{m}] in the following and choose the consensus sequence *s*_{1},…,*s*_{m} as the centre of a discriminant sphere.

We first consider the problem of choosing the radius of a discriminant sphere for the case in which the positive and negative examples are spherically separable. Similar to a discriminant hyperplane of an SVM in *X*_{m}={*s*_{1},…,*s*_{m}} and negative examples *Y* _{n}={*t*_{1},…,*t*_{n}} are spherically separable with respect to *r** is not an integer, then we arbitrarily choose one of the integers closest to *r**.

Next, we consider the case in which the positive examples *X*_{m} and negative examples *Y* _{n} are spherically inseparable. We denote subsamples of the positive and negative examples that a discriminant sphere *r* correctly classifies by *S* by ♯*S*. The numbers of strings in *X*_{m} and in *Y* _{n} that *X*_{m} and *Y* _{n} are spherically inseparable, we choose the radius of a discriminant sphere based on the principle of minimizing the number of misclassified inputs and maximizing the margin, which is a modification of the principle used by an ordinary SVM in

*Step 1 (minimizing the number of misclassified inputs).* Search for a set of radii that minimize the number of misclassified inputs, i.e. a set of positive integers

*Step 2 (maximizing the margin).* Choose *r** is not uniquely determined, then we arbitrarily choose one of them). This step is formally written as follows: the distances between *s* and *t* are support strings, in contrast to support vectors for an ordinary SVM in *r** is represented as

Let *N* and ℓ be the number and the maximum length of input strings, respectively. The time complexity is *O*(*N*ℓ) for constructing a centre string of a discriminant sphere, *O*(*N*ℓ^{2}) for computing Levenshtein distances from all input strings to the centre string, *O*(*N*ℓ) for computing the number *r*, because there exist *O*(ℓ) candidates for *r*, *O*(*N*ℓ) for searching *O*(*N*ℓ) for searching *r**. Therefore, the total time complexity for constructing the classifier is equal to *O*(*N*ℓ^{2}).

## 4. Limit theorems in *A**

In §3, we constructed a statistical learning machine that classifies string data by forming a direct sum decomposition of *A**. In this section, we demonstrate a generalization of theorem 2 described in the electronic supplemental material of [3], which will be applied to an asymptotic analysis of the generalization error of our learning machine in §5. We also derive several corollaries from the generalization. The notation and definitions of terms used in this section are provided in §S2 of the electronic supplementary material for this paper. In the following sections, the alphabet is *A*={*a*_{1},…,*a*_{z−1}}. We set *a*_{z}=*e* for the empty letter *e* and refer to *A**. Roughly describing, *μ*(*α*_{1},…,*α*_{n}) denotes the consensus letter of random letters *α*_{1},…,*α*_{n}, and ** μ**(

*σ*

_{1},…,

*σ*

_{n}) and

**(**

*κ**σ*

_{1},…,

*σ*

_{n}) represent the consensus sequence and variance of random strings

*σ*

_{1},…,

*σ*

_{n}, respectively.

*n*-tuples of random letters the consensus letter of which is uniquely determined and of

*n*-tuples of random strings the consensus sequence of which is uniquely determined, respectively.

Let *h*=1,…,*z*. *p*(*i*,*h*) represents the probability that the *i*th random letter realizes the *h*th letter in the extended alphabet *h*th letter in *n* observations are made. For a statement S, S a.s. represents that S holds with probability one. First, theorem 1 from the electronic supplemental material of [3] can be generalized as follows.

### Theorem 4.1

*We suppose that* *(i) satisfies* *for each* *, and (ii) is independent. If (iii)* *is uniquely determined independent of n, there exists* *such that if n≥n*_{0}*, then*
*holds.*

Proofs of all results described in this paper are provided in §S1 of the electronic supplementary material. In theorem 4.1, the independence of random letters *α*_{1},…,*α*_{n} is assumed, but the identical distribution is not. Therefore, the consensus letters can vary among the distributions of *α*_{1},…,*α*_{n}. Theorem 4.1 states that even if the distributions of *α*_{1},…,*α*_{n} do not have an identical consensus letter, the consensus letter *μ*(*α*_{1},…,*α*_{n})(*ω*) of the observed letters converges to a letter *a*_{ι} under the above conditions. Even if *α*_{1},…,*α*_{n} do not have an identical distribution, by the definition of *a*_{ι}, we have
*α*_{1},…,*α*_{n} have an identical consensus letter. Thus, theorem 1 from the electronic supplemental material of [3] is a special case of theorem 4.1.

Let *h*=1,…,*z*.

### Corollary 4.2

*If* (*i*) *satisfies* *for each* *ii*) *is independent for each* *and* (*iii*) *is uniquely determined independent of* *n*, *there exists* *such that if* *n*≥*n*_{0}, *then*
*holds*.

Theorem 2 from the electronic supplementary material of [3] can be obtained as a special case of corollary 4.2. We denote almost sure convergence by *X* by *E*[*X*]. Using corollary 4.2, theorem 3 from the electronic supplementary material of [3] is extended as follows.

### Corollary 4.3

*We suppose that* (*i*) *satisfies* *for each* *and that* (*ii*) *is independent. If* (*iii*) *is uniquely determined independent of* *n*, *then we have*

*as* *More strictly than under condition* (*iii*), *if* (*iv*) *holds and* (*v*) {*σ*_{i}} *have an identical family of finite dimensional distributions, we have*
*as*

Because a consensus sequence of strings is a majority vote, unlike a mean of real vectors, the distributions of random letters that compose a consensus sequence converge to the Dirac measures for letters of the extended alphabet, rather than to distributions such as the normal distribution, under the conditions of corollary 4.2.

### Corollary 4.4

*We suppose that* *satisfies the conditions of corollary* 4.2. *We denote the Dirac measure for a letter* *such that* *by* *δ*_{ι(j)} (*ι*(*j*) *is independent of* *n* *and unique*). *Then, there exists* *such that if* *n*≥*n*_{0}, *then a sequence of distributions of random letters that compose* ** μ**(

*σ*

_{1},…,

*σ*

_{n})

*is equal to*

Let *p*(*i*,*j*,*h*) is independent of *i*) for each *α*_{1j0},…,*α*_{nj0} and consequently, a consensus sequence of *σ*_{1},…,*σ*_{n} are not determined with probability one as *σ*_{1},…,*σ*_{n} is not determined with probability one as

## 5. Asymptotic optimality of the proposed learning machine

As described in §1, in the conventional framework of converting strings into numerical vectors using a string kernel and subsequently applying a classifier working in a numerical vector space to the vectors, it is impossible to theoretically evaluate the generalization error of the classifier for string classification, because conversion using a string kernel is not bijective. Consequently, to evaluate the performance of a classifier for string data, we have no option but to apply the classifier to certain datasets and repeat the cross-validation. In this section, by applying corollary 4.2 demonstrated in §4 on the asymptotic behaviour of a consensus sequence of random strings, we theoretically consider whether our statistical learning machine working in *A** constructed in §3 is optimal in terms of its generalization error. Generally, in classical statistics, the problems of estimation and hypothesis testing are considered under the assumption that the data are generated according to an unknown but unique population distribution. However, machine learning is typically applied to mining datasets that are considerably larger than the datasets analysed using traditional statistical methods. Therefore, it is absolutely impossible to suppose that data are generated according to an unknown but unique distribution. In this section, we theoretically analyse the generalization error of our learning machine in a setting in which both positive and negative examples are generated according to an unknown number of unknown distributions.

For *i*th positive example *s*_{i} is generated according to one of *f*_{1} unknown distributions with probability functions *A** for each *i*=1,…,*m* and the *i*th negative example *t*_{i} is generated according to one of *f*_{2} unknown distributions with probability functions *A** for each *i*=1,…,*n*. *f*_{1} and *f*_{2} are also unknown. Thus, the models that generate positive and negative examples are finite mixture models
*i*=1,2. (For *i*=1,2 and *k*=1,…,*f*_{i}, *p*_{σ} in §S2 of the electronic supplementary material for this paper.) *D*_{1} and *D*_{2} represent the supports of *p*_{1} and *p*_{2}, respectively, i.e. *D*_{i}={*s*∈*A**:*p*_{i}(*s*)>0} for *i*=1,2. We assume that *D*_{1}−*D*_{2}≠∅ and *D*_{2}−*D*_{1}≠∅. If *D*_{1}∩*D*_{2}=∅, then the probability that the generalization error becomes zero after finite times of learning is equal to one. Therefore, we consider the case of *D*_{1}∩*D*_{2}≠∅ in the following. The generalization error *E*_{0}(*s*,*r*) of a discriminant sphere ∂*U*(*s*,*r*) is written as *E*_{0}(*s*,*r*)=*E*_{1}(*s*,*r*)+*E*_{2}(*s*,*r*) for
*s*_{0}∈*A**. *r*^{†}(*s*_{0}) is the radius of a discriminant sphere that is optimal in terms of the generalization error given a centre. We denote the relative frequencies of *t* in *X*_{m} and in *Y* _{n} by *t*∈*A**. We set
*s*∈*A** and

Assuming that *r** converges to an optimal radius in terms of the generalization error as our learning machine updates *r** through a learning process. Note that if positive examples are generated in a manner in which the conditions of corollary 4.2 described in the previous section are satisfied, the optimal radius is *r*^{†}({*a*_{ι(j)}}), because *a*_{ι(j)}} with probability one, given a sufficient number of positive examples.

### Theorem 5.1 Asymptotic optimality of *r**

*If (i) the positive examples s*_{1}*,…,s*_{m} *are realizations of random strings σ*_{1}*,…,σ*_{m} *that are independent and distributed according to the mixture model* *p*_{1} *and the negative examples t*_{1}*,…,t*_{n} *are realizations of random strings τ*_{1}*,…,τ*_{n} *that are independent and distributed according to* *p*_{2}*, (ii) σ*_{1}*,…,σ*_{m} *satisfy the conditions of corollary *4.2 *for each* *and (iii) there uniquely exists r*^{†}*({a*_{ι(j)}*}), then we have*
*as* *. In other words, r** *converges to a radius that is asymptotically optimal given* *as the centre of a discriminant sphere with probability one.*

In the proof of theorem 5.1, only the principle of minimizing the number of misclassified inputs was used, and the principle of maximizing the margin was not required (see Section S1 of the electronic supplementary material), because the samples of the positive and negative examples accurately reflected their population distributions in the asymptotic setting. This suggests that the reason an ordinary SVM working in

We next consider the optimality of *D*_{1}∩*D*_{2}≠∅ holds and *p*_{1} and *p*_{2} satisfy the conditions that (i) *p*_{1}(*s*) is monotonically non-increasing with respect to *d*_{L}(*s*,{*a*_{ι(j)}}) on *D*_{1} and (ii) there exists *p*_{2}(*s*) is monotonically non-decreasing with respect to *d*_{L}(*s*,{*a*_{ι(j)}}) on *d*_{0} is sufficiently large and consider only discriminant spheres that are disjoint with *r** is chosen as the radius of a discriminant sphere. Note that ♯*U*(*s*,*r*) increases monotonically with respect to the length of *s* and *r*. We denote sets of pairs *B*_{0}(*r*),*B*_{1}(*r*), and *B*_{2}(*r*), respectively, for any

### Theorem 5.2 (Asymptotic optimality of s ¯ m )

*In the setting described above, if random strings σ*_{1}*,…,σ*_{m} *that generate positive examples satisfy the conditions of corollary *4.2 *for each* *we have*
*as* *for any* *(s*′*,r*′*)∈B*_{j}*(r), and j∈{0,1,2}. In other words, for any radius r, a discriminant sphere with a centre* *is asymptotically optimal in a class of discriminant spheres that are equal to it in size and has asymptotically minimum probabilities of false-negatives of discriminant spheres of equal or smaller size and of false-positives of discriminant spheres of equal or larger size.*

Let *i*th positive example for each *i*=1,…,*m*. If *σ*_{1},…,*σ*_{m} are independent, *α*_{1j},…,*α*_{mj} are also independent for each *α*_{1j},…,*α*_{mj} is assumed for each *σ*_{1},…,*σ*_{m} is not. Furthermore, the independence of random strings that generate negative examples is not also assumed. If we choose two different points ** x**,

**′ in**

*x***and**

*x***′ and equal radii, the measures (volumes for**

*x**p*=3) of the hyperspheres are equal. However, choosing two different strings

*s*,

*s*′ in

*A** and then making two spheres with centres

*s*and

*s*′ and equal radii with respect to the Levenshtein distance, the numbers of strings that the spheres contain are not necessarily equal, which is an essential reason why the statement of theorem 5.2 is divided into cases

*j*=0,1,2.

## 6. Applications to biological sequence analysis

Our statistical learning machine classifies strings in an almost optimal manner under the conditions of theorems 5.1 and 5.2 in §5 when the training samples are sufficiently large. However, large training samples are not necessarily obtainable in all problems of classifying strings. How accurately does our machine classify strings in such cases? In this section, we examine the usefulness of our learning machine in practical data analysis by applying it to analysing amino acid sequences of proteins and nucleotide sequences of RNAs, including cases in which sufficiently large samples cannot be obtained.

### (a) Application to predicting protein–protein interactions

A protein is a polymer of 20 types of amino acids and can be represented as a string on an alphabet *A*={` a`,…,

`}−{`

*z*`,`

*b*`,`

*j*`,`

*o*`,`

*u*`,`

*x*`} composed of 20 letters. Predicting protein–protein interactions is one of the most important problems in bioinformatics, because most proteins fulfill their functions after forming a complex with other proteins. A domain of a protein generally interacts with multiple domains of other proteins. However, only a few proteins have domains that interact with a number of domains of other proteins and function as a hub in a protein–protein interaction network [24,25]. Thus, large numbers of positive examples cannot necessarily be obtained in the problem of predicting protein–protein interactions. We formulated this prediction problem as the problem of classifying domains of proteins into two classes of domains that interact and do not interact with a given domain and applied our machine working in`

*z**A** to this problem. We examined the classification accuracy of our machine through comparisons with the SVM with the two-spectrum kernel (the dot product of spectral representations of two strings [13]), the SVM with the spatial kernel with

*t*=2,

*k*=1, and

*d*=2 (see [26] for these parameters), and the one-nearest-neighbour method that employs outputs from the spatial kernel as dissimilarities.

We first prepared positive examples by using the three-dimensional interacting domains (3did) database [27], which contains high-resolution, three-dimensional structural data for domain–domain interactions obtained from the Protein Data Bank (PDB) [28]. The PDB includes three-dimensional structures of proteins and protein complexes obtained from experiments such as X-ray crystal structural analysis and nuclear magnetic resonance spectroscopy. We randomly selected amino acid sequences of 20 protein domains, ‘1il1 A:134–215’ (which denotes the amino acid subsequence from residue 134 to 215 in chain A of PDB ID 1il1), ‘2bnq E:127–220’, ‘1inq B:1011–1092’, ‘1it9 H:133–216’, ‘1it9 L:123–210’, ‘1ikv A:317–419’, ‘1cff A:84–145’, ‘1iza A:1–20’, ‘1ifh L:119–206’, ‘1p2c F:1501–1627’, ‘1j5o B:238–307’, ‘1mcz A:190–325’, ‘1at1 B:101–152’, ‘1ikv B:1238–1307’, ‘1lw6 E:27–275’, ‘3tu4 F:24–93’, ‘1jgw H:11–137’, ‘1z8u B:6–106’, ‘4e5x A:1–179’ and ‘4e5x B:11–92’. For each of the amino acid sequences of these 20 domains, we collected sequences of interacting domains from 3did without any redundancy in sequences. Sequences having extremely different lengths relative to the sequences collected from the database were not included in the samples of positive examples. It should be noted that the same amino acid sequence can be included in different entries of the PDB.

We next consider the procedure for preparing negative examples. Real sequences that would not be positive examples and artificial and randomly generated sequences have been used as negative examples in the development and validation of classifiers for string and sequence data. For example, in promoter prediction in *Escherichia coli*, Horton & Kanehisa [29] used sequences randomly chosen from coding regions as the negative examples. However, as Larsen *et al.* [30] indicated, the use of such negative examples is very different from a real biological discrimination problem. For the problem of predicting the interaction between miRNA and mRNA, recent studies [31–33] used randomly generated artificial sequences as the negative examples. However, as recent studies [34–36] demonstrated experimentally, such sequences often interact with miRNA, and therefore, it is uncertain if they are negative examples. Even if the randomly generated sequences are real negative examples, they may be unrealistically different from positive examples. In this case, as Bandyopadhyay & Mitra [37] indicated, the positive and negative examples are easily distinguishable, and a classifier that yields poor performance on other independent datasets may be produced.

Thus, in this study, we considered a procedure for preparing negative examples that were somewhat similar to the positive examples but were not likely to interact based on biophysical chemistry data. Bogan & Thorn [38] compiled a database of 2325 alanine mutants of heterodimeric protein–protein complexes and examined which amino acids are located in the interfaces of protein–protein complexes with a relatively high frequency. Ma *et al*. [39] identified amino acids that are located in the interfaces with a high frequency on the basis of 1629 two-chain interface entries in the PDB. Based on the results of these studies, Arg, Asp, Trp and Tyr are located at the interface of protein–protein interactions with high frequency, whereas Lys and Glu are located at interfaces with low frequency. Arg and Lys have a positive charge, and Arg tends to be located at interfaces; conversely, Lys does not tend to be located at interfaces. By contrast, Asp and Glu are negatively charged, whereas Asp tends to be located at interfaces, Glu does not. These observations imply that amino acids in which the side chain has relatively low entropy tend to be located at interfaces. *π* electrons in the side chains of aromatic amino acids interact strongly with positively charged amino acids [40], and, for this reason, aromatic amino acids such as Trp and Tyr are located at interfaces with high frequency. Therefore, we generated a negative example from each positive example using the following procedure.

Step (i). We first substituted Glu for Arg and Lys for Asp, Trp and Tyr in a positive example with a probability of 0.5 (because if all the Arg, Asp, Trp and Tyr were replaced, the resulting negative examples would not contain the four types of letters that represent these amino acids). According to van Holde *et al*. [41], the frequencies of Arg, Asp, Trp and Tyr in common proteins are 5.1%, 5.3%, 1.4% and 3.2%, respectively. Hence, in this step, approximately 7.5% of the letters in the positive example that represented amino acids that tend to be located in interfaces were replaced with letters representing amino acids that do not tend to be located in interfaces.

Step (ii). Next, we trisected a positive example with the substitutions in step (i), chose a letter in the intermediate substring at random, and transposed the order of the first half substring, from the first letter in the substituted positive example to the chosen letter, and the second half substring, composed of the other letters in the substituted positive example.

The sum of the numbers of positive and negative examples is given as *N* in table 1 (the number of positive examples = the number of negative examples =*N*/2 from the procedure of preparing the negative examples described in the above paragraph). Our examination included cases with an insufficient sample size *N*=20 up to a case with a large sample size *N*=126. We constructed the above-mentioned four machines for each of the 20 domains. We evaluated the performance of the machines using the four indices of accuracy =(TP+TN)/N, precision =TP/(TP+FP), recall =TP/(TP+FN) and F-measure =2×precision×recall/(precision+recall), where TP,FP,TN and FN represent the numbers of true-positives, false-positives, true-negatives and false-negatives, respectively. After randomly dividing each of the samples of positive and negative examples into three subsamples of equal size, we used the union of two subsamples as a training sample and the other subsample as a test sample to compute the above indices.

We calculated the mean accuracy, precision, recall and F-measure by repeating this process 50 times. The mean values obtained in this procedure are shown in table 1. Furthermore, values obtained by averaging these mean accuracies, precisions, recalls and F-measures over the 20 protein domains and their standard deviations are provided in table S1 in §S3 of the electronic supplementary material. Moreover, we tested the significance of the differences between the mean accuracies, precisions, recalls and F-measures for each pair of the four methods using the Wilcoxon signed-rank test. The results are shown in table S2. The differences are statistically significant, except for those between the mean precisions of the developed learning machine and the SVM with spectrum kernel and between the mean recalls of the SVM with the spectrum kernel and the nearest-neighbour method. The mean precision from the SVM with the spectrum kernel is higher than that from the SVM with the spatial kernel, whereas the SVM with the spatial kernel is higher than the SVM with the spectrum kernel in the other three indices. Therefore, the developed learning machine has the highest predictive power, and the nearest-neighbour method has the lowest predictive power. The predictive powers of the SVMs with the spectrum kernel and with the spatial kernel are almost equal.

### (b) Application to classifying RNAs by the secondary structure

Next, we applied the learning machine developed in this study, the SVM with the spectrum kernel, the SVM with the spatial kernel and the nearest-neighbour method that uses outputs from the spatial kernel as dissimilarities to predicting whether RNAs belong to a family with similar secondary structures based on nucleotide sequences. We chose 10 RNA families registered in the Rfam database [42] at random. Each of the families is composed of RNAs that have similar secondary structures. We used sequence data of the RNAs in each family as positive examples after deleting gaps of the alignment in the sequences. Sequences that are not likely to form secondary structures similar to those of the RNAs in each family are needed for negative examples. Therefore, we prepared a negative sequence from each positive sequence according to step (ii) of the procedure described in §6a to change the positions of complementary pairs of nucleotides in the RNA sequences. For each family, a sample of negative sequences was prepared in the above-mentioned manner and sequences in other families were not used as negative examples.

As in §6a, we calculated the mean accuracy, precision, recall and F-measure by repeating the threefold cross-validation 50 times. The obtained mean values are shown in table 2. Furthermore, the means and standard deviations of the mean accuracies, precisions, recalls and F-measures from the four methods over the 10 RNA families are provided in table S1. Moreover, table S2 shows the results of testing the significance of the differences of the mean accuracies, precisions, recalls and F-measures between each pair of the four methods using the Wilcoxon signed-rank test. The differences of all of the four indices between the SVMs with the spectrum kernel and with the spatial kernel are not statistically significant. The differences between the other pairs of the four methods are statistically significant, except for those of the mean recalls between the SVM with the spectrum kernel and the nearest-neighbour method and between the SVM with the spatial kernel and the nearest-neighbour method. Therefore, from these tables, it is concluded that the developed learning machine has the highest predictive power, followed by the SVMs with the spectrum kernel and with the spatial kernel, and the nearest-neighbour method has the lowest predictive power, which is a result similar to that obtained in §6a.

## 7. Conclusion

String data analysis has a wide range of tasks, such as string comparison, quantifying string similarity, string clustering and identifying string features. In this study, we addressed the problem of classifying strings of these various problems. As described in §1, it is impossible to theoretically evaluate the performance of learning machines that convert strings into numerical vectors using string kernels, which are not bijective, and subsequently analyse the vectors, using probability theory. Therefore, to evaluate the performance of such learning machines, we have no option but to depend on the numerical method in which the cross-validation is repeated. However, as often pointed out, results of the performance evaluation in this manner vary greatly, depending on the datasets used. Dealing with these problems by applying the probability theory on *A** developed in [3] is the main purpose of this study, and we conducted this study, placing special emphasis on the theoretical aspect of string data analysis.

Many methods for the above-mentioned tasks of string data analysis are based on string representations in the kernel-based methods. Many previous studies proved the kernel-based methods for string representations to be useful in practical string data analysis. See, for example, [26,43–52] for important recent studies on the kernel-based methods for string data. However, it is impossible to evaluate, applying traditional probability theory that has been constructed on spaces such as the Euclidean space *L*^{2}, the performance of the methods based on the string kernels, considering that a sample of observed strings is a part of a population generated according to a probability law. In this study, we addressed the problem of supervised classification of strings and obtained the theoretical results on the performance of the learning machine that works in *A** by applying the limit theorem in probability theory on *A** demonstrated in this paper. In [53], we approached the problem of unsupervised clustering of strings by introducing a parametric probability distribution on *A** and developing a theory of EM algorithm for the mixture model of the distributions and demonstrated the asymptotic optimality of the constructed clustering procedure by combining the limit theorems obtained in this paper. It is a future challenge to develop a method for evaluating the performance of the kernel-based methods of string data analysis in a theoretical manner by extending probability theory on *A** and inspecting the correspondence between observed strings and their representations by string kernels.

## Ethics

There are no ethical concerns with regard to this manuscript.

## Data accessibility

All data used in this study are openly available from the 3did and Rfam databases. The source code for classifying strings using the learning machine that works in *A** constructed in this study is available if requested.

## Author' contributions

H.K. conceived of the study, constructed the theory, performed the simulation experiments and drafted the manuscript. M.H. helped to perform the simulation experiments. T.A. helped to constructed the theory.

## Competing interests

The authors declare that they have no competing interests.

## Funding

This work was supported in part by grant-in-aid for Challenging Exploratory Research from the Japan Society for the Promotion of Science (26610037).

## Acknowledgements

We are grateful to the anonymous referees for their valuable comments on the manuscript.

- Received August 10, 2015.
- Accepted February 2, 2016.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.