## Abstract

This paper is concerned with exploiting uncertainty in order to develop a robust regression algorithm for a pre-sliding friction process based on a Nonlinear Auto-Regressive with eXogenous inputs neural network. Essentially, it is shown that using an interval-valued neural network allows a trade-off between the model error and the interval width of the network weights or a ‘degree of uncertainty’ parameter. The neural network weights are replaced by interval variables and cannot therefore be derived from a conventional optimization algorithm; in this case, the problem is solved by using differential evolution. The paper also shows how to implement the idea of ‘opportunity’ as used in Ben-Haim's information-gap theory.

## 1. Introduction

The aim of this work was to provide a computational model of a real-life engineering system. The system in question was a single-input, single-output system, where the input was a displacement and the output was a pre-sliding friction force. This data was obtained experimentally from a specially constructed test rig; a schematic of which is shown in figure 1. Both the input and output values were normalized. Figure 2 shows that the system is highly hysteretic in its input–output relationship, and thus more difficult to model than a standard static system.

This work followed on from a previous study of this problem conducted by Parlitz *et al*. (2004) involving artificial neural networks. That work had showed that by using a standard Nonlinear Auto-Regressive with eXogenous inputs (NARX) multi-layer perceptron (MLP) network, a predictive model for the data could be obtained with a mean squared error (MSE) value of 6.5%. The principal aim of this work was to show how an MLP network using interval-valued connection weights could give an improved model prediction. A few different methods were used to train this interval network for comparison purposes, the most significant being the evolutionary strategy, differential evolution (Storn & Price 1997).

When trained and operated correctly, neural networks can be very effective at modelling complex relationships between sets of data. They are principally used when the system in question is nonlinear, and may often have several inputs and outputs. As with all real-life measured values, those in this particular dataset contained some amount of measurement noise. A neural network was required here to find the underlying input–output relationship of the system using the measured values provided.

An autoregressive input structure is often used for dynamic systems such as this one. This model structure involves feeding previous system input and output values back into the system as inputs, in order to model the dependency of the system on its past states and values. The number of these past input and output values that are used depends on how much ‘memory’ the system is adjudged to have, and this is one of the things that needs selecting in order to get an accurate system model.

The training and operation phases for an NARX MLP network are significantly different, in particular regarding the previous time output values (or ‘output lags’) used by the network. In the training phase, these previous outputs are taken directly from the dataset to give what is termed the *one-step-ahead prediction*, and training proceeds until a suitably well-trained network model is obtained. In the operation phase, however, the previous output values are fed back from the network itself, so this is truly testing the predictive capabilities of the network. It is on this set of predictions that the performance of the neural network is judged, known as the *long-range output*.

Interval-valued neural networks are to be used in this work so that a range of system outputs can be provided for an arbitrary input vector. It will be shown that accepting a small amount of uncertainty on the result enables the prediction error of a system model to be reduced. In most applications, there is some degree of tolerance that can be accepted upon an output, so it is beneficial to use this tolerance to best advantage. This technique of deliberately introducing uncertainty into a crisp-valued model has also been used by the authors for radial basis function neural networks, ARX (Auto-Regressive with eXogenous inputs) models and NARX models (Chetwynd *et al*. 2005*a*,*b*, 2006).

The layout of this paper is as follows. Section 2 gives background information, both in terms of the literature and the previous paper that initiated this work. Section 3 gives a brief outline of the uncertainty technique interval arithmetic, considers how it would function for an interval MLP network and then describes the differential evolution algorithm. Section 4 describes the network training strategies to be used. Section 5 gives the results obtained along with figures and finally §6 provides the discussion and conclusions.

## 2. Background

The dataset used here contained 24 000 separate samples of data and was initially split into three equal sets of 8000 samples each. For time reasons, only the first 2000 samples of each set were used, and the sets were labelled as the training, validation and test sets, respectively. This was in line with the best practice method of training neural networks as defined by Tarassenko (1998). In selecting a neural network to represent a certain set of data, both the network architecture and the values of all the connection weights are important. As a NARX neural network was used here, the number of previous input and output values into the network also had to be selected.

In the background work conducted on this same set of data, the optimum network arrangement was found to be a two-layer MLP with two processing neurons in the hidden row. This work also showed that the optimum number of lags to use on the network was two previous inputs and three previous outputs. Along with the current input value this gave six network inputs, so the network structure was 6:2:1. A hyperbolic tangent was used as the activation function of the hidden nodes, and the single-output node was linear.

If the optimum network architecture and set of weights for an interval MLP had been sought in tandem, this would have taken an excessive amount of training time. The decision was therefore made to use the same network architecture as that in the background work, this being a two layer 6:2:1 NARX network, using two previous input and three previous output values at each time-step. The principal difference was that here the network weights were now interval numbers as opposed to scalars.

## 3. Theory

### (a) Interval arithmetic

Interval arithmetic is a technique that was proposed by Moore (1966), and it operates by denoting every uncertain quantity as a range of possible values. The size of this range (or interval) denotes the uncertainty associated with the quantity. These intervals are then used in all calculations involving the variable, and there exist rules governing the standard mathematical operations for interval numbers. In this way, the total uncertainty associated with a sequence of calculations can be measured.

If an interval number takes the form , then addition of two interval numbers is(3.1)Multiplication of two interval quantities is given by(3.2)which takes into account the effect of negative values. Thirdly, multiplication of an interval number by a positive scalar *c* is given by(3.3)Finally, if an interval number is the input to a monotonically increasing function *F* then this can be written as(3.4)In this work, however, the interval quantities were stored as a central value and a half-width/radius value, i.e.(3.5)although all of the calculations were still done in the upper and lower bound form of (3.1)–(3.4). In storage terms, this modification is exactly the same as the standard method, as two scalar quantities are still required to define each interval used.

### (b) An interval arithmetic multi-layer perceptron

The standard multi-layer perceptron (MLP) neural network has been thoroughly researched and is well understood. For this reason, the concept of introducing interval arithmetic into an MLP is relatively straightforward. All of the network connection weights become interval quantities, and as the two nodal activation functions to be used are hyperbolic tangent and linear then relationship (3.4) applies. The output from an interval MLP network will thus be an interval range, i.e. a predicted range of values within which the true output value is expected to lie. The other relationships (3.1)–(3.3) will also be used in the interval MLP. Interval arithmetic has been previously used within the neural network research by Drago & Ridella (1998, 1999).

In interval arithmetic terms, a ‘crisp’ interval is one that has zero width, i.e. a real number. A crisp neural network is therefore one that uses real number weights, i.e. a standard MLP. As this dataset only has one output value then a graphical plot of a crisp network prediction will be a single line. A similar plot for an interval MLP will be a pair of bounds forming a closed region between them. This is the prediction interval of the network and the target values from the dataset are estimated to lie inside it. Superimposing the target outputs over both of these two curves will give a visual indication as to the quality of the network prediction. This can be done for either one-step-ahead or long-range outputs of the NARX network, although the plots for both cases are likely to differ.

An ideal interval solution is one in which the target values of any training pattern always lie within the predicted output bounds. This alone is not sufficient though, as the width of the prediction interval is also of key importance. In theory, this interval could be deliberately created very wide so that the true target values would always be contained within it. This, however, would lead to a set of output predictions that were so vague as to be of no real use, so the network output intervals have to be minimized effectively.

Minimizing the width of the output intervals also minimizes their likelihood of containing the target values however, especially if some of the data samples can be classed as ‘outliers’. This in turn increases the prediction error of the interval MLP. Reducing the network output error and reducing the predicted interval widths are clearly two aims that are in conflict, with the best interval solution being a trade-off between these two factors. This is in many ways a decision that will be situation dependent, as some applications may require lower errors than others.

During a one-step-ahead prediction, the network error from one data sample is not passed onto the following sample, so there is no possibility of errors propagating over the dataset. The reverse is true for the long-range output case though, and thus it is the more difficult of the two prediction methods to minimize. This is why, as mentioned in §1, the predictive capabilities of the interval network can only be assessed upon its long-range outputs. In real-life operation, there will be no training data outputs to refer to, so it needs to be shown that any trained interval network can still predict outputs with a low error when used in a free-run operation.

Figures 3 and 4 compare the one-step-ahead and long-range outputs of the same interval MLP network, using the same dataset. For the one-step-ahead case in figure 3, the network prediction bounds follow the path of the target data and do not increase in width. Figure 4, however, shows that the network bounds using the long-range outputs increase in width exponentially, until they reach the limits set by the hidden node hyperbolic tangents. This clearly shows that due to its conservatism, interval arithmetic is unsuitable for use in a recursive model such as this one, so an alternative approach is needed.

### (c) Operation of the interval multi-layer perceptron

As with all neural networks, the prediction error is a function of the network weights used, and the aim is to find the optimum set of weights to minimize this error value. The predicted outputs of crisp and interval multi-layer perceptrons (MLPs) are different though, so a modified error function is needed for the interval case. One of the most common error measures used in a crisp network is the MSE function, where the error contribution from one data sample is given by(3.6)with as the network predicted output and as the target output.

The total MSE value in percentage terms is given by(3.7)where *N* is the number of samples and is the variance of the target outputs.

This can now be modified for an interval arithmetic MLP, so that the error from one sample is given by(3.8)where and are the predicted upper and lower output bounds for that data sample. Function (3.8) is similar to the *ϵ*-insensitive quadratic error function used in support vector regression applications (Cristianini & Shawe-Taylor 2000).

As mentioned in §3*b* though, it is also important to minimize the width of the network output intervals. In order to accommodate this, an extra term was added to (3.7) to give(3.9)where *ω* and *δ* are the respective centre and radius values of each interval network weight. Note that here it is the absolute ratios between the interval weight radii and centres that are being considered, as opposed to merely summing the modulus values of the radii.

The width of the predicted output region for an interval MLP is directly related to the radii of the weights used. If the interval weights are wide then this allows a large range of intermediate values to propagate through the network and the interval outputs will thus be wide. Conversely, if the weight radii are kept small in relation to their centres then the output region will be narrow.

The second term in (3.9) can thus be thought of as an output width penalty term. The coefficient *β* is selected by the user at the start of the test and indicates the emphasis to be placed on minimizing the interval output widths. A large value of *β* means that there will be a large penalty on the width of the interval weights so they will be kept narrow. A small *β* value assigns only a small width penalty to the weights so they can be set much wider in comparison. A *β* value of zero means that no width penalty is applied, so the interval network prediction is assessed on its MSE characteristics only.

The interval network error function (3.9) is therefore a trade-off between the ability of a network to predict accurate output intervals, and its ability to minimize their range as much as possible, as discussed in §3*b*. One of the aims of this work is to assess how the interval network error varies with the width penalty coefficient *β*. This is an example of a multi-objective problem. In graphical terms, the results arising from the use of (3.9) could potentially be represented in the form of a Pareto frontier, where the two axes would be interval prediction error and width penalty coefficient *β*.

### (d) The vertex method

As mentioned in §3*b*, interval arithmetic is ineffective at calculating the long-range outputs of an MLP with interval weights. An alternative approach is therefore to use the vertex method by Dong & Shah (1987), which considers the vertices of the weight space region in use. This method is analogous to a design-of-experiments process with full factorial level two. For an interval MLP with *n* weight parameters this would mean a problem in *n*-dimensional hyperspace.

This hypercube would have 2^{n} vertices, and considering each individual vertex separately could be viewed as an ‘enforced’ Monte Carlo simulation. The assumption can be made that the upper and lower bound values for each sample output will always occur at one of these weight space vertices. By using the values of every vertex in turn as the interval network weights, several sets of predictions could be obtained for the curve of output values. Taking the maximum and minimum values of these predictions for each individual data sample would give an output range within which the true output is forecasted to lie.

This method does take longer than using interval arithmetic, but is not subject to the ‘bound explosion’ effect. It is therefore an effective replacement for interval arithmetic when calculating the long-range outputs of an interval-valued MLP network.

### (e) Differential evolution

Now that the operation of an interval MLP network has been described, it needs to be considered how such a network could be trained. As each network weight needs two scalar values to define it, this leads to a more complicated training phase, and thus the standard error back-propagation algorithm (Rumelhart *et al*. 1986) cannot be used here. If both parameters of each interval weight are to be trained simultaneously then this leads to a problem located in 34-dimensional space. The network prediction error surface is expected to be very nonlinear with several local minima, and so any gradient descent training method is likely to give poor results.

A global search method is a much more desirable choice. Genetic algorithms and evolutionary strategies are, in general, effective examples of such methods and it was chosen to use the differential evolution algorithm in this work. Differential evolution was first proposed in 1997 as a new type of evolutionary strategy, so is still a relatively new idea. It can be used to optimize a set of parameters for a certain system, does not rely on gradient descent techniques or the gradients of the error surface, and is population-based. It is therefore very effective at avoiding local minima, and can be used for problems with a multi-modal and/or non-differentiable error surface. The objective function to be minimized in this case is the error value given by a set of parameters when used as the weights of the neural network.

The algorithm is an ideal candidate to train this interval MLP for two principal reasons. Firstly, as already mentioned, differential evolution is a global search optimization method, and this problem has a high-dimensional and most likely nonlinear error surface. Secondly, differential evolution stores all of the relevant parameters under training as floating point numbers, as opposed to the binary encoding used by many other genetic-based algorithms. The network weights being trained here are real number quantities, and so this eliminates the need for them to be encoded into binary format and then decoded at each algorithm time-step. It also gives the solutions to a high precision.

A thorough explanation of the algorithm operation is not given here, although it can be found in one of the early publications on the subject (Storn & Price 1997). The algorithm is defined by three principal parameters: the population size NP, the crossover constant CR and the weighting factor *F*. Differential evolution is a multi-point search method, where NP denotes the number of solution space points searched simultaneously. The solution space is explored by the mutation and crossover operations, these being defined by *F* and CR, respectively. These two operations are equally important and it is the *combination* of these two that causes the algorithm to function successfully.

The first use of genetic algorithms in neural network training was by Montana & Davis (1989), and there have been countless other examples of this since then. Differential evolution has been used to train neural networks by, among others, Masters & Land (1997), Plagianakos *et al*. (2001) and Abbas (2001).

## 4. Method

### (a) Introduction

It was stated in §3*c* that the width of an interval MLP output region is directly linked to the proportional radii of the weights it uses. If every weight radius were set to zero then the output region would have zero width, i.e. a crisp network prediction. The same result could be obtained by using a crisp MLP with those central values for its weights. It can therefore be assumed that the interval output region is centred about this crisp approximation, with its width determined by the weight radii. It is for this reason that it was chosen to use the centre and radius notation (3.5) for the interval weights, as opposed to the standard upper and lower bound terminology.

In order to train the interval MLP network proposed here, all of the weight parameters need to be optimized, but there is no stipulation for the centres and radii to be trained simultaneously. Initially training a crisp neural network upon the data would give a set of values to use for the interval weight centres. The width of any subsequent interval MLP prediction is then dependent on the quality of that crisp solution.

The predicted output region should be minimized as much as possible, and in graphical terms it only needs to be wide enough to contain all of the target values. If an initial crisp solution fits the target data closely then the interval weight radii only need to be small to give the required output MSE. Conversely, if the initial crisp approximation is poor then large weight radii will be required. An initial crisp solution could be found using either differential evolution or by standard error back-propagation. In order to then find suitable values for the interval weight radii there are two methods that can be used.

#### (i) Proportional interval weight expansion

One method is simply to set the interval weight radii as some proportion of the centre values found from the crisp network approximation. This proportion can be incrementally raised, thus increasing all the weight radii together. The width of the interval output region will therefore increase and the network output error will fall. The interval weights can be expanded until either some arbitrary error value is reached (i.e. 1% MSE) or until the network error has fallen to zero.

This technique is only assured to work for one-step-ahead errors however. If a crisp one-step-ahead prediction is used then the expanded interval solution is also defined only in regard to one-step-ahead predictions. There is, therefore, no guarantee that it will regress accurately for long-range outputs, so in some cases it may not suffice. This could only be established by testing the long-range output of the interval MLP afterwards.

This weight expansion method is simple and computationally inexpensive as no training algorithms are required, although this simplicity can affect the quality of the results obtained. For example, a crisp solution with a certain MSE value gives no indication as to the ‘contribution’ of each of its weights towards that MSE. Even if the MSE is low it may be that one individual weight in the solution may be responsible for the majority of it.

In using this method, all of the network weights would need expanding together by a suitable proportion to overcome the contribution from this ‘bad’ weight. The remaining weights would therefore be expanded by far more than is necessary for the optimum solution, giving a wider output region. An optimization method of interval weight radii training is likely to therefore give a better solution, although the low computational expense of this proportional expansion method does justify attempting it first.

#### (ii) Differential evolution weight radii training

The best optimization training method to use for the interval weight radii is the differential evolution algorithm. Proportional weight expansion is not strictly a training method and does not use the width penalty term described in §3*c*, but this new approach does.

Starting with an initial crisp network solution the interval weight centres are held fixed at those particular values. A variety of different width penalty coefficients are each tested 10 times from randomly initialized populations, to allow for the stochastic nature of the differential evolution algorithm. The algorithm parameters to be used are weighting factor *F*=0.5, crossover constant CR=0.1 and population size NP=85 (five times the number of weights in the crisp network). These particular values of the algorithm parameters were chosen based on previous experience of the authors with differential evolution (Kyprianou *et al*. 2000, 2001). The convergence criterion is when the decrease in total population costs over 10 iterations falls to less than 0.01%. All three of the datasets are to be used for the testing in line with Tarassenko's recommended training method. Each training run operates solely on the training dataset, and only calculates one-step-ahead predictions of the interval network error in (3.9). Upon convergence, the best weight radii solution is taken from the final vector population. For each different value of the width penalty coefficient there are thus 10 possible solutions provided.

These 10 solutions are then tested within an interval network on the validation dataset. The long-range output errors are calculated on their MSE contributions only, using the vertex method described in §3*d*. This is due to the fact that the final output errors of crisp and interval MLPs trained on the same data need to be in equivalent units for comparison purposes. It can be assumed that the use of the width penalty term throughout the training phase has effectively narrowed the output intervals by this point.

For each possible interval network solution, this simultaneously tests both its generalization and free-run predictive capabilities. By looking at the 10 validation set costs for each coefficient, the best network of the 10 is selected and then used on the test dataset. An interval network solution is therefore provided for each value of the width penalty coefficient *β*.

### (b) Simultaneous weight centre and radius training

Both of the two previous methods in §4*a* operate by fixing the interval weight centres at some optimum crisp solution and then only training the weight radii. A third training strategy though is to train the weight centres and radii together. The training could once again be done by using the differential evolution algorithm. Either the initial populations could be initialized completely randomly, or the parameters corresponding to the interval weight centres could be initialized by perturbations of the optimum crisp solution.

The differential evolution algorithm here would operate in the same way as in §4*a*(ii), with the centre and radius values trained until convergence is met. Ten different algorithm runs could be undertaken for each value of the width penalty coefficient *β* in (3.9), then these 10 interval weight solutions would be tested on the validation dataset. The solution with the lowest associated cost of these 10 would be selected as the optimum interval MLP weights solution for that *β* value, and then used on the test set. The main problem with this approach is that it requires twice as many parameters to be trained, so it may not be practical in relation to computational time.

## 5. Results

In the background work, a crisp solution had been found for the data with a long-range output MSE of 6.5% on the test set. It was achieved using the standard error back-propagation technique with momentum, and prior to any interval network training, an improvement upon this was sought by adjusting the MLP network weights while keeping the network architecture fixed. By differential evolution, a crisp solution was found with a long-range output error of 4.1% when used on the test dataset, with the algorithm parameters described in §4*a*(ii). The interval model used here can be viewed as modelling the uncertainty between the initial crisp MLP network and the real-life system. With any crisp predictive model for a real-life application, there will always be some discrepancy between the prediction from this model and the actual system being modelled, due to restrictions imposed by the selected structure of the model.

### (a) Proportional interval weight expansion

The relative computational ease of this method enabled several different initial crisp solutions to be considered. Twenty three different crisp networks were studied including the 4.1% MSE network, and the error values of these networks ranged up to 9.4% MSE. Figure 5 shows the long-range output MSE of an interval network expanded about the 4.1% crisp solution in relation to the expansion proportion of the weights.

Figure 6 shows an expanded interval network prediction about this crisp solution with a MSE value of 1%. Figure 7 shows an interval network prediction about a crisp solution of 9.4% MSE, also expanded until the long-range MSE is reduced to 1%. In both these figures, the interval bounds are the two dashed curves and the solid line is the true target outputs. For ease of viewing, only the first 500 samples of the test dataset are shown.

### (b) Differential evolution weight radii training

Figure 8 shows how the long-range output MSE varies with the width penalty coefficient *β*. Ten training runs were completed for each different coefficient value to allow for the effect of varying initial conditions. The 4.1% MSE crisp solution was used throughout this as the centres of the various interval network weights and this crisp error is the dashed horizontal line on the graph. Increments of 0.25 were used for the width penalty coefficient and this gave sufficient results for a graph to be constructed with confidence.

Figure 9 shows an interval network prediction trained by this method with a long-range MSE value of 1%. This can be compared to the 1% MSE interval network in figure 6 that was trained by the proportional weight expansion method. Again, the interval bounds are given by the dashed curves and only the first 500 samples of the test set are shown.

As mentioned in §4*b*, training both the interval weight centres and radii simultaneously was much more demanding in relation to computational expense. This approach was briefly investigated for its suitability, but available computational time simply prohibited anything more substantial than this. It is likely that the simultaneous training of MLP weight centres and radii would have given an alternative differential evolution error curve to that of figure 8. This error curve would be necessary for a direct comparison between weight radii training alone and centre and radii training with differential evolution.

Figure 10 shows an interval network prediction obtained from differential evolution training of both the interval weight centres and radii, where the weight centre parameters were initialized as Gaussian perturbations of the 4.1% MSE crisp solution. This network prediction gives a long-range MSE of 0.9% on the test dataset.

## 6. Discussion and conclusions

From figure 5, it can be seen that the long-range MSE of the interval MLP falls monotonically as the weight radii are proportionally increased. This corresponds with the intuitive concept that as this particular output region becomes wider, it therefore contains more of the target data. The decision can then be made by the user as to what value of error is acceptable for their particular application of the interval network.

This relates to the opportunity function in Ben-Haim's information-gap theory (2001). In this case, the uncertainty parameter *α* is the percentage expansion of the interval weights. The desired windfall reward *r*_{w} is the long-range MSE value required by the user, i.e. 1% MSE. The task is therefore to find the minimum expansion percentage that can give that particular MSE value. For instance, from figure 5 it can be seen that in order to obtain a MSE value of 1% the proportional expansion of the interval weights about the crisp solution needs to be approximately 0.5%.

Figure 6 shows that a 1% MSE interval network can be obtained from the initial 4.1% crisp solution with the interval bounds kept reasonably narrow. It would need to be decided by the user if that particular prediction width is suitable for the application in question though. The precision of the interval bounds obtained is susceptible to the quality of the initial crisp solution, as can be seen by comparing figure 6 with figure 7.

In figure 7, an inferior crisp solution of 9.4% MSE was used as a starting point. In this case, it can be seen that to reach the same long-range MSE value as figure 6, the output bounds need to be noticeably wider. For this starting point, an interval network returning output bounds of similar width to figure 6 would have a significantly larger MSE.

Figure 8 shows the variation of the long-range interval network MSE under differential evolution training of the weight radii. It shows that a zero MSE value can be obtained when there is no interval width penalty, then the network MSE increases monotonically with the penalty coefficient *ρ*. This coefficient has an inverse relationship with the width of the output intervals, as increasing it forces the intervals to become narrower during training. As these output intervals become narrower, they then contain less of the target data, and so the network MSE increases.

Ten tests were conducted for each different penalty coefficient and the one with the lowest error was selected, although the variance over the 10 tests was in every case very small. For a given application, the value of the width penalty coefficient can be selected from figure 8 by a similar method to that used in figure 5. If again a 1% interval network MSE is required, figure 8 shows that a width penalty coefficient of approximately 2.75 should be used.

Figure 9 shows an interval network with a MSE value of 1% obtained from the initial 4.1% MSE crisp solution by differential evolution training of the weight radii. Comparison with figure 5 shows that there is only a slight difference between the widths of the interval bounds for the two networks. For this particular initial crisp solution, the differential evolution algorithm offers only a small improvement over the proportional weight expansion approach.

This will not always be the case though. The two different networks were both generated from a crisp solution with a low MSE value, so this was already very close to the target data. The proportional expansion method was therefore able to give narrow output bounds that could not be significantly improved by differential evolution. For an inferior crisp solution though (i.e. that used in figure 7), the differential evolution algorithm is likely to be the more effective of the two.

Figure 10 was included solely to show that simultaneous differential evolution training of the interval weight centres and radii is able to give a good solution, although computational time was not available to investigate this intensively as in §5*b*. This is a suggestion for potential further work.

For the interval network error function in (3.9), the following alternative version could potentially be used:(6.1)This, however, does not take into account any varying magnitudes between elements of the crisp solution. A particular interval weight may have a radius that is small in absolute terms, but large as an absolute proportion of its central value. The interval error function in (6.1) would not take this into account, and thus the function given in (3.9) is the best choice to use.

The differential evolution approach involved training interval networks on their one-step-ahead predictions and then testing the converged radii solutions on their long-range outputs. Changing the objective function of the algorithm to train on long-range outputs would most likely give more effective results, since there are differences between this error measure and one-step-ahead predictions.

This strategy was considered during the work here, but had to be abandoned due to time constraints. The vertex method took approximately 10 min to calculate the long-range outputs of a set with 2000 samples, which would equate to over a day for calculating a mere single generation of the differential evolution algorithm here. Clearly this was impractical, but if a much faster computational method became available then it would provide an opportunity to extend this work, and obtain improved results.

If the extreme values from the network occur at the weight space vertices, which is true for many situations, then the vertex method returns the true network bounds. If, however, the extreme values lie within the hypercube, then true bounds will be wider than those predicted, and the error will thus be lower. The vertex method can hence be viewed as giving an upper bound on the long-range MSE value of the interval MLP network.

The NARX MLP network used here was a relatively simple system, although optimization of 17 parameters is not insignificant. If this technique of deliberate uncertainty introduction was to be used for a more complicated real-life system, a slight change in approach would be needed. Sensitivity analysis could initially be used to find the parameters most susceptible to uncertainty, and then only these parameters could be set as interval numbers. The previous paper by the authors on radial basis function (Chetwynd *et al*. 2005*b*) involve introducing uncertainty into just some of the network parameters as opposed to all of them, and the overall behaviour was similar to that observed here.

## Footnotes

- Received November 9, 2005.
- Accepted March 16, 2006.

- © 2006 The Royal Society