## Abstract

Recently, Cox and Battey (2017 *Proc. Natl Acad. Sci. USA* **114**, 8592–8595 (doi:10.1073/pnas.1703764114)) outlined a procedure for regression analysis when there are a small number of study individuals and a large number of potential explanatory variables, but relatively few of the latter have a real effect. The present paper reports more formal statistical properties. The results are intended primarily to guide the choice of key tuning parameters.

## 1. Introduction

We consider studies of dependence in which there are relatively few individuals on each of which a large number of potential explanatory variables are measured. Genetic microarray experiments are a prime example. For progress, an implicit or explicit assumption of sparsity is required, namely that the great majority of the explanatory variables have no effect. We call explanatory variables with an effect on response signal variables, and those with no effect noise variables.

Powerful procedures for analysis emanate from the lasso [1] (for a discussion of the mathematical aspects, see [2]). They aim to uncover a single small set of signal variables effective for prediction. It is possible, however, that many different choices of explanatory variables are essentially equally effective and have quite different subject–matter implications. Cox & Battey [3] outline a different approach aiming to specify those small sets of potential signal variables that give essentially the same fit; choice between different sets requires either more data or specific subject–matter information. Their procedure also allowed informal checks standard in much statistical work, such as those for nonlinearity, interaction or anomalous individuals.

In the present paper, we outline a probabilistic base for such an analysis. The emphasis is on how the procedure would perform under various idealized scenarios, as such providing guidance on the choice of key tuning parameters. The focus is on two key aspects at each stage and in the overall process of analysis. What is the probability that a signal variable is falsely discarded? How many of the variables ultimately suggested as potentially important are in fact noise variables? Our objective is essentially to calibrate the analytical procedure by specifying its behaviour under idealized conditions, not to set up a procedure to achieve preassigned error rates.

The ideas involved apply rather generally, for example, to likelihood-based fitting of logistic models for binary data.

## 2. Broad outline of procedure

Suppose that *v* variables are to be assessed for their explanatory power for a response variable. It is assumed that *v* is roughly *k*^{d} for *k* ≤ 15 and *d* a small positive integer. In the method as outlined by Cox & Battey [3], the indices of these variables are arranged in a *k* × *k* × *k* cube for *d* = 3 or *k* × *k* × *k* × *k* hypercube for *d* = 4, etc. Sets of variables are then selected for test *k* at a time by traversing the cube by rows, columns, etc. Every variable thus has associated with it *d* sets of (*k* − 1) companion variables from each time the traversal of the cube passes through that variable. The explanatory power of each variable is assessed alongside its *d* sets of companions and, on the basis of those analyses, variables are either discarded or retained according to a suitable decision rule. For instance, a variable might be retained if it is among the two most significant in at least *d*/2 of the *d* analyses in which it appears.

This is repeated with successively lower dimensional hypercubes until ideally fewer than 20 variables remain on which informal diagnostic checks are then performed.

The final phase of the analysis is to find small subsets of variables among the augmented set of retained variables and any square or interaction terms suggested at the informal phase. See Cox & Battey [3] for a more detailed description of the exploratory and subset selection phases.

## 3. Specification

At the *i*th stage of the process, let there be respectively *v*_{Si} signal variables and *v*_{Ni} noise variables. In the specific example below, *i* = 0, 1, 2. Thus of the *v*_{S0} signal variables initially present, *v*_{S2} are chosen at the end for detailed study. Given *v*_{S0}, the numbers of signal variables chosen at the first and second stage have binomial distributions with parameters *θ*_{S1} and *θ*_{S2} = *θ*_{S1}*θ*_{S2.1}, where, in particular, *θ*_{S2.1} is the conditional probability, given that a specific signal variable is chosen at the first stage, that it is chosen again at the second stage. A different specification is needed for the noise variable because, initially at least, there are a large number of them. In particular, the *v*_{Ni} have Poisson distributions with means *μ*_{Ni}. We write *ϕ*_{N2.1} = *μ*_{N2}/*μ*_{N1} for the probability that a noise variable retained after step 1 is still retained at step 2.

A summary of the properties of a two-stage procedure is thus provided by *θ*_{S2} and *μ*_{N2}, the probability that a signal variable is retained and the expected number of noise variables not rejected.

We now study these in order to compare different strategies of analysis.

## 4. Some reduction strategies

We compare three possible approaches to each analysis of the first stage reduction. Thus, we may

— take the single variable with highest score (most significant);

— take the two variables with highest scores; and

— take all those variables, if any, whose scores exceed a threshold.

Significance tests based on the Wald, score or likelihood ratio statistics are natural choices. The former does, however, have the disadvantage of not being parameterization invariant.

In a set of *k* variables chosen at random for test, the number of signal variables has a Poisson distribution of mean *kv*_{S0}/*v*, where *v* = *v*_{S0} + *v*_{N0} and *v*_{S0} is assumed modest, and hence has two or more signal variables with probability approximately *k*^{2}*v*^{2}_{S0}/(2*v*^{2}). This is small for the situations to be considered and does not affect the qualitative comparison of procedures. It is, therefore, ignored in later calculations.

It is convenient to phrase the initial discussion in terms of significance tests and their associated *p*-values. For noise variables, these are uniformly distributed on (0, 1), and for signal variables their density can be modelled as (1 − *γ*)*x*^{−γ}, where 0 < *γ* < 1.

In the following calculations of the probability *𝜗*^{(j)} that the *j*th procedure listed above chooses the signal variable, it is convenient to use Stirling's formula in the form that for large *k* and fixed *a* the ratio *Γ*(*k* + *a*)/*Γ*(*k*) is close to *k*^{a}.

If only the most significant out of *k* is chosen, the probability that the signal variable is taken is
*γ* = 0, there is no distinction between signal and noise variables and the notional signal variable is selected with probability 1/*k*, that is, essentially at random.

If now we take the two most significant values out of *k*, then *𝜗*^{(2)} = *𝜗*^{(1)} + *𝜗*^{(2.1)}, where
*𝜗*^{(2.1)}/*𝜗*^{(1)} = 1 − *γ*. The difference between *𝜗*^{(2)} and *𝜗*^{(1)} is maximized at approximately *γ*_{0} = (log*k* − 1)/log*k*, at which *𝜗*^{(2.1)} is *Γ*{(log*k*)^{−1}}/(*e*log*k*) (figure 1).

If we were to set a critical level at *α*, then the corresponding probability for a signal is
*𝜗*^{(3)} = *𝜗*^{(1)} and *𝜗*^{(3)} = *𝜗*^{(2)} are achieved for any *γ* by setting, respectively, *α* = *k*^{−1}*Γ*(2 − *γ*)^{1/(1−γ)} ≈ *k*^{−1} and

This shows that strategy 3 with *α* = 1/*k* and strategy 1 are equivalent as this choice also makes *μ*^{(3)} = *μ*^{(1)}. By contrast, (2 − *γ*)^{1/(1−γ)} > 2 for all 0 < *γ* < 1, meaning that, for the same probability of selecting a signal variable, strategy 3 necessarily selects more noise variables than strategy 2 and, therefore, is strictly inferior.

In replacing the second strategy by the first, *μ*_{N1} and consequently *μ*_{N2} are reduced by roughly a factor of four independently of *γ*. However, the penalties of including noise variables are mainly incidental in view of the exhaustive search over models that must later take place. The associated computational burdens are reasonable unless *μ*_{N2} is, for example, 20 or more, and so strategy 2 would normally be preferred for its higher *θ*_{S1}.

The probability *θ*^{(j)}_{S1} that the *j*th procedure retains a signal variable through the first stage is *θ*^{(j)}_{S1} = *𝜗*^{(j)3} + 3*𝜗*^{(j)2}(1 − *𝜗*^{(j)}) and so, approximately,
*γ* = 1, is the reduction in survival probability of signal variables from replacing the second procedure by the first. Here we have approximated by treating analyses in which the same variable appears as independent. The difference (4.3) is approximately
*γ* = *γ*_{0}, and as the slowest decaying term in (4.4) is of the order of 1/(*e*^{2}log*k*) this shows the potential advantages of strategy 2 over strategy 1 to decrease only very slowly with *k*.

A less general but more conventional formulation is to assume a normal theory linear model. Define Δ to be signal strength multiplied by the square root of the sample size. The values of *𝜗*^{(}*j*) are then
*ϕ*(*x*) and *Φ*(*x*) are the standard normal probability density and cumulative distribution function at *x*, respectively. The threshold in the third strategy is, by convention, calibrated as the upper quantile *κ** of the standard normal distribution. We thereby approximate *𝜗*^{(1)} and

The integral in *𝜗*^{(2.1)} is negligible over *g*(*x*, *ν*) is uniformly bounded in *x* as *h* has a single maximum, *x*_{0}(*ν*), which varies with *ν*. Specifically we take *g*(*x*) = *ϕ*(*x* − Δ), *ν* = *k* − 2 and *h*(*x*, *ν*) = log*Φ*(*x*) + log*Φ*( − *x*)/*ν*. The integral *𝜗*^{(2.1)} is thus approximable by the method of Laplace (e.g. [4, pp. 60–65]). The idea in outline is that, for large *ν*, by far the greatest contribution to the integral comes from a neighbourhood of *x*_{0}. The integral is evaluated by expanding *g* and *h* in a neighbourhood of *x* = *x*_{0}, leading to
^{′} denotes the partial derivative with respect to the first argument. It is customary when constructing Laplace integrals to define *h* such that its maximum is independent of *ν*. This is not possible here as, without the contribution of the *Φ*( − *x*) term, *h* has no sharp maximum. That the form in (4.5) is permissible is discussed by Copson [5, pp. 42–47].

The second derivative at *x* is
*x*_{0} of *h* satisfies
*𝜗*^{(2.1)} is
_{0}(*k*) = − *Φ*^{−1}(1/*k*) of
*a* shows the accuracy of the approximation (4.9) for *k* = 10.

The qualitative implications of the normal theory formulation are equivalent to those of the formulation in terms of *p*-values. For instance, at Δ = 0, which corresponds to *γ* = 0, equation (4.9) is *k* to two decimal places for *k* sufficiently large. If *γ* is redefined as *γ* = {cosh(Δ) − 1}/cosh(Δ), then equation (4.2) has a unique maximum at sech^{−1}(1/log*k*), which, like Δ_{0}(*k*), is a very slowly growing function of *k*. Figure 2*b* shows the approximation in equation (4.2) as a function of Δ with *γ* redefined as described. Any mapping Δ↦*g*(Δ) such that *g*(0) = 0 and *g*(*x*) ≈ 1 for *x* > 4 gives essentially the same qualitative conclusions.

## 5. Relative severity of each stage

We now consider the relative advantages of

— a mild first stage and severe second stage

— a severe first stage and a mild second stage,

where, for clarity of exposition, ‘mild’ and ‘severe’ are associated with the same quantitative values in both versions. In particular, let 1 > *𝜗*_{H} > *𝜗*_{L} > 0, where *𝜗*_{H} and *𝜗*_{L} are the probabilities that a signal variable is retained in a single analysis of a mild and severe reduction, respectively. Let *θ*^{HL}_{S2} and *θ*^{LH}_{S2} denote the overall survival probabilities of the first and the second of these procedures, respectively. To demonstrate that
*𝜗*_{L} + 2*𝜗*_{H}(2 − *𝜗*_{L}) is 5, attained only in the limit as *𝜗*_{L} because *𝜗*_{L} < 1 and so equation (5.1) implies
*θ*^{HL}_{S2} > *θ*^{LH}_{S2}. That is, the first procedure gives a higher probability of a signal variable being retained than the second. The same argument shows that the first procedure results also in a higher probability of noise variables being retained.

## 6. A simple example

The performance of various stages of the procedure may be investigated as follows. Generate *n* = 10^{2} replicates of *v* = 10^{3} variables from a normal distribution of zero mean and covariance matrix *PΣP*^{−1}, where *P* is a permutation matrix and *Σ* is an identity matrix with one diagonal block replaced by a correlation matrix of dimension *v*_{S0} + *v*_{C0} and equal correlation *ρ*. For each replicate, *v*_{S0} of the *v*_{S0} + *v*_{C0} correlated variables is multiplied by a constant signal and added, together with standard normal noise. Conditional on a realization *v*_{S0} signal variables, the resulting response variable is then normally distributed of mean *γ* is a *v*_{S0} vector of constants. Arrange the indices of the *v* variables in a 10 × 10 × 10 cube. The rows, the columns, etc. of the cube form 3*k*^{2} = 300 sets, each of *k* = 10 variables. Reduction is performed by taking the top two scoring variables in each analysis in the first stage and by taking all those exceeding the threshold of a 0.1% level test in the second stage. Finally, find small subsets of variables that give adequate fit using a likelihood ratio test against the comprehensive model.

Summaries estimated from 500 Monte Carlo replications are reported in table 1. The general conclusion is that such correlation slightly degrades the probability of a signal variable being retained and increases the number of false models not rejected. The comprehensive model in the likelihood ratio test is taken as the model with all variables retained through the reduction phase. Having been selected in the light of the data, it achieves a better fit to the data than an arbitrary model embedding the one to be tested and so the probability of the true model being accepted, conditional on it having been retained, is lower than the nominal coverage probability of the likelihood ratio test. A simple way to recover the conditional nominal coverage is to split the sample. Table 1 reports the improvement in coverage probability and the associated increase in the number of false models not rejected from using 70 observations for reduction and the remaining 30 to construct the final set of low-dimensional models.

## 7. Discussion

### (a) On the choice of *k*

The combinatorial arrangements used are essentially the partially balanced incomplete block designs introduced 80 years ago in the context of plant breeding trials [6]. Our treatment has thus taken *k* as determined *a priori*, preferably below 15. An alternative would be to draw random sets of size *k* for each variable, leaving the choice of *k* undetermined. The argument against too large a value of *k* is that the precision of the resulting estimates is degraded by the correlations induced when there are many correlated variables among those selected for test.

The following is a simplistic formulation outlining the effect of *k* on the estimated standard errors and ignoring other aspects.

Suppose *Z*_{1}, …, *Z*_{k} have zero mean, unit variance and are weakly correlated with average correlation *l*^{T}*Z*, an estimated effect, to that assuming independence is approximately

The worst case is where, if the correlations are all of the same sign, the *l*_{i} are approximately equal and the ratio becomes *k*. If correlations of the order 0.05 – 0.1 are likely to be present, *k* of the order 10–20 is a reasonable choice, justifying the choice described here and in the previous paper.

### (b) Arrangement randomization

Strong reassurance of the security of one's conclusions is given if, upon re-randomization of the arrangement of the variable indices in the cube, the outcome is relatively stable. An unstable outcome would most likely indicate that too severe a reduction has been used. The probabilistic properties set forth in §4 guarantee robustness to re-randomization under idealized conditions provided that the decision rules used are not too severe, and empirical experience with microarray data supports this for continuous responses. Some applications with binary outcome require caution. In particular, when *v* is very large, there may be an appreciable number of relatively low-dimensional sets of variables perfectly predicting the outcome. If a set of variables forms such a separating hyperplane, then so does any larger set. The likelihood would, in such a case, be theoretically unbounded and consequently all sets of *k* variables containing a primitive set of separating variables would be retained, leading to too mild a reduction.

The special features of the procedure with binary responses will not be explored in the present paper. Note, however, that agreement of the outcome with the lasso solution should not be expected with binary responses for the reasons outlined by Cox & Battey [3].

### (c) Further remarks

The analysis discussed in the paper is intended to be largely exploratory. The object of the formal probabilistic discussion is to calibrate the procedure to show how it performs under idealized conditions. It is not aimed to justify an explicit probabilistically based assessment, such as a confidence coefficient, attached to a specific analysis. To develop such an assessment, a Bayesian analysis might be considered. For this, meaningful prior probabilities have to be attached to key unknown features, such as the true number of signal variables. A flat or so-called indifference prior would in this case be inappropriate in that it would put overwhelming probability on large values; a Poisson prior distribution of modest mean might be more suitable. Aspects describing the correlation structure of errors also need explicit formulation. At this stage of the work, we have not followed that route.

## Data accessibility

This article has no additional data. The procedure as described in 2 and the previous paper is implemented in the R package 'HCmodelSets', written by H. H. Hoeltgebaum and available on the Comprehensive R Archive Network.

## Author's contributions

H.S.B and D.R.C designed the research, performed the research and wrote the paper.

## Competing interests

We declare we have no competing interests.

## Funding

The work was supported by a UK Engineering and Physical Sciences research fellowship (EP/P002757/1) to H.S.B.

- Received September 7, 2017.
- Accepted June 4, 2018.

- © 2018 The Authors.

Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.