## Abstract

Traditional quantum state tomography requires a number of measurements that grows exponentially with the number of qubits *n*. But using ideas from computational learning theory, we show that one can do exponentially better in a statistical setting. In particular, to predict the outcomes of *most* measurements drawn from an arbitrary probability distribution, one needs only a number of sample measurements that grows linearly with *n*. This theorem has the conceptual implication that quantum states, despite being exponentially long vectors, are nevertheless ‘reasonable’ in a learning theory sense. The theorem also has two applications to quantum computing: first, a new simulation of quantum one-way communication protocols and second, the use of trusted classical advice to verify untrusted quantum advice.

## 1. Introduction

Suppose we have a physical process that produces a quantum state. By applying the process repeatedly, we can prepare as many copies of the state as we want, and can then measure each copy on the basis of our choice. The goal is to learn an approximate description of the state by combining the various measurement outcomes.

This problem is called *quantum state tomography*, and it is already an important task in experimental physics. To give some examples, tomography has been used to obtain a detailed picture of a chemical reaction (namely, the dissociation of I_{2} molecules; Skovsen *et al*. 2003), to confirm the preparation of three-photon (Resch *et al*. 2005) and eight-ion (Häffner *et al*. 2005) entangled states, to test controlled-NOT gates (O'Brien *et al*. 2003) and to characterize optical devices (D'Ariano *et al*. 2002).

Physicists would like to scale up tomography to larger systems in order to study the many-particle entangled states that arise (for example) in chemistry, condensed-matter physics and quantum information. But there is a fundamental obstacle in doing so. This is that, to reconstruct an *n*-qubit state, one needs to measure a number of observables that grows exponentially in *n*: in particular like 4^{n}, the number of parameters in a 2^{n}×2^{n} density matrix. This exponentiality is certainly a practical problem—Häffner *et al*. (2005) report that to reconstruct an entangled state of eight calcium ions, they needed to perform 656 100 experiments! But to us it is a theoretical problem as well. For it suggests that learning an arbitrary state of (say) a thousand particles would take longer than the age of the Universe, even for a being with unlimited computational power. This, in turn, raises the question of what one even *means* when talking about such a state. For whatever else a quantum state might be, *at the least* it ought to be a hypothesis that encapsulates previous observations of a physical system, and thereby lets us predict future observations!

Our purpose here is to propose a new resolution for this conundrum. We will show that to predict the outcomes of ‘most’ measurements on a quantum state, where ‘most’ means with respect to any probability distribution of one's choice, it suffices to perform a number of sample measurements that grows only *linearly* with the number of qubits *n*. To be clear, this is not a replacement for standard quantum state tomography, since the hypothesis state that is output could be arbitrarily far from the true state in the usual trace distance metric. All we ask is that the hypothesis state is hard to distinguish from the true state with respect to a given distribution over measurements. This is a more modest goal—but even so, it might be surprising that changing the goal in this way gives an *exponential* improvement in the number of measurements required.

As a bonus, we will be able to use our learning theorem to prove two new results in quantum computing and information. The first result is a new relationship between randomized and quantum one-way communication complexities: namely that for any partial or total Boolean function *f*, where *R*^{1}(*f*) is the randomized one-way communication complexity of *f*, *Q*^{1}(*f*) is the quantum one-way communication complexity and *M* is the length of the recipient's input. The second result says that *trusted classical advice can be used to verify untrusted quantum advice on most inputs*—or in terms of complexity classes that . Both of these results follow from our learning theorem in intuitively appealing ways; on the other hand, we would have no idea on how to prove these results without the theorem.

We wish to stress that the main contribution of this paper is conceptual rather than technical. All of the ‘heavy mathematical lifting’ needed to prove the learning theorem has already been done: once one has the appropriate set-up, the theorem follows readily by combining previous results due to Bartlett & Long (1998) and Ambainis *et al*. (2002). Indeed, what is surprising to us is precisely that such a basic theorem was not discovered earlier.

The paper is organized as follows. We first give a formal statement of our learning theorem in §1, then answer objections to it in §2, situate it in the context of earlier work in §3 and discuss its implications in §4. In §2 we review some necessary results from computational learning theory and quantum information theory, and then prove our main theorem. Section 3 applies the learning theorem to communication complexity, while §4 applies it to quantum computational complexity and untrusted quantum advice. We conclude in §5 with some open problems.

### (a) Statement of result

Let *ρ* be an *n*-qubit mixed state: that is, a 2^{n}×2^{n} Hermitian positive semidefinite matrix with Tr(*ρ*)=1. By a *measurement* of *ρ*, we will mean a ‘two-outcome POVM’: that is, a 2^{n}×2^{n} Hermitian matrix *E* with eigenvalues in [0,1]. Such a measurement *E accepts ρ* with probability Tr(*Eρ*) and *rejects ρ* with probability 1−Tr(*Eρ*).

Our goal will be to learn *ρ*. Our notion of ‘learning’ here is purely operational: we want a procedure that, given a measurement *E*, estimates the acceptance probability Tr(*Eρ*). Of course, estimating Tr(*Eρ*) for *every E* is the same as estimating *ρ* itself, and we know this requires exponentially many measurements. So if we want to learn *ρ* using fewer measurements, then we will have to settle for some weaker success criterion. The criterion we adopt is that we should be able to estimate Tr(*Eρ*) for *most* measurements *E*. In other words, we assume there is some (possibly unknown) probability distribution from which the measurements are drawn.1 We are given a ‘training set’ of measurements *E*_{1}, …, *E*_{m} drawn independently from , as well as the approximate values of Tr(*E*_{i}*ρ*) for *i*∈{1, …, *m*}. Our goal is to estimate Tr(*Eρ*) for most *E*s drawn from , with high probability over the choice of training set.

We will show that this can be done using a number of training measurements *m* that grows only *linearly* with the number of qubits *n*, and inverse polynomially with the relevant error parameters. Furthermore, the learning procedure that achieves this bound is the simplest one imaginable: it suffices to find any ‘hypothesis state’ *σ* such that Tr(*E*_{i}*σ*)≈Tr(*E*_{i}*ρ*) for all *i*. Then with high probability that hypothesis will ‘generalize’, in the sense that Tr(*Eσ*)≈Tr(*Eρ*) for most *E*s drawn from . More precisely,

*Let ρ be an n-qubit mixed state*, *a distribution over two-outcome measurements of ρ and ϵ=*(*E*_{1}, …, *E*_{m}) *a training set consisting of m measurements drawn independently from* . *In addition, fix error parameters ϵ, η, γ*>0 *with γϵ*≥7*η. Call* *a ‘good’ training set if any hypothesis σ that satisfies**for all E*_{i}∈, *also satisfies**Then there exists a constant K>*0 *such that* *is a good training set with probability at least* 1*−δ, provided that*

A proof will be given in §2.

### (b) Objections and variations

Before proceeding further, it will be helpful to answer various objections that might be raised against theorem 1.1 Along the way, we will also state two variations of the theorem.

*By changing the goal to a statistical one*, *theorem 1.1* *dodges much of the quantum state tomography problem as ordinarily understood*.

Yes, that is exactly what it does! The motivating idea is that one does not need to know the expectation values for *all* observables, only for most of the observables that will actually be measured. As an example, if we can apply only one- and two-qubit measurements, then the outcomes of three-qubit measurements are irrelevant by assumption. As a less trivial example, suppose the measurement distribution is uniformly random (i.e. the Haar measure). Then even if our quantum system is ‘really’ in some pure state |*ψ*〉, for reasonably large *n* it will be billions of years before we happen upon a measurement that distinguishes |*ψ*〉 from the maximally mixed state. Hence the maximally mixed state is perfectly adequate as an explanatory hypothesis, despite being far from |*ψ*〉 in the usual metrics, such as trace distance.

*But to apply* *theorem 1.1*, *one needs the measurements to be drawn independently from some probability distribution* . *Is this not a strange assumption? Should one also not allow adaptive measurements?*

If all of our training data involved measurements in the {|0〉, |1〉} basis, then regardless of how much data we had, clearly we could not hope to simulate a measurement in the {|+〉, |−〉} basis. Therefore, as usual in learning theory, to get anywhere we need to make *some* assumption to the effect that the future will resemble the past. Such an assumption does not strike us as unreasonable in the context of quantum state estimation. For example, suppose that (as is often the case) the measurement process was itself stochastic, so that the experimenter did not know which observable was going to be measured until after it *was* measured. Or suppose the state was a ‘quantum programme’, which had to succeed only on typical inputs drawn from some probability distribution.2

However, with regard to the power of adaptive measurements, it is possible to ask somewhat more sophisticated questions. For example, suppose we perform a binary measurement *E*_{1} (drawn from some distribution ) on one copy of an *n*-qubit state *ρ*. Then, based on the outcome *z*_{1}∈{0, 1} of that measurement, suppose we perform another binary measurement *E*_{2} (drawn from a new distribution ) on a second copy of *ρ*, and so on for *r* copies of *ρ*. Finally, suppose we compute some Boolean function *f*(*z*_{1}, …, *z*_{r}) of the *r* measurement outcomes.

Now, how many times will we need to repeat this adaptive procedure before, given *E*_{1,} …, *E*_{r} drawn as above, we can estimate (with high probability) the conditional probability that *f*(*z*_{1}, …, *z*_{r})=1? If we simply apply theorem 1.1 to the tensor product of all *r* registers, then it is easy to see that *O*(*nr*) samples suffice. Furthermore, using ideas in the electronic supplementary material, one can show that this is optimal: in other words, no improvement to (say) *O*(*n*+*r*) samples is possible.

Indeed, even if we want to estimate the probabilities of all *r* of the measurement outcomes *simultaneously*, it follows from the union bound that we could do this with high probability, after a number of samples linear in *n* and polynomial in *r*.

We hope this illustrates how our learning theorem can be applied to more general settings than that for which it is explicitly stated. Naturally, there is a great deal of scope here for further research.

*Theorem 1.1* *is purely information theoretic; as such, it says nothing about the computational complexity of finding a hypothesis state σ.*

This is correct. Using semidefinite and convex programming techniques, one can implement any of our learning algorithms to run in time polynomial in the Hilbert space dimension, *N*=2^{n}. This might be fine if *n* is at most 12 or so; note that ‘measurement complexity’, and not computational complexity, has almost always been the limiting factor in real experiments. But of course such a running time is prohibitive for larger *n*.

Let us stress that exactly the same problem arises even in classical learning theory. For it follows from a celebrated result of Goldreich *et al*. (1984) that if there exists a polynomial-time algorithm to find a Boolean circuit of size *n* consistent with observed data (whenever such a circuit exists), then there are no cryptographic one-way functions. Using the same techniques, one can show that if there exists a polynomial-time quantum algorithm to prepare a state of *n*^{k} qubits consistent with observed data (whenever such a state exists), then there are no (classical) one-way functions secure against quantum attack. The only difference is that, while finding a classical hypothesis consistent with data is an search problem,3 finding a quantum hypothesis is a search problem.

A fundamental question left open by this paper is whether there are non-trivial special cases of the quantum learning problem that can be solved, not only with a linear number of measurements, but also with a polynomial amount of quantum computation.

*The dependence on the error parameters γ and ϵ in* *theorem 1.1* *looks terrible*.

Indeed, no one would pretend that performing ∼(1/*γ*^{4}*ϵ*^{4}) measurements is practical for reasonable *γ* and *ϵ*. Fortunately, we can improve the dependence on *γ* and *ϵ* quite substantially, at the cost of increasing the dependence on *n* from linear to *n* log^{2}*n*.

*The bound in* *theorem 1.1* *can be replaced by**for all ϵ, η, γ>0 with γ>η*.

In the electronic supplementary material, we will show that the dependence on *γ* and *ϵ* in theorem 1.2 is close to optimal.

*To estimate the measurement probabilities* Tr(*E*_{i}*ρ), one needs the ability to prepare multiple copies of ρ*.

This is less of an objection to theorem 1.1 than to quantum mechanics itself! With only one copy of *ρ*, the uncertainty principle immediately implies that not even statistical tomography is possible.

*One could never be certain that the condition of* *theorem 1.1* *was satisfied (in other words, that* *for every i)*.

This is correct, but there is no need for certainty. For suppose we apply each measurement *E*_{i} to *Θ*((log *m*)/*η*^{2}) copies of *ρ*. Then by a large deviation bound, with overwhelming probability, we will obtain real numbers *p*_{1}, …, *p*_{m} such that for every *i*. So if we want to find a hypothesis state *σ* such that for every *i*, then it suffices to find a *σ* such that for every *i*. Certainly such a *σ* exists, for take *σ*=*ρ*.

*But what if one can apply each measurement only once, rather than multiple times? In that case, the above estimation strategy no longer works*.

In the electronic supplementary material, we prove a learning theorem that applies directly to this ‘measure-once’ scenario. The disadvantage is that the upper bound on the number of measurements increases from ∼1/(*γ*^{4}*ϵ*^{4}) to ∼1/(*γ*^{8}*ϵ*^{4}).

*Let ρ be an n-qubit state*, *a distribution over two-outcome measurements and ϵ=*(*E*_{1}, …, *E*_{m}) *consist of m measurements drawn independently from* . *Suppose we are given bits B=*(*b*_{1}, …, *b*_{m})*, where each b*_{i} *is* 1 *with independent probability* Tr(*E*_{i}*ρ*) *and* 0 *with probability* 1−Tr(*E*_{i}*ρ*). *Suppose also that we choose a hypothesis state σ to minimize the quadratic functional* . *Then there exists a positive constant K such that**with probability at least* 1−*δ over* *and B, provided that*

*What if, instead of applying the ‘ideal’ measurement E, the experimenter can only apply a noisy version E′?*

If the noise that corrupts *E* to *E*′ is governed by a known probability distribution such as a Gaussian, then *E*′ is still just a POVM, so theorem 1.1 applies directly. If the noise is adversarial, then we can also apply theorem 1.1 directly, provided we have an upper bound on (which simply gets absorbed into *η*).

*What if the measurements have k>*2 *possible outcomes?*

Here is a simple reduction to the two-outcome case. Before applying the *k*-outcome POVM , first choose an integer *j*∈{1, …, *k*} uniformly at random and then pretend that the POVM being applied is (i.e. ignore the other *k*−1 outcomes). By the union bound, if our goal is to ensure thatwith probability at least 1−*δ*, then in our upper bounds it suffices to replace every occurrence of *γ* by *γ/k*, and every occurrence of *ϵ* by *ϵ/k*. We believe that one could do better than this by analysing the *k*-outcome case directly; we leave this as an open problem.4

### (c) Related work

This paper builds on two research areas—computational learning theory and quantum information theory—in order to say something about a third area, quantum state estimation. Since many readers are probably unfamiliar with at least one of these areas, let us discuss them in turn.

#### (i) Computational learning theory

Computational learning theory can be understood as a modern response to David Hume's problem of induction: ‘if an ornithologist sees 500 ravens and all of them are black, why does that provide any grounds at all for expecting the 501st raven to be black? After all, the hypothesis that the 501st raven will be white seems equally compatible with evidence.’ The answer, from a learning theory perspective, is that in practice one always restricts attention to some class of hypotheses that is vastly smaller than the class of logically conceivable hypotheses. So the real question is not ‘is induction possible?’, but rather ‘what properties does the class have to satisfy for induction to be possible?’

In a seminal 1989 paper, Blumer *et al.* (1989) showed that if is finite, then any hypothesis that agrees with randomly chosen data points will probably agree with most future data points as well. Indeed, even if is infinite, one can upper-bound the number of data points needed for learning in terms of a combinatorial parameter of called the Vapnik–Chervonenkis (VC) dimension. Unfortunately, these results apply only to Boolean hypothesis classes. So to prove our learning theorem, we will need a more powerful result due to Bartlett & Long (1998), which upper-bounds the number of data points needed to learn *real*-valued hypothesis classes.

#### (ii) Quantum information theory

Besides results from classical learning theory, we will also need a result of Ambainis *et al*. (2002) in quantum information theory. Ambainis *et al*. (2002) showed that if we want to encode *k* bits into an *n*-qubit quantum state, in such a way that any one bit can later be retrieved with error probability at most *p*, then we need *n*≥(1−*H*(*p*))*k*, where *H* is the binary entropy function.

Perhaps the central idea of this paper is to turn the result of Ambainis and colleagues on its head, and see it not as lower-bounding the number of qubits needed for coding and communication tasks, but instead as *upper*-bounding the ‘effective dimension’ of a quantum state to be learned. (In theoretical computer science, this is hardly the first time that a negative result has been turned into a positive one. A similar ‘lemons-into-lemonade’ conceptual shift was made by Linial *et al*. (1993), when they used a limitation of constant-depth circuits to give an efficient algorithm for learning those circuits.)

#### (iii) Quantum state estimation

Physicists have been interested in quantum state estimation since at least the 1950s (see Paris & Řeháček (2004) for a good overview). For practical reasons, they have been particularly concerned with minimizing the number of measurements. However, most literature on the subject restricts attention to low-dimensional Hilbert spaces (say, 2 or 3 qubits), taking for granted that the number of measurements will increase exponentially with the number of qubits.

There *is* a substantial body of work on how to estimate a quantum state given incomplete measurement results—see Bužek *et al*. (1999) for a good introduction to the subject, or Bužek (2004) for estimation algorithms that are similar in spirit to ours. But there are at least two differences between the previous work and ours. First, while some of the previous work offers numerical evidence that few measurements seem to suffice in practice, so far as we know none of it considers asymptotic complexity. Second, the previous work almost always assumes that an experimenter starts with a prior probability distribution over quantum states (often the uniform distribution), and then either updates the distribution using Bayes' rule, or else applies a maximum-likelihood principle. By contrast, our learning approach requires no assumptions about a distribution over states; it instead requires only a (possibly unknown) distribution over *measurements*. The advantage of the latter approach, in our view, is that an experimenter has much more control over which measurements to apply than the nature of the state to be learned.

### (d) Implications

The main implication of our learning theorem is conceptual: it shows that quantum states, considered as a hypothesis class, are ‘reasonable’ in the sense of computational learning theory. Were this *not* the case, it would presumably strengthen the view of quantum computing sceptics (Levin 2003; Goldreich 2004) that quantum states are ‘inherently extravagant’ objects, which will need to be discarded as our knowledge of physics expands. (Or at least, it would suggest that the ‘operationally meaningful’ quantum states comprise only a tiny portion of Hilbert space.) Instead we have shown that, while the effective dimension of an *n*-qubit Hilbert space *appears* to be exponential in *n*, in the sense that is relevant for approximate learning and prediction, this appearance is illusory.

Beyond establishing this conceptual point, we believe that our learning theorem could be of practical use in quantum state estimation, since it provides an explicit upper bound on the number of measurements needed to ‘learn’ a quantum state with respect to any probability measure over observables. Even if our actual result is not directly applicable, we hope the mere *fact* that this sort of learning is possible will serve as a spur to further research. As an analogy, classical computational learning theory has had a large influence on neural networks, computer vision and other fields,5 but this influence might have had less to do with the results themselves than with their philosophical moral.

We turn now to a more immediate application of our learning theorem: solving open problems in quantum computing and information.

The first problem concerns *quantum one-way communication complexity*. In this subject we consider a sender, Alice, and a receiver, Bob, who hold inputs *x* and *y,* respectively. We then ask the following question: assuming the best communication protocol and the worst (*x*, *y*) pair, how many bits must Alice send to Bob, for Bob to be able to evaluate some joint function *f*(*x*, *y*) with high probability? Note that there is no back communication from Bob to Alice.

Let *R*^{1}(*f*) and *Q*^{1}(*f*) be the number of bits that Alice needs to send, if her message to Bob is randomized or quantum, respectively.6 Then improving an earlier result of Aaronson (2004), in §3 we are able to show the following.

*For any Boolean function f (partial or total)*, , *where M is the length of Bob's input*.

Intuitively, this means that if Bob's input is small, then quantum communication provides at most a small advantage over classical communication.

The proof of theorem 1.4 will rely on our learning theorem in an intuitively appealing way. Basically, Alice will send some randomly chosen ‘training inputs’ which Bob will then use to learn a ‘pretty good description’ of the quantum state that Alice would have sent him in the quantum protocol.

The second problem concerns *approximate verification of quantum software*. Suppose you want to evaluate some Boolean function *f*:{0, 1}^{n}→{0, 1} on typical inputs drawn from a probability distribution . So you go to the quantum software store and purchase |*ψ*_{f}〉, a *q*-qubit piece of quantum software. The software vendor tells you that to evaluate *f*(*x*) on any given input *x*∈{0, 1}^{n}, you simply need to apply a fixed measurement *E* to the state |*ψ*_{f}〉|*x*〉. However, you do not trust |*ψ*_{f}〉 to work as expected. Thus, the following question arises: is there a fixed, polynomial-size set of ‘benchmark inputs’ *x*_{1}, …, *x*_{T}, such that for *any* quantum programme |*ψ*_{f}〉, if |*ψ*_{f}〉 works on the benchmark inputs then it will also work on most inputs drawn from ?

Using our learning theorem, we will show in §4 that the answer is yes. Indeed, we will actually go further than that and give an *efficient procedure* to test |*ψ*_{f}〉 against the benchmark inputs. The central difficulty here is that the measurements intended to test |*ψ*_{f}〉 might also destroy it. We will resolve this difficulty by means of a ‘Witness Protection Lemma’, which might have applications elsewhere.

In terms of complexity classes, we can state our verification theorem as follows.

.

Here is the class of problems solvable in quantum polynomial time, with help from a polynomial-size ‘quantum advice state’ |*ψ*_{n}〉 that depends only on the input length *n*; while Quantum Merlin–Arthur () is the class of problems for which a ‘yes’ answer admits a polynomial-size quantum proof. Then and are the *heuristic* versions of and , respectively—that is, the versions where we want only to succeed on most inputs rather than all of them.

## 2. The measurement complexity of quantum learning

We now prove theorems 1.1 and 1.2. To do so, we first review results from computational learning theory, which upper-bound the number of data points needed to learn a hypothesis in terms of the ‘dimension’ of the underlying hypothesis class. We then use a result of Ambainis *et al*. (2002) to upper-bound the dimension of the class of *n*-qubit mixed states.

### (a) Learning probabilistic concepts

The prototype of the sort of learning theory result we need is the ‘Occam's razor theorem’ of Blumer *et al*. (1989), which is stated in terms of a parameter called VC dimension. However, this result of Blumer *et al*. (1989) does not suffice for our purpose, since it deals with *Boolean* concepts that map each element of an underlying sample space to {0, 1}. By contrast, we are interested in probabilistic concepts—called *p-concepts* by Kearns & Schapire (1994)—that map each measurement *E* to a real number .

Generalizing from Boolean concepts to *p*-concepts is not as straightforward as one might hope. Fortunately, various authors (Kearns & Schapire 1994; Bartlett *et al*. 1996; Alon *et al*. 1997; Bartlett & Long 1998; Anthony & Bartlett 2000) have already done most of the work for us, with results due to Anthony & Bartlett (2000) and due to Bartlett & Long (1998) being particularly relevant. To state their results, we need some definitions. Let be a finite or infinite set called the *sample space*. Then a *p-concept over* is a function *F*:→[0,1], and a *p-concept class over* is a set of *p*-concepts over . Kearns & Schapire (1994) proposed a measure of the complexity of *p*-concept classes, called the *fat-shattering dimension*.

Let be a sample space, a *p*-concept class over and *γ*>0 a real number. We say a set is *γ*-fat-shattered by if there exist real numbers *α*_{1,} …, *α*_{k} such that for all , there exists a *p*-concept *F*∈ such that for all *i*∈{1, …, *k*},

if

*i*∉*B*then andif

*i*∈*B*then .

Then the *γ*-fat-shattering dimension of , or , is the maximum *k* such that some is *γ*-fat-shattered by . (If there is no such finite maximum, then .)

We can now state the result of Anthony & Bartlett (2000).

*Let* *be a sample space*, *a p-concept class over* *and* *a probability measure over* . *Fix an element F*∈, *as well as error parameters ϵ, η, γ>*0 *with γ>η. Suppose we draw m samples X=*(*x*_{1}, …, *x*_{m}) *independently according to* *and then choose any hypothesis H*∈ *such that* *for all x*∈*X. Then there exists a positive constant K such that**with probability at least* 1*−δ over X, provided that*

Note that in theorem 2.2, the dependence on the fat-shattering dimension is superlinear. We would like to reduce the dependence to linear, at least when *η* is sufficiently small. We can do so using the following result of Bartlett & Long (1998).7

*Let* *be a sample space*, *a p-concept class over* *and* *a probability measure over* . *Fix a p-concept F*:→[0,1] *(not necessarily in* *)*, *as well as an error parameter α>*0. *Suppose we draw m samples X=*(*x*_{1}, *…,* *x*_{m}) *independently according to* *and then choose any hypothesis H∈* *such that* *is minimized. Then there exists a positive constant K such that**with probability at least* 1−*δ over X, provided that*

Theorem 2.3 has the following corollary.

*In the statement of* *theorem* 2.2, *suppose γϵ≥7η. Then the bound on m can be replaced by*

Let be a sample space, a *p*-concept class over and a probability measure over . Then let ^{*} be the class of *p*-concepts *G*:→[0,1] for which there exists an *F*∈ such that for all *x*∈. In addition, fix a *p*-concept *F*∈. Suppose we draw *m* samples *X*=(*x*_{1},…,*x*_{m}) independently according to and then choose any hypothesis *H*∈ such that for all *x*∈*X*. Then there exists a *G*∈^{*} such that *G*(*x*)=*H*(*x*) for all *x*∈*X*. This *G* is simply obtained by setting *G*(*x*)≔*H*(*x*) if *x*∈*X* and *G*(*x*)≔*F*(*x*) otherwise.

So by theorem 2.3, provided thatwe havewith probability at least 1−*δ* over *X*. Here we have used the fact that *G*∈^{*} and henceSetting , this implies by Markov's inequality thatand thereforeSince , the above inequality implies thatas desired.

Next we claim that . The reason is simply that, if a given set *α*-fat-shatters ^{*}, then it must also (*α*−*η*)-fat-shatter by the triangle inequality.

Putting all together, we haveand hencesamples suffice. ▪

### (b) Learning quantum states

We now turn to the problem of learning a quantum state. Let be the set of two-outcome measurements on *n* qubits. In addition, given an *n*-qubit mixed state *ρ*, let be the *p*-concept defined by and be the class of all such *F*_{ρ}s. Then to apply theorems 2.2 and 2.3, all we need to do is upper-bound in terms of *n* and *γ*. We will do so using a result of Ambainis *et al*. (2002), which upper-bounds the number of classical bits that can be ‘encoded’ into *n* qubits.

*Let k and n be positive integers with k>n. For all k-bit strings y=y*_{1}, …, *y*_{k}*, let ρ*_{y} *be an n-qubit mixed state that ‘encodes’ y. Suppose there exist two-outcome measurements E*_{1}, …, *E*_{k} *such that for all y*∈{0, 1}^{k} *and i*∈{1, …, *k},*

*if y*_{i}*=*0*then**and**if y*_{i}=1*then*.

*Then* , *where H is the binary entropy function*.

Theorem 2.5 has the following easy generalization.

*Let k, n and {ρ*_{y}*} be as in* *theorem 2.5*. *Suppose there exist measurements E*_{1}, *…,* *E*_{k}*, as well as real numbers α*_{1},*…, α*_{k}*, such that for all y∈{*0, 1*}*^{k} *and i∈{*1, …, k*},*

*if y*_{i}*=*0*then**and**if y*_{i}*=*1*then*.

*Then* .

Suppose there exists such an encoding scheme with . Then consider an amplified scheme, where each string *y*∈{0,1}^{k} is encoded by the tensor product state . Here we set for some *c*>0. In addition, for all *i*∈{1, …, *k*}, let be an amplified measurement that applies *E*_{i} to each of the ℓ copies of *ρ*_{y}, and accepts if, and only if, at least *α*_{i}ℓ of the *E*_{i}'s do. Then provided we choose *c* to be sufficiently large, it is easy to show by a Chernoff bound that for all *y* and *i*,

if

*y*_{i}=0 then andif

*y*_{i}=1 then .

So to avoid contradicting theorem 2.5, we need . But this implies that .8 ▪

If we interpret *k* as the size of a fat-shattered subset of , then theorem 2.6 immediately yields the following upper bound on fat-shattering dimension.

*For all γ>*0 *and n, we have* .

Combining corollary 2.4 with corollary 2.7, we find that if *γϵ*≥7*η* then it suffices to usemeasurements. Likewise, combining theorem 2.2 with corollary 2.7, we find that if *γ*>*η* then it suffices to usemeasurements. This completes the proofs of theorems 1.1 and 1.2, respectively.

## 3. Application to quantum communication

In this section we use our quantum learning theorem to prove a new result about *one-way communication complexity*. Here we consider two players, Alice and Bob, who hold inputs *x* and *y*, respectively. For concreteness, let *x* be an *N*-bit string and *y* be an *M*-bit string. In addition, let be a Boolean function, where is some subset of {0, 1}^{N}×{0, 1}^{M}. We call *f total* if *Z*={0, 1}^{N}×{0, 1}^{M}, and *partial* otherwise.

We are interested in the minimum number of bits *k* that Alice needs to send to Bob, for Bob to be able to evaluate *f*(*x*, *y*) for any input pair . We consider three models of communication: deterministic; randomized; and quantum. In the deterministic model, Alice sends Bob a *k*-bit string *a*_{x} depending only on *x*. Then Bob, using only *a*_{x} and *y*, must output *f*(*x*, *y*) with certainty. In the randomized model, Alice sends Bob a *k*-bit string *a* drawn from a probability distribution _{x}. Then Bob must output *f*(*x*, *y*) with probability at least 2/3 over .9 In the quantum model, Alice sends Bob a *k*-qubit mixed state *ρ*_{x}. Then Bob, after measuring *ρ*_{x} in a basis depending on *y*, must output *f*(*x*, *y*) with probability at least 2/3. We use *D*^{1}(*f*), *R*^{1}(*f*) and *Q*^{1}(*f*) to denote the minimum value of *k* for which Bob can succeed in the deterministic, randomized and quantum models, respectively. Clearly, *D*^{1}(*f*)≥*R*^{1}(*f*)≥*Q*^{1}(*f*) for all *f*.

The question that interests us is how small the quantum communication complexity *Q*^{1}(*f*) can be when compared with the classical complexities *D*^{1}(*f*) and *R*^{1}(*f*). We know that there exists a total function for which *D*^{1}(*f*)=*N* but *R*^{1}(*f*)=*Q*^{1}(*f*)=*O*(log *N*).10 Furthermore, Gavinsky *et al*. (2007) have recently shown that there exists a partial function *f* for which but .

On the other hand, it follows from a result of Klauck (2000) that for all total *f*. Intuitively, if Bob's input is small, then quantum communication provides at most a limited savings over classical communication. But does the bound hold for partial *f* as well? Aaronson (2004) proved a slightly weaker result: for all *f* (partial or total), . Whether the factor can be removed has remained an open problem for several years.

Using our quantum learning theorem, we are able to resolve this problem at the cost of replacing *D*^{1}(*f*) by *R*^{1}(*f*). We now prove theorem 1.4 that for any Boolean function *f*.

Let be a Boolean function with . Fix Alice's input and let be the set of all such that . By Yao's minimax principle, to give a randomized protocol that errs with probability at most 1/3 for all , it is enough, for any fixed probability distribution over , to give a randomized protocol that errs with probability at most 1/3 over *y* drawn from .11

So let be such a distribution; then the randomized protocol is as follows. First Alice chooses *k* inputs *y*_{1}, …, *y*_{k} independently from , where *k*=*O*(*Q*^{1}(*f*)). She then sends Bob *y*_{1}, …, *y*_{k}, together with *f*(*x*, *y*_{i}) for all *i*∈{1, …, *k*}. Clearly, this message requires only *O*(*MQ*^{1}(*f*)) classical bits. We need to show that it lets Bob evaluate *f*(*x*, *y*), with high probability over *y* drawn from .

By amplification, we can assume Bob errs with probability at most *η* for any fixed constant *η*>0. We will take *η*=(1/100). In addition, in the quantum protocol for *f*, let *ρ*_{x} be the *Q*^{1}(*f*)-qubit mixed state that Alice would send given input *x* and *E*_{y} the measurement that Bob would apply given input *y*. Then if *f*(*x*, *y*)=1, while if *f*(*x*, *y*)=0.

Given Alice's classical message, first Bob finds a *Q*^{1}(*f*)-qubit state *σ* such that for all *i*∈{1, …, *k*}. Certainly such a state exists (for take *σ*=*ρ*_{x}) and Bob can find it by searching exhaustively for its classical description. If there are multiple such states, then Bob chooses one in some arbitrary deterministic way (e.g. by lexicographic ordering). Note that we then have for all *i*∈{1, …, *k*} as well. Finally Bob outputs *f*(*x*, *y*)=1 if or *f*(*x*, *y*)=0 if .

Set *ϵ*=*δ*=(1/6) and *γ*=0.42, so that *γϵ*=7*η*. Then by theorem 1.1,with probability at least 1−*δ* over Alice's classical message, provided thatSo in particular, there exist constants *A and B* such that if , thenwith probability at least 1−*δ*. Since *γ+η*<(1/2), it follows that Bob's classical strategy will fail with probability at most *ϵ*+*δ*=(1/3) over *y* drawn from . ▪

It is easy to see that, in theorem 1.4, the upper bound on *R*^{1}(*f*) needs to depend on both *M* and *Q*^{1}(*f*). For the index function12 yields a total *f* for which *R*^{1}(*f*) is exponentially larger than *M*, while the recent results of Gavinsky *et al*. (2007) yield a partial *f* for which *R*^{1}(*f*) is exponentially larger than *Q*^{1}(*f*). However, is it possible that theorem 1.4 could be improved to ?

Using a slight generalization of Gavinsky *et al*.'s (2007) result, we are able to rule out this possibility. Gavinsky and colleagues consider the following one-way communication problem, called the *Boolean Hidden Matching Problem*. Alice is given a string *x*∈{0, 1}^{N}. For some parameter *α*>0, Bob is given *αN* disjoint edges in {1, …, *N*}^{2}, together with a string *w*∈{0, 1}^{αN}. (Thus Bob's input length is .) Alice and Bob are promised that either

.

Bob's goal is to output *f*=0 in case (i) or *f*=1 in case (ii).

It is not hard to see that for all *α*>0.13 What Gavinsky *et al.* (2007) showed is that if , then . By tweaking their proof a bit, one can generalize their result to for *all* (R. de Wolf 2006, personal communication). So in particular, set . Then we obtain a partial Boolean function *f* for which and but , thereby refuting the conjecture that .

As a final remark, the Boolean Hidden Matching Problem clearly satisfies for all *α*>0. So by varying *α*, we immediately get that not only is false, but also Aaronson's bound (Aaronson 2004) is *tight* up to a polylogarithmic term. This answers one of the open questions in (Aaronson 2004).

## 4. Application to quantum advice

Having applied our quantum learning theorem to communication complexity, in this section we apply the theorem to computational complexity. In particular, we will show how to use a trusted classical string to perform approximate verification of an untrusted quantum state.

The following conventions will be helpful throughout the section. We identify a language *L*⊆{0, 1} with the Boolean function *L*:{0, 1}^{*}→{0, 1} such that *L*(*x*)=1 if and only if *x*∈*L*. Given a quantum algorithm *A*, we let be the probability that *A* accepts and the probability that *A* rejects if given the state |*ψ*〉 as input. Note that *A* might neither accept nor reject (in other words, output ‘do not know’), in which case . Finally, we use to denote a Hilbert space of *k* qubits, and poly(*n*) to denote an arbitrary polynomial in *n*.

### (a) Quantum advice and proofs

, or Bounded-Error Quantum Polynomial Time, is the class of problems efficiently solvable by a quantum computer. Then is a generalization of , in which the quantum computer is given a polynomial-size quantum advice state that depends only on the input length *n*, but could otherwise be arbitrarily hard to prepare. More formally:

A language *L*⊆{0, 1}^{*} is in if there exists a polynomial-time quantum algorithm *A* such that for all input lengths *n*, there exists a quantum advice state such that for all *x*∈{0, 1}^{n}.

How powerful is this class? Aaronson (2004) proved the first limitation on , by showing that . Here is a generalization of , in which we can ‘postselect’ on the outcomes of measurements,14 and means ‘with polynomial-size classical advice’. Intuitively, this result means that anything we can do with quantum advice, we can also do with classical advice, provided we are willing to use exponentially more computation time to extract what the advice is telling us.

In addition to quantum advice, we will also be interested in quantum proofs. Compared to advice, a proof has the advantage that it can be tailored to a particular input *x*, but the disadvantage that it cannot be trusted. In other words, while an advisor's only goal is to help the algorithm *A* decide whether *x*∈*L*, a prover wants to *convince A* that *x*∈*L*. The class of problems that admit polynomial-size quantum proofs is called .

A language *L* is in if there exists a polynomial-time quantum algorithm *A* such that for all *x*∈{0, 1}^{n}:

if

*x*∈*L*then there exists a quantum witness such that andif

*x*∉*L*then for all |*φ*〉.

One can think of as a quantum analogue of .

### (b) Untrusted advice

To state our result in the strongest possible way, we need to define a new notion called *untrusted advice*, which might be of independent interest for complexity theory. Intuitively, untrusted advice is a ‘hybrid’ of proof and advice: it is like a proof in that it cannot be trusted, but like advice in that depends only on the input length *n*. More concretely, let us define the complexity class , or Yoda Polynomial Time, to consist of all problems solvable in classical polynomial time with the help from polynomial-size untrusted advice.15

A language *L* is in if there exists a polynomial-time algorithm *A* such that for all *n*:

there exists a string such that

*A*(*x*,*y*_{n}) outputs*L*(*x*) for all*x*∈{0, 1}^{n}and*A*(*x*,*y*) outputs either*L*(*x*) or do not know for all*x*∈{0, 1}^{n}and all*y*.

From the definition, it is clear that is contained in both and . Indeed, while we are at it, let us initiate the study of by mentioning four simple facts that relate to standard complexity classes.

.

=∩,

*where**is the exponential-time analogue of**(i.e. both the advice size and the verifier's running time are*2^{O(n)}*)*.*If*=*then*=∩.*If**then*=.

Similar to the proof that ⊂/. Given a machine

*M*, first amplify*M*so that its failure probability on any input of length*n*is at most 2^{−2n}. Then by a counting argument, there exists a single random string*r*_{n}that causes*M*to succeed on all 2^{n}inputs simultaneously. Use*r*_{n}as the machine's advice.⊆∩ is immediate. For ∩⊆, first concatenate the and the witnesses for all 2

^{n}inputs of length*n*, then use the resulting string (of length 2^{O(n)}) as the machine's advice.If = then = by padding. Hence =∩ by part (ii).

Let

*M*be a machine and*y*_{n}be the lexicographically first advice string that causes*M*to succeed on all 2^{n}inputs of length*n*. Consider the following computational problem:*given integers**encoded in binary, compute the ith bit of y*_{n}. We claim that this problem is in . For an machine can first guess*y*_{n}, then check that it works for all*x*∈{0, 1}^{n}using queries, then check that no lexicographically earlier string*also*works using queries and finally return the*i*th bit of*y*_{n}. So if , then the problem is in*E*, which means that an machine can recover*y*_{n}itself by simply looping over all*i*. So if*n*and*i*take only logarithmically many bits to specify, then a machine can recover*y*_{n}. Hence =. ▪

Naturally one can also define and , the (bounded-error) probabilistic and quantum analogues of . For brevity, we give only the definition of .

A language *L* is in if there exists a polynomial-time quantum algorithm *A* such that for all *n*:

there exists a state such that for all

*x*∈{0, 1}^{n}andfor all

*x*∈{0, 1}^{n}and all |*φ*〉.

In analogy to the classical case, is contained in both and . We also have =, since the untrusted advice can be tacked onto the trusted / advice. Figure 1 shows the known containments among various classes involving quantum advice and proofs.

### (c) Heuristic complexity

Ideally, we would like to show that =—in other words, that trusted quantum advice can be replaced by trusted classical advice together with untrusted quantum advice. However, we will be able to prove this only for the *heuristic* versions of these classes: that is, the versions where we allow algorithms that can err on some fraction of inputs.16 We now explain what this means (for details, see the excellent survey by Bogdanov & Trevisan (2006)).

A *distributional problem* is a pair , where *L*⊆{0, 1}^{*} is a language and _{n} is a probability distribution over {0, 1}^{n}. Intuitively, for each input length *n*, the goal will be to decide whether *x* ∈*L* with high probability over *x* drawn from _{n}. In particular, the class , or Heuristic-, consists (roughly speaking) of all distributional problems that can be solved in polynomial time on a fraction of inputs.

A distributional problem is in if there exists a polynomial-time algorithm *A* such that for all *n* and *ϵ*>0,

One can also define or with polynomial-size advice. (Note that in this context, ‘polynomial-size’ means polynomial not just in *n* but in 1/*ϵ* as well.) Finally, let us define the heuristic analogues of and .

A distributional problem is in if there exists a polynomial-time quantum algorithm *A* such that for all *n* and *ϵ*>0,

A distributional problem is in if there exists a polynomial-time quantum algorithm *A* such that for all *n* and *ϵ*>0:

there exists a state such that

the probability over

*x*∈_{n}that there exists a |*φ*〉 such that is at most*ϵ*.

It is clear that .

The following table summarizes the most important complexity classes, prefixes and suffixes defined in §4*a*–*c*.

### (d) Proof

Our goal is to show that : in the heuristic setting, trusted classical advice can be used to verify untrusted quantum advice. The intuition behind this result is simple: the classical advice to the verifier *V* will consist of a polynomial number of randomly chosen ‘test inputs’ *x*_{1}, …, *x*_{m}, as well as whether each *x*_{i} belongs to the language *L*. Then given an untrusted quantum advice state |*φ*〉, first *V* will check that |*φ*〉 yields the correct answers on *x*_{1}, …, *x*_{m}; only if |*φ*〉 passes this initial test *V* will use it on the input *x* of interest. By appealing to our quantum learning theorem, we will argue that any |*φ*〉 that passes the initial test must yield the correct answers for *most x* with high probability.

But there is a problem: what if a dishonest prover sends a state |*φ*〉 such that, while *V*'s measurements succeed in ‘verifying’ |*φ*〉, they also *corrupt* it? Indeed, even if *V* repeats the verification procedure many times, conceivably |*φ*〉 could be corrupted by the very last repetition without *V* ever realizing it. Intuitively, the easiest way to avoid this problem is just to repeat the verification procedure a random number of times. To formalize this intuition, we need the following ‘quantum union bound’, which was proved by Aaronson (2006) based on a result of Ambainis *et al*. (2002).

*Let E*_{1},…,*E*_{m} *be two-outcome measurements, and suppose* *for all i*∈{1,…,*m}. Then if we apply E*_{1},…,*E*_{m} *in sequence to the initial state ρ, the probability that any of the E*_{i}'*s reject is at most* .

Using proposition 4.9, we can prove the following Witness Protection Lemma.

*Let* *={E*_{1},*…,**E*_{m}*} be a set of two-outcome measurements and T a positive integer. Then there exists a test procedure Q with the following properties*.

*Q takes a state ρ*_{0}*as input, applies at most T measurements from**and then returns either*‘*success*’*or*‘*failure*’.*If**for all i, then Q succeeds with probability at least*.*If Q succeeds with probability at least λ, then conditioned on succeeding, Q outputs a state σ such that**for all i*.

The procedure *Q* is given by the following pseudocode.

`Let`*ρ*≔*ρ*_{0}`Choose`*t*∈{1, …,*T*}`uniformly at random``For`*u*≔1`to`*t*`Choose`*i*∈{1, …,*m*}`uniformly at random``Apply`*E*_{i}`to`*ρ*`If`*E*_{i}`rejects, return “FAILURE” and halt`

`Next`*u*`Return “SUCCESS” and output`*σ*≔*ρ*

Property (ii) follows immediately from proposition 4.9. For property (iii), let *ρ*_{u} be the state of *ρ* immediately after the *u*th iteration, conditioned on iterations 1,…,*u* all succeeding. In addition, let . Then *Q* fails in the {*u*+1}st iteration with probability at least *β*_{u}/*m*, conditioned on succeeding in iterations 1,…,*u*. So letting *p*_{t} be the probability that *Q* completes all *t* iterations, we haveHence, letting *z*>0 be a parameter to be determined later,In addition, by the assumption that *Q* succeeds with probability at least *λ*, we have . So for all *i*,The last step is to set , thereby obtaining the optimal lower bound ▪

Finally, by using lemma 4.10, we can prove theorem 1.5 that .

Fix a distributional problem HeurBQP/qpoly. Then there exists a polynomial-time quantum algorithm *A* such that for all *n* and *ϵ*>0, there exists a state of size such thatLet be the distribution obtained by starting from _{n} and then conditioning on . Then our goal will be to construct a polynomial-time verification procedure *V* such that, for all *n* and *ϵ*>0, there exists an advice string for which the following holds.

There exists a state such that

The probability over that there exists a state |

*φ*〉 such that is at most*ϵ*.

If *V* succeeds with probability at least 1−*ϵ* over , then by the union bound it succeeds with probability at least 1−2*ϵ* over *x*∈_{n}. Clearly, this suffices to prove the theorem.

As a preliminary step, let us replace *A* by an amplified algorithm A^{*}, which takes as advice and returns the majority answer among ℓ invocations of *A*. Here ℓ is a parameter to be determined later. By a Chernoff bound,We now describe the verifier *V*. The verifier receives three objects as input.

*An input x*∈{0,1}^{n}.*An untrusted quantum advice state*|*φ*_{0}〉. This |*φ*_{0}〉 is divided into ℓ registers, each with*q*qubits. The state that the verifier*expects*to receive is .*A trusted classical advice string a*_{n,ϵ}. This*a*_{n,ϵ}consists of*m*test inputs {*x*_{1},…,*x*_{m}}∈{0,1}^{n}, together with*L*(*x*_{i}) for*i*∈{1,…,*m*}. Here*m*is a parameter to be determined later.

Given these objects, *V* does the following, where *T* is another parameter to be determined later.

**Phase 1: Verify**|*φ*_{0}〉`Let`|*φ*_{0}〉≔|*φ*_{0}〉`Choose`*t*∈{1, …,*T*}`uniformly at random``For`*u*≔1 to*t*`Choose`*i*∈{1, …,*m*}`uniformly at random``Simulate``If`*A*^{*}`outputs`1−*L*(*x*_{i}),`output ‘don't know’ and halt`

`Next`*u***Phase 2: Decide whether***x∈L*`Simulate``Accept if`*A*^{*}`outputs`1;`reject otherwise`

It suffices to show that there exists a choice of test inputs *x*_{1},…,*x*_{m}, as well as parameters ℓ, *m* and *T*, for which the following holds.

If , then phase 1 succeeds with probability at least 5/6.

If phase 1 succeeds with probability at least 1/3, then conditioned on its succeeding, for all

*i*∈{1,…,*m*}.If for all

*i*∈{1,…,*m*}, then

For conditions (i)–(iii) ensure that the following holds with probability at least 1−*ϵ* over . First, if , thenby the union bound. Here 1/6 is the maximum probability of failure in phase 1, while 5/6 is the minimum probability of success in phase 2. Second, for all |*φ*_{0}〉, either phase 1 succeeds with probability less than 1/3, or else phase 2 succeeds with probability at least 5/6. HenceTherefore, *V* is a valid verifier as desired.

Setwhere *K*>0 is a sufficiently large constant and *q* is the number of qubits of |*ψ*_{n,ϵ}〉. In addition, form the advice string *a*_{n,ϵ} by choosing *x*_{1},…,*x*_{m} independently from . We will show that conditions (i)–(iii) all hold with high probability over the choice of *x*_{1},…,*x*_{m}—and hence, that there certainly *exists* a choice of *x*_{1},…,*x*_{m} for which they hold.

To prove (i), we appeal to part (ii) of lemma 4.10. Setting , we have for all *i*∈{1,…,*m*}. Therefore phase 1 succeeds with probability at least

To prove (ii), we appeal to part (iii) of lemma 4.10. Set *λ*≔(1/3). Then if phase 1 succeeds with probability at least *λ*, for all *i* we have

Finally, to prove (iii), we appeal to theorem 1.2. Set *η*≔(1/18). Then for all *i* we haveand alsoHence,Now set *γ*≔(1/9) and *δ*≔(1/3). Then *γ*>*η* andSo theorem 1.2 implies thatand hencewith probability at least 1−*δ* over the choice of *a*_{n,ϵ}. Here we have used the fact thatand that . ▪

## 5. Summary and open problems

Perhaps the central question left open by this paper is which classes of states and measurements can be learned, not only with a linear number of measurements but also with a reasonable amount of computation. To give two examples, what is the situation for stabilizer states (Aaronson & Gottesman 2004) or non-interacting fermion states (Terhal & DiVincenzo 2002)?17

On the experimental side, it would be interesting to demonstrate our statistical quantum-state learning approach in photonics, ion traps, NMR or any other technology that allows the preparation and measurement of multi-qubit entangled states. Already for three or four qubits, complete tomography requires hundreds of measurements, and depending on what accuracy is needed, it seems probable that our learning approach could yield an efficiency improvement. How much of an improvement partly depends on how far our learning results can be improved, as well as on what the constant factors are. A related issue is that, while one can always reduce noisy, *k*-outcome measurements to the noiseless, two-outcome measurements that we consider, one could almost certainly prove better upper bounds by analysing realistic measurements more directly.

One might hope for a far-reaching generalization of our learning theorem to what is known as *quantum process tomography*. Here the goal is to learn an unknown quantum *operation* on *n* qubits by feeding it inputs and examining the outputs. But for process tomography, it is not hard to show that exponentially many measurements really are needed; in other words, the analogue of our learning theorem is false.18 Still, it would be interesting to know if there is anything to say about statistical process tomography for restricted classes of operations.

Finally, our quantum information results immediately suggest several problems. First, does ? In other words, can we use classical advice to verify quantum advice even in the worst-case setting? Alternatively, can we give a ‘quantum oracle’ (see Aaronson & Kuperberg 2007) relative to which ? Second, can the relation *R*^{1}(*f*)=*O*(*MQ*^{1}(*f*)) be improved to *D*^{1}(*f*)=*O*(*MQ*^{1}(*f*)) for all *f*? Perhaps learning theory techniques could even shed light on the old problem of whether *R*^{1}(*f*)=*O*(*Q*^{1}(*f*)) for all total *f*.

## Acknowledgments

I thank Noga Alon, Dave Bacon, Peter Bartlett, Robin Blume-Kohout, Andy Drucker, Aram Harrow, Tony Leggett, Peter Shor, Luca Trevisan, Umesh Vazirani, Ronald de Wolf and the anonymous reviewers for their helpful discussions and correspondence.

## Footnotes

Electronic supplementary material is available at http://dx.doi.org/10.1098/rspa.2007.0113 or via http://www.journals.royalsoc.ac.uk.

This work was done while the author was a postdoc at the University of Waterloo, supported by CIAR through the Institute for Quantum Computing.

↵ can also be a continuous probability measure; this will not affect any of our results.

↵At this point we should remind the reader that the distribution over measurements has to exist only; it does not have to be known. All of our learning algorithms will be ‘distribution-free’, in the sense that a single algorithm will work for any choice of .

↵Interestingly, in the ‘representation-independent’ setting (where the output hypothesis can be an arbitrary Boolean circuit), this problem is not known to be -complete.

↵Note that any sample complexity bound must have at least a linear dependence on

*k*. Here is a proof sketch: given a subset*S*⊆{1, …,*k*} with |*S*|=*k*/2, let |*S*〉 be a uniform superposition over the elements of*S*. Now consider simulating a measurement of |*S*〉 in the computational basis, {|1〉, …, |*k*〉}. It is clear that*Ω*(*k*) sample measurements are needed to do this even approximately.↵According to Google Scholar, Valiant's original paper on the subject (Valiant 1984) has been cited 1918 times as of this writing, with a large fraction of the citations coming from fields other than theoretical computer science.

↵Here the superscript ‘1’ denotes one-way communication.

↵The result we state is a special case of Bartlett & Long's (1998) theorem 20, where the function

*F*to be learned is itself a member of the hypothesis class .↵If we care about optimizing the constant under the

*Ω*(*k*), then we are better off avoiding amplification and instead proving theorem 2.6 directly using the techniques of Ambainis*et al*. (2002). Doing so, we obtain*n*/*γ*^{2}≥2*k*/ln 2.↵We can assume without loss of generality that Bob is deterministic, i.e. that his output is a function of

*a*and*y*.↵This

*f*is the equality function:*f*(*x*,*y*)=1 if*x*=*y*, and*f*(*x*,*y*)=0 otherwise.↵Indeed, it suffices to give a deterministic protocol that errs with probability at most 1/3 over

*y*drawn from , a fact we will not need.↵This is the function

*f*:{0, 1}^{N}×{1, …,*N*}→{0, 1} defined by*f*(*x*_{1}, …,*x*_{N},*i*)=*x*_{i}.↵The protocol is as follows: first Alice sends the log

*N*-qubit quantum message . Then Bob measures in a basis corresponding to (*i*_{1},*j*_{1}), …, (*i*_{α}_{N},*j*_{α}_{N}). With probability 2*α*, Bob will learn whether for some edge (*i*_{ℓ},*j*_{ℓ}). So it suffices to amplify the protocol*O*(1/*α*) times.↵See Aaronson (2005) for a detailed definition, as well as a proof that PostBQP coincides with the classical complexity class PP.

↵Here Yoda, from Star Wars, is intended to evoke a sage whose messages are highly generic (“Do or do not... there is no try”). One motivation for the name YP is that, to our knowledge, there had previously been no complexity class starting with a ‘Y’.

↵Closely related to heuristic complexity is the better-known average-case complexity. In average-case complexity one considers algorithms that can never err, but that are allowed to output ‘don't know’ on some fraction of inputs.

↵Note that we can only hope to learn such states efficiently for restricted classes of measurements. Otherwise, even if the state to be learned were a classical basis state |

*x*〉, a ‘measurement’ of |*x*〉 might be an arbitrary polynomial time computation that fed*x*as input to a pseudorandom function.↵Here is a proof sketch: let

*U*be an*n*-qubit unitary that maps |*x*〉|*b*〉 to |*x*〉|*b*⊕*f*(*x*)〉, for some Boolean function*f*:{0, 1}^{n−1}→{0, 1}. Then to predict*U*on a 1−*ϵ*fraction of basis states, we need to know (1−*ϵ*)^{2n−1}bits of the truth table of*f*. But Holevo's theorem implies that by examining*U*|*ψ*_{i}〉 for*T*input states |*ψ*_{1}〉, …, |*ψ*_{T}〉, we can learn at most*Tn*bits about*f*.- Received July 2, 2007.
- Accepted August 16, 2007.

- © 2007 The Royal Society