Quantum secure learning with classical samples

,


I. INTRODUCTION
The hybridization of machine learning and quantum theory has been intensively studied, especially to explore the possibility of exploiting quantum learning speedups. Very recently, the incorporation of useful quantumalgorithm-kernel (e.g., quantum linear solvers [1]) into data processing tasks in machine learning has yielded encouraging results [2][3][4][5]. Within a span of a few years, such approaches have become increasingly important in quantum computation, leading to the advent of quantum machine learning [6,7].
In parallel, the issue of security has been of considerable interest to the machine learning community. The term "secure learning" is usually used to indicate that the learning is allowed only for the legitimate learner, who wants to rule out adversarial learners. The main objective of these adversaries is to acquire ability to become equals of the legitimate learner or to render the learning of the legitimate learner counterproductive. In this context, one of the open issues is how to define a secure learning condition for detecting and preventing these adversaries. While this problem has been widely studied in classical learning [8,9], only a few quantum mechanical studies have been conducted so far [10][11][12]. * The first two authors contributed equally to this work † jbang@etri.re.kr We indicate that the legitimate learning mates can communicate a (classically) encrypted dataset after generating a secret key via a well-established quantum-keydistribution (QKD) scheme. In that case, it would be impractical for the adversarial learner(s) to extract critical learning information once the QKD is completed. However, the adversarial learner(s) may want to spoil the learning by disrupting the communication. Such a purpose can be achieved simply by disrupting the encrypted data after the key is distributed. This is actually one of the distinctive aspects of the learning security [8]. Thus, the learning security can neither be fully achieved nor defined by the QKD alone.
Having the above in mind, we in this paper construct a secure learning condition with favorable quantum properties. To this end, we first design a protocol for secure sampling that runs between two legitimate learning parties. We cast a classical-quantum hybrid oracle that allows large-size classical inputs with a small-scale quantum system [13]. As the main result, we derive a secure learning condition such that only the original legitimate learner is guaranteed success for learning; we designate the condition as the secure probably-approximatelycorrect (PAC) learning condition. The beauty of this condition is that the security is derived only from the size of learning samples the legitimate learner requires and it stems from the quantum no-broadcasting principle [14,15]. Therefore, such condition cannot be defined in any classical regime. Our paper also leads to an intriguing classical-quantum interplay, namely, in which the (large) input data remain classical while the useful quantum properties are explored for a small quantum system [16,17]. Such architecture helps avoid the use of a largely superposed sample and is well suited to noisy intermediate-scale quantum (NISQ) technologies [18].

II. PROBLEM
Given a (Boolean) function c ∈ C that maps the input x = x 0 x 1 · · · x n−1 to a binary value c(x) ∈ {0, 1}, learning is defined as the process of identifying a hypothesis h ∈ H close to c. The binary number x j ∈ {0, 1} (j = 0, 1, . . . , n − 1) can be considered as the "feature" and the size of the hypothesis set |H|, called "model complexity," is assumed to be finite. Such a problem covers a wide variety of learning tasks. In particular, this binary setting of the problem can, in principle, be extended to a more general situation such as multiclass tasks [19]. For this reason, the binary classification framework has generally been used in computational learning theory [20,21].
In such a problem, the learner, say Alice (A ), should first sample a set T of input-target pairs, where T = {(x, c(x))}. To accomplish this sampling, A employs a black box, called the oracle. The oracle is responsible for accessing critical information, namely, c(x) for a given x.
Here, we assume that the oracle is owned by A 's distant partner, say Bob (B). Such an assumption, namely, of the two learning parties being located far apart, is commonly invoked in secure learning [8]. The issue is then how A can sample a clean dataset T with B in a manner that is secure against any malicious attack; in other words, how can A learn c securely?

III. SECURE SAMPLING PROTOCOL
We introduce a classical-quantum hybrid oracle O(c), which consists of input and output channels for n-bit classical data x and for a single qubit, denoted by C A B and Q A B , respectively. This oracle O(c) implements (x, |α ) → (x, |c(x) ⊕ α ) for α ∈ {0, 1} and (r, |α ) → (r, |α ) for α ∈ {+, −}, where |c(x) is the oracle-answer for a given x. Here, r is a random input which is casted for the purpose of testing the existence of any malicious intruder who disturbs the communication. Thus, r is chosen such that (r, y) / ∈ T (for any y ∈ {0, 1}). The construction of such an operation is fairly common, e.g., in QKD or quantum secure direct communication schemes [22,23]. Note that it is not permissible to extract any information by looking into O(c). A useful hybrid oracle architecture is presented in Appendix A.

IV. NO-BROADCASTING OF LEARNING SAMPLES
With the protocol described above, we present our first result: Theorem 1. In our protocol, for any given c ∈ C, B cannot distribute the full set of learning samples, namely, T = {(x, c(x))}, to A and other (external) learners. Therefore, the condition where T (k) is the set of samples that the k learner (i.e., A or E ) finally gets for strategy E, cannot be satisfied.
For proving this theorem, we letρ 0 = |c(x) c(x)| and ρ 1 = |α α|, each of which is defined in terms of a state of the ideal oracle output in a trial for a given input (x or r).
an arbitrary ancillary state FIG. 2. General attack by adversarial learners. Here, we consider L − 1 adversarial learners who can freely access C A B and Q A B . Each adversarial learner has his or her own (in principle, infinite size) ancillary system and is assumed to be an expert in quantum theory. We further assume that the adversarial learners can team up to process an optimal strategy E for their own or for the group's benefit.
Suppose B adopts a strategy E to distribute the samples in T among learners (including A ), with L ≥ 2. In general, E can be represented as a completely positive and trace-preserving map with an overall unitaryÛ E and an arbitrary ancilla stateΞ (see Fig. 2). The distributed states can be written such that where Tr S\(k) denotes the partial trace with respect to all systems S except the one labeled with the kth learner, andΓ represents a state of L − 1 qubits, each of which is distributed to the corresponding learner, except A (here, k = 1 denotes A ). Then, it is true that B cannot broadcast the statesρ s (s = 0, 1) to a (k-indexed) learner. This is confirmed by the principle that the statesρ 0 andρ 1 are not distinguishable [14,24,25]. Therefore, a sample pair (x, c(x)) cannot be shared for a given x. Thus, the full set of T cannot be distributed in the complete form and Theorem 1 holds.

V. SECURE PROBABLY-APPROXIMATELY-CORRECT LEARNING
Suppose that A is the only legitimate learner, and the other L − 1 learners are malicious intruders. Without loss of generality, we let k ∈ {A , E } with L = 2, or equivalently, by assuming that all L − 1 intruders team up together as one E . In this setting, we can assume that E is a general attack strategy adopted by E . Then, Theorem 1 describes the following situation: if E disturbs the protocol, the samples prepared by A (and also E ) must be noisy; specifically, a portion η A (and η E ) of the contaminated samples, for example, (x, c(x) ⊕ 1), would be included in A 's (and E 's) samples. Note that A and E cannot identify these contaminations. Here, η (k) ≤ 1 2 (k ∈ {A , E }) and is determined by E 's strategy E. It can be written as (for T 1) where T (k) S denotes the set of uncontaminated samples in T (k) ; thus, T is the fidelity between the statesρ andσ [26]. Here, the inequality in the rightmost side is introduced because E would make the contaminated samples even in cases whereρ s are correctly cloned [? ]. The equality is always saturated for A . We then assume that our protocol forbids any strategy E that allows the condition with a critical factor η c . This assumption is true when η c is chosen such that η c = 1 − F opt where F opt is the optimal fidelity achievable by a (1 → 2)ρ s cloner [? ]. Then, Eq. (4) can be rewritten by using Eq. (3) as which immediately contradicts the quantum no-cloning principle [15]. We note that if Alice could acquire information about Eve's attack scenario (if any), it might be possible to consider a more useful η c setting. If η c = 0, Eq. (4) becomes equivalent to the condition Eq. (1) and we encounter Theorem 1.
We now discuss secure learning in the framework of the so-called PAC learning [21,27]. In a PAC learning, the concept class C is said to be ( , δ)-PAC learnable [we call the learner a ( , δ)-PAC learner] if an -approximated correct solution (i.e., hypothesis) h ∈ H can be found with a probability 1 − δ; in other words, C is said to be is an error function that indicates how h and c differ [21]. Such a theorem of PAC learning indicates that if a learner is allowed to use a certain size, say M b ( , δ), of contaminated samples with η, he or she is guaranteed to be a ( , δ)-PAC learner. In this case, η is defined as the percentage of contaminated samples in the entire set of samples [refer to Eq. (3)]. Usually, M b ( , δ) is referred to as "sample complexity" [20,28]. Here, M b ( , δ) is divided into two categories depending on whether the samples are ideal (i.e., η = 0) or noisy (i.e., η ∈ (0, 1 2 ]) (For more details, see Appendix B, Refs. [21,27], and the informative summary in Chap. 5 of Ref. [29]). The latter, namely, the noisy PAC learning model, provides a useful framework and is suitable for our paper because contaminations, either from E or from imperfection intrinsic to the channels, can be included in the expression for η (k) .
It is noteworthy that the (full) quantum model of the PAC learning, namely, quantum PAC learning, was also developed by using a quantum oracle that allows the (large) superposition of the inputs x [29]. However, no study has been conducted on secure learning in a classical or a quantum PAC learning framework.
We now present our second result: Theorem 2. For any given c ∈ C, let M A b ( , δ) and M E b ( , δ) denote the "optimal" sample complexities of A and E , respectively [? ]. Then, during the running of our protocol, if A becomes a ( , δ)-PAC learner by identifying the samples smaller than M E b ( , δ), E cannot become a ( , δ)-PAC learner for the same and δ.
The proof of this theorem is as follows. First, consider the case η A ≥ η E , which will lead to In this case, it is impossible for A to be a ( , δ)-PAC learner with M samples smaller than M E b ( , δ). Second, in the case of η A < η E , if A completes the learning with M samples and becomes a ( , δ)-PAC learner satis- , then E cannot simultaneously be a ( , δ)-PAC learner because the protocol will be terminated before E obtains a sufficient number of samples (i.e., larger than M E b ( , δ)) to be a ( , δ)-PAC learner. This proves Theorem 2.
On the basis of the above analysis, we present a definition for a secure learner: Here, M b ( , δ) and M c ( , δ) are defined as M For wide applicability of Theorem 1, 2 and Definition 1, we apply two additional rules: (R.1) When the number of trials for (r, |α ) reaches M b ( , δ) − Γ, then A tests whether A suspends the process by confirming that the state change, namely, |± → |∓ , occurs by E ; otherwise, A continues the process. Here, we approximate by assuming M c(x)→c(x)⊕1 = M c(r) =α , where M c(x)→c(x)⊕1 denotes the number of contaminated pairs in A 's sample set after a certain number of trials. This assumption is reasonable because A generates (r, |α ∈ {|+ , |− }) or (x, |α ∈ {|0 , |1 }) with probability 1 2 , which cannot be discriminated by E . (R.2) If the learning is not completed until the number of trials for (x, |α ) reaches M c ( , δ), A quits the process. It is to be noted that the factors Γ and ∆ in (R.1) are introduced to limit the quality of E 's learning.
We can now analyze the possible situations. First, let us consider the case (i) η A ≥ η E . Then, the following two subcases can be considered: However, cases (i-a) and (i-b) do not actually happen because (R.1) will halt the process when η A ≥ η c − ∆; hence E is not allowed to become a ( , δ)-PAC learner. Second, for the case (ii) η A < η E , we can also consider the following two subcases: In case (ii-a), if A can learn h c (for any given and δ) with M samples, with M satisfying Eq. (6), A becomes a secure ( , δ)-PAC learner according to Definition 1, while E cannot. However, at least in theory, it is not impossible for E to obtain the samples with a size identical to A 's after the completion of A 's learning. Nevertheless, E cannot be a ( , δ)-PAC learner at the same level as A since η E cannot be smaller than η A + ∆. The condition η A ≥ η c −∆ in (ii-b) will also halt the protocol because of rule (R.1). Thus, our results (i.e., Theorem 1 and 2 and Definition 1) can be practically applied to the protocol against any E . Further, by using Γ and ∆, we can set the minimum gap between the level of A 's and E 's PAC learning in the worst case, and it would prevent E from becoming a slightly weaker PAC learner than A . The subcases η c − ∆ ≥ η A ≥ η E and η c − ∆ ≥ η E > η A are not expected to occur since they contradict Eq. (4).

VI. MULTI-CLASS CLASSIFICATION
We also consider the multi-class problem by assuming that the input x belongs to 2 m different classes (m ≥ 2). Here, we briefly sketch two strategies: (i) First, the multi-class classification problem is commonly solved by decomposing it into several binary problems. For instance, the "one-vs-all (OVA)" reduction is often used [19], where the problem is decomposed into 2 m decisions of h i (i ∈ {0, 1, . . . , 2 m − 1}) that separates the learning data of the ith class from the other ones. Then, datum x is classified with arg max i h i (x). Here, the condition for secure PAC learning in Eq. (6) can be applied to each decision of h i . However, a long learning time is required because the condition in Eq. (6) should be satisfied for every 2 m decisions.
(ii) In another way, we can consider a single-machine approach, where the oracle can answer for all 2 m labels, that is, y ∈ {0, 1} m , by allowing m qubits conditioned by the same x-input channels. In such generalization, our theorems and the condition in Eq. (6) can also be applied consistently for the states of an arbitrary number of qubits. However, in this case, the region that satisfies the secure PAC learning, i.e., |M c ( , δ) − M b ( , δ)|, narrows. In other words, the security condition becomes more stringent. For detailed analysis, see Appendix C.

VII. REMARKS
We have presented a concept of secure learning that safeguards against any malicious manipulation of learning samples. In contrast to other studies on secure learning, we constructed an analytic framework based on a computational model of learning theory, called PAC learning. This allowed us to establish the link between sample complexity and the condition for learning security. Our approach is appealing because the security condition is defined solely by the sample size; in particular, it is independent of A 's (or E 's) learning algorithms.
Our derivations of Theorem 1 and 2 were based on the quantum principle of no-broadcasting of states, and using these theorems, we introduced the concept of secure PAC learning. Such a security condition cannot exist in the classical regime where E can create as many copies of the learning samples as he or she wishes.
It is noteworthy that our protocol was designed based on a classical-quantum hybridization, where the input data remain classical but only a single-qubit system is employed. Such a hybridization differs considerably from those of other hybrid models. This architecture renders our protocol suitable for NISQ implementation, without the requirement of an excessively large superposition of samples and/or without accessing a novel quantum gadget, called quantum random-access memory [30,31].
We finally point out that determining a more practical form of M c ( , δ) in Eq. (6) continues to be an open problem, and it will be considered in a follow-up study. Notably, it is related to the determination of the optimal sample complexity, which has been a long-standing interest in computational learning theory, especially in the case where the samples are noisy. We believe that our paper will contribute to expanding the frontiers for quantum secure machine learning. Schematic of a hybrid oracle. The oracle consists of two different input and output channels: classical input data x = x1x2 · · · xn (xj ∈ {0, 1} ∀j = 1, . . . , n) and a single qubit to produce the oracle output states. This oracle applies 2 n unitary gatesâ k ∈ {σz, iσy} (k = 0, 1, . . . , 2 n − 1) conditioned on the values of the classical bits xj in x to the qubit channel. In a purely classical case, these gates are either identity or logical-NOT gates. Here, we present an example of a classical-quantum hybrid oracle, which can be applied to our study of secure learning. This oracle allows the classical inputs x and a single qubit |α . It performs the mapping and (r, |α ) → (r, |α ) for α ∈ {+, −}, where r is a random datum that is to be used for performing a security check. Note that x remains unaltered during and after the sampling process. This hybrid oracle can be implemented by a circuit having a specific architecture, such as that shown in Fig. 3. This circuit contains 2 n gates acting on the ancilla qubit: the single-qubit gateâ 0 and 2 n − 1 gateŝ a k (k = 1, 2, . . . , 2 n − 1) conditioned on the classical-bit values x 1 , x 2 , . . . , x n in x. The gatesâ k are given bŷ a k ∈ {σ z , iσ y } , for all k = 0, 1, . . . , 2 n − 1, (A3) whereσ x ,σ y , andσ z are the Pauli operators. This architecture of the oracle is inspired by the general expression of a Boolean function [32], which is given by where a k ∈ {0, 1} (k = 0, 1, . . . , 2 n − 1) are known as the Reed-Muller coefficients. Here, each coefficient has a corresponding gate operationâ k . More specifically, a k = 0 implies thatâ k leaves the bit signal unchanged (identity), while a k = 1 indicates thatâ k flips the bit signal (logicalnot) [33]. The oracle is thus characterized by a fixed set of gatesâ k for a given c. Information on the gatesâ k and how they run is not provided, and it should be learned. Such an oracle architecture indeed differs from other hybrid schemes. It has been argued that such hybridization can offer the advantage of being NISQ implementable and of achieving speedups [16,17].

Appendix B: Probably-Approximately-Correct (PAC) Learning Model
In the PAC learning model [27], a learner samples a finite set of training data {(x i , c(x i ))} (i = 1, 2, . . . , M ) by accessing an oracle. Here, x i is typically assumed to be drawn uniformly. For any c ∈ C, a learning algorithm is a ( , δ)-PAC learner (under uniform distribution) if it can obtain an -approximated correct h ∈ H with probability 1 − δ. More specifically, a learning algorithm is a ( , δ)-PAC learner if it satisfies the condition where E(h, c) denotes the error, for example, the distance between h and c. If the obtained h agrees with of samples constructed from the oracle, then Eq. (B1) holds. Here, |H| denotes the cardinality of H, often-called model complexity. In the standard context, Eq. (B2) is known as the "sample complexity" [21,27]. In other words, it yields the minimum number of training samples required to successfully learn h ∈ H, satisfying Eq. (B1). Such a sample complexity derived from previous classical studies can be directly used in our scenario. In our classical-quantum hybrid query scheme, the same sample complexity exists since x i and c(x i ) identified by the measurement performed by Alice are classical. The beauty of this theorem is that the condition for being a PAC learner depends only on the number of samples, not on any specific learning algorithm.
In the case where the oracle outputs are contaminated, the sample complexity in Eq. (B2) is modified as follows: First, we draw a sequence of training data, where m i ∈ {c(x i ), c(x i ) ⊕ 1} denotes the outcome of the measurement performed by Alice. Subsequently, if sampling is performed with we can verify that Eq. (B1) holds for the algorithm that obtains h ∈ H. It has been proven that the additional factor ξ is given by [28] ξ = 1 Such a noisy PAC learning model provides a useful framework for our study of secure learning. It is noteworthy that in our scenario, the contamination of the output because of an attack by Eve and that resulting from imperfections intrinsic to the oracle can be incorporated together into the factor η.

Appendix C: Extension to multi-class classification
Each training datum can be considered to belong to one of 2 m different classes (m ≥ 2), and the goal is to learn a hypothesis that, given a (new) data point, can correctly decide the class to which the data point belongs. This problem is called the multi-class classification problem.

One-vs-All (OVA) reduction
The conventional approach used to solve the multiclass classification problem is to decompose the problem into several binary classification problems. The most simple, but powerful, method is the so-called OVA reduction [19], where each binary classifier (e.g., RLSC, SVM) is trained to distinguish the examples in a single class from those in all remaining classes. More specifically, in such strategy, the problem is decomposed to 2 m decisions of h i , (i ∈ {0, 1, . . . , 2 m − 1}) that separates the training data of the ith class from those of the other classes (see Fig. 4), and (new) data are classified using where h i (x) is a hypothesis identified in each trial and h(x) is a decision for the classification of the input x.
Here, h i (x) is interpreted as the probability of a given input being included in the ith class, which is very suitable for our PAC learning framework. To achieve OVA (r, ↵ 0 ↵ 1 · · · ↵ m 1 ) ! (r, ↵ 0 ↵ 1 · · · ↵ m 1 ) (x, ↵ 0 ↵ 1 · · · ↵ m 1 ) ! (x, c 0 (x)c 1 (x) · · · c m 1 (x)) x r reduction, we can apply the condition for secure PAC learning (Eq. (7) of our main paper), as it is, to each trial performed for identifying h i (x). However, in this case, the learning time is increased as we should prepare the dataset to train 2 m classifiers and the secure PAC learning condition should be satisfied for every 2 m trials.

A strategy of single-machine approach
Another useful approach is to solve a single optimization problem that trains many binary classifiers simultaneously; this approach is akin to the so-called "single machine approach" [19]. To apply this approach, we should consider an oracle that, given an input x ∈ {0, 1} n , outputs the corresponding label y ∈ {0, 1} m for all 2 m classes, for example, by employing an arbitrary function h : {0, 1} n → {0, 1} m . This is possible by allowing m qubits conditioned by the same x-input channels (see Fig. 5). More specifically, in this generalization, the oracle performs the following mapping (x, α 0 α 1 · · · α m−1 ) → (x, c 0 (x)c 1 (x) · · · c m−1 (x))(C2) for the learning (i.e., for α 0 α 1 · · · α m−1 ∈ {0, 1} m ) and the mapping (r, α 0 α 1 · · · α m−1 ) → (r, α 0 α 1 · · · α m−1 ) (C3) for the security check (i.e., for α 0 α 1 · · · α m−1 ∈ {+, −} m ). The learner (Alice, here) can identify the oracle's output by measuring each returning qubit and construct the training samples for the learning. In this strategy, our theorems and the secure PAC learning condition can be applied to the states of an arbitrary number of qubits. Note that in our analysis, the statesρ s andρ (k) E comprise an arbitrary number of qubits. The rules [R.1] and [R.2] derived for practical use of our protocol are applicable to each qubit measurement outcome. However, in this case, M b ( , δ) is expected to increase as a higher model complexity, |H|, would be imposed for large m. Furthermore, M c ( , δ) decreases since η c increases for large m; specifically, we have [34] η c = 1 − max F (ρ 0 (x) ⊗m ,ρ ⊗m E ) = 1 (2m + 4) . (C4) Consequently, the region |M c ( , δ) − M b ( , δ)| that satisfies the secure PAC learning narrows as m increases; in other words, the security condition becomes more stringent. Therefore, there exists a tradeoff between the two aforementioned approaches. Note that |M c ( , δ) − M b ( , δ)| ≥ 0 is always satisfied along with the no-broadcasting theorem with the condition η c ≥ η A ∧ η c ≥ η E in Eq. (4) of our main paper.