Routine 4.3.3: Density matrix exponentiation (in superposition)
9.1 Quantum nearest neighbour routine in basis encoding
x1
x2
x
˜ x1
˜ x2
Figure 9.1: Illustration of all-nearest neighbour where the neighbours are weighted by the Euclidean distance to the new input. The symbols show the 2-dimensional inputs that have each a class attribute ‘circle’ or ‘rectangle’. The new input is located at ˜x= (˜x1,x˜2), and as in the k-nearest neighbour illustration it will be classified as a circle.
proximity to the new input. Although not necessarily introducing a speed-up compared to the quantum method, the quantum machine learning algorithms are a fruitful demonstration of how to represent a distance measure in the amplitudes and are pre-designed for a comparably simple implementation in current-day hardware.
and 1 with probability
X
m|ym=1
κ(|x˜−xm|).
Alternatively, one can turn the algorithm into a ‘stochastic’ or ‘sampling’ version of k- nearest-neighbour by measuring the entire state repeatedly, effectively taking samples from the superposition. Training inputs that are closer will have a higher probability to be the outcome, and the state of their class qubit is recorded.
Of course, in both versions the measurement destroys the state and for each classification it has to be re-prepared, which requires to store and access the entire dataset like the classical method does. The runtime of the routine depends on the resources needed for state preparation, which is in general linear inM, N and therefore similar to classical k-nearest neighbour. If faster state preparation is available or the dataset superposition is the result of a previous quantum computation, the routine is independent of M by applying the weighing step fully in parallel (although in thek < M case the number of samples required for a certain accuracy may depend onM).
To start with, a state preparation procedure (such as presented in Section 4.1) is used to encode the training set into a uniform superposition 1/√
MPM
m=1|xm1, ..., xmN;ymijoined with a register encoding the new input as well as an ancilla in superposition:
√1 M
M
X
m=1
|x1, ..., xN ; ymi|x˜1, ...,x˜Ni(|0i+|1i).
Here and unless stated different, global normalisation constants will be ignored since the desired result is a relative value. An XOR gate applied to every qubitxj and ˜xj forj = 1...N compares the two vectors and writes the result
dmj =
1, ifxmj = ˜xj, 0, else, into the register previously containing the new input,
√1 M
M
X
m=1
|x1, ..., xN ; ymi|dm1, ..., dmNi(|0i+|1i).
A unitarye−i2Nπ H with the Hamiltonian H= 1⊗1⊗X
j
(σz)j+ 1 2
⊗σz,
which (σz)j acting ondj, has the effect of writing the sum of thedj, in other words the Hamming distance, into the phase with a sign conditioned on the last ancilla qubit,
√1 M
M
X
m=1
ei2Nπ Pjdmj |x1, ..., xN ; ymi|dm1 , ..., dmNi|0i+ 1
√M
M
X
m=1
e−i2Nπ Pjdmj |x1, ..., xN ; ymi|dm1 , ..., dmNi|1i.
A Hadamard gate on the last ancilla qubit then interferes the two terms and results in
√1 M
M
X
m=1
cos( π 2N
X
j
dmj )|x1, ..., xN ; ymi|dm1 , ..., dmNi|0i+
√1 M
M
X
m=1
sin( π 2N
X
j
dmj )|x1, ..., xN ; ymi|dm1, ..., dmNi|1i.
A conditional measurement on the ancilla selects the cosine branch. The probability of acceptance is given by
pacc= 1 M
M
X
m=1
cos2
π 2N
X
j
dmj
.
This probability is also a measure of how close the data is to the new input: If the collective Hamming distance is large, the sine branch of the superposition will have a larger probability to be measured. In the worst case scenario, all training vectors have a Hamming distance close to N and the probability of the conditional measurement to succeed will be close to zero. However, in this case the data might not reveal a lot of information for the classification of the new input anyways, and the probability of acceptance can therefore be seen as a measure of how well-posed the classification problem is in the first place.
After a successful conditional measurement, the state becomes proportional to
M
X
m=1
cos( π 2N
X
j
dmj )|xm1 , ..., xmN ; ymi,
where the|d1, ..., dNi|0iregisters were ommitted. This is precisely the desired outcome defined in Equation (9.1). The kernel function used is a cosine, and the factor of 2Nπ normalises the argument in the cosine to the interval [0, π] where it is a monotonically decreasing function, similar to a Gaussian. As required, training vectors with a larger Hamming distance to the new input get a smaller weight than those closer to it.
The two versions of the nearest neighbour algorithm can now be implemented by choosing the measurement. A measurement on the class qubitym will have a probability of
p(˜y=−1) =p(|ymi=|0i) = X
m|ym=0
cos2( π 2N
X
j
dmj ),
to predict the class−1 and a complementary probability of p(˜y= 1) =p(|ymi=|1i) = X
m|ym=1
cos2( π 2N
X
j
dmj ),
to predict class 1. Alternatively, measuring the entire basis state|xm1, ..., xmN ; ymihas a probability of
p(xm) = cos2( π 2N
X
j
dmj ),
to pick the mth training vector, and closer training vectors are thus preferred. Doing this
0.0 0.2 0.4 0.6 0.8 1.0
1
N|˜x−xm|Hamming
0.0 0.2 0.4 0.6 0.8 1.0
κ(˜x,xm)
0.0 0.5 1.0 1.5 2.0
|˜x−xm|Eucl
0.0 0.2 0.4 0.6 0.8 1.0
κ
(˜x,
xm)Figure 9.2: The two kernel functions of the quantum nearest neighbour algorithm. In basis encoding a cosine kernel is used (left) while in amplitude encoding a square kernel appears (right).
repeatedly can be understood as a stochastic procedure to select the closest vectors and to assign the new class via majority vote.
Representing the dataset in basis encoding is rather expensive in terms of spatial resources, espe- cially when dealing with continuous features: The number of qubits needed if every real entry of the input vector is encoded inτqubits is of the orderN τ. For example, if the features are encoded as binary fractions, a precision ofτ would imply an error of less than 21τ. To have an error of less than 0.01, we need at least τ = 7 qubits per feature. Considering the technological challenges to scale quantum computing to a large number of qubits, this is quickly prohibitive for large input spaces. It is therefore interesting to develop a similar routine based on the much more compact amplitude encoding.