Routine 4.3.3: Density matrix exponentiation (in superposition)
7.3 Extending the model
0 5 10 15 20 25 30 35 40 Condition number
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Distribution
2x2 5x5 10x10 20x20
0 5 10 15 20 25 30 35 40
Condition number 0.0
0.1 0.2 0.3 0.4 0.5
Distribution
2x3 4x7 10x15 19x22
Figure 7.1: Distribution of condition numbers for matrices of different dimensions with entries drawn uniformly at random from [−1,1]. For square matrices (left) the mean of the distribution increases with the dimension, while for rectangular matrices (right) the increase is much slower, and curiously enough in all simulations performed sublinear in the smallest dimension. The growth of the condition number with the dataset has to be taken into account if the “super-efficient” speedup is to be maintained. Note that the distributions remain the same if sparse or symmetric matrices are generated.
and contain the desired result (7.4) up to a known normalisation factor c
2√
p(1) as well as the rescaling factor|X|−1|x˜|−1|y|−1stemming from the initial normalisation of the data that allowed us to encode it into quantum states. Whichever way is chosen, the prediction step is linear in the number of qubits used.
In summary, the upper bound for the runtime of the quantum linear regression algorithm can be roughly estimated as O(logN κ2−3) with the error , condition number κand input dimension N. For the swap test, a factor logM for the swap operator has to be considered. Remember that this does not include the costs of quantum state preparation in case the algorithm processes classical information. The quantum algorithm is “super-efficient” if the condition number scales logartihmically in the dimensions of the dataset, which as Figure 7.1 suggests is not always the case.
Compared to the previous result by Wiebe, Braun and Lloyd, this version boasts an improvement of a factorκ−4 whereas the dependence on the accuracy is worse by a factor−2. The algorithm can efficiently be applied to nonsparse, but low rank approximations of the matrixX†X.
The norm||w|| can refer to the L1 norm P
iw2i (ridge-regression) or L2 norm P
i|wi| (LASSO procedure), in general leading to different optimal solutions.2 I will focus on ridge regression here.
The extra term changes the pseudoinverse to
X+= (X†X+αI)−1X†,
and inserting again the singular value decomposition ofXleads to the slightly different estimator of the parameter vector,
ˆ w=
R
X
r=1
σr
σ2r−αuTryvr, (7.9)
It is not difficult to amend the original quantum linear regression routine to cater for the additional term in the scalar. In Step 3, when rotating the ancilla conditioned on the eigenvalue register, rotate the ancilla according to the eigenvalue plus regularisation factor instead:
R
X
r=1
σr|vri|uri|λri
s
1− c
λr+α 2
|0i+ c λr+α|1i
.
Since α is positive, this increases the probability of accepting the conditional measurement, preg(1) = P
r
c λr+α
2
, which is a similar effect to increasing the numerical stability in the clas- sical computations with regularisation. Both lead to the algorithm being less dependent on the condition number.
7.3.2 How to implement feature maps
As the name says, linear regression is based on a linear model with a likewise linear decision boundary and derives its true power from the kernel trick. As a reminder, the kernel trick refers to the procedure of mapping the input vectors to a higher dimensional space through a nonlinear feature map and performing the classification in this space, where the data becomes linear separable. This is how a linear model can fit nonlinear data.
In the routine outlined above, applying a feature map means to replace the kernel matrix ρxx†
with other Gram matrices.
7.3.2.1 Linear kernels
A linear or simple dot kernel is a function κ(x,x0) = xTx0. The corresponding kernel (Gram) matrix for input data vectorsx(m), m= 1...M has entriesKmm0 = (x(m))Tx(m0). The procedure to prepare a density matrixρK that entrywise represents the kernel matrix from a quantum state encoding the training inputs in its amplitudes has been the one used above. To recapitulate, assume the (normalised) training data is given in amplitude encoding,
|Di=
M
X
m=1 N
X
k=1
x(m)k |ki|mi,
2LASSO prefers sparse parameter vectors.
the corresponding density matrix is
|DihD|=
M
X
m,m0=1 N
X
k,k0=1
x(m)k (x(mk00))∗|kihk0||mihm0|.
Taking the trace over the|kiregister yields the desired density matrixρK = trk|DihD|,
ρK=
M
X
m,m0=1 N
X
k=1
x(m)k (x(mk 0))∗|mihm0|.
7.3.2.2 Homogeneous polynomial kernels
A polynomial kernel of orderdis a functionκ(x,x0) = (xTx0)d. The corresponding kernel (Gram) matrix has entriesKmm0= ((x(m))Tx(m0))d and can be constructed by usingdcopies of|ψx(m)i=
N
P
k=1
x(m)k |ki[126]. Given
|Di=
M
X
m=1
(|ψx(m)i ⊗ |ψx(m)i ⊗. . .⊗ |ψx(m)i)⊗ |mi,
(where the tensor products are made explicit for now), the corresponding density matrix becomes
|DihD|=
M
X
m,m0=1 d
Y
i=1
N
X
ki,ki0=1
x(m)ki (x(mk0 0) i )∗
|k1ihk10| ⊗. . .⊗ |kdihkd0| ⊗ |mihm0|.
The trace over all|kiiregisters now yields
ρK=
M
X
m,m0=1
(
N
X
k=1
x(m)k (x(mk 0))∗)d|mihm0|.
Replacing the density matrixρXTX above with this kernel matrix implicitly implements a feature map into a space of polynomials of degreed.
7.3.2.3 Translation invariant kernels
Although the dot-product kernel functions above are in use, a lot of theoretical analysis considers translation invariant kernels whereκ(x,x0) = κ(|x−x0|)). Important examples are exponential versions such as the Gaussian or radial basis function kernel exp(1/2|x−x0|2) and rational functions such as the triangular kernel 1− |x−x0|and the Epanechnikov kernel 3/4(1− |x−x0|2), both being zero outside of the interval [−1,1]. Other covariance functions are discussed in Rasmussen & Williams’ textbook on Gaussian Processes [30].
It is an interesting question whether we can also implement translation invariant kernels with density matrices. This question has very recently been addressed by Chatterjee and Yu [216], who encode features into canonical coherent states whose inner product naturally implements in a radial kernel function. It has also been shown here that with the swap test Routine 3.4.4 that for units vectors, there is a close relationship between the inner product and the difference between
two unit vectorsxandx0:
|x−x0|2=X
i
(xi−x0i)2
= 2−2X
i
xix0i
⇔ −(1−0.5|x0−y|2) = (x0)Ty
This implies that the dot product kernel is in this case similar to the negative Epanechikov kernel, which looks very much like a semi-circle. Which other kernels are natural for the density matrix of a quantum system? Can we combine the theory of so called reproducing kernel Hilbert spaces with quantum mechanics to our avail? This is a very promising question for further research.
Quantum gradient descent algorithm
This chapter is based on work done in collaboration with Dr Patrick Rebentrost and Prof Seth Lloyd from the Massachusetts Institute of Technology that is available as a pre-print (Reference [217]). The idea was developed by Patrick Rebentrost, Seth Lloyd and myself, while Patrick Rebentrost and I were with equal shares responsible to work out and formulate the quantum algorithm in detail. Patrick Rebentrost was particularly involved in the error analysis in the Appendix, while I wrote the computer simulations for the analysis and plots. The presentation in this chapter is written by myself.
The previous chapter on least squares regression implicitly dealt with an unconstrainedquadratic programme that can be expressed as an objective function xTATAx+ 2Axy+y2 and a closed form equation for the solution was given. Quadratic programmes of the formxBx can be very difficult to solve (i.e., proven to be NP-hard [218]) if B is not positive definite (which it always is forB =ATA), and the function therefore non-convex. Other machine learning methods can reveal even more complicated optimisation landscapes, for example the notorious deep neural networks. In such cases iterative methods that perform a stepwise search for the optimum can be the only possible solution method. The most common are gradient descent methods, using local information on gradients in each step. Unfortunately there is only little theory about overall convergence times and numerical stability and benchmark studies are usually done by comparing the performance of a machine learning algorithm on several standard datasets in practise.
The quantum machine learning community has so far widely avoided the topic of iterative optimisation methods such as gradient descent. For example, proposals of quantum neural networks are mostly limited to proposing a feed-forward mechanism and leaving the central question of training open, or suggests classical parameter fitting for the quantum setup. An important reason for the reluctance is the difficulty to come up with a clever quantum routine that extracts information on the best direction for a local search and updates a quantum state accordingly, and which exploits the laws of quantum mechanics to achieve a speed-up of some sorts. The attempt is thereby not new; the research field of quantum optimisation has produced a number of ‘quantisations’ of classical algorithms, some of which have been mentioned before.
131
−2 −1 0 1 2
−2
−1 0 1 2
gradient descent
true minimum
−2 −1 0 1 2
−2
−1 0 1 2
Newton’s method
Figure 8.1: Gradient descent and Newton’s method for a quadratic objective function 0.5θTAθ+ bTθ with A= [[0.2,0.2],[0.2,0.6]] and b= (0.3 0.2)T. The stepsize isη = 0.2 and 100 steps are performed. It is obvious that while gradient descent follows the gradients, Newton’s method finds a more direct route to the minimum.
This chapter is (to my knowledge) a first approach to find a quantum version of iterative optimisa- tion algorithms such as gradient descent and Newton’s method. The objective function considered here is a homogeneous polynomial under unit sphere constraints, but the principle may be appli- cable for other problems as well. It is based on the amplitude encoding strategy, and a vector encoded in a quantum state is updated in every step according to a ‘normalised’ variation of the classical method (i.e., it reproduces the classical result but harvests the potential advantages of exponentially compact information encoding). A central caveat is that the algorithm ‘uses’ a large number of copies of a quantum state in order to find the next point, which is infeasible for the long iteration cycles of common gradient descent methods. Newton’s method has more favourable convergence properties known to converge quadratically close to a minimum. The scheme could therefore be interesting for local optimisation, such as fine tuning after pre-training.