• Tidak ada hasil yang ditemukan

Classical problem formulation

Routine 4.3.3: Density matrix exponentiation (in superposition)

8.1 Classical problem formulation

−2 −1 0 1 2

−2

−1 0 1 2

gradient descent

true minimum

−2 −1 0 1 2

−2

−1 0 1 2

Newton’s method

Figure 8.1: Gradient descent and Newton’s method for a quadratic objective function 0.5θTAθ+ bTθ with A= [[0.2,0.2],[0.2,0.6]] and b= (0.3 0.2)T. The stepsize isη = 0.2 and 100 steps are performed. It is obvious that while gradient descent follows the gradients, Newton’s method finds a more direct route to the minimum.

This chapter is (to my knowledge) a first approach to find a quantum version of iterative optimisa- tion algorithms such as gradient descent and Newton’s method. The objective function considered here is a homogeneous polynomial under unit sphere constraints, but the principle may be appli- cable for other problems as well. It is based on the amplitude encoding strategy, and a vector encoded in a quantum state is updated in every step according to a ‘normalised’ variation of the classical method (i.e., it reproduces the classical result but harvests the potential advantages of exponentially compact information encoding). A central caveat is that the algorithm ‘uses’ a large number of copies of a quantum state in order to find the next point, which is infeasible for the long iteration cycles of common gradient descent methods. Newton’s method has more favourable convergence properties known to converge quadratically close to a minimum. The scheme could therefore be interesting for local optimisation, such as fine tuning after pre-training.

where η > 0 is a hyper-parameter called the learning rate that may change with the number of stepst. A (local) minimum is found if the gradient vanishes and θ(t+1)(t).

Newton’s method adapts this procedure by multiplying the inverse Hessian matrix to the gradient,

θ(t+1)(t)−ηH1∇o(θ(t)). (8.2)

The Hessian matrix has entriesHij= ∂θ2o(θ(t))

i∂θj whereo(θ(t)) is evaluated at the current candidate point θ(t). By taking the second derivative into account Newton’s method evaluates curvature information on top of the gradients and therefore tends to follow a more direct path to the minimum (see Figure 8.1).

8.1.2 The objective function

I will only consider homogeneous polynomial optimisation functions here, which are generalisations of linear and quadratic programmes. To translate smoothly to the quantum algorithm, the chosen notation is based on tensor products and therefore a little unusual.

Consider an unconstrained linear programme minimising the vector-valued objective function o1(θ) =qTθ+θ0,

withθ,q∈RN,r∈R. One can increase the order of the polynomial to an unconstrainedquadratic progamme,

o2(θ) = 1

TAθ+qTθ+θ0, The next higher order would requireAto be a 3-tensor,

o3(θ) = 1 3

N

X

i,j,k=1

θiθjθkbijk+1 2

N

X

i,j,k=1

θiθjaij+

N

X

i=1

θiqi0.

If we only consider the highest order term and require thatk= 2pis an even number we get the family of even homogenous polynomials, which can be written as

ok(θ) = 1 2p

N

X

i1,...,i2p=1

ai1...i2pθi1...θi2p.

This is the objective function considered here. In machine learning one often turns non- homogeneous polynomials into this form by demanding that the first parameter is clamped to one (which has been mentioned for linear model when the parameterw0was included in thewTx term). This is unfortunately not possible when iteratively renormalising the vector as done in the projected and later in the quantum methods. One can however think of tricks to introduce minor

‘non-homogeneities’.

Now consider a slightly different formulation. The even ordersk= 2pcan be written in the form of a quadratic programme if we go into a space that is created by takingptensor products of the

RN,

o2p(θ) =1

T ⊗. . .⊗θT

| {z }

ptimes

A θ⊗. . .⊗θ

| {z }

ptimes

, (8.3)

withθ ∈RN andA containing the coefficients. The tensor A ∈ RN ⊗. . .⊗RN can always be written this way, i.e. as a sum of tensor products ofN×N matricesA(α)i , i= 1...p,

A=X

α

A(α)1 ⊗. . .⊗A(α)p . (8.4)

To illustrate this notation and its scope, consider the following example:

Example 8.1.1: Polynomial objective function for p = 2, N = 2

For a 2 dimensional input domain, a 4th order polynomial can be represented in the form of Equation (8.3) as

o4(θ) = 1

2(θ1, θ2)⊗(θ1, θ2)

a1111 a1112 a1211 a1212

a1121 a1122 a1221 a1222

a2111 a2112 a2211 a2212

a2121 a2122 a2221 a2222

 θ1

θ2

⊗

 θ1

θ2

= 1

2(θ1θ1, θ1θ2, θ2θ1, θ2θ2)

a1111 a1112 a1211 a1212

a1121 a1122 a1221 a1222

a2111 a2112 a2211 a2212

a2121 a2122 a2221 a2222

 θ1θ1

θ1θ2

θ2θ1

θ2θ2

= 1

2a1111θ14

+1

2(a1112+a1121+a1211+a211113θ2

+1

2(a1122+a1212+a1221+a2112+a2121+a221121θ22

+1

2(a1222+a2122+a2212+a22211θ32 +1

2a2222θ42

If we could clamp θ2 = 1 we would get a polynomial over theN−1 = 1-dimensional input space with all orders up to fourth, namely 1, θ1, θ12, θ31, θ41.

Equation (8.3) constitutes the type of objective function for which the quantum gradient descent algorithm will be formulated.

8.1.3 Normalisation constraints

When encoding the classical vectorsθ as quantum states in the next section, an additional con- straint appears:

N

X

i=1

θ2i = 1,

orθTθ= 1. In the optimisation literature this is known as aspherical constraint or orthogonality constraint. Applications of such optimisation problems appear in image and signal processing,

1.0 0.5 0.0 0.5 1.0 x1

0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009

objective function

1.0 0.5x1

0.0 0.5 1.0 x2

1.00.50.00.51.00.0000.0010.0020.0030.0040.0050.0060.0070.0080.009 objective function in augmented space

Figure 8.2: Embedding of a quadratic objective function over the interval [−1,1] onto the unit circle of a 2-dimensional space. It is an interesting question whether such embeddings can transform a L2regularisation into a ‘normalisation condition’.

biomedical engineering, speech recognition and even quantum mechanics [219].1 For the purpose of machine learning, spherical optimisation is not very common, while L2-ball optimisation with the constraint θTθ ≤ 1 appears frequently as a regularisation method. The constraint ensures that the length of the parameter vector does not exceed a certain size. It needs to be established in future research what effects replacing θTθ ≤ 1 with θTθ = 1 will have on regularisation.

Intuitively, this reduces the risk that the parameters vanish and might be interesting for recurrent neural networks, where length-preserving training methods have recently been proposed [222].

Furthermore, in cases where the local minimum of the overall objective function lies outside of the unit ball, L2 regularisation would find the constrained minimum on the unit sphere in any case, which coincides with the case of unit sphere constraints. For cases in which unit sphere constraints have unfavourable properties it would be interesting to investigate whether the spherical constraints can be mapped to ball constraints in a lower dimension (as depicted in Figure 8.2). These open questions are left for future investigations.

In general, constrained optimisation problems can be solved with gradient projection methods [223, 224], in which after each gradient descent update the result is projected to the feasible set, here given byS ={θ∈RNTθ= 1}. A proof of convergent in convex basins around a minimum is given in the Appendix B.3. The quantum method below will recover the exact behaviour of a projected method, but the normalisation step will be generically performed by measurements on the quantum system.

The objective function defined above together with the normalisation constraint shapes a very specific type of optimisation problem, and it is important to have a look at typical problem instances here. Homogeneous polynomials have the property of a vanishing gradient in the origin, which means that the origin is either a minimum, maximum or saddle point. For quadratic formso(θ) = θTAθ the projected Newton’s method is not expected to perform well, sinceH1∇o(θ) =θ, and the search direction is consequently parallel to the current vector and no steps are taken, as seen in Figure 8.3. In comparison, the projected gradient descent method performs as desired. This problem can be rectified by including a bias term, o(θ) = θTAθ+bTθ, using a further vector

1The solution of optimising homogeneous polynomials under spherical constraints has been shown to be NP-hard ford= 3, and very difficult to approximate [220]. Beyond this case, only little is known about the complexity of spherical optimisation [221].

o(θ)

1.51.00.5 0.0 0.5 1.0 1.5

2

1 0 1

2 H1o(θ)

1.51.00.5 0.0 0.5 1.0 1.5

2

1 0 1 2

1 0 1 2 3 4 5 6 7 angle

1.0

0.5 0.0 0.5

o(θ1,θ2)

1 0 1 2 3 4 5 6 7 angle

1.0

0.5 0.0 0.5

1.6

1.2

0.8

0.4 0.0 0.4 0.8 1.2

θ1

θ2

θ1

Figure 8.3: Projected gradient descent and projected Newton’s method for quadratic optimisa- tion under unit sphere constraints (i.e., the solution is constrained to the circle). Below is the function on the feasible set only, with the angle staring from position (1,0)T. The parame- ters are chosen as K = 1, p = 1, N = 2 and the objective function is o(θ) = θAθ with A = [[0.6363,−0.7031],[0.0,−0.8796]] and the initial point (red dot) is chosen asθinit≈[−0.83,−0.55].

The number of steps isT = 20. One can see that Newton’s method struggles to find the minimum on the feasible set plotted below, since it would naturally update the current vector towards the origin, which is not possible under the normalisation constraints.

addition in the quantum algorithm below. With this trick, we can reproduce the example in Figure 8.1 as shown in Figure 8.4. The Figure also reveals that the objective function on the unit sphere still exhibits an interesting landscape for optimisation, and the problem is therefore non-trivial.

However, it would be interesting to open the method to other objective functions, and this chapter shall serve as a starting point for future investigations.

8.1.4 Gradient and Hessian of the objective function

For the gradient descent method we need to determine the first derivative of the objective function.

Using the decomposition ofAas in Eq. 8.4, the derivative reads

∇o(θ) =1 2

K

X

α=1 p

X

j=1

 Y

i6=j

θA(α)i θ

(A(α)j )T+A(α)j

θ. (8.5)

o( )

1.51.00.5 0.0 0.5 1.0 1.5

2

1 0 1

2 H1o( )

1.51.00.5 0.0 0.5 1.0 1.5

2

1 0 1 2

1 0 1 2 3 4 5 6 7

0.5 0.0 0.5 1.0

angle

1 0 1 2 3 4 5 6 7

0.5 0.0 0.5 1.0

angle

0.4 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8

o(θ1,θ2) θ2

θ1 θ1

θ θ

Figure 8.4: Projected descent methods for inhomogeneous polynomial. The parameters are chosen as in Figure 8.1 and shows how an inhomogeneity can make Newton’s function perform well for quadratic objective functions.

Proof. ExpressAaccording to Equation 8.4:

∂θb

1

2 θ⊗. . .⊗θAθ⊗. . .⊗θ= ∂

∂θb

X

α p

Y

i=1

θAαi θ

=1 2

X

α

X

j

 Y

i6=j

X

st

(ast)αiθsθt

 X

mn

(amn)αj

∂θbmθn)

=X

α

X

j

Y

i6=j

X

st

(ast)αiθsθt

| {z }

θA(α)i θ

X

m

1

2 (amb)αj + (abm)αj θm

Hence, to calculate the derivative at a given pointθ one has to apply a matrix

D= 1 2

X

α

X

j

 Y

i6=j

θA(α)i θ

(A(α)j )T +A(α)j

(8.6)

to the vectorθ. This is crucial for the quantum algorithm, in which we will apply a corresponding quantum operatorD to a quantum state vector representing the current point.

The Hessian matrix can be expressed as

H= 1 2

K

X

α=1 p

X

j=1

" p X

i=1i6=j

p

Y

k=1k6=i

θA(α)k θ

(A(α)i )T +A(α)i θ θT

(A(α)j )T +A(α)j +

p

Y

i=1i6=j

θA(α)i θ

(A(α)j )T +A(α)j

# , (8.7)

and will likewise be translated into a quantum operatorH. Proof. Similar to above.