and Physics Simulation

A quantum algorithm for training deep neural networks

Introduction

Due to the good conditioning of a deep neural network, we show that the output of an NTK can also be approximated by removing all non-diagonal elements of the NTK matrix. 26] we define the nonlinearity of the activation function and note the effect of normalization on the resulting constant.

Sparsified neural tangent kernel

The use of a deep neural network has recently been shown to speed up optimization via gradient descent due to the well conditioning of the NTK [26]. The key property of the NTK that allows this is a lower bound on the NTK matrix elements.

Diagonal neural tangent kernel

We use Theorem 1.2.5 in the same way as in steps 2–3 to identify O(logn) non-zero elements in NTK, choosing larger elements with higher probability; to ensure that the non-zero elements are different, the QRAM is modified in O(polylog n) time after each measurement. Since the product (kNTK)∗K˜NTK−1 y is proportional to hk∗|K˜NTK−1 |yi up to a positive normalization factor, and Theorem 1.2.3 ensures convergence to the exact NTK output, the final classification can be performed up to O(1/n) errors in the wide and deep neural network.

Numerical experiments

Finally, we observe the performance of exact and approximate NTKs on both the toy dataset and the MNIST dataset. As expected, the performance of the NTK approximations approaches the exact NTK rapidly, and the reduced NTK has improved performance compared to the diagonal NTK approximation (Fig. 1.3).

Figure 1.1: Left: true data distribution on a 3-dimensional sphere, where labels are assigned by sign P d

Discussion

Suppose we want to find the non-zero elements of the jth column in the NTK matrix. As in the case of a uniform distribution on the sphere, both data sets clearly satisfy δ= Ω(1/poly n) (Figure A.3).

Quantum generative adversarial networks

Overview

Numerical experiments confirm that EQ-GAN converges on problem instances where QuGAN failed. We show that the EQ-GAN adversarial approach is more robust to such errors than the simpler supervised learning approach.

Prior work

When ρ is close to σ, the optimal Hesltrom measurement operator T = P+(σ−ρ) is close to orthogonal to the true quantum data σ and opposite to ρ. The next step of training will try to align the generator state ρ with T to obtain the cost function in Eq.

Mode collapse example of QuGAN

This leads to oscillation of the generator and discriminator, possibly preventing convergence;. we show a case of infinite oscillation below. Mode collapse manifests itself as oscillation in the generator and discriminator loss without converging to a global optimum.

Convergence of EQ-GAN

The EQ-GAN architecture adversarially optimizes state generation ρ(θg) and fidelity measure learning Dσ (Fig. 2.2). The measurement on the auxiliary qubit is used for the cost function in a similar way to EQ-GAN defined above.

Figure 2.2: EQ-GAN architecture. The generator varies θ g to fool the discrim- discrim-inator; the discriminator varies θ d to distinguish the state

Learning to suppress errors

Therefore, the EQ-GAN obtains a state overlap that is significantly better than that of the perfect swap test (Fig. 2.6). The “converged” EQ-GAN (dotted line) coincides with the iteration where the discriminator loss is minimized.

Figure 2.5: EQ-GAN experiment for learning a single-qubit state. The dis- dis-criminator (U (θ d ) is constructed with free Z rotation angles to suppress CZ gate errors, allowing the generator ρ(θ g ) to converge closer to the true data state σ by varying

Training EQ-GAN

Instead of training both the generator and the discriminator from scratch, we pretrain EQ-GAN in a supervised environment. While the supervised learner cannot be trained by gradient descent, EQ-GAN achieves a state overlap of 0.97.

Figure 2.7: Comparison of QuGAN [10, 11] and EQ-GAN learning the state given by Eq. 2.4

Application to QRAM

On the other hand, by allowing the discriminator to change, the vanishing gradient problem is circumvented and the generator learns the data state (Fig. 2.10). A readout qubit (orange) performs parametrized two-qubit interactions shown in Fig. a) Exponential peak ansatz (b) Gaussian peak ansatz.

Figure 2.11: Two-peak total dataset (sampled from normal distributions, N = 120 ) and variational QRAM of the training dataset ( N = 60 )

Discussion

To write the NTK elements, we define the dual activation function σˆ corresponding to the activation function σ [5]. We briefly comment on the output distribution of a trained NTK in the limit of t → ∞ (Lemma 1.1.1 of the main text).

Near-term quantum simulation of wormhole teleportation 43

Wormhole teleportation

At this point in time, we use a coupling term between the left and right sides of the wormhole; if chosen appropriately, this produces a negative zero energy in the bulk. Without the iµV coupling between the left and right systems att= 0 (green), there would be no shock wave and the message would continue to propagate in a straight line to enter the singularity.

Dirac SYK model

To simulate the protocol on a quantum device, we perform the following, where the times t0 ≈t1 are chosen to be approximately equal to the scrambling time. Note that to send a negative energy shock wave (Fig. 3.1), the sign of µ must be chosen appropriately. At the end of the protocol, register T will be maximally entangled with register Q.

To achieve successful wormhole teleportation, we expect the spectrum to be approximately continuous between energy levels.

Low-rank SYK

3.6) Since the coefficients of the original SYK model are random, numerical experiments can instead randomly select the coefficients in this factored Hamiltonian. By taking a single termf†f, we recover the original Dirac SYK model form to determine the appropriate distribution oorgpq,k. This is numerically verified to produce a Gaussian distribution of coefficients when reverting to the original SYK model form (Eq. 3.3).

The Dirac model SYK is now approximated in the form of the Hamiltonian of rank L given by Eq. 3.6, which may be more conveniently decomposed into Givens rotary circuits Uk of linear depth and linear connectivity [24]. Although using a low-rank Hamiltonian may improve circuit depth and require fewer gates, it may require a wider circuit due to degraded spectrum.

Shallow circuit time evolution

As an ansatz for the variational circuit V(θ), we use a single Trotter step, but with each gate fully parameterized. In the proposed approach, we add a single Trotter step over a small period of time (i.e. T /N is small for large N) to the previous fit of V(θk) ≈ e−iH(kt/N (Y/N). While a single Trotter step gives a fidelity of 2% at t= 1, the approximation maintains a fidelity of 85% with the circuit of the same depth due to the approximation of 20 Trotter steps.

A single Trotter step is compared to learned time evolution, which has the same circuit as a single Daver step, but learns gate parameters to compress 20 Daver steps. Circuits with one and two Trotter steps are compared with learned time evolution, which compresses 16 Trotter steps into a circuit with the depth of one Trotter step.

Figure 3.4: Schematic for learning a shallow approximation to a Trotterization, where the blue shaded region is optimized to maximize the value of the swap test

Classical simulation of wormhole behavior

Note that although we compute the Dirac SYK model numerically, the behavior is expected to be largely the same, since Majorano fermions can be coupled into Dirac fermions with c = ψ1 +iψ2 and so Dirac SYK corresponds. In order to study the causal propagator in the Dirac SYK model for larger N with a continuous spectrum, we need to change our approach to coupling the left and right systems. In the Dirac SYK case, we take the interaction V = Ni P. Using the Jordan-Wigner transformation described above, the unitary eiµV is the exponent of the sum of Pauli operations with one qubit.

This fact allows us to efficiently prepare the thermofield dual state under the Dirac SYK interaction V. Finally, we can evaluate the causal propagator Kin the Dirac SYK teleportation protocol, where. 3.19). Although we have not derived the exact form of K(t0, t1) for the Dirac SYK, we obtain the expected qualitative behavior from the Majorana causal propagator analysis (Fig. 3.8).

Figure 3.7: Mutual information of the teleportation protocol (left) and the asymmetry of mutual information between positive and negative µ (right).

Near-term quantum simulation

4] we define the non-linearity of the activation function and note the effect of normalization on the resulting constant. Finally, we note some important properties of the conjugate activation function (Def. A.1.6) of Daniely et al. As a result of Lemma A.1.1, we see that Theorem A.1.3 on the matrix elements of the NTK is valid for all L≥Lconv .

For each 0 < ξ < 1, the derivative of the double activation function with nonlinearity coefficient µ satisfies σ(1ˆ˙ −ξ)≥1−µ. We can now apply the definition of NTK to calculate a lower bound on a given matrix element.

Computing the diagonal NTK approximation

As given by Theorem A.1.3, the diagonal elements of NTK are equal and greater than the off-diagonal elements, since Lemma A.1.1 guarantees sufficient depth of the neural network Lconv ≥ 2L0(δ, µ). From these bounds, we conclude that NTK is well-conditioned when it represents a deep enough neural network to converge, consistent with the result of Agarwal et al. In addition to the above properties of the NTK matrix, data separability ensures that a single element of the NTK.

Thus, an element of the NTK matrix can be computed in O(polylog(n)/μ) time, given the inner product between the data. Application of the results of Sec. A.26) Since KNTK= (KNTK)11A this gives the required relationship for the NTK itself.

Quantum algorithm

For any positive integer P, amplitude estimation yields p˜∈[0,1] such that. A.30) with probability at least 8/π2 using P iterations of the algorithm A. Since the NTK is only a function of the inner product x∗ ·xi, we can calculate the kernel elements after dividing the inner product between the estimated test data and training data, following a similar approach to Kerenidis et al. Because δ= Ω(1/poly n), evaluating the NTK between two data points takes time O(polylog(n)/µ) given their inner product.

To estimate the sign of hk∗|yi, we encode the relative phase of the states and perform an inner product estimation procedure such that m measurements of the state yield 1/√. Since at least one of the terms has Ω(1/polylog n) size, the probability of completely canceling non-negligible terms is vanishingly small with O(n) labels assigned to both +1 and .

Computing the sparsified NTK approximation

The function ν : [n]×[s]→[n] calculates the row index of the `th non-zero entry of the jth column. Due to the efficient kernel estimate, we can prepare the state |k∗i in polylogarithmic time in n. When returning s=O(logn) off-diagonal elements to the identity matrix approximation of the NTK, we need only minor modifications to the eigenvalue bounds shown previously.

As determined by Theorem A.1.3, the diagonal elements of NTK are equal and larger than the off-diagonal elements, since Lemma A.1.1 ensures a sufficient depth of the neural network Lconv ≥ 2L0(δ, µ). Let M = ˜KNTK be a thinning of the exact NTK KNTK with the entire diagonal and any subset s = O(logn) of off-diagonal elements.

Datasets

Since the matrix error is the same as in Theorem A.2.6, the result of Theorem A.2.8 applies directly to the normalization of the initial state |k∗i and the consequent convergence of the NTK approximation. We can define δ in terms of an n×n matrix G defined in the same way as the Gram matrix, but with sizes of inner products, i.e. Let Abe be a symmetric n×n matrix with elements Aij sampled from the distribution of inner products|xi·xj| .

Since the distribution is uniform on the surface of a sphere, symmetry under orthogonal matrix multiplication implies that we can let xj = ||u||u foru∼ Nd(0,1)foru= (u1, . . , ud). Suppose we sample m times from the distribution, corresponding to the m≈ n2/2 randomly chosen elements in the symmetric matrixA.

Figure A.2: Empirical fit of δ(n) for the sphere dataset with d = 10 . We find δ(n) ≈ 0.90n −0.46 ( R 2 = 0.96 ), showing good agreement with the prediction of δ(n) ∝ n −0.44 by Lemma A.5.1.

Numerical evaluation of the NTK

297 qgan_g_model = tf.keras.Model(inputs=[signal_input, swap_test_input], . 298 outputs=[expectation_output, log_output]).