• Tidak ada hasil yang ditemukan

Classification with asymptotically large ensembles

Routine 4.3.3: Density matrix exponentiation (in superposition)

11.1 Classification with asymptotically large ensembles

The Quantum ensemble learning algorithm

The results of this chapter have been written up for a manuscript with the title “Quantum parallelism for exponentially large ensemble classifiers” and will be submitted for publication soon.

I was responsible for the idea, development, analysis and write-up of the content.

This last chapter of Part III investigates how quantum parallelism can be used to construct en- sembles of quantum classifiers. Ensemble methods train a number of models in a specific training regime and use their combined decision to improve on each individual classifier. Quantum par- allelism refers to the fact that a functionf(h) can be evaluated in superposition on a quantum device. More precisely, the operation |hi ⊗ |0i → |hi ⊗ |f(h)i on the qubit registers|hi,|0i can be applied to a superposition of the first register, P

h|hi ⊗ |0i → P

h|hi ⊗ |f(h)i. I will use this property in order to evaluate a ‘superposition of quantum models’ in parallel and extract a decision via a single qubit measurement. As a particular case, I will analyse a collective decision procedure where each model is weighed by its accuracy on the dataset. A quantum algorithm for the implementation of a quantum ensemble classifier is given, but the results from this analysis extend to applications with classical ensembles as well.

selection procedure

f(x;θ1) f(x;θ2)

...

f(x;θN) x˜

decision

proceduredecision y˜ procedure

D D

Figure 11.1: The principle of ensemble methods in machine learning is to select a set of classifiers and combine their predictions to obtain a better performance in generalising from the data than the best single model. Here, theN classifiers are considered to be deterministic parametrised model functions from a family{f(x;θ)}, where the parameterθsolely defines the individual model. The datasetD is consulted in the selection procedure and sometimes also plays a role in the decision mechanism.

predicting the rest of the inputs. This ‘expert knowledge’ is lost if only one winner is selected.

The idea is to allow for an ensemble or committeeE of trained models (sometimes called ‘experts’

or simply ‘classifiers’) that take the decision for a new prediction together. Considering how familiar this principle is in our societies, it is surprising that this thought only gained widespread attention as late as the 1990s.

Many different proposals have been put forward of how to use more than one model for prediction.1 The proposals can be categorised along two dimensions [233], first the selection procedure they apply to obtain the ensemble members, and second the decision procedure defined to compute the final output (see Figure 11.1). The easiest strategy would possibly be to train several models and decide according to their majority rule. More intricate variations are popular in practice and have interesting theoretical foundations. Bagging [234] trains classifiers on different subsamples of the training set, thereby reducing the variance of the prediction. AdaBoost [235, 236] trains subsequent models on the part of the training set that was misclassified previously and applies a given weighing scheme, which reduces the bias of the prediction. Mixtures of experts train a number of classifiers using a specific error function and in a second step train a ‘gating network’

that defines the weighing scheme [237]. For all these methods, the ensemble classifier can be written as

˜

y= sgn X

θ∈E

wθf(˜x;θ)

!

. (11.1)

The coefficientswθ weigh the decisionf(˜x;θ)∈ {−1,1}of the model specified byθ∈ E, while the sign function assigns class 1 to the new input if the weighed sum is positive and−1 otherwise. It is important for the following to rewrite this as a sum over allE possible parameters. Here I will use a finite representation of numbers and limit the parameters to a certain interval to get the discrete sum

˜ y= sgn

E1

X

θ=0

wθf(˜x;θ)

!

. (11.2)

1Here I will focus on the idea of using a parametrised model family with fixed hyperparameters. In more general, ensembles can be composed of models with different hyperparameters (for example neural networks of a different architecture), or even different model types all together (for example neural networks, random forests and linear regression).

In order to obtain the ensemble classifier of Equation 11.1, the weights wθ which correspond to models that are not part of the ensembleE are set to zero. Given a model family f, an interval for the parameters as well as a precision to which they are represented, an ensemble is specified by the set of weights{w1...wE}.

Writing a sum over all possible models provides a framework to think about asymptotically large ensembles which can for example be realised by quantum parallelism. Interesting enough, this formulation is also very close to another paradigm of classification, the Bayesian learning approach presented in Section 2.2. This method uses the knowledge represented by the data to estimate which hypothesisθis the ‘true’ one. If we replace f(˜x;θ) in Equation (11.2) with a probabilistic distributionp(˜y|x, θ) and interpret˜ wθas the likelihoodp(θ|D) ofθbeing the true model given the observed data, we arrive the a Bayesian classification rule.

p(˜y|x,˜ D) = Z

dθ p(˜y|x, θ)p(θ˜ |D). (11.3) The probability of the new input ˜x being in class ˜y = 1 given the data can be expressed as the integral over the probability of a prediction of 1 given the model, times the probability of this particular model given the data. If we also consider different model families specified by hyperparameters, this method turns intoBayesian Model Averagingwhich is sometimes considered as an ensemble method (although not necessarily leading to optimal generalisation results [238]).

Beyond the Bayesian framework, increasing the size of the ensemble to include all possible pa- rameters has been studied in different contexts. In some cases addingaccurate classifiers with a classification error (the number of misclassified test examples divided by their total) of less than 0.5 increases the performance of the ensemble decision. This means that each member is better than random guessing and has learned the pattern of the training set to at least a small extend. The most well-known case has been found by Schapire [235] leading to the aforementioned AdaBoost algorithm where a collection of such weak classifiers can be turned into a strong classifier that has a high accuracy on the test set. The advantage here is that weak classifiers are comparably easy to train and combine. But people thought about the power of weak learners long before AdaBoost. TheCordocet Jury Theorem from 1758 states that considering a committee of judges where each judge has a probabilitypwithp >0.5 to reach a correct decision, the probability of a correct collective decision by majority vote will converge to 1 as the number of judges approaches infinity. This idea has been applied to ensembles of neural networks by Hansen and Salamon [239].

If all ensemble members have a likelihood ofpto classify a new instance correctly (which can be estimated by their accuracy aon the test set), and their errors are uncorrelated, the probability that the majority rule classifies the new instance incorrectly is given by

E

X

k>E/2

 E

k

pEk(1−p)k,

whereEis again the ensemble size. The convergence behaviour is plotted in Figure 11.2 (left) for different values ofp. The assumption of uncorrelated errors is idealistic, since some data points will be more difficult to classify than others and therefore tend to be misclassified by a large proportion

50 100 ensemble size

0.0 0.5 1.0

error

p= 0.49 p= 0.51

p= 0.7 p= 0.9

0.0 0.5 1.0

p

5 0 5

p (1p)

p (1p)

2

Figure 11.2: Left: Prediction error when increasing the size of an ensemble of classifiers each of which has an accuracyp. Asymptotically, the error converges to zero ifp >0.5. Right: Forp >0.5, the odds ratiop/(1−p) grows slower than its square. Together with the results from Lam et al described in the text, this is an indication that adding accurate classifiers to an ensemble increases its predictive power.

of the ensemble members. Hansen and Salamon argue that for the highly overparametrised neural network models they consider as base classifiers, training will get stuck in different lo- cal minima, so that the ensemble members will be diverse and their errors sufficiently uncorrelated.

A more realistic setting would also assume that each model has a different prediction probability pmeasured by the accuracya, which has been investigated by Lam and Suen [240]. The change in prediction power with the growth of the ensemble obviously depends on the predictive power of the new ensemble member, but its sign can be determined. Roughly stated, adding two classifiers with accuracies a1, a2 to an ensemble of size 2n will increase the prediction power if the value of (1aa11)(1a2a2) is not less than the ratio (1aiai) of any ensemble member, i = 1, ..., E. When plotting the ratio and its square in Figure 11.2 (right), it becomes apparent that for allai >0.5 chances are high to increase the predictive power of the ensemble. All these results suggest that constructing large ensembles of accurate classifiers can lead to a strong combined classifier.

Before proceeding to quantum models, another result is important to mention. If we consider all possible parametersθin the sum of Equation (11.2) and assuming that the model defined byθhas an accuracyaθ on the training set, the optimal weighing scheme is given by

w(θ) = log aθ

1−aθ

. (11.4)

It is interesting to note that this weighing scheme corresponds to the weights chosen in AdaBoost for each trained model, where they are derived from what seems to be a different theoretical objective.