• Tidak ada hasil yang ditemukan

2.2 Different types of models

2.2.1 Deterministic approach

The deterministic approach solves the supervised pattern classification or regression problem out- lined in Definition 1 by using the datasetD to find amodel function f :X →Y. A prediction is the evaluation of the function for a certain input,y=f(x).

2.2.1.1 Parametric model functions

In parametric methods,f is a model family that depends on some free parameters θ ∈Θ. The model function relating inputs to outputs is therefore of the form

y=f(x;θ). (2.1)

Note that in the following these parameters will sometimes be expressed by a weight vector w= (w1, ..., wN)T ∈RN.

Learning translates to estimating the ‘best’ parameters ˆθ given the data. The notion of ‘best’

ultimately refers to a low generalisation error of a model (2.1) and as explained in the last section, this means to lower the empirical error on the training data set while not overfitting. To quantify this, anobjective functionorloss function mapping from the parameter space to the real

6Again, such distinctions are not sharp as parametric models in the so called dual formulation can become nonparametric, and nonparametric models can have a fixed number of hyper-parameters that can be learned from the data; deterministic and probabilistic procedures can be mixed up when introducing noise into the former, or taking mean values of the latter.

M (θ)

D θ ˆ x ˜ M (ˆ θ) y ˜

Figure 2.8: Schematic illustration of the learning (left) and prediction (right) phase of parametric machine learning methods. Given a modelM(θ) with free parametersθand a data setD, learning means to estimate an optimal set of parameters ˜θ that generalises from the data. The trained modelM(ˆθ) maps a new input ˜xto the predicted output ˜y.

numbers, o: Θ→ R, is defined. The training procedure minimises the function, thereby solving an optimisation problem. Regularisation can be included into the optimisation problem via constraints (i.e., to force the parameters to be sparse or small), or as part of the solution method (i.e., stopping before the global minimum is found, or setting small weights in the solution to zero).

A very common choice for the objective functiono(θ) is the least squares error which compares the outputsf(xm, θ) produced by the model when fed with the inputs of the training data set with the target outputs in the data setym,

o(θ) =

M

X

m=1

|f(xm, θ)−ym|2, (2.2)

The resulting optimisation problem is known in statistics asleast-squares optimisation:

Definition 3. Least-Squares Optimisation. Given a model f : X ×θ → Y with suitable input, output and parameter domains X,Y, θ, as well as a dataset D with tuples (xm, ym) ∈ X × Y. Find

θˆ= min

θ M

X

m=1

|f(xm, θ)−ym|2. (2.3)

Regularisation constraints can be included by adding a ‘penalty’ termλ||wˆ|| with strength con- trolled byλto the right side. Different norms|| · ||favour different solutions, for example sparse or short vectors.

Mathematical optimisation theory developed an extensive framework to classify and solve optimi- sation problems [26] which are called programmes, and there are important distinctions between types of programmes that roughly define how difficult it is to find a global solution with a com- puter (For some problems, even local or approximate solutions are hard to compute). The most important distinction is between convex problems for which a number of algorithms and extensive theory exists, and nonconvex problems that are a lot harder to treat [27]. Convexity thereby refers to the objective function and possible constraint functions. Roughly speaking, a set is convex if a straight line connecting any two points in that set lies inside the set. A functionf : X →Ris convex ifX is a convex domain and if a straight line connecting any two points of the function lies

‘above’ the function (for more details see [26]).7 To give an example, least-squares optimisation in

7Two reasons explaining why convex optimisation is relatively well understood is that a) they only have global optima, and b) local information on the function contains global information as well. An optimisation algorithm can therefore better determine the search direction and does not risk to get stuck in local minima.

M

k

˜

x y ˜

D

Figure 2.9: Schematic illustration of similarity based machine learning methods. The model (or often a kernel as part of the model, here indicated by k) is ‘directly’ constructed from the dataset Dand used for prediction.

Definition 3 based on a model function that is linear in the parameters is a rather simple convex quadratic optimisation problem that has a closed-form solution. For general nonconvex problems much less is known, and many machine learning problems fall into this category. Popular methods are therefore iterative searches such as gradient or steepest descent, which performs a stepwise search for the minimum.

Box 2.2.1: Gradient descent method

In gradient descent methods the parametersθ of an objective function o(θ) are successively updated according to

θ(t+1)(t)−η∇o(θ(t)), (2.4)

whereη is an external parameter called thelearning rate. The gradient nabla o(θ(t)) always points towards the ascending direction in the landscape ofo, and following its negative means to descend into valleys. As one can imagine, this method can get stuck in local minima if they exist, and convergence to a minimum can take a long time. The advantage is its simplicity and applicability in many settings. Note that there are many variations of gradient descent that improve on the basic update. An important variation is stochastic gradient descent, where small alternating batches of the training set are evaluated in the objective function. Another improvement comes from a step-dependent learning rateη(t).

2.2.1.2 Nonparametric model functions

Nonparametric model functions do not depend on a fixed number of variables. I will particularly look at so called “kernel” methods, which use the data points themselves for classification. The more data, the more flexible the model becomes. An important idea for kernel methods is that

“similar inputs have similar outputs” [18] and their heart is usually a similarity measure on the input space X. A simple example is the k-nearest neighbour method. Given a new input, the prediction is the average or majority output amongst thekclosest training inputs. Similarity can here be defined by the Euclidean distance.

Parametric models can sometimes be turned into nonparametric ones. For example, for a (parametric) linear model of the form f(x;w) = wTx one can assume the weight vector to be a linear combination of training inputs w = P

mαmxm and get the (nonparametric) classifier f(x;w) = P

mαm(xm)Tx. Typically, inner products between the training inputs and the new

input appear. The so calledkernel trick allows us to give such models a lot of power and flexibility.

Box 2.2.2: The kernel trick

In machine learning, kernels are defined as follows:

Definition 4. Kernel. A kernel is a bivariate functionκ:X × X →Rsuch that for any set {x1, ..., xN} ⊂ X the matrixK called kernel or Gram matrix with entries

Kij =κ(xi, xj), xi, xj∈ X (2.5) is positive (semi-)definite. As a consequence,κ(xi, xi)>0 andκ(xi, xj) =κ(xj, xi).

The scalar product between two real vectors is a kernel.

The kernel trick can be expressed as follows:

Given an algorithm which is formulated in terms of a positive definite kernelκ, one can construct an alternative algorithm by replacingκby another positive definite kernelκ0. [28]

This becomes interesting when Mercer’s theorem states that every kernel can be expressed as a scalar product,

k(xi, xj) =hφ(xi), φ(xj)i,

where φ:X → X0 maps the input vectors into a (usually higher dimensional) feature space on which a scalar product is defined [29].

The consequences are important. For every positive semi-definite kernelκ(xi, xj) there is a feature map (see Definition 2) so thatκ= hφ(xi),φ(xj)i. In other words, given a machine learning model in which the inputs appear in form of a kernel (for example a simple dot product between input and training inputs), replacing the kernel by another kerneleffectively can implement a nonlinear feature map into a higher dimensional space. In this higher dimensional space the data can often be processed with much simpler models: Consider a dataset of two concentric circles is impossible to separate by a linear decision boundary, the feature map φ((x1, x2)T) = (x1, x2,0.5(x21 +x22))T transforms it into a linearly separable dataset.

−1.5−1.0−0.5 0.0 0.5 1.0 1.5

x1

−1.5

−1.0

−0.5 0.0 0.5 1.0 1.5

x2

x1

−1.0−0.5 0.00.5 1.0−1.0−0.50.0x2

0.51.0

x3

0.0 0.2 0.4 0.6 0.8 1.0

Note that the feature map can also map into spaces of infinite dimension. The crux is that one never has to calculate the scalar product in this space, but simply calculates the kernel function with the original inputs.

As an example, take the squared exponential kernel function with xi,xj ∈ RN and use the series expansion of the exponential function to get

κ(xi,xj) =e12|xixj|2 =

X

j=0

(xTixj)k

k! e12|xi|2e12|xj|2 =hφ(xi), φ(xj)i.

The squared exponential kernel effectively implements a feature map φ(x) = (e12|xi|2,xe12|xi|2,x2e12|xi|2, ...)T into an infinite dimensional space.

Excellent introductions to kernel methods are given by Refs. [30, 28].

A number of kernel methods will be introduced in the next Section.