2.2 Different types of models
2.2.2 Probabilistic approach
Note that the feature map can also map into spaces of infinite dimension. The crux is that one never has to calculate the scalar product in this space, but simply calculates the kernel function with the original inputs.
As an example, take the squared exponential kernel function with xi,xj ∈ RN and use the series expansion of the exponential function to get
κ(xi,xj) =e−12|xi−xj|2 =
∞
X
j=0
(xTixj)k
k! e−12|xi|2e−12|xj|2 =hφ(xi), φ(xj)i.
The squared exponential kernel effectively implements a feature map φ(x) = (e−12|xi|2,xe−12|xi|2,x2e−12|xi|2, ...)T into an infinite dimensional space.
Excellent introductions to kernel methods are given by Refs. [30, 28].
A number of kernel methods will be introduced in the next Section.
x
3 2 1 0 1 2 3
y
1.51.00.50.00.51.01.52.02.53.0 0.20 0.150.10 0.05 0.00 0.050.10 0.15 0.20
Figure 2.10: Probabilistic models treat inputs and outputs as random variables drawn from a probability distribution. To use the distribution for prediction, one derives the marginalised dis- tribution over the outputs given the input (here indicated by the black line intersecting with the distribution) and estimates its maximum. This is of course also possible for discrete inputs and outputs, where the distribution can be displayed in a table listing all possible combinations of the random variables.
the class-conditional distribution via Bayes’ theorem. Even though generative models contain a lot more information and it may seem that they need much more resources to be derived from data, it is by no means clear which type of model is better and as so often depends on the task at hand [32].
2.2.2.1 Parametric probability distributions
Similar to the deterministic case, probability distributions can of course be parametrised (indicated here by p(x, y;θ) for the joint model distribution and p(y|x;θ) for the class conditional model distribution). Examples for parametrised distributions in one dimension are the normal or Gaussian distribution withθconsisting of meanµand varianceσ2,
p(x;µ, σ) = 1
√2πσe−(x−µ)22σ2 ,
as well as the Bernoulli distribution for binary numbers withθ=q, 0≤q≤1, p(x;q) =qx(1−q)1−x.
Like in the deterministic case, learning reduces to estimating a set of parametersθ which fully determine the model distribution.
In statistics, estimating the parameters of a probabilistic mathematical model given a dataset is often done throughmaximum likelihood estimation. In fact, least squares can be shown to be the maximum likelihood solution under the assumption of Gaussian noise [12]. The underlying idea is to find parametersθ so that there is a high probability that the data have been drawn from the distributionp(x, y;θ). This logic has already been encountered in Example 2.1.4.
Definition 5. Maximum likelihood estimation
Given a probability distributionp:X × Y ×θ→[0,1]with suitable input, output and parameter domainsX,Y, θ, as well as a dataset Dof tuples (xm, ym)∈ X × Y. Find
θˆ= max
θ M
X
m=1
p(x(m), y(m);θ). (2.8)
It is standard practice to insert a logarithm into Eq. (2.8) as it does not change the solution of the optimisation task while giving it favourable properties.
The rule for maximum likelihood estimation can be derived beautifully using the framework of Bayesian probability calculus. Bayes’ famous rule derived from probability theory is given by
p(a|b) = p(b|a)p(a)
p(b) , (2.9)
wherea,bare values of (sets of) random variablesA,B, andp(a|b) is the conditional probability of agivenb defined as
p(a|b) =p(a, b) p(b) .
The term p(b|a) in Eq. (2.9) is called the likelihood, p(a) is the prior and p(a|b) the posterior.
The application of Bayesian probability theory to machine learning problems is called ‘Bayesian learning’ [33, 13]. Box 2.2.3 briefly introduces into the logic behind Bayesian learning; not only to show how to obtain maximum likelihood optimisation but also as an important foundation to later chapters.
Box 2.2.3: Bayesian Learning
Given a training dataset D as before and understanding x = (x1, ..., xn) as well as y as random variables, we want to find the probability densityp(x, y|D). In parametrised methods, the distribution p(x, y) is fully determined by a set of parameters θ, and formally one can write
p(x, y|D) = Z
p(x, y;θ|D)dθ= Z
p(x, y|θ)p(θ|D)dθ. (2.10) The first part of the integrand,p(x, y|θ), is given by the model distributionp(x, y;θ) that one assumes. We therefore need to findp(θ|D). Using Bayes’ formula reveals
p(θ|D) = p(D|θ)p(θ)
R p(D|θ)p(θ)dθ. (2.11)
The priorp(θ) describes our knowledge on which parameters lead to the best model before seeing the data. For example, one could find it more likely that a coin is fair than that it always lands on one side. If nothing is known, the prior can be chosen uniform. The likelihood p(D|θ) is the probability of seeing the data given certain parameters. Together
10 5 0 5 10 0.00
0.10 0.20 0.30
Figure 2.11: A histogram derived from one-dimensional inputs of two different classes (blue trian- gles and red circles) can be used to derive coarse-grained class-conditional probability distribution over the data.
with the normalisation factorp(D) =R
p(D|θ)p(θ)dθwe get the posteriorp(θ|D) which is the probability of parametersθ after seeing the data.
The equation shifts the problem to finding the distribution over the data given the parameters.
Under the assumption that the data samples are drawn independently, one can factorise the distribution and write
p(D|θ) =Y
m
p(x(m), ym|θ).
To chose the most likely parameters given the data, we have to maximise this expression over θ, which is exactly the maximum likelihood optimisation problem in Definition 5.
An interesting consideration about the computational complexity arises from Bayesian learning.
If one could compute the full integral in Equation (2.10), one would not have to do maximum likelihood optimisation [33]. Computing the integral can easily become prohibitive. It seems that optimisation and integration are two sides of the same coin, and both are hard problems to solve when no additional structure is given.
2.2.2.2 Nonparametric probability distributions
Nonparametric statistical models are probability distributions that do not depend on a fixed amount of parameters, but on the data themselves. A very simple way to construct nonpara- metric probability distributions from data are histograms. As shown in Figure 2.11 one simply counts the number of data in a certain interval ∆xiof the input space. The probability distribution would then be the amount of data in that intervalMi divided by the total amount of data,
p(∆xi) =Mi
M.
The smooth version of this is called a kernel density estimator and will be introduced below. Of course, kernel tricks are also applicable to statistical models.