CAUSES NON-ZERO SKEWNESS
3.3 Classification – Probabilistic Discriminative Model with Maximum Likelihood Estimate (MLE) Likelihood Estimate (MLE)
The advantage of the probabilistic generative model is that we can create (generate) synthetic input values, !x, by sampling from the marginal distribution
!
!p
( )
x . However, the predictive performance may decrease, especially when the Gaussian form, which we used to model the class-conditional densities, does not give a good representation. In this section, we compute the parameter values in a more direct approach by maximization of the likelihood function or the posterior probability density function (PDF). By not modeling the class-conditional densities explicitly, we will have less number of parameters to determine, and this may lead to an increase in predictive performance.Directly determining the parameters is an example of a probabilistic discriminative approach.
The likelihood function we want to maximize to determine the parameters consists of the conditional distributions introduced earlier:
!
!p C
( )
k x . We start with a relabeling of the variables first, and then we simplify as much as possible to avoid clutter in our mathematical expressions. In the previous section, we obtained the functional form of the posterior class probability, conditional on an input vector, as (3.4). Using this definition, let us define!
!
!
yk
( )
x =p C(
k x,wk)
= exp(
ak( )
x wk)
exp
(
aj( )
x w j)
∑
j (3.29)where !ak are called activations and are given by
!
!ak
( )
x wk =wkTx (3.30) Let us now clarify the terms that involve “ ”:In the expression above,
!
!
!wk=
(
wk0,wkT)
T and !!!x=( )
1,xT T, that is, we augment the input vector with a dummy input !!x0=1, similar to what we did in the least squaresclassification in Appendix C. In order to decrease the clutter in the mathematical notation, let us redefine the parameters (!w →w and !x→x) such that
!
!
!wk =
(
wk0,wk1,,wkD)
T and !!!x=(
1,x1,x2,,xD)
T. Then, we consider maximization of the likelihood function to determine the parameters!
!
{ }
wk directly.Now, we need the likelihood function. As we mentioned above, it consists of the posterior class probabilities
!
!p C
( )
k x if the prior on the !Ck’s is uniform. We will follow the same !!1−of −K coding scheme as we did above for the target vectors: the target vector !!tn associated with the input vector !!xn, which is assigned to class !Ck, will be a unit vector of dimension !!K=3 with each of its elements being zero unless it is the !kth element, which is one. Then, we obtain the likelihood function as!
!
!p
(
T X,W)
= p C(
k xn,wk)
tnk = k ynktnk=1
∏
K n=1∏
N k=1∏
K n=1∏
N (3.31)where the elements !tnk form the matrix !T whose dimension is !N×K with !N as the number of data points and !K as the number of classes, and !W is formed by !!D+1- dimensional vector !!wk as its !kth column, and !X is formed by !!D+1-dimensional vector
!
!xnT as its !nth row. So, !W and !X are matrices with dimensions
!
!
( )
D+1 ×K and!
!N×
( )
D+1 respectively. We also have!
!
!
ynk= yk
( )
xn =p C(
k xn,wk)
= expexp(
wwkTxn)
j Txn
( )
∑
j ∈⎡⎣ ⎤⎦0,1 (3.32)Before we start evaluating the probabilistic discriminative model from a Bayesian perspective, let us use the maximum likelihood method to find !!WMLE by maximizing the likelihood function given by (3.31). Note that the value we will find is in fact the
Bayesian maximum a posteriori (MAP) value but we are taking a flat (non-informative) prior for !W, so !MAP≡MLE. I solved this optimization problem by an algorithm provided by Matlab.
We maximized (3.31) with respect to !W separately for acceleration, velocity, and displacement input and obtained the following confusion matrices by assigning an input vector !x to class !Ck, where
!
!
!p C
(
k x,wk)
is a maximum over !!k=1,2,3. Similar to the previous confusion matrices, we show the predictive performance of our models by cross validation; we divide the data sets into two: training data set and validation data set. Then we swap these data sets and average the predictive performances in the form of confusion matrices.Let us start with acceleration results. The maximum likelihood estimate (MLE) of the parameter matrix,
!
!WMLE
acceleration, computed using the entire acceleration data set for
training, is given by
!
!
!
W
MLE acceleration
= 2.899
&0.003 0.002
&0.002
&0.008
&1.263 0.018 0.013
&0.118
&0.067
&1.606 0.003
&0.005 0.075 0.117
⎡
⎣
⎢⎢
⎢⎢
⎢⎢
⎤
⎦
⎥⎥
⎥⎥
⎥⎥
(3.33)
Table 3.4: Confusion matrix for probabilistic discriminative classification with MLE using acceleration data.
ALL DATA PREDICTED CLASSES
ACTUAL CLASSES Acceleration Okay Over Under
Okay 95.2 2 2.8
Over 16.8 83.2 0
Under 21.2 0 78.8
FIRST HALF PREDICTED CLASSES
ACTUAL CLASSES Acceleration Okay Over Under
Okay 88.4 1.4 10.2
Over 12 88 0
Under 9.6 0 90.4
SECOND HALF PREDICTED CLASSES
ACTUAL CLASSES Acceleration Okay Over Under
Okay 97.2 1.8 1
Over 37.6 62.4 0
Under 39.2 0 60.8
AVERAGE OF CROSS VALIDATIONS
PREDICTED CLASSES
ACTUAL CLASSES Acceleration Okay Over Under
Okay 92.8 1.6 5.6
Over 24.8 75.2 0
Under 24.4 0 75.6
!
!WMLE
velocity computed using the entire velocity data set for training is given by
!
!
!
WMLE velocity
= 2.743 '0.001 0.001 0.022 '0.024
'1.150 0.028 0.011 '0.184 '0.039
'1.563 0.003 '0.006 0.114 0.091
⎡
⎣
⎢⎢
⎢⎢
⎢⎢
⎤
⎦
⎥⎥
⎥⎥
⎥⎥
(3.34)
Table 3.5: Confusion matrix for probabilistic discriminative classification with MLE using velocity data.
ALL DATA PREDICTED CLASSES
ACTUAL CLASSES Velocity Okay Over Under
Okay 95.3 1.9 2.8
Over 19.2 80.8 0
Under 22.8 0 77.2
FIRST HALF PREDICTED CLASSES
ACTUAL CLASSES Velocity Okay Over Under
Okay 87.6 1.8 10.6
Over 14.4 85.6 0
Under 9.6 0 90.4
SECOND HALF PREDICTED CLASSES
ACTUAL CLASSES Velocity Okay Over Under
Okay 96 1.6 2.4
Over 46.4 53.6 0
Under 44 0 56
AVERAGE OF CROSS VALIDATIONS
PREDICTED CLASSES
ACTUAL CLASSES Velocity Okay Over Under
Okay 91.8 1.7 6.5
Over 30.4 69.6 0
Under 26.8 0 73.2
!
!WMLE
displacementcomputed using the entire displacement data set for training is given by
!
!
!
WMLE displacement
= 2.680 0.0002 0.0003 0.024 )0.056
)0.910 0.052 0.012 )0.294 )0.059
)1.757 )0.015 )0.020 0.264 0.130
⎡
⎣
⎢⎢
⎢⎢
⎢⎢
⎤
⎦
⎥⎥
⎥⎥
⎥⎥
(3.35)
Table 3.6: Confusion matrix for probabilistic discriminative classification with MLE using displacement data.
ALL DATA PREDICTED CLASSES
ACTUAL CLASSES Displacement Okay Over Under
Okay 94.3 2.8 2.9
Over 30.8 69.2 0
Under 22.4 0 77.6
FIRST HALF PREDICTED CLASSES
ACTUAL CLASSES Displacement Okay Over Under
Okay 91.6 2.6 5.8
Over 24.8 75.2 0
Under 10.4 0.8 88.8
SECOND HALF PREDICTED CLASSES
ACTUAL CLASSES Displacement Okay Over Under
Okay 94.4 2.2 3.4
Over 50.4 49.6 0
Under 49.6 0 50.4
AVERAGE OF CROSS VALIDATIONS
PREDICTED CLASSES
ACTUAL CLASSES Displacement Okay Over Under
Okay 93 2.4 4.6
Over 37.6 62.4 0
Under 30 0.4 69.6