A Tutorial on Data Reduction
Linear Discriminant Analysis (LDA)
Shireen Elhabian and Aly A. Farag
University of Louisville, CVIP Lab
September 2009
Outline
• LDA objective
• Recall … PCA
• Now … LDA
• LDA … Two Classes
– Counter example
• LDA … C Classes
– Illustrative Example
• LDA vs PCA Example
• Limitations of LDA
LDA Objective
• The objective of LDA is to perform dimensionality reduction …
– So what, PCA does this L …
• However, we want to preserve as much of the class discriminatory information as possible.
– OK, that’s new, let dwell deeper J …
Recall … PCA
• In PCA, the main idea to re-express the available dataset to extract the relevant information by reducing the redundancy and minimize the noise.
• We didn’t care about whether this dataset represent features from one or more classes, i.e. the discrimination power was not taken into consideration while we were talking about PCA.
• In PCA, we had a dataset matrix X with dimensions mxn, where columns represent different data samples.
• We first started by subtracting the mean to have a zero mean dataset, then we computed the covariance matrix Sx = XXT.
• Eigen values and eigen vectors were then computed for Sx. Hence the new basis vectors are those eigen vectors with highest eigen values, where the number of those vectors was our choice.
• Thus, using the new basis, we can project the dataset onto a less dimensional space with more powerful data representation.
n– feature vectors
(data samples)
m-dimensional data vector
Now … LDA
• Consider a pattern classification problem, where we have C- classes, e.g. seabass, tuna, salmon …
• Each class has N
im-dimensional samples, where i = 1,2, …, C.
• Hence we have a set of m-dimensional samples {x
1, x
2,…, x
Ni} belong to class ω
i.
• Stacking these samples from different classes into one big fat matrix X such that each column represents one sample.
• We seek to obtain a transformation of X to Y through projecting the samples in X onto a hyperplane with dimension C-1 .
• Let’s see what does this mean?
LDA … Two Classes
• Assume we have m-dimensional samples {x1, x2,…, xN}, N1 of which belong to ω1 and N2 belong to ω2.
• We seek to obtain a scalar y by projecting the samples x onto a line (C-1 space, C = 2).
– where w is the projection vectors used to project x to y.
• Of all the possible lines we would like to select the one that maximizes the separability of the scalars.
úú úú û ù êê êê ë é
= úú
úú û ù êê êê ë é
=
=
m m
T
w w w
and x
x x
where x
w
y .
. .
.
1 1
The two classes are not well separated when projected onto this line
This line succeeded in separating the two classes and in the
meantime reducing the
dimensionality of our problem from two features (x1,x2) to only a scalar value y.
LDA … Two Classes
• In order to find a good projection vector, we need to define a measure of separation between the projections.
• The mean vector of each class in x and y feature space is:
– i.e. projecting x to y will lead to projecting the mean of x to the mean of y.
• We could then choose the distance between the projected means as our objective function
i i
i i
i i
i
T i x
T x
T y i
x i i
w N x
w
x N w
N y and
N x
µ µ
µ
w w w
w
=
=
=
=
=
å å å
å
Î Î Î
Î
1 1
~ 1 1
(
2)
2
2 1 1
1
~ ) ~
( w = µ - µ = w
Tµ - w
Tµ = w
Tµ - µ
J
LDA … Two Classes
• However, the distance between the projected means is not a very good measure since it does not take into account the standard deviation within the classes.
This axis has a larger distance between means This axis yields better class separability
LDA … Two Classes
• The solution proposed by Fisher is to maximize a function that represents the difference between the means, normalized by a measure of the within-class variability, or the so-called scatter.
• For each class we define the scatter, an equivalent of the variance, as;
(sum of square differences between the projected samples and their class mean).• measures the variability within class ω
iafter projecting it on the y-space.
• Thus measures the variability within the two classes at hand after projection, hence it is called within-class scatter of the projected samples.
( )
å
Î-
=
y i
i
i
y
s
w
µ
22
~
~
~
2s
i2 2 2
1
~
~ s + s
LDA … Two Classes
• The Fisher linear discriminant is defined as the linear function wTx that maximizes the criterion function: (the distance between the projected means normalized by the within- class scatter of the projected samples.
• Therefore, we will be looking for a projection where examples from the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible
2 2 2
1
2
~
2~
~
~ )
(
1s w s
J +
= µ - µ
LDA … Two Classes
• In order to find the optimum projection w
*, we need to express J(w) as an explicit function of w.
• We will define a measure of the scatter in multivariate feature space x which are denoted as scatter matrices;
• Where S
iis the covariance matrix of class ω
i, and S
wis called the within-class scatter matrix.
( )( )
2
1
S
S S
x x
S
w
T i x
i i
i
+
=
- -
= å
Î
µ µ
w
2 2 2
1
2 2
~
~
~
~ )
( 1
s w s
J +
=
µ
-µ
LDA … Two Classes
• Now, the scatter of the projection y can then be expressed as a function of the scatter matrix in feature space x.
Where is the within-class scatter matrix of the projected samples y.
( ) ( )
( )( )
( )( )
( )
T W WT T
T
i T x
T i i
T x
T i i
T x
i T T
y
i i
S w
S w w
S S
w w
S w w
S w s
s
w S w w
x x
w
w x
x w
w x
w y
s
i i
i i
~ ~
~
~
~
2 1
2 1
2 2 2
1
2 2 2
=
= +
= +
= +
÷÷ = ø ö çç è
æ - -
=
- -
=
-
= -
=
å å
å å
Î Î
Î Î
w w
w w
µ µ
µ µ
µ µ
S ~
W2 2 2
1
2 2
~
~
~
~ )
( 1
s w s
J +
=
µ
-µ
LDA … Two Classes
• Similarly, the difference between the projected means (in y-space) can be expressed in terms of the means in the original feature space (x-space).
• The matrix SB is called the between-class scatter of the original samples/feature vectors, while is the between-class scatter of the projected samples y.
• Since SB is the outer product of two vectors, its rank is at most one.
( ) ( )
( )( )
B B
T
S T T
T T
S w
S w
w w
w w
~
B~
~
2 1
2 1
2 2 1
2
1 2
=
=
- -
=
-
= -
!
!
! "
!
!
! #
$ µ µ µ µ µ
µ µ
µ
S ~
B2 2 2
1
2 2
~
~
~
~ )
( 1
s w s
J +
=
µ
-µ
LDA … Two Classes
• We can finally express the Fisher criterion in terms of S
Wand S
Bas:
• Hence J(w) is a measure of the difference between class means (encoded in the between-class scatter matrix) normalized by a measure of the within-class scatter matrix.
w S
w
w S w s
w s J
W T
B
=
T+
= -
22 2
1
2
~
2~
~
~ )
( µ
1µ
LDA … Two Classes
• To find the maximum of J(w), we differentiate and equate to zero.
( ) ( ) ( ) ( )
( ) ( )
0 )
(
0 )
(
0 :
2
0 2
2
0 0
) (
1
- =
Þ
= -
Þ
÷÷ = ø çç ö
è - æ
÷÷ ø çç ö
è Þ æ
= -
Þ
= -
Þ
÷÷ = ø çç ö
è
= æ
-
S w J w w
S
w S w J w S
w w S
S w
w S w w
w S S w
w S w
w S w by
Dividing
w S w S w w
S w S w
w S dw w
w d S w w
S dw w
w d S w
w S w
w S w dw
w d dw J
d
B W
W B
W W
T B T B
W T
W T
W T
W B
T B
W T
W T B
T B
T W
T
W T
B T
LDA … Two Classes
• Solving the generalized eigen value problem
yields
• This is known as Fisher’s Linear Discriminant, although it is not a discriminant but rather a specific choice of direction for the projection of the data down to one dimension.
• Using the same notation as PCA, the solution will be the eigen vector(s) of
scalar w
J where
w w
S
S
W-1 B= l l = ( ) =
(
1 2)
1
*
arg max ( ) arg max ÷÷ = µ - µ
ø çç ö
è
= æ
=
W-W T
B T w
w
w S S w
w S w w
J w
B W
X
S S
S =
-1LDA … Two Classes - Example
• Compute the Linear Discriminant projection for the following two- dimensional dataset.
– Samples for class ω1 : X1=(x1,x2)={(4,2),(2,4),(2,3),(3,6),(4,4)}
– Sample for class ω2 : X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
x1
x 2
LDA … Two Classes - Example
• The classes mean are :
÷÷ ø çç ö
è
= æ ú û ê ù
ë
é ÷÷ ø
çç ö è + æ
÷÷ ø çç ö è + æ
÷÷ ø çç ö è + æ
÷÷ ø çç ö è + æ
÷÷ ø çç ö
è
= æ
=
÷÷ ø çç ö
è
= æ ú û ê ù
ë
é ÷÷ ø
çç ö è + æ
÷÷ ø çç ö è + æ
÷÷ ø çç ö è + æ
÷÷ ø çç ö è + æ
÷÷ ø çç ö è
= æ
=
å å
Î Î
6 . 7
4 . 8 8
10 7
8 5
9 8
6 10
9 5
1 1
8 . 3
3 4
4 6
3 3
2 4
2 2
4 5
1 1
2 2
1 1
2 1
w w
µ µ
x x
N x
N x
LDA … Two Classes - Example
• Covariance matrix of the first class:
( )( )
÷÷ ø çç ö
è æ
-
= -
ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ
ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è
= æ -
-
= å
Î
2 . 2 25
. 0
25 . 0 1
8 . 3
3 4
4 8
. 3
3 6
3 8
. 3
3 3
2
8 . 3
3 4
2 8
. 3
3 2
4
2 2
2
2 2
1 1
1
1
T x
x x
S µ µ
w
LDA … Two Classes - Example
• Covariance matrix of the second class:
( )( )
÷÷ ø çç ö
è æ
-
= -
ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö
è + æ ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ
ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö è + æ ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö
è
= æ -
-
= å
Î
3 . 3 05
. 0
05 . 0 3
. 2
6 . 7
4 . 8 8
10 6
. 7
4 . 8 7
8 6
. 7
4 . 8 5
9
6 . 7
4 . 8 8
6 6
. 7
4 . 8 10
9
2 2
2
2 2
2 2
2
2
T x
x x
S µ µ
w
LDA … Two Classes - Example
• Within-class scatter matrix:
÷÷ ø çç ö
è æ
-
= -
÷÷ ø çç ö
è æ
- + -
÷÷ ø çç ö
è æ
-
= - +
=
5 . 5 3
. 0
3 . 0 3
. 3
3 . 3 05
. 0
05 . 0 3
. 2 2
. 2 25
. 0
25 . 0 1
2
1
S
S
S
wLDA … Two Classes - Example
• Between-class scatter matrix:
( )( )
( )
÷÷ ø çç ö
è
= æ
-
÷÷ - ø çç ö
è æ
-
= -
ú û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö
è ú æ û ê ù
ë
é ÷÷
ø çç ö
è - æ
÷÷ ø çç ö
è
= æ
- -
=
44 . 14 52
. 20
52 . 20 16
. 29
8 . 3 4
. 8 5
. 3
4 . 5
6 . 7
4 . 8 8
. 3
3 6
. 7
4 . 8 8
. 3
3
2 1
2 1
T T
S
Bµ µ µ µ
LDA … Two Classes - Example
• The LDA projection is then obtained as the solution of the generalized eigen value problem
( )( )
( )
2007 .
12 ,
0
0 2007
. 12 0
2007 .
12
0 2339 .
4 489 .
6 9794
. 2 2213
. 9
9794 .
2 2339
. 4
489 . 6 2213
. 9
1 0 0
0 1 44
. 14 52
. 20
52 . 20 16
. 29 1827
. 0 0166 .
0
0166 .
0 3045 .
0
1 0 0
0 1 44
. 14 52
. 20
52 . 20 16
. 29 5
. 5 3
. 0
3 . 0 3
. 3
0
2 1
2
1 1
1
=
= Þ
= -
Þ
= -
Þ
=
´ -
- -
=
÷÷ ø çç ö
è æ
- Þ -
÷÷ = ø çç ö
è - æ
÷÷ ø çç ö
è
÷÷ æ ø çç ö
è Þ æ
÷÷ = ø çç ö
è - æ
÷÷ ø çç ö
è
÷÷ æ ø çç ö
è æ
- Þ -
= -
Þ
=
- -
-
l l
l l l
l
l l
l l
l l l
l
I S
S w w
S S
B W B
W
LDA … Two Classes - Example
• Hence
• The optimal projection is the one that given maximum λ = J(w)
!
* 2
1
2 1 2
2 1 1
4173 .
0
9088 .
0 8178
. 0
5755 .
0
;
2007 .
9794 12 .
2 2339 .
4
489 . 6 2213
. 9
9794 0 .
2 2339 .
4
489 . 6 2213
. 9
2 1
w w
and w
Thus
w w w
and
w w w
÷÷ = ø çç ö
è
= æ
÷÷ ø çç ö
è
= æ-
÷÷ ø çç ö
è
= æ
÷÷ ø çç ö
è æ
÷÷ ø çç ö
è
= æ
÷÷ ø çç ö
è æ
" #
"$
%
l lLDA … Two Classes - Example
( )
÷÷ ø çç ö
è
= æ
÷÷ ø çç ö
è æ
-
÷÷ - ø çç ö
è
= æ
ú û ê ù
ë
é ÷÷ ø
çç ö è - æ
÷÷ ø çç ö
è
÷÷ æ ø çç ö
è æ
-
= - -
=
- -
4173 .
0
9088 .
0
8 . 3
4 . 5 1827
. 0 0166
. 0
0166 .
0 3045
. 0
6 . 7
4 . 8 8
. 3
3 5
. 5 3
. 0
3 . 0 3
.
3
12 1
1
*
S
Wµ µ
w
Or directly;
LDA - Projection
-4 -3 -2 -1 0 1 2 3 4 5 6
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
y p(y|w i)
Classes PDF : using the LDA projection vector with the other eigen value = 8.8818e-016
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
x1
x 2
LDA projection vector with the other eigen value = 8.8818e-016
The projection vector corresponding to the
smallest eigen value
Using this vector leads to bad separability between the two classes
LDA - Projection
0 5 10 15
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
y p(y|w i)
Classes PDF : using the LDA projection vector with highest eigen value = 12.2007
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
x1
x 2
LDA projection vector with the highest eigen value = 12.2007
The projection vector corresponding to the
highest eigen value
Using this vector leads to good separability between the two classes
LDA … C-Classes
• Now, we have C-classes instead of just two.
• We are now seeking (C-1) projections [y
1, y
2, …, y
C-1] by means of (C-1) projection vectors w
i.
• w
ican be arranged by columns into a projection matrix W = [w
1|w
2|…|w
C-1] such that:
[
1 2 1]
1
1 1
1 1 1
1
| ...
|
|
. , .
. .
- -
´
-
´ -
´
=
ú ú ú ú û ù ê ê
ê ê ë é
= ú ú
ú ú û ù ê ê
ê ê ë é
=
C C
m
C C
m m
w w
w W
and
y y y
x x x
where
x W
y x
w
y
i=
iTÞ =
TLDA … C-Classes
• If we have n-feature vectors, we can stack them into one matrix as follows;
[
1 2 1]
1
1 2
1 1
1
1 2
1 1
1
1 2
1
1 2
1 1
1
| ...
|
|
.
. .
. .
. .
. .
. ,
.
. .
. .
. .
. .
.
- -
´
- -
-
´ -
´
=
ú ú ú ú ú û ù
ê ê ê ê ê ë é
= ú ú
ú ú ú û ù
ê ê ê ê ê ë é
=
C C
m
n C C
C
n
n C n
m m
m
n
n m
w w
w W
and
y y
y
y y
y Y
x x
x
x x
x X
where
X W
Y =
TLDA – C-Classes
• Recall the two classes case, the within-class scatter was computed as:
• This can be generalized in the C- classes case as:
2
1
S
S S
w= +
( )( )
å å å
Î Î
=
=
- -
=
=
i i
i x i
T i x
i i
C i
i W
N x and
x x
S where
S S
w w
µ
µ µ
1
1
x1
x2 µ1
µ2
µ3
Sw2
Example of two-dimensional features (m = 2),with three classes C = 3.
Ni : number of data samples in class ωi.
LDA – C-Classes
• Recall the two classes case, the between- class scatter was computed as:
• For C-classes case, we will measure the between-class scatter with respect to the mean of all classes as follows:
( )( )
å
å å
å
Î
"
"
=
=
=
=
- -
=
x i
i i
x
i i x
C i
T i
i i B
N x and
N N N x
where
N S
w
µ
µ µ
µ µ
µ µ
1
1 1
1
( )( )
TS
B= µ
1- µ
2µ
1- µ
2x1
x2 µ1
µ2
µ3
Sw2
µ
Example of two-dimensional features (m = 2),with three classes C = 3.
Ni : number of data samples in class ωi.
N: number of all data .
LDA – C-Classes
• Similarly,
– We can define the mean vectors for the projected samples y as:
– While the scatter matrices for the projected samples y will be:
å
å
Î "=
=
y i y
i
y
and N N
iy
~ 1
~ 1 µ
µ
w
( )( )
å å å
= = Î- -
=
=
Ci
T i y
i C
i
i
W
S y y
S
1 i
1
~
~ ~
~ µ µ
w
( )( )
å
=- -
=
Ci
T i
i i
B
N
S
1
~
~
~
~ µ ~ µ µ µ
LDA – C-Classes
• Recall in two-classes case, we have expressed the scatter matrices of the projected samples in terms of those of the original samples as:
This still hold in C-classes case.
• Recall that we are looking for a projection that maximizes the ratio of between-class to within-class scatter.
• Since the projection is no longer a scalar (it has C-1 dimensions), we then use the determinant of the scatter matrices to obtain a scalar objective function:
• And we will seek the projection W* that maximizes this ratio.
W S W S
W S W S
B T B
W T W
=
~ =
~
W S W
W S W S
W S J
W T
B T
W B
=
= ~
~ )
(
LDA – C-Classes
• To find the maximum of J(W), we differentiate with respect to W and equate to zero.
• Recall in two-classes case, we solved the eigen value problem.
• For C-classes case, we have C-1 projection vectors, hence the eigen value problem can be generalized to the C-classes case as:
• Thus, It can be shown that the optimal projection matrix W* is the one whose columns are the eigenvectors corresponding to the largest eigen values of the following generalized eigen value problem:
1 ,...
2 , 1 )
1
= = ( = = -
-
S w w where J w scalar and i C
S
W B il
i il
i iscalar w
J where
w w
S
S
W-1 B= l l = ( ) =
[
* 1]
* 2
* 1
*
*
*
* 1
| ...
|
| )
(
--
=
=
=
=
C B
W
w w
w W
and scalar
W J where
W W
S S
l
l
Illustration – 3 Classes
• Let’s generate a dataset for each class to simulate the three classes shown
• For each class do the following,
– Use the random number generator to generate a uniform stream of 500 samples that follows U(0,1).
– Using the Box-Muller approach, convert the generated uniform stream to N(0,1).
x1
x2 µ1
µ2
µ3 µ
– Then use the method of eigen values and eigen vectors to manipulate the standard normal to have the required mean vector and covariance matrix . – Estimate the mean and covariance
matrix of the resulted dataset.
• By visual inspection of the figure, classes parameters (means and covariance matrices can be given as follows:
Dataset Generation
x1
x2 µ1
µ2
µ3 µ
÷÷ø çç ö
è
=æ
÷÷ø çç ö
è
=æ
÷÷ø çç ö
è æ
-
= -
úû ê ù ë +é ú =
û ê ù
ë é
- + - ú =
û ê ù ë +é-
=
úû ê ù ë
= é
5 . 2 3
1 5 . 3
4 0
0 4
3 3
1 5
5 , 7
5 . 3
5 . , 2
7 3
5 mean 5
Overall
3 2 1
3 2
1
S S S
µ µ µ
µ µ
µ
µ
Zero covariance to lead to data samples distributed horizontally.
Positive covariance to lead to data samples distributed along the y = x line.
Negative covariance to lead to data samples distributed along the y = -x line.
In Matlab J
It’s Working … J
-5 0 5 10 15 20
-5 0 5 10 15 20
X1 - the first feature X 2 - the second feature
x1
x2 µ1
µ2
µ3 µ
Computing LDA Projection Vectors
( )( )
å
å å
å
Î
"
"
=
=
=
=
- -
=
x i
i i
x
i i x
C i
T i
i i B
N x and
N N N x
where N S
w
µ
µ µ
µ µ
µ µ
1
1 1
1
( )( )
å å å
Î Î
=
=
- -
=
=
i i
i x i
T i x
i i
C i
i W
N x and
x x
S where
S S
w w
µ
µ µ
1
1
Recall …
B W
S
S
-1Let’s visualize the projection vectors W
-15 -10 -5 0 5 10 15 20 25
-10 -5 0 5 10 15 20 25
X1 - the first feature X 2 - the second feature
Projection … y = W T x
Along first projection vector
-5 0 5 10 15 20 25
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
y p(y|w i)
Classes PDF : using the first projection vector with eigen value = 4508.2089
Projection … y = W T x
Along second projection vector
-10 -5 0 5 10 15 20
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
y p(y|w i)
Classes PDF : using the second projection vector with eigen value = 1878.8511
Which is Better?!!!
• Apparently, the projection vector that has the highest eigen value provides higher discrimination power between classes
-10 -5 0 5 10 15 20
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
y p(y|w i)
Classes PDF : using the second projection vector with eigen value = 1878.8511
-5 0 5 10 15 20 25
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
y p(y|w i)
Classes PDF : using the first projection vector with eigen value = 4508.2089
PCA vs LDA
Limitations of LDA L
• LDA produces at most C-1 feature projections
– If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features
• LDA is a parametric method since it assumes unimodal Gaussian likelihoods
– If the distributions are significantly non-Gaussian, the LDA projections will not be able to preserve any complex structure of the data, which may be needed for classification.
Limitations of LDA L
• LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data