1 3
5
0 0.02 0.04 0.06 0.08 0.1
1 2 3 4 5
Nonlinear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎𝑇𝑦
1 𝑋 ∋ 𝑥
𝑌 ∋ 𝑦 = 𝑦1
⋮ 𝑦𝑛 𝑓 𝑥 = 𝑎𝑇𝑦
Geometric Deep Learning on graph and manifolds, Michael Bronstein, SIAM 2018, Imperial College London
𝑓 𝑥
1 𝑏
1 3
5
0 0.02 0.04 0.06 0.08 0.1
1 2 3 4 5
Nonlinear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓𝑖 𝑥 = 𝑎𝑖𝑇𝑦
2 𝑋 ∋ 𝑥
𝑌 ∋ 𝑦 = 𝑦1
⋮ 𝑦𝑛
𝑓𝑖 𝑥 = 𝑝(𝑥|𝜔𝑖)
1 𝑏
𝑓𝑛 𝑥 𝑓1 𝑥
𝑓𝑖 𝑥
𝑖𝑓𝑖(𝑥)
1 3
5
0 0.02 0.04 0.06 0.08 0.1
1 2 3 4 5
Nonlinear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓
𝑖𝑥 =
𝑎𝑖𝑇𝑦൘
σ𝑗𝑎𝑗𝑇𝑦
(softmax)
3 𝑋 ∋ 𝑥
𝑌 ∋ 𝑦 = 𝑦1
⋮ 𝑦𝑛
𝑓𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)
1 𝑏
𝑓𝑛 𝑥 𝑓1 𝑥
𝑓𝑖 𝑥
𝑖𝑓𝑖(𝑥)
Feature Dimension Reduction:
PCA & LDA (I)
Jin Young Choi
Seoul National University
4
Outline
Feature Extraction
Introduction of PCA & LDA
Principal Component Analysis (PCA)
Linear Discriminant Analysis (FLDA)
Multiple Discriminant Analysis (MDA)
Simple Enhancement of PCA/LDA
Feature Extraction
Features
Weight, Height, Width, Volume, Head size, … Edge, Shape, Geometric Relations …
RGB Color for each pixel SIFT, SURF, HOG, …
Feature Extraction from Raw Data
Pixel Valued Vector is raw data vector Raw data vector is redundant
The dimension should be reduced
Component Analysis and Discriminants
How to reduce excessive dimensionality?
• Answer: Combine features highly dependent to each other.
Linear methods project high-dimensional data onto lower dimensional space.
Principal Components Analysis (PCA)
• seeks the projection which best represents the data in a least- square error sense.
Linear Discriminant Analysis (LDA) or Fisher Linear Discriminant
• seeks the projection that best separates the data in a least-square discrimination error sense.
Principal Component Analysis
𝑛2
𝑛1
𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥
*
Principal Component Analysis
𝑛2
𝑛1
𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥
*
Principal Component Analysis
𝑛2
𝑛1
𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥
Principal Component Analysis
𝑛2
𝑛1
𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥 𝑥 = 𝑎1𝑒1 + 𝑎2𝑒2 = 𝑒1 𝑒2 𝑎 = 𝐸𝑎 𝑒1𝑡𝑥 = 𝑎1𝑒1𝑡𝑒1 + 𝑎2𝑒1𝑡𝑒2 = 𝑎1
𝑒2𝑡𝑥 = 𝑎1𝑒2𝑡𝑒1 + 𝑎2𝑒2𝑡𝑒2 = 𝑎2 𝑒1
𝑒2
𝑎1
𝑎2 = 𝑒1𝑡 𝑒2𝑡
𝑥1
𝑥2 𝑎=𝐸𝑡𝑥 𝑎2
𝑎1
Linear Discriminant Analysis
𝑛2
𝑛1 𝑒1
𝑎1 = 𝑒1𝑡 𝑥
Linear Discriminant Analysis
Maximize Between-Class Distance Minimize Within-Class Distance
𝑎2 = 𝑒2𝑡 𝑥 𝑛2
𝑛1 𝑒2
PCA & LDA
PCA == LDA
𝑛2
𝑛1 𝑒2
𝑒1
𝑎1 = 𝑒1𝑡 𝑥
Principal Components Analysis (PCA)
How to represent n d-dimensional vector samples {x1,.., xn} by a single vector x0 ?
• Find x0 that minimizes squared error correction function
2
0 0 0
1
( ) || || .
n
k k
J
x x x
Principal Components Analysis (PCA)
How to represent n d-dimensional vector samples {x1,.., xn} by a single vector x0 ?
• Find x0 that minimizes squared error correction function
The solution is sample mean
This is zero-dimensional representation of the data set.
One-dimensional representation by projecting the data onto a line through the sample mean reveals variability in the data.
2
0 0 0
1
( ) || || .
n
k k
J
x x x
n
0 k
k=1
x = m = 1 x
n
Principal Components Analysis (PCA)
This is zero-dimensional representation of the data set.
One-dimensional representation by projecting the data onto a line through the sample mean reveals variability in the data.
n
0 k
k=1
x = m = 1 x n
𝒙𝑘
PCA ; Projection
Let e be a unit vector in a direction of the line. The equation of the line
Representing xk by find “optimal”
set minimizing criterion function :
2
1 1
1
( ,..., , ) || || .
n
n k k
k
J a a a
e m e x
a
x m e
ak
m e ak
𝒙𝑘
PCA ; Projection
Representing xk by find “optimal”
set minimizing criterion function :
from
we find
2
1 1
1
( ,..., , ) || || .
n
n k k
k
J a a a
e m e x
ak
m e ak
1 / k 0
J a
𝒙𝑘
( )
t
k k
a e x m
PCA ; Projection
Representing xk by find “optimal”
How to find the best direction for e ?
The least square solution: project the vector xk onto the line in the direction of e, passing through the sample mean.
Minimize 𝐽w.r.t e.
ak
m e ak
( )
t
k k
a e x m
𝒙𝑘
2 1 1
1
( ,..., , ) || || .
n
n k k
k
J a a a
e m e x
( )
t
k k
a e x m
PCA ; Scatter matrix
Substituting 𝑎𝑘 into 𝐽1(𝑎, 𝐞) we find
where a scatter matrix S which is (𝑛 − 1) times of sample covariance matrix
1
( )( ) .
n
t
k k
k
S x m x m
2 2 2
1
1 1 1
2 2 2 2 2
1 1 1 1 1
2
1 1
1
( , ) || || 2 ( ) || ||
2 || || [ ( )] || ||
( )( ) || ||
|| ||
n n n
t
k k k k
k k k
n n n n n
t
k k k k k
k k k k k
n n
t t
k k k
k k
n t
k k
J a a a
a a
e e e x m x m
x m e x m x m
e x m x m e x m
e Se x m 2
PCA ; Scatter matrix
Vector e that minimizes J1 also maximizes .
So we find e, which maximize
subject to constraint 𝒆 =1
Let be Lagrange multiplier.
Differentiating 𝐿 with respect to e:
By setting to zero we see that e is an eigenvector of S:
So to maximize takes maximal
/ 2 2
L
e Se e
L e Set (e et 1) e Set
Se e e Set
e Set
e Set 2
1
1
( , ) || ||
n t
k k
J a
e e Se x m
PCA ; Scatter matrix
The result is easily extended to d’ dimensional projection:
The criterion function
is minimized when vectors are the eigenvectors having the largest eigenvalues.
The coefficients are principal components.
'
'
2
1 1
n d
i
k i k
d
k i
J a
m
e x( )
i t
k i k
a e x - m
1 2 d'
e , e , ...,e
'
' 1
where
d i
k k k i
i
a d d
x‘ m e
Error function
If d’ < d error which is made by dropping the last terms is
This is a sum of lowest eigenvalues.
'
2
1 ' 1
' 1 1
' 1 ' 1
( )( )
n d
i k i d
k i d
d n
t t
i k k i
i d k
d d
t
i i i
i d i d
J a
e
e x m x m e e Se
( )
i t
k i k
a e x - m
'
1 d
i
k k k i
i
a
x‘ m e
PCA – the algorithm
Input:
Take
Output:
Algorithm:
• Compute the mean of the training set
• Compute the scatter matrix S.
• Find eigenvectors of S and corresponding eigenvalues:
• Choose eigenvectors, and for each sample point compute d' d
( )
1 1
{ ,.., }, ,...,
n k k
n k d
X x x x x x
'
( )
1 { ,..,1 }
n k k
n k d
A a a a a a
1 1 2
{ , }i i id , i i , d
S e
i Se
e
d' xk
'
{ (t ) d1
k i k i
a e x m
1
1 .
n k
n k
m x
Interim Summary
'
'
'
1
1 2
2
( ) , 1,...,
( )
... ...
E ( )
) y W ( )
i t
k i k
t k
t k
k d t
k d
t
k k
t
k k
a i d
a a
a
cf
e x m
e
e x m
e
a x m
x m
Principal Component Analysis
Feature Extraction
Dimension Reduction
1
( )( ) .
n
t
k k
k
S x m x m
Se e
t
e Se
'
'
2
1 1
n d
i
k i k
d
k i
J a
m
e xFeature Dimension Reduction:
PCA & LDA (II)
Jin Young Choi
Seoul National University
1
Outline
Feature Extraction
Introduction of PCA & LDA
Principal Component Analysis (PCA) Linear Discriminant Analysis (FLDA) Multiple Discriminant Analysis (MDA) Simple Enhancement of PCA/LDA
2
Linear Discriminant Analysis: LDA
We have n d-dimensional samples x1,.., xn, n1 in a subset , labeled w1 and n2 in a subset , labeled w2 .
Find direction of line w , that maximally separate the data.
Let a difference between sample means be a measure of separation of projected points
3
D1
D2
Maximize Between-Class Distance Minimize Within-Class Distance
Fisher Linear Discriminant cont.
Project samples onto w.
n samples are divided into the subsets and
Let be the sample mean
The sample mean for projected points
Distance between the projected means is
4
yk w xt k
1
i
i
i D
n
x
m x
y Y
1 1
y
i i
t t
i i
i i D
m n
n
x
w x w m
%
1 2 1 2
| m%m% | | w mt( m ) | xk
yk mi
Y1 Y2
Fisher Linear Discriminant cont.
A scatter for projected samples labeled 𝜔𝑖
is an estimate of the variance of the pooled data.
is called total within-class scatter of the projected samples.
The Fisher discriminant employs for which criterion
is maximum
5
2 2
1 2
(1 / )(n s% % s )
2 2
1 2
s s
% %
w xt
2
1 2
2 2
1 2
| |
( ) m m
J s s
w % %
% %
2 2 y
(y )
i
i i
Y
s m
% %
Fisher Linear Discriminant cont.
Define scatter matrices and by
and
Then
Thus
6
Si Sw
( )( )
i
t
i i i
x D
S
x m x m1 2
Sw S S
2 2
( ) ( )( )
i i
t t t t t
i i i i i
D D
s m m m
x x
w x w w x x w w S w
%
2 2
1 2
t
s s w S ww
% %
Fisher Linear Discriminant cont.
Similarly,
is called within-class scatter matrix (proportional to sample covariance matrix )
is called between-class scatter matrix.
This gives the equivalent expression for Fisher’s discriminant
Which vector w maximizes it?
7
2 2
1 2 1 2 1 2 1 2
(m% %m ) (w mt w mt ) w mt( m )(m m )tw w S wt B Sw
1 2 1 2
( )( )t
B
S m m m m
( )
t B t
W
J w S w w w S w
2
( ) 2 0
t
W
B B
t t t
W W W
wJ S w w S w S w w w S w w S w w S w
Fisher Linear Discriminant cont.
Hence one gets or equivalently
Since for any w, is always in the direction of m1-m2:
It is not necessary to determine the eigenvalues of
One simply gets
Scale factor for w is unimportant (why?).
FLDA is one-dimensional projection
8
, ,
t B
B W t
W
w S w S w S w
w S w
1 ,
W B
S S w w
1 2 1 2 1 2
( )( )t ( )
B
S w m m m m w m m S wB
1 .
W B
S S
1
1 2
( )
SW
w m m
Fisher Linear Discriminant cont.
9
( |
i) p x
Σi
with
0 0
t w
w x
w
1
1 2 0 10 20
( ), w w w
w Σ μ μ
1 0
1 ln ( )
2
t
i i i i
w μ μ P
μ1 μ2
Matrix Norm
10
Induced Norm
Spectral (maximum singular value) norm
2 max( ) ( max( T ))1/ 2 f A A A A A 𝐴 𝑝 = sup𝑥 𝑝=1 𝐴𝑥 𝑝 𝐴 1 = max
1≤𝑗≤𝑛
𝑖=1 𝑚
𝑎𝑖𝑗 𝐴 ∞ = max
1≤𝑖≤𝑚
𝑗=1 𝑛
𝑎𝑖𝑗
2
2
2
2
2
2 2
1
1/ 2 1
1/ 2 1
1/ 2 1
2 1/ 2 1
1/ 2 max
sup
sup( )
sup( )
sup ( ) 1
sup ( )
( ( ))
x
T T
x
T T
x
T T T T
y
i i y
T
A Ax
x A Ax x U Ux
y y y y x U Ux y
X X
Schatten norm
nuclear norm
Frobenius Norm
Multiple Discriminant Analysis
11
d c
1
( )( )
c
t
B i i i
i
n
S m m m m
1
( )( )
i
c
t
w i i
i D
x
S x m x m
1
( )( )
i
c t
w i i
i Y
y
S% y m% y m%
1
( )( )
c
t
B i i i
i
n
S% m% m m% % m%
m %
1m%2
m%3
m%
1 2 1 2
) B ( )( )t
cf S m m m m
Multiple Discriminant Analysis
For the c -class problem we have c-1 discriminant functions.
The projection from a d-dimensional space to a (c-1) dimension is accomplished by (c-1) discriminant functions (we assume ).
Within-class scatter matrix is:
where and
Define a total mean vector
12
d c
1 c
w i
i
S S
( )( )
i
t
i i i
D
x
S x m x m
1
i
i
i D
n
x
m x
1
1 1 c
i i i
n n n
x
m x m
Multiple Discriminant Analysis
And total scatter matrix
It can be transformed to
The between-class scatter is:
13
( )( )t
T
x
S x m x m
1
1 1
1
( )( )
( )( ) ( )( )
( )( )
i
i i
c
t
T i i i i
i D
c c
t t
i i i i
i D i D
c
t
W i i i W B
i
n
x
x x
S x m m m x m m m
x m x m m m m m
S m m m m S S
1
( )( )
c
t
B i i i
i
n
S m m m m
Multiple Discriminant Analysis
For the c -class problem we have (c-1) discriminant functions. The projection from a d-dimensional space to a (c-1) dimensional space is accomplished by (c-1) discriminant functions:
Taking d-by-(c-1) matrix which columns are vectors , we’ll get in matrix form:
Samples are projected to
We define
14
y
i w x
ti i 1,..., ( c 1)
t
y W x
W
w
i1
,...,
nx x y
1,..., y
n.
1 1
i i
i i i
Y Y
i
n n n
y y
m% y m% m%
Multiple Discriminant Analysis cont.
It’s easy to show that
The criterion function which should be maximized is:
Every column of W we should be solution of generalized eigenvalue problem
15
1
( )( )
i
c t
W i i
i Y
y
S% y m% y m%
1
( )( )
c
t
B i i i
i
n
S% m% m m% % m%
t t
W W and B B
S% W S W S% W S W
| | | | ( )
( )
| | | | ( )
t t t
i B i
i
B B B
t t t
W W W i i W i
J tr
tr
w S w S W S W W S W
W S W S W W S W w S w
%
%
w
i1
W B i
i iS S w w
Multiple Discriminant Analysis cont.
The criterion function which should be maximized is:
Every column of W we should be solution of generalized eigenvalue problem
16
| | | | ( )
( )
| | | | ( )
t t t
i B i
i
B B B
t t t
W W W i i W i
J tr
tr
w S w S W S W W S W
W S W S W W S W w S w
%
%
w
i 1W B i i i
S S w w
( ) ( ) ( )
( )
( ) ( )
B i i W i
B W
t t
B W
t t
B W
t B
i t
i
W
tr tr tr
tr tr
tr
S w S w S W S W
W S W W S W
W S W W S W W S W W S W
Multiple Discriminant Analysis cont.
The MDA provides the way of reducing the dimensionality of the problem.
The technique for finding probability density might not be feasible in the original space.
The technique for finding probability density may work well after reducing the dimension of feature space.
MDA may improve the separability of classes.
17
Simple Enhancement for PCA/LDA
Significant pairs for between-class scatter matrix
• Non-boundary patterns with the different class labels
Significant pairs for within-class scatter matrix
• Non-boundary patterns with the same class labels
18
between-class
non-boundary region non-boundary region boundary region
within-class
non-boundary region non-boundary region boundary region
Non-boundary Pattern Selection Algorithm
Step 1. For each ,
Find the neighborhood defined as follows.
where is the set of nearest samples to by L2-norm.
Calculate voting probabilities of to each class .
where is 1 if the class of neighbor is , otherwise 0.
Calculate the neighborhood entropy of .
Step 2. Obtain boundary patterns and non-boundary patterns
PCA using NPS (LDA in the same manner)
Select non-boundary patterns via BNPS.
Non-boundary patterns make up significant pairs.
Emphasize the significant pairs.
Whole data
Non-boundary pattern
Toy Example
UCI Machine Learning Repository
Name # of data # of attributes # of classes Missing
attributes
Haberman 306 3 2 No
Hepatitis 155 19 2 Yes
Pima 768 8 2 No
Balance-scale 625 4 3 No
Liver-disorders 345 6 2 No
Iris 150 4 3 No
Glass 214 9 6 No
Wisconsin 699 9 2 Yes
Sonar 208 60 2 No
Dermatology 366 34 6 Yes
PCA vs. NPS+PCA
Leave-one-out
NN Classifier 10-fold cross validation NN Classifier
LDA vs. NPS+LDA
Leave-one-out
NN Classifier 10-fold cross validation
NN Classifier
Interim Summary
Fisher Linear Discriminant Analysis
Multiple Discriminant Analysis
Simple Enhancement for PCA/LDA ( )
t B t
W
J w S w w w S w
1
,
W B
S S w w w S
W1( m
1 m
2)
1
W B i
i iS S w w
1
( )( )
c
t
B i i i
i
n
S m m m m