Nonlinear Mapping

(1)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎^𝑇𝑦

1 𝑋 ∋ 𝑥

𝑌 ∋ 𝑦 = 𝑦₁

⋮ 𝑦_𝑛 𝑓 𝑥 = 𝑎^𝑇𝑦

Geometric Deep Learning on graph and manifolds, Michael Bronstein, SIAM 2018, Imperial College London

𝑓 𝑥

1 𝑏

(2)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓_𝑖 𝑥 = 𝑎_𝑖^𝑇𝑦

2 𝑋 ∋ 𝑥

𝑌 ∋ 𝑦 = 𝑦₁

⋮ 𝑦_𝑛

𝑓𝑖 𝑥 = 𝑝(𝑥|𝜔𝑖)

1 𝑏

𝑓_𝑛 𝑥 𝑓1 𝑥

𝑓_𝑖 𝑥 ෍

𝑖𝑓𝑖(𝑥)

(3)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓

_𝑖

𝑥 =

^𝑎^𝑖^𝑇^𝑦

൘

_σ

𝑗𝑎_𝑗^𝑇𝑦

(softmax)

3 𝑋 ∋ 𝑥

𝑌 ∋ 𝑦 = 𝑦₁

⋮ 𝑦_𝑛

𝑓𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)

1 𝑏

𝑓_𝑛 𝑥 𝑓1 𝑥

𝑓_𝑖 𝑥 ෍

𝑖𝑓𝑖(𝑥)

(4)

Feature Dimension Reduction:

PCA & LDA (I)

Jin Young Choi

Seoul National University

4

(5)

Outline

Feature Extraction

Introduction of PCA & LDA

Principal Component Analysis (PCA)

Linear Discriminant Analysis (FLDA)

Multiple Discriminant Analysis (MDA)

Simple Enhancement of PCA/LDA

(6)

Feature Extraction

 Features

Weight, Height, Width, Volume, Head size, … Edge, Shape, Geometric Relations …

RGB Color for each pixel SIFT, SURF, HOG, …

 Feature Extraction from Raw Data

Pixel Valued Vector is raw data vector Raw data vector is redundant

The dimension should be reduced

(7)

Component Analysis and Discriminants

 How to reduce excessive dimensionality?

• Answer: Combine features highly dependent to each other.

 Linear methods project high-dimensional data onto lower dimensional space.

 Principal Components Analysis (PCA)

• seeks the projection which best represents the data in a least- square error sense.

 Linear Discriminant Analysis (LDA) or Fisher Linear Discriminant

• seeks the projection that best separates the data in a least-square discrimination error sense.

(8)

Principal Component Analysis

𝑛₂

𝑛₁

𝑥 = 𝑥₁𝑛₁ + 𝑥₂𝑛₂ = 𝑛₁ 𝑛₂ 𝑥 = 𝑁𝑥

*

(9)

Principal Component Analysis

𝑛₂

𝑛₁

𝑥 = 𝑥₁𝑛₁ + 𝑥₂𝑛₂ = 𝑛₁ 𝑛₂ 𝑥 = 𝑁𝑥

*

(10)

𝑛₂

𝑛₁

𝑥 = 𝑥₁𝑛₁ + 𝑥₂𝑛₂ = 𝑛₁ 𝑛₂ 𝑥 = 𝑁𝑥

(11)

𝑛₂

𝑛₁

𝑥 = 𝑥₁𝑛₁ + 𝑥₂𝑛₂ = 𝑛₁ 𝑛₂ 𝑥 = 𝑁𝑥 𝑥 = 𝑎₁𝑒₁ + 𝑎₂𝑒₂ = 𝑒₁ 𝑒₂ 𝑎 = 𝐸𝑎 𝑒₁^𝑡𝑥 = 𝑎₁𝑒₁^𝑡𝑒₁ + 𝑎₂𝑒₁^𝑡𝑒₂ = 𝑎₁

𝑒₂^𝑡𝑥 = 𝑎₁𝑒₂^𝑡𝑒₁ + 𝑎₂𝑒₂^𝑡𝑒₂ = 𝑎₂ 𝑒₁

𝑒₂

𝑎₁

𝑎₂ = 𝑒₁^𝑡 𝑒₂^𝑡

𝑥₁

𝑥₂ 𝑎=𝐸^𝑡𝑥 𝑎₂

𝑎₁

(12)

Linear Discriminant Analysis

𝑛₂

𝑛₁ 𝑒₁

𝑎₁ = 𝑒₁^𝑡 𝑥

(13)

Linear Discriminant Analysis

Maximize Between-Class Distance Minimize Within-Class Distance

𝑎₂ = 𝑒₂^𝑡 𝑥 𝑛₂

𝑛₁ 𝑒₂

(14)

PCA & LDA

PCA == LDA

𝑛₂

𝑛₁ 𝑒₂

𝑒₁

𝑎₁ = 𝑒₁^𝑡 𝑥

(15)

Principal Components Analysis (PCA)

 How to represent n d-dimensional vector samples {x₁,.., x_n} by a single vector x₀ ?

• Find x₀ that minimizes squared error correction function

2

0 0 0

1

( ) || || .

n

k k

J



  

x x x

(16)

Principal Components Analysis (PCA)

 How to represent n d-dimensional vector samples {x₁,.., x_n} by a single vector x₀ ?

• Find x₀ that minimizes squared error correction function

 The solution is sample mean

 This is zero-dimensional representation of the data set.

 One-dimensional representation by projecting the data onto a line through the sample mean reveals variability in the data.

2

0 0 0

1

( ) || || .

n

k k

J



  

x x x

n

0 k

k=1

x = m = 1 x

n 

(17)

Principal Components Analysis (PCA)

 This is zero-dimensional representation of the data set.

 One-dimensional representation by projecting the data onto a line through the sample mean reveals variability in the data.

n

0 k

k=1

x = m = 1 x n 

𝒙_𝑘

(18)

PCA ; Projection

 Let e be a unit vector in a direction of the line. The equation of the line

 Representing x_k by find “optimal”

set minimizing criterion function :

2

1 1

1

( ,..., , ) || || .

n

n k k

k

J a a a





  

e m e x

  a 

x m e

ak



m e a_k

𝒙_𝑘

(19)

PCA ; Projection

set minimizing criterion function :

from

we find

2

1 1

1

( ,..., , ) || || .

n

n k k

k

J a a a





  

e m e x

ak



m e a_k

1 / _k 0

J a

  

  

𝒙_𝑘

( )

t

k k

a e x m 

(20)

PCA ; Projection

 How to find the best direction for e ?

 The least square solution: project the vector x_k onto the line in the direction of e, passing through the sample mean.

 Minimize 𝐽w.r.t e.

ak



m e a_k

( )

t

k k

a  e x m 

𝒙_𝑘

2 1 1

1

( ,..., , ) || || .

n

n k k

k

J a a a





  

e m e x

( )

t

k k

a e x m 

(21)

PCA ; Scatter matrix

 Substituting 𝑎_𝑘 into 𝐽₁(𝑎, 𝐞) we find

 where a scatter matrix S which is (𝑛 − 1) times of sample covariance matrix

1

( )( ) .

n

t

k k

k





 

S x m x m

2 2 2

1

1 1 1

2 2 2 2 2

1 1 1 1 1

2

1 1

1

( , ) || || 2 ( ) || ||

2 || || [ ( )] || ||

( )( ) || ||

|| ||

n n n

t

k k k k

k k k

n n n n n

t

k k k k k

n n

t t

k k k

k k

n t

k k

J a a a

a a

  

    

 



    

        

     

   

  

    

 



e e e x m x m

x m e x m x m

e x m x m e x m

e Se x m ²

(22)

PCA ; Scatter matrix

 Vector e that minimizes J₁ also maximizes .

 So we find e, which maximize

subject to constraint 𝒆 =1

 Let be Lagrange multiplier.

 Differentiating 𝐿 with respect to e:

 By setting to zero we see that e is an eigenvector of S:

 So to maximize takes maximal

/ 2 2

L 

  e Se e

 L e Se^t (e e^t 1) e Set



Se  e e Se^t  

e Set 

e Set 2

1

( , ) || ||

n t

k k

J a



  





e e Se x m

(23)

PCA ; Scatter matrix

 The result is easily extended to d’ dimensional projection:

 The criterion function

is minimized when vectors are the eigenvectors having the largest eigenvalues.

 The coefficients are principal components.

'

2

1 1

n d

i

k i k

d

k i

J a

 

 

   

 



^m



^e ^x

( )

i t

k i k

a  e x - m

1 2 d'

e , e , ...,e

'

' 1

where

d i

k k k i

i

a d d



 



  

x^‘ m e

(24)

Error function

 If d’ < d error which is made by dropping the last terms is

 This is a sum of lowest eigenvalues.

'

2

1 ' 1

' 1 1

' 1 ' 1

( )( )

n d

i k i d

k i d

d n

t t

i k k i

i d k

d d

t

i i i

i d i d

J a



  

  

   



  

 

 

e

e x m x m e e Se

( )

i t

k i k

a e x - m

'

1 d

i

k k k i

i

a



 





x^‘ m e

(25)

PCA – the algorithm

 Input:

 Take

 Output:

 Algorithm:

• Compute the mean of the training set

• Compute the scatter matrix S.

• Find eigenvectors of S and corresponding eigenvalues:

• Choose eigenvectors, and for each sample point compute d' d

( )

1 1

{ ,.., }, ,...,

n k k

n k d

X  x x  x x x 

'

( )

1 { ,..,1 }

n k k

n k d

A    a a a a a

1 1 2

{ , }_i _{i i}^d , _i _i , _d

S e



_   i Se 

  

e     



d' x_k

'

{ (^t ) ^d1

k  i k   i

a e x m

1

1 .

n k

n k_





m x

(26)

Interim Summary

'

1

1 2

2

( ) , 1,...,

( )

... ...

E ( )

) y W ( )

i t

k i k

t k

k d t

k d

t

k k

t

k k

a i d

a a

a

cf

    

   

   

     

   

   

 

 

e x m

e

e x m

e

a x m

x m

 Principal Component Analysis

Feature Extraction

Dimension Reduction

1

( )( ) .

n

t

k k

k





 

S x m x m





Se e

t 



e Se

'

2

1 1

n d

i

k i k

d

k i

J a

 

 

   

 



^m



^e ^x

(27)

Feature Dimension Reduction:

PCA & LDA (II)

Jin Young Choi

Seoul National University

1

(28)

Outline

Feature Extraction

Introduction of PCA & LDA

Principal Component Analysis (PCA) Linear Discriminant Analysis (FLDA) Multiple Discriminant Analysis (MDA) Simple Enhancement of PCA/LDA

2

(29)

Linear Discriminant Analysis: LDA

 We have n d-dimensional samples x₁,.., x_n, n₁in a subset , labeled w₁ and n₂in a subset , labeled w₂ .

 Find direction of line w , that maximally separate the data.

 Let a difference between sample means be a measure of separation of projected points

3

D1

D2

Maximize Between-Class Distance Minimize Within-Class Distance

(30)

Fisher Linear Discriminant cont.

 Project samples onto w.

 n samples are divided into the subsets and

 Let be the sample mean

 The sample mean for projected points

 Distance between the projected means is

4

y_k  w x^t _k

1

i

i D

n 





x

m x

y Y

1 1

y

i i

t t

i i

i i D

m  n



_  n



_ 

x

w x w m

%

1 2 1 2

| m%m% | | w m^t( m ) | xk

y_k mi

Y1 Y₂

(31)

Fisher Linear Discriminant cont.

 A scatter for projected samples labeled 𝜔_𝑖

is an estimate of the variance of the pooled data.

is called total within-class scatter of the projected samples.

 The Fisher discriminant employs for which criterion

is maximum

5

2 2

1 2

(1 / )(n s% % s )

2 2

1 2

s  s

% %

w xt

2

1 2

2 2

1 2

| |

( ) m m

J s s

 

w % %

% %

2 2 y

(y )

i

i i

Y

s m









% %

(32)

Fisher Linear Discriminant cont.

 Define scatter matrices and by

and

 Then

 Thus

6

Si S_w

( )( )

i

t

i i i

x D

S







^{x m} ^{x m}

1 2

Sw  S S

2 2

( ) ( )( )

i i

t t t t t

i i i i i

D D

s m m m

 





 



  

x x

w x w w x x w w S w

%

2 2

1 2

t

s  s  w S ww

% %

(33)

 Similarly,

is called within-class scatter matrix (proportional to sample covariance matrix )

is called between-class scatter matrix.

 This gives the equivalent expression for Fisher’s discriminant

 Which vector w maximizes it?

7

2 2

1 2 1 2 1 2 1 2

(m% %m ) (w m^t w m^t )  w m^t( m )(m m )^tw  w S w^t _B Sw

1 2 1 2

( )( )^t

B   

S m m m m

( )

t B t

W

J  w S w w w S w

2

( ) 2 0

t

W

B B

t t t

W W W

_wJ  S w  w S w S w  w w S w w S w w S w

(34)

Fisher Linear Discriminant cont.

 Hence one gets or equivalently

 Since for any w, is always in the direction of m₁-m₂:

 It is not necessary to determine the eigenvalues of

 One simply gets

 Scale factor for w is unimportant (why?).

 FLDA is one-dimensional projection

8

, ,

t B

B W t

W

 

   w S w S w S w

w S w

1 ,

W^ B  

S S w w

1 2 1 2 1 2

( )( )^t ( )

B     

S w m m m m w m m S wB

1 .

W B

 

S S

1

1 2

( )

SW^

 

w m m

(35)

9

( |

_i

) p x 

Σi

with

0 0

t  w 

w x

w

1

1 2 0 10 20

( ), w w w

      w Σ μ μ

1 0

1 ln ( )

2

t

i i i i

w  ^ μ ^ μ  P 

μ1 μ²

(36)

Matrix Norm

10

 Induced Norm

 Spectral (maximum singular value) norm

 

2 max( ) ( max( ^T ))^{1/ 2} f A  A  A   A A 𝐴 _𝑝 = sup

𝑥 _𝑝=1 𝐴𝑥 _𝑝 𝐴 ₁ = max

1≤𝑗≤𝑛෍

𝑖=1 𝑚

𝑎_𝑖𝑗 𝐴 _∞ = max

1≤𝑖≤𝑚 ෍

𝑗=1 𝑛

𝑎_𝑖𝑗

2

2 2

1

1/ 2 1

2 1/ 2 1

1/ 2 max

sup

sup( )

sup ( ) 1

sup ( )

( ( ))

x

T T

x

T T

x

T T T T

y

i i y

T

A Ax

x A Ax x U Ux

y y y y x U Ux y

X X





 

    





 Schatten norm

 nuclear norm

 Frobenius Norm

(37)

Multiple Discriminant Analysis

11

d  c

1

( )( )

c

t

B i i i

i

n





 

S m m m m

1

( )( )

i

c

t

w i i

i D



 

 

x

S x m x m

1

( )( )

i

c t

w i i

i Y





 

y

S% y m% y m%

1

( )( )

c

t

B i i i

i

n





 

S% m% m m% % m%

m %

1

m%2

m%3

m%

1 2 1 2

) _B ( )( )^t

cf S  m m m m

(38)

 For the c -class problem we have c-1 discriminant functions.

 The projection from a d-dimensional space to a (c-1) dimension is accomplished by (c-1) discriminant functions (we assume ).

 Within-class scatter matrix is:

where and

 Define a total mean vector

12

d  c

1 c

w i

i





S S

( )( )

i

t

i i i

D





 

x

S x m x m

1

i

i D

n _





x

m x

1

1 1 ^c

i i i

n n _ n









x

m x m

(39)

Multiple Discriminant Analysis

 And total scatter matrix

 It can be transformed to

 The between-class scatter is:

13

( )( )^t

T 



 

x

S x m x m

1

1 1

1

( )( )

( )( ) ( )( )

( )( )

i

i i

c

t

T i i i i

i D

c c

t t

i i i i

i D i D

c

t

W i i i W B

i

n

 

   



      

     

     

 

   



x

x x

S x m m m x m m m

x m x m m m m m

S m m m m S S

1

( )( )

c

t

B i i i

i

n





 

S m m m m

(40)

 For the c -class problem we have (c-1) discriminant functions. The projection from a d-dimensional space to a (c-1) dimensional space is accomplished by (c-1) discriminant functions:

 Taking d-by-(c-1) matrix which columns are vectors , we’ll get in matrix form:

 Samples are projected to

 We define

14

y

_i

 w x

^t_i

  i 1,..., ( c  1)

 t 

y W x

W

w

_i

1

,...,

_n

x x y

₁

,..., y

_n

.

1 1

i i

i i i

Y Y

i

n _ n _ n





 





y y

m% y m% m%

(41)

Multiple Discriminant Analysis cont.

 It’s easy to show that

 The criterion function which should be maximized is:

 Every column of W we should be solution of generalized eigenvalue problem

15

1

( )( )

i

c t

W i i

i Y





 

y

S% y m% y m%

1

( )( )

c

t

B i i i

i

n





 

S% m% m m% % m%

t t

W  W and B  B

S% W S W S% W S W

| | | | ( )

( )

| | | | ( )

t t t

i B i

i

B B B

t t t

W W W i i W i

J tr

   tr 



w S w S W S W W S W

W S W S W W S W w S w

%

w

i

1

W^ B i

 

i i

S S w w

(42)

Multiple Discriminant Analysis cont.

 The criterion function which should be maximized is:

 Every column of W we should be solution of generalized eigenvalue problem

16

| | | | ( )

( )

| | | | ( )

t t t

i B i

i

B B B

t t t

W W W i i W i

J tr

   tr 



w S w S W S W W S W

W S W S W W S W w S w

%

w

i 1

W^ B i  i i

S S w w

( ) ( ) ( )

( )

( ) ( )

B i i W i

B W

t t

B W

t t

B W

t B

i t

i

W

tr tr tr

tr tr

tr





 

 





S w S w S W S W

W S W W S W

W S W W S W W S W W S W

(43)

Multiple Discriminant Analysis cont.

 The MDA provides the way of reducing the dimensionality of the problem.

 The technique for finding probability density might not be feasible in the original space.

 The technique for finding probability density may work well after reducing the dimension of feature space.

 MDA may improve the separability of classes.

17

(44)

Simple Enhancement for PCA/LDA

 Significant pairs for between-class scatter matrix

• Non-boundary patterns with the different class labels

 Significant pairs for within-class scatter matrix

• Non-boundary patterns with the same class labels

18

between-class

non-boundary region non-boundary region boundary region

within-class

non-boundary region non-boundary region boundary region

(45)

Non-boundary Pattern Selection Algorithm

 Step 1. For each ,

 Find the neighborhood defined as follows.

where is the set of nearest samples to by L2-norm.

 Calculate voting probabilities of to each class .

where is 1 if the class of neighbor is , otherwise 0.

 Calculate the neighborhood entropy of .

 Step 2. Obtain boundary patterns and non-boundary patterns

(46)

PCA using NPS (LDA in the same manner)

 Select non-boundary patterns via BNPS.

 Non-boundary patterns make up significant pairs.

 Emphasize the significant pairs.

Whole data

Non-boundary pattern

(47)

Toy Example

(48)

UCI Machine Learning Repository

Name # of data # of attributes # of classes Missing

attributes

Haberman 306 3 2 No

Hepatitis 155 19 2 Yes

Pima 768 8 2 No

Balance-scale 625 4 3 No

Liver-disorders 345 6 2 No

Iris 150 4 3 No

Glass 214 9 6 No

Wisconsin 699 9 2 Yes

Sonar 208 60 2 No

Dermatology 366 34 6 Yes

(49)

PCA vs. NPS+PCA

Leave-one-out

NN Classifier 10-fold cross validation NN Classifier

(50)

LDA vs. NPS+LDA

Leave-one-out

NN Classifier 10-fold cross validation

NN Classifier

(51)

Interim Summary

 Fisher Linear Discriminant Analysis

 Multiple Discriminant Analysis

 Simple Enhancement for PCA/LDA ( )

t B t

W

J  w S w w w S w

1

,

W^ B

  

S S w w w  S

_W^¹

( m

₁

 m

₂

)

1

W^ B i

 

i i

S S w w

1

( )( )

c

t

B i i i

i

n





 

S m m m m