• Tidak ada hasil yang ditemukan

Nonlinear Mapping

N/A
N/A
Protected

Academic year: 2024

Membagikan "Nonlinear Mapping"

Copied!
51
0
0

Teks penuh

(1)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎𝑇𝑦

1 𝑋 ∋ 𝑥

𝑌 ∋ 𝑦 = 𝑦1

𝑦𝑛 𝑓 𝑥 = 𝑎𝑇𝑦

Geometric Deep Learning on graph and manifolds, Michael Bronstein, SIAM 2018, Imperial College London

𝑓 𝑥

1 𝑏

(2)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓𝑖 𝑥 = 𝑎𝑖𝑇𝑦

2 𝑋 ∋ 𝑥

𝑌 ∋ 𝑦 = 𝑦1

𝑦𝑛

𝑓𝑖 𝑥 = 𝑝(𝑥|𝜔𝑖)

1 𝑏

𝑓𝑛 𝑥 𝑓1 𝑥

𝑓𝑖 𝑥

𝑖𝑓𝑖(𝑥)

(3)

1 3

5

0 0.02 0.04 0.06 0.08 0.1

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓

𝑖

𝑥 =

𝑎𝑖𝑇𝑦

σ

𝑗𝑎𝑗𝑇𝑦

(softmax)

3 𝑋 ∋ 𝑥

𝑌 ∋ 𝑦 = 𝑦1

𝑦𝑛

𝑓𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)

1 𝑏

𝑓𝑛 𝑥 𝑓1 𝑥

𝑓𝑖 𝑥

𝑖𝑓𝑖(𝑥)

(4)

Feature Dimension Reduction:

PCA & LDA (I)

Jin Young Choi

Seoul National University

4

(5)

Outline

Feature Extraction

Introduction of PCA & LDA

Principal Component Analysis (PCA)

Linear Discriminant Analysis (FLDA)

Multiple Discriminant Analysis (MDA)

Simple Enhancement of PCA/LDA

(6)

Feature Extraction

 Features

Weight, Height, Width, Volume, Head size, … Edge, Shape, Geometric Relations …

RGB Color for each pixel SIFT, SURF, HOG, …

 Feature Extraction from Raw Data

Pixel Valued Vector is raw data vector Raw data vector is redundant

The dimension should be reduced

(7)

Component Analysis and Discriminants

 How to reduce excessive dimensionality?

• Answer: Combine features highly dependent to each other.

 Linear methods project high-dimensional data onto lower dimensional space.

 Principal Components Analysis (PCA)

• seeks the projection which best represents the data in a least- square error sense.

 Linear Discriminant Analysis (LDA) or Fisher Linear Discriminant

• seeks the projection that best separates the data in a least-square discrimination error sense.

(8)

Principal Component Analysis

𝑛2

𝑛1

𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥

*

(9)

Principal Component Analysis

𝑛2

𝑛1

𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥

*

(10)

Principal Component Analysis

𝑛2

𝑛1

𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥

(11)

Principal Component Analysis

𝑛2

𝑛1

𝑥 = 𝑥1𝑛1 + 𝑥2𝑛2 = 𝑛1 𝑛2 𝑥 = 𝑁𝑥 𝑥 = 𝑎1𝑒1 + 𝑎2𝑒2 = 𝑒1 𝑒2 𝑎 = 𝐸𝑎 𝑒1𝑡𝑥 = 𝑎1𝑒1𝑡𝑒1 + 𝑎2𝑒1𝑡𝑒2 = 𝑎1

𝑒2𝑡𝑥 = 𝑎1𝑒2𝑡𝑒1 + 𝑎2𝑒2𝑡𝑒2 = 𝑎2 𝑒1

𝑒2

𝑎1

𝑎2 = 𝑒1𝑡 𝑒2𝑡

𝑥1

𝑥2 𝑎=𝐸𝑡𝑥 𝑎2

𝑎1

(12)

Linear Discriminant Analysis

𝑛2

𝑛1 𝑒1

𝑎1 = 𝑒1𝑡 𝑥

(13)

Linear Discriminant Analysis

Maximize Between-Class Distance Minimize Within-Class Distance

𝑎2 = 𝑒2𝑡 𝑥 𝑛2

𝑛1 𝑒2

(14)

PCA & LDA

PCA == LDA

𝑛2

𝑛1 𝑒2

𝑒1

𝑎1 = 𝑒1𝑡 𝑥

(15)

Principal Components Analysis (PCA)

 How to represent n d-dimensional vector samples {x1,.., xn} by a single vector x0 ?

• Find x0 that minimizes squared error correction function

2

0 0 0

1

( ) || || .

n

k k

J

  

x x x

(16)

Principal Components Analysis (PCA)

 How to represent n d-dimensional vector samples {x1,.., xn} by a single vector x0 ?

• Find x0 that minimizes squared error correction function

 The solution is sample mean

 This is zero-dimensional representation of the data set.

 One-dimensional representation by projecting the data onto a line through the sample mean reveals variability in the data.

2

0 0 0

1

( ) || || .

n

k k

J

  

x x x

n

0 k

k=1

x = m = 1 x

n 

(17)

Principal Components Analysis (PCA)

 This is zero-dimensional representation of the data set.

 One-dimensional representation by projecting the data onto a line through the sample mean reveals variability in the data.

n

0 k

k=1

x = m = 1 x n 

𝒙𝑘

(18)

PCA ; Projection

 Let e be a unit vector in a direction of the line. The equation of the line

 Representing xk by find “optimal”

set minimizing criterion function :

2

1 1

1

( ,..., , ) || || .

n

n k k

k

J a a a

  

e m e x

  a 

x m e

ak

m e ak

𝒙𝑘

(19)

PCA ; Projection

 Representing xk by find “optimal”

set minimizing criterion function :

from

we find

2

1 1

1

( ,..., , ) || || .

n

n k k

k

J a a a

  

e m e x

ak

m e ak

1 / k 0

J a

  

  

𝒙𝑘

( )

t

k k

ae xm 

(20)

PCA ; Projection

 Representing xk by find “optimal”

 How to find the best direction for e ?

 The least square solution: project the vector xk onto the line in the direction of e, passing through the sample mean.

 Minimize 𝐽w.r.t e.

ak

m e ak

( )

t

k k

ae xm 

𝒙𝑘

2 1 1

1

( ,..., , ) || || .

n

n k k

k

J a a a

  

e m e x

( )

t

k k

ae xm 

(21)

PCA ; Scatter matrix

 Substituting 𝑎𝑘 into 𝐽1(𝑎, 𝐞) we find

 where a scatter matrix S which is (𝑛 − 1) times of sample covariance matrix

1

( )( ) .

n

t

k k

k

S x m x m

2 2 2

1

1 1 1

2 2 2 2 2

1 1 1 1 1

2

1 1

1

( , ) || || 2 ( ) || ||

2 || || [ ( )] || ||

( )( ) || ||

|| ||

n n n

t

k k k k

k k k

n n n n n

t

k k k k k

k k k k k

n n

t t

k k k

k k

n t

k k

J a a a

a a

  

 

 

  

    

 

e e e x m x m

x m e x m x m

e x m x m e x m

e Se x m 2

(22)

PCA ; Scatter matrix

 Vector e that minimizes J1 also maximizes .

 So we find e, which maximize

subject to constraint 𝒆 =1

 Let be Lagrange multiplier.

 Differentiating 𝐿 with respect to e:

 By setting to zero we see that e is an eigenvector of S:

 So to maximize takes maximal

/ 2 2

L

  e Se e

L e Set (e et 1) e Set

See e Set

e Set

e Set 2

1

1

( , ) || ||

n t

k k

J a

 

e e Se x m

(23)

PCA ; Scatter matrix

 The result is easily extended to d’ dimensional projection:

 The criterion function

is minimized when vectors are the eigenvectors having the largest eigenvalues.

 The coefficients are principal components.

'

'

2

1 1

n d

i

k i k

d

k i

J a

m

e x

( )

i t

k i k

ae x - m

1 2 d'

e , e , ...,e

'

' 1

where

d i

k k k i

i

a d d

 

  

x m e

(24)

Error function

 If d’ < d error which is made by dropping the last terms is

 This is a sum of lowest eigenvalues.

'

2

1 ' 1

' 1 1

' 1 ' 1

( )( )

n d

i k i d

k i d

d n

t t

i k k i

i d k

d d

t

i i i

i d i d

J a

 

 

   

  

 

 

 

 

e

e x m x m e e Se

( )

i t

k i k

ae x - m

'

1 d

i

k k k i

i

a

 



x m e

(25)

PCA – the algorithm

 Input:

 Take

 Output:

 Algorithm:

• Compute the mean of the training set

• Compute the scatter matrix S.

• Find eigenvectors of S and corresponding eigenvalues:

• Choose eigenvectors, and for each sample point compute d'd

( )

1 1

{ ,.., }, ,...,

n k k

n k d

X x x  x x x 

'

( )

1 { ,..,1 }

n k k

n k d

A    a a a a a

1 1 2

{ , }i i id , i i , d

S e

  i Se

  

e     

d' xk

'

{ (t ) d1

k i k  i

a e x m

1

1 .

n k

n k

m x

(26)

Interim Summary

'

'

'

1

1 2

2

( ) , 1,...,

( )

... ...

E ( )

) y W ( )

i t

k i k

t k

t k

k d t

k d

t

k k

t

k k

a i d

a a

a

cf

    

   

   

     

   

   

   

 

 

 

e x m

e

e x m

e

a x m

x m

 Principal Component Analysis

Feature Extraction

Dimension Reduction

1

( )( ) .

n

t

k k

k

 

S x m x m

Se e

t

e Se

'

'

2

1 1

n d

i

k i k

d

k i

J a

 

   

 

m

e x
(27)

Feature Dimension Reduction:

PCA & LDA (II)

Jin Young Choi

Seoul National University

1

(28)

Outline

Feature Extraction

Introduction of PCA & LDA

Principal Component Analysis (PCA) Linear Discriminant Analysis (FLDA) Multiple Discriminant Analysis (MDA) Simple Enhancement of PCA/LDA

2

(29)

Linear Discriminant Analysis: LDA

 We have n d-dimensional samples x1,.., xn, n1 in a subset , labeled w1 and n2 in a subset , labeled w2 .

 Find direction of line w , that maximally separate the data.

 Let a difference between sample means be a measure of separation of projected points

3

D1

D2

Maximize Between-Class Distance Minimize Within-Class Distance

(30)

Fisher Linear Discriminant cont.

 Project samples onto w.

 n samples are divided into the subsets and

 Let be the sample mean

 The sample mean for projected points

 Distance between the projected means is

4

ykw xt k

1

i

i

i D

n

x

m x

y Y

1 1

y

i i

t t

i i

i i D

m n

n

x

w x w m

%

1 2 1 2

| m%m% | | w mt( m ) | xk

yk mi

Y1 Y2

(31)

Fisher Linear Discriminant cont.

 A scatter for projected samples labeled 𝜔𝑖

is an estimate of the variance of the pooled data.

is called total within-class scatter of the projected samples.

 The Fisher discriminant employs for which criterion

is maximum

5

2 2

1 2

(1 / )(n s% % s )

2 2

1 2

ss

% %

w xt

2

1 2

2 2

1 2

| |

( ) m m

J s s

 

w % %

% %

2 2 y

(y )

i

i i

Y

s m

% %

(32)

Fisher Linear Discriminant cont.

 Define scatter matrices and by

and

 Then

 Thus

6

Si Sw

( )( )

i

t

i i i

x D

S

x mx m

1 2

Sw  S S

2 2

( ) ( )( )

i i

t t t t t

i i i i i

D D

s m m m

 

  

x x

w x w w x x w w S w

%

2 2

1 2

t

ssw S ww

% %

(33)

Fisher Linear Discriminant cont.

 Similarly,

is called within-class scatter matrix (proportional to sample covariance matrix )

is called between-class scatter matrix.

 This gives the equivalent expression for Fisher’s discriminant

 Which vector w maximizes it?

7

2 2

1 2 1 2 1 2 1 2

(m% %m ) (w mt w mt ) w mt( m )(m m )tw w S wt B Sw

1 2 1 2

( )( )t

B

S m m m m

( )

t B t

W

Jw S w w w S w

2

( ) 2 0

t

W

B B

t t t

W W W

wJ S w w S w S w w w S w w S w w S w

(34)

Fisher Linear Discriminant cont.

 Hence one gets or equivalently

 Since for any w, is always in the direction of m1-m2:

 It is not necessary to determine the eigenvalues of

 One simply gets

 Scale factor for w is unimportant (why?).

 FLDA is one-dimensional projection

8

, ,

t B

B W t

W

 

  w S w S w S w

w S w

1 ,

W B  

S S w w

1 2 1 2 1 2

( )( )t ( )

B

S w m m m m w m m S wB

1 .

W B

S S

1

1 2

( )

SW

w m m

(35)

Fisher Linear Discriminant cont.

9

( |

i

) p x 

Σi

with

0 0

tw

w x

w

1

1 2 0 10 20

( ), w w w

    w Σ μ μ

1 0

1 ln ( )

2

t

i i i i

w μ μ P

μ1 μ2

(36)

Matrix Norm

10

Induced Norm

 Spectral (maximum singular value) norm

 

2 max( ) ( max( T ))1/ 2 f A A A A A 𝐴 𝑝 = sup

𝑥 𝑝=1 𝐴𝑥 𝑝 𝐴 1 = max

1≤𝑗≤𝑛

𝑖=1 𝑚

𝑎𝑖𝑗 𝐴 = max

1≤𝑖≤𝑚

𝑗=1 𝑛

𝑎𝑖𝑗

2

2

2

2

2

2 2

1

1/ 2 1

1/ 2 1

1/ 2 1

2 1/ 2 1

1/ 2 max

sup

sup( )

sup( )

sup ( ) 1

sup ( )

( ( ))

x

T T

x

T T

x

T T T T

y

i i y

T

A Ax

x A Ax x U Ux

y y y y x U Ux y

X X

Schatten norm

nuclear norm

Frobenius Norm

(37)

Multiple Discriminant Analysis

11

d  c

1

( )( )

c

t

B i i i

i

n

 

S m m m m

1

( )( )

i

c

t

w i i

i D

 

 

x

S x m x m

1

( )( )

i

c t

w i i

i Y



 

y

S% y m% y m%

1

( )( )

c

t

B i i i

i

n

 

S% m% m m% % m%

m %

1

m%2

m%3

m%

1 2 1 2

) B ( )( )t

cf Smm mm

(38)

Multiple Discriminant Analysis

 For the c -class problem we have c-1 discriminant functions.

 The projection from a d-dimensional space to a (c-1) dimension is accomplished by (c-1) discriminant functions (we assume ).

 Within-class scatter matrix is:

where and

 Define a total mean vector

12

d  c

1 c

w i

i

S S

( )( )

i

t

i i i

D

x

S x m x m

1

i

i

i D

n

x

m x

1

1 1 c

i i i

n n n

x

m x m

(39)

Multiple Discriminant Analysis

 And total scatter matrix

 It can be transformed to

 The between-class scatter is:

13

( )( )t

T

x

S x m x m

1

1 1

1

( )( )

( )( ) ( )( )

( )( )

i

i i

c

t

T i i i i

i D

c c

t t

i i i i

i D i D

c

t

W i i i W B

i

n





 

   

x

x x

S x m m m x m m m

x m x m m m m m

S m m m m S S

1

( )( )

c

t

B i i i

i

n

S m m m m

(40)

Multiple Discriminant Analysis

 For the c -class problem we have (c-1) discriminant functions. The projection from a d-dimensional space to a (c-1) dimensional space is accomplished by (c-1) discriminant functions:

 Taking d-by-(c-1) matrix which columns are vectors , we’ll get in matrix form:

 Samples are projected to

 We define

14

y

i

 w x

ti

  i 1,..., ( c  1)

t 

y W x

W

w

i

1

,...,

n

x x y

1

,..., y

n

.

1 1

i i

i i i

Y Y

i

n n n

 



y y

m% y m% m%

(41)

Multiple Discriminant Analysis cont.

 It’s easy to show that

 The criterion function which should be maximized is:

 Every column of W we should be solution of generalized eigenvalue problem

15

1

( )( )

i

c t

W i i

i Y



 

y

S% y m% y m%

1

( )( )

c

t

B i i i

i

n

 

S% m% m m% % m%

t t

WW and BB

S% W S W S% W S W

| | | | ( )

( )

| | | | ( )

t t t

i B i

i

B B B

t t t

W W W i i W i

J tr

   tr

w S w S W S W W S W

W S W S W W S W w S w

%

%

w

i

1

W B i

 

i i

S S w w

(42)

Multiple Discriminant Analysis cont.

 The criterion function which should be maximized is:

 Every column of W we should be solution of generalized eigenvalue problem

16

| | | | ( )

( )

| | | | ( )

t t t

i B i

i

B B B

t t t

W W W i i W i

J tr

   tr

w S w S W S W W S W

W S W S W W S W w S w

%

%

w

i 1

W B i  i i

S S w w

( ) ( ) ( )

( )

( ) ( )

B i i W i

B W

t t

B W

t t

B W

t B

i t

i

W

tr tr tr

tr tr

tr

 

 

 

 

S w S w S W S W

W S W W S W

W S W W S W W S W W S W

(43)

Multiple Discriminant Analysis cont.

 The MDA provides the way of reducing the dimensionality of the problem.

 The technique for finding probability density might not be feasible in the original space.

 The technique for finding probability density may work well after reducing the dimension of feature space.

 MDA may improve the separability of classes.

17

(44)

Simple Enhancement for PCA/LDA

 Significant pairs for between-class scatter matrix

• Non-boundary patterns with the different class labels

 Significant pairs for within-class scatter matrix

• Non-boundary patterns with the same class labels

18

between-class

non-boundary region non-boundary region boundary region

within-class

non-boundary region non-boundary region boundary region

(45)

Non-boundary Pattern Selection Algorithm

Step 1. For each ,

Find the neighborhood defined as follows.

where is the set of nearest samples to by L2-norm.

Calculate voting probabilities of to each class .

where is 1 if the class of neighbor is , otherwise 0.

Calculate the neighborhood entropy of .

Step 2. Obtain boundary patterns and non-boundary patterns

(46)

PCA using NPS (LDA in the same manner)

 Select non-boundary patterns via BNPS.

 Non-boundary patterns make up significant pairs.

 Emphasize the significant pairs.

Whole data

Non-boundary pattern

(47)

Toy Example

(48)

UCI Machine Learning Repository

Name # of data # of attributes # of classes Missing

attributes

Haberman 306 3 2 No

Hepatitis 155 19 2 Yes

Pima 768 8 2 No

Balance-scale 625 4 3 No

Liver-disorders 345 6 2 No

Iris 150 4 3 No

Glass 214 9 6 No

Wisconsin 699 9 2 Yes

Sonar 208 60 2 No

Dermatology 366 34 6 Yes

(49)

PCA vs. NPS+PCA

Leave-one-out

NN Classifier 10-fold cross validation NN Classifier

(50)

LDA vs. NPS+LDA

Leave-one-out

NN Classifier 10-fold cross validation

NN Classifier

(51)

Interim Summary

 Fisher Linear Discriminant Analysis

 Multiple Discriminant Analysis

 Simple Enhancement for PCA/LDA ( )

t B t

W

Jw S w w w S w

1

,

W B

  

S S w w w  S

W1

( m

1

 m

2

)

1

W B i

 

i i

S S w w

1

( )( )

c

t

B i i i

i

n

S m m m m

Referensi

Dokumen terkait