• Tidak ada hasil yang ditemukan

Advanced Deep Learning

N/A
N/A
Protected

Academic year: 2024

Membagikan "Advanced Deep Learning"

Copied!
52
0
0

Teks penuh

(1)

U Kang 1

Advanced Deep Learning

Linear Factor Models U Kang

Seoul National University

(2)

In This Lecture

Linear Factor Model

Probabilistic PCA and factor analysis

ICA

Slow Feature Analysis

Sparse Coding

Manifold Interpretation of PCA

(3)

U Kang 3

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(4)

Linear Factor Models

linear factor models: the simplest probabilistic models with latent variables

Probabilistic inference

Many of the research frontiers in deep learning involve building a probabilistic model

Such a model can, in principle, use probabilistic inference to predict any of the variables in its environment given any of the other variables

(5)

U Kang 5

Latent Variables

Latent variables

Many of probabilistic models have latent variables h

Latent variables are those which represent abstract concepts or theoretical constructs which cannot be directly measured

Latent variables provide another means of representing the data

Representing the data

Localist representations

Distributed representation

(6)

Latent Variables

Localist representations

The simplest way to represent things with neural networks is to dedicate one neuron to each thing

Easy to understand

Easy to code by hand

Easy to learn

Easy to associate with other representations or responses

But localist models are very inefficient whenever the data has componential structure

(7)

U Kang 7

Latent Variables

Examples of componential structure

Big, yellow, Volkswagen

Do we have a neuron for this combination?

Consider a visual scene

It contains many different objects

Each object has many properties like shape, color, size, motion

Objects have spatial relationships to each other

(8)

Latent Variables

Distributed representations

Distributed representation means a many-to-many relationship between two types of representation (such as concepts and neurons)

Each concept is represented by many neurons

Each neuron participates in the representation of many concepts

(9)

U Kang 9

Latent Variables Models

Offer a lower dimensional representation of the data and their dependencies

Latent variable model

x: observed variables (d-dimensions)

h: latent variables (q-dimensions)

q < d

(10)

Linear Factor Models

A linear factor model describes a data generating process for x that includes latent variables h,

where x is a linear function of h

(11)

U Kang 11

Linear Factor Models

Data-generation process

1) Sample the explanatory factors h from a distribution

h ~ p(h)

p h = ςi p(hi) (factorial distribution)

2) Sample the real-valued observable variables given the factors

x = W h + b + noise

The noise is typically Gaussian and diagonal (independent across dimensions)

(12)

Special Cases of Linear Factor Models

There are special cases of linear factor models

Probabilistic PCA

Factor Analysis

Independent Component Analysis

Slow Feature Analysis

Sparse Coding

They only differ in the choices made for the noise distribution and the model’s prior over latent

variables h before observing x

(13)

U Kang 13

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(14)

Principal Component Analysis (PCA)

With a large number of variables, a matrix may be too large to study and interpret properly

There would be too many pairwise correlations between the variables to consider

To interpret the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a few, interpretable linear

combinations of the data

Each linear combination will correspond to a principal component.

(15)

U Kang 15

Examples of PCA

(16)

Examples of PCA

First, consider a dataset in two dimensions, like (height, weight)

This dataset can be plotted as points in a plane

(17)

U Kang 17

Examples of PCA

But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value

(18)

Examples of PCA

But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value

(19)

U Kang 19

PCA Procedure

Used to transform observed data matrix x into h (find the q principal components)

Fairly simple solution

1. Centralize the x

2. Calculate the covariance matrix C of x

3. Calculate the eigenvectors of the C

4. Select the dimensions that correspond to the q highest eigenvalues

(20)

Limitations of PCA

PCA is a simple linear algebra transformation, it does not produce a probabilistic model for the observed data

The covariance matrix needs to be calculated

Can be very computation-intensive for large datasets with a high # of dimensions

Does not deal properly with missing data

Outlying data observations can unduly affect the analysis

(21)

U Kang 21

Probabilistic PCA model

Enables comparison with other probabilistic techniques

Maximum-likelihood estimates can be computed for elements associated with principal

components

Extends the scope of PCA

Multiple PCA models can be combined as a probabilistic mixture

PCA projections can be obtained when some data values are missing

(22)

Factor Analysis

Latent variable model with a linear relationship

x ~ Wh + b + ε

W is a matrix that relates observed variables x to the latent variables h

Latent variables: h ~ N(0, I)

Error (or noise): ε ~ N(0, ψ) – Gaussian noise

Location term (mean): b

(23)

U Kang 23

Aside: Gaussian Distribution

Linear combination of Gaussian

Let x be a multivariate Gaussian with mean 𝜇 and covariance Σ

Consider a new variable y = Wx + b

Then, y is a multivariate Gaussian with mean W𝜇 + b and covariance WΣ𝑊𝑇

Sum of multivariate Gaussian

Suppose y ~ N(𝜇, Σ) and z ~ N(𝜇′, Σ′) are independent multivariate Gaussian.

Then, y+z ~ N(𝜇 + 𝜇, Σ + Σ′)

(24)

Factor Analysis

Then,

x ~ N(b, C)

where C=WWT + ψ is the covariance matrix for observed variables x.

The model’s parameters W, b and ψ can be found using maximum likelihood estimate.

(25)

U Kang 25

Probabilistic PCA

A special case of the factor analysis model

We can make a slight modification to the factor

analysis model, making the conditional variances σi2 equal to each other

Noise variances constrained to be equal (ψ= σ2I)

x ~ Wh + b + ε

Latent variables: h ~ N(0, I)

Error (or noise): ε ~ N(0, σ2I)

Location term (mean): b

(26)

Probabilistic PCA

Then,

x ~ N(b, C)

where C=WWT + σ2I is the covariance matrix of x

Normal PCA is a limiting case of probabilistic PCA, taken as the limit as the covariance of the noise becomes infinitesimally small (ψ = lim

σ2→0 σ2I)

(27)

U Kang 27

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(28)

Independent Component Analysis (ICA)

ICA is among the oldest representation learning algorithms

It is an approach to modeling linear factors that seeks to separate an observed signal into many underlying signals that are scaled and added

together to form the observed data.

These signals are intended to be fully

independent, rather than merely decorrelated from each other

(29)

U Kang 29

Independent Component Analysis (ICA)

What distinguishes ICA from other methods is that it looks for components that are both

statistically independent, and non-gaussian.

In practical situations, we cannot in general find a representation where the components are really independent, but we can at least find

components that are as independent as possible.

(30)

What is the difference between

PCA and ICA?

(31)

U Kang 31

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(32)

Slow Feature Analysis

Slow Feature Analysis

A linear factor model that uses information from time signals to learn invariant features

Extract slowly varying features from a quickly varying input signal

A particularly efficient application of the slowness principle

SFA is not quite a generative model per se

It defines a linear map between input space and feature space

But it does not define a prior over feature space

Thus it does not impose a distribution p(x) on input space.

(33)

U Kang 33

Motivation

Slowness principle

Physical entities in real life are subject to slow and continuous changes.

The important characteristics of scenes change very slowly compared to the individual measurements that make up a description of a scene

Example

In computer vision, individual pixel values can change very rapidly

If a zebra moves from left to right across the image, an individual pixel will rapidly change from black to white and back again as the zebra’s stripes pass over the pixel

By comparison, the feature indicating whether a zebra is in the image will not change at all, and the feature describing the zebra’s position will change slowly

(34)

Example of Slowness Principle

Illustration of the slowness principle

[reference - link]

(35)

U Kang 35

Slow Feature Analysis

Task of Slow Feature Analysis (SFA)

Given: a multi-dimensional input signal 𝐱(𝐭)

Find: functions 𝒇𝒋(𝐱) such that the output signals 𝒚𝒋 𝐭 ≔ 𝒇𝒋(𝐱(𝐭)) minimize variation of 𝒚𝒋 over time

(36)

Slow Feature Analysis

Objective function

min𝜃 𝔼𝑡 𝑓 𝑥 𝑡+1

𝑖 − 𝑓 𝑥 𝑡

𝑖 2

Constraints

𝔼𝑡𝑓 𝑥 𝑡

𝑖 = 0

𝔼𝑡 𝑓 𝑥 𝑡

𝑖

2 = 1

∀𝑖 < 𝑗, 𝔼𝑡 𝑓 𝑥 𝑡

𝑖𝑓 𝑥 𝑡

𝑗 = 0

(37)

U Kang 37

Slow Feature Analysis

Constraints

𝔼𝑡𝑓 𝑥 𝑡

𝑖 = 0

Avoid trivial constant solution

𝔼𝑡 𝑓 𝑥 𝑡

𝑖

2 = 1

Prevent the pathological solution where all features collapse to 0

The SFA features are ordered, with the first feature being the slowest

∀𝑖 < 𝑗, 𝔼𝑡 𝑓 𝑥 𝑡

𝑖𝑓 𝑥 𝑡

𝑗 = 0

Linearly decorrelated from each other.

Without this constraint, all of the learned features would simply capture the one slowest signal

(38)

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(39)

U Kang 39

Sparse coding

Sparse coding

An unsupervised feature learning and feature extraction mechanism

Motivation

The observation that most sensory data such as

natural images may be described as the superposition of a small number of atomic elements such as

surfaces or edges.

(40)

Sparse coding

2 tasks of sparse coding

The dimension of 𝐡 can be larger than that of x

Task 1 (encoding)

Given: test data 𝐱, dictionary D

Extract : sparse code 𝐡

Strictly speaking, “sparse coding”

Task 2 (learning)

Given: training data 𝐱

Learn: dictionary D and sparse code 𝐡

Strictly speaking, “sparse modeling”

(41)

U Kang 41

Sparse coding

Objective function

= arg min

𝜆 𝐡 1 + 𝛽 𝐱 − 𝐃𝐡 22

x is a given data, D is a given dictionary

𝜆 𝐡 1 is a regularization term for sparsity

Minimize the reconstruction error and make 𝐡 sparse!

Reconstruction ො𝐱

Reconstruction error

(42)

Task 1 (1)

How to compute 𝐡

We could use a gradient descent method

Algorithm

1. Initialize 𝐡

2. while 𝐡 has not converged

3. 𝐡 ← 𝐡 − α𝐃𝑇 𝐃𝐡 − 𝐱 (update from reconstruction)

4. 𝐡 ← shrink(𝐡, 𝛼𝜆) (update from sparsity)

5. Return 𝐡

(43)

U Kang 43

Task 1 (2)

Details of Algorithm

𝑙 𝐱 = 𝜆 𝐡 1 + 1

2 𝐱 − 𝐃𝐡 22

𝛻𝐡𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛(𝐡) +

𝐃

𝑇

𝐃𝐡 − 𝐱

=> 𝜕

𝜕ℎ𝑘 𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛 ℎ𝑘 + 𝑫∙,𝑘𝑇 (𝐃𝐡 − 𝐱)

Line 3 in Algorithm

𝑘 is updated by ℎ𝑘 − 𝛼𝑫∙,𝑘𝑇 (𝐃𝐡 − 𝐱)

(44)

Task 1 (3)

Details of Algorithm

𝑙 𝐱 = 𝜆 𝐡 1 + 1

2 𝐱 − 𝐃𝐡 22

𝛻𝐡𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛(𝐡) +

𝐃

𝑇

𝐃𝐡 − 𝐱

=> 𝜕

𝜕ℎ𝑘 𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛 ℎ𝑘 + 𝑫∙,𝑘𝑇 (𝐃𝐡 − 𝐱)

Line 4 in Algorithm

If 𝑠𝑖𝑔𝑛 ℎ𝑘 ≠ 𝑠𝑖𝑔𝑛 ℎ𝑘 − 𝛼𝜆𝑠𝑖𝑔𝑛 ℎ𝑘 then ℎ𝑘 = 0

else ℎ𝑘 = ℎ𝑘 − 𝛼𝜆𝑠𝑖𝑔𝑛 ℎ𝑘

(45)

U Kang 45

Task 2 (1)

Objective function

= arg min

𝜆 𝐡 1 + 𝛽 𝐱 − 𝐃𝐡 22

x is a given data

𝜆 𝐡 1 is regularization term for sparsity

Learn dictionary D and sparse code 𝐡

Reconstruction ො𝐱

Reconstruction error

(46)

Task 2 (2)

Alternating optimization to compute D and h

Fix h, optimize D in objective function

Fix D, optimize h

There are several techniques for learning dictionary D

Stochastic gradient descent

(47)

U Kang 47

Application

Image Denoising

Image Restoration

(48)

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(49)

U Kang 49

Manifold Interpretation of PCA

Manifold is a connected region: a set of points associated with a neighborhood around each region

The surface of the earth is a 2-D manifold in 3-D space

(50)

Manifold Interpretation of PCA

Linear factor models can be interpreted as learning a manifold.

PCA can be interpreted as aligning this pancake with a linear manifold in a higher dimensional space

The variance in the direction orthogonal to the manifold is very small (arrow pointing out of plane) and can be

considered “noise”, while the other variances are large (arrows in the plane) and correspond to “signal” and to a coordinate system for the reduced-dimension data

manifold

(51)

U Kang 51

What You Need to Know

Linear Factor Model

Probabilistic PCA and factor analysis

ICA

Slow Feature Analysis

Sparse Coding

Manifold Interpretation of PCA

(52)

Questions?

Referensi

Dokumen terkait

Langkah-langkah komprehensif Kansei Engineering melalui tahapan analisis statistik multivariat yakni Coefficient Component Analysis, Principal Component Analysis, Factor

Characteristics in body size and shape of M auratus is determined through Principal Component Analysis (PCA); Factor Analysis (FA) is used to determine factors

Dalam penelitian ini dibuat sistem pengenalan individu melalui identifikasi iris mata menggunakan metode Principal Component Analysis ( PCA ) untuk mendapatkan ciri

; bn so that the sum of squared errors is the smallest (minimum). The most powerful and mathematically mature data analysis method, multiple linear regression is focused on a

The simple tree method will only be used on the breast and prostate cancer sets from Chapter 4 , Advanced Feature Selection in Linear Models. Modeling

Feature Extraction Based on Kernel-Based Principal Component Analysis KPCA PCA is a frequently utilized method for describing data by extracting a small number of characteristics from

Among the two classification models, the CNN model achieves the highest accuracy when using a baseline ratio of 90:10, TF-IDF feature extraction with a maximum of 5000 features,

Feature Exact Analysis Over-approximate Analysis Components FFNN, CNN, SEGNET, NNCS FFNN, CNN, SEGNET, NNCS Plant dynamics for NNCS Linear ODE Linear ODE, Nonlinear ODE