Advanced Deep Learning

(1)

U Kang 1

Advanced Deep Learning

Linear Factor Models U Kang

Seoul National University

(2)

In This Lecture

 Linear Factor Model

 Probabilistic PCA and factor analysis

 ICA

 Slow Feature Analysis

 Sparse Coding

 Manifold Interpretation of PCA

(3)

U Kang 3

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

Slow Feature Analysis Sparse Coding

Manifold Interpretation of PCA

(4)

Linear Factor Models

 linear factor models: the simplest probabilistic models with latent variables

 Probabilistic inference

 Many of the research frontiers in deep learning involve building a probabilistic model

 Such a model can, in principle, use probabilistic inference to predict any of the variables in its environment given any of the other variables

(5)

U Kang 5

Latent Variables

 Latent variables

 Many of probabilistic models have latent variables h

 Latent variables are those which represent abstract concepts or theoretical constructs which cannot be directly measured

 Latent variables provide another means of representing the data

 Representing the data

 Localist representations

 Distributed representation

(6)

Latent Variables

 Localist representations

 The simplest way to represent things with neural networks is to dedicate one neuron to each thing

 Easy to understand

 Easy to code by hand

 Easy to learn

 Easy to associate with other representations or responses

 But localist models are very inefficient whenever the data has componential structure

(7)

U Kang 7

Latent Variables

 Examples of componential structure

 Big, yellow, Volkswagen

 Do we have a neuron for this combination?

 Consider a visual scene

 It contains many different objects

 Each object has many properties like shape, color, size, motion

 Objects have spatial relationships to each other

(8)

Latent Variables

 Distributed representations

 Distributed representation means a many-to-many relationship between two types of representation (such as concepts and neurons)

 Each concept is represented by many neurons

 Each neuron participates in the representation of many concepts

(9)

U Kang 9

Latent Variables Models

 Offer a lower dimensional representation of the data and their dependencies

 Latent variable model

 x: observed variables (d-dimensions)

 h: latent variables (q-dimensions)

 q < d

(10)

Linear Factor Models

 A linear factor model describes a data generating process for x that includes latent variables h,

where x is a linear function of h

(11)

U Kang 11

Linear Factor Models

 Data-generation process

 1) Sample the explanatory factors h from a distribution

 h ~ p(h)

 p h = ς_i p(h_i) (factorial distribution)

 2) Sample the real-valued observable variables given the factors

 x = W h + b + noise

 The noise is typically Gaussian and diagonal (independent across dimensions)

(12)

Special Cases of Linear Factor Models

 There are special cases of linear factor models

 Probabilistic PCA

 Factor Analysis

 Independent Component Analysis

 Slow Feature Analysis

 Sparse Coding

 They only diﬀer in the choices made for the noise distribution and the model’s prior over latent

variables h before observing x

(13)

U Kang 13

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

(14)

Principal Component Analysis (PCA)

 With a large number of variables, a matrix may be too large to study and interpret properly

 There would be too many pairwise correlations between the variables to consider

 To interpret the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a few, interpretable linear

combinations of the data

 Each linear combination will correspond to a principal component.

(15)

U Kang 15

Examples of PCA

(16)

Examples of PCA

 First, consider a dataset in two dimensions, like (height, weight)

 This dataset can be plotted as points in a plane

(17)

U Kang 17

Examples of PCA

 But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value

(18)

Examples of PCA

 But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value

(19)

U Kang 19

PCA Procedure

 Used to transform observed data matrix x into h (find the q principal components)

 Fairly simple solution

 1. Centralize the x

 2. Calculate the covariance matrix C of x

 3. Calculate the eigenvectors of the C

 4. Select the dimensions that correspond to the q highest eigenvalues

(20)

Limitations of PCA

 PCA is a simple linear algebra transformation, it does not produce a probabilistic model for the observed data

 The covariance matrix needs to be calculated

 Can be very computation-intensive for large datasets with a high # of dimensions

 Does not deal properly with missing data

 Outlying data observations can unduly affect the analysis

(21)

U Kang 21

Probabilistic PCA model

 Enables comparison with other probabilistic techniques

 Maximum-likelihood estimates can be computed for elements associated with principal

components

 Extends the scope of PCA

 Multiple PCA models can be combined as a probabilistic mixture

 PCA projections can be obtained when some data values are missing

(22)

Factor Analysis

 Latent variable model with a linear relationship



x ~ Wh + b + ε

 W is a matrix that relates observed variables x to the latent variables h

 Latent variables: h ~ N(0, I)

 Error (or noise): ε ~ N(0, ψ) – Gaussian noise

 Location term (mean): b

(23)

U Kang 23

Aside: Gaussian Distribution

 Linear combination of Gaussian

 Let x be a multivariate Gaussian with mean 𝜇 and covariance Σ

 Consider a new variable y = Wx + b

 Then, y is a multivariate Gaussian with mean W𝜇 + b and covariance WΣ𝑊^𝑇

 Sum of multivariate Gaussian

 Suppose y ~ N(𝜇, Σ) and z ~ N(𝜇′, Σ′) are independent multivariate Gaussian.

 Then, y+z ~ N(𝜇 + 𝜇^′, Σ + Σ′)

(24)

Factor Analysis

 Then,



x ~ N(b, C)

 where C=WW^T + ψ is the covariance matrix for observed variables x.

 The model’s parameters W, b and ψ can be found using maximum likelihood estimate.

(25)

U Kang 25

Probabilistic PCA

 A special case of the factor analysis model

 We can make a slight modiﬁcation to the factor

analysis model, making the conditional variances σ_i² equal to each other

 Noise variances constrained to be equal (ψ= σ²I)



x ~ Wh + b + ε

 Latent variables: h ~ N(0, I)

 Error (or noise): ε ~ N(0, σ²I)

 Location term (mean): b

(26)

Probabilistic PCA

 Then,

 x ~ N(b, C)

 where C=WW^T + σ²I is the covariance matrix of x

 Normal PCA is a limiting case of probabilistic PCA, taken as the limit as the covariance of the noise becomes infinitesimally small (ψ = lim

σ²^→0 σ²I)

(27)

U Kang 27

Outline

Linear Factor Model

Probabilistic PCA and factor analysis ICA

(28)

Independent Component Analysis (ICA)

 ICA is among the oldest representation learning algorithms

 It is an approach to modeling linear factors that seeks to separate an observed signal into many underlying signals that are scaled and added

together to form the observed data.

 These signals are intended to be fully

independent, rather than merely decorrelated from each other

(29)

U Kang 29

Independent Component Analysis (ICA)

 What distinguishes ICA from other methods is that it looks for components that are both

statistically independent, and non-gaussian.

 In practical situations, we cannot in general find a representation where the components are really independent, but we can at least find

components that are as independent as possible.

(30)

What is the difference between

PCA and ICA?

(31)

U Kang 31

Outline

Linear Factor Model

Slow Feature Analysis Sparse Coding

(32)

Slow Feature Analysis

 A linear factor model that uses information from time signals to learn invariant features

 Extract slowly varying features from a quickly varying input signal

 A particularly efficient application of the slowness principle

 SFA is not quite a generative model per se

 It defines a linear map between input space and feature space

 But it does not define a prior over feature space

 Thus it does not impose a distribution p(x) on input space.

(33)

U Kang 33

Motivation

 Slowness principle

 Physical entities in real life are subject to slow and continuous changes.

 The important characteristics of scenes change very slowly compared to the individual measurements that make up a description of a scene

 Example

 In computer vision, individual pixel values can change very rapidly

 If a zebra moves from left to right across the image, an individual pixel will rapidly change from black to white and back again as the zebra’s stripes pass over the pixel

 By comparison, the feature indicating whether a zebra is in the image will not change at all, and the feature describing the zebra’s position will change slowly

(34)

Example of Slowness Principle

 Illustration of the slowness principle

[reference - link]

⋯

(35)

U Kang 35

 Task of Slow Feature Analysis (SFA)

 Given: a multi-dimensional input signal 𝐱(𝐭)

 Find: functions 𝒇_𝒋(𝐱) such that the output signals 𝒚_𝒋 𝐭 ≔ 𝒇_𝒋(𝐱(𝐭)) minimize variation of 𝒚_𝒋 over time

(36)

 Objective function

min𝜃 𝔼_𝑡 𝑓 𝑥 ^𝑡+1

𝑖 − 𝑓 𝑥 ^𝑡

𝑖 2

 Constraints

 𝔼_𝑡𝑓 𝑥 ^𝑡

𝑖 = 0

 𝔼_𝑡 𝑓 𝑥 ^𝑡

𝑖

2 = 1

 ∀𝑖 < 𝑗, 𝔼_𝑡 𝑓 𝑥 ^𝑡

𝑖𝑓 𝑥 ^𝑡

𝑗 = 0

(37)

U Kang 37

Slow Feature Analysis

 Constraints

 𝔼_𝑡𝑓 𝑥 ^𝑡

𝑖 = 0

 Avoid trivial constant solution

 𝔼_𝑡 𝑓 𝑥 ^𝑡

𝑖

2 = 1

 Prevent the pathological solution where all features collapse to 0

 The SFA features are ordered, with the first feature being the slowest

 ∀𝑖 < 𝑗, 𝔼_𝑡 𝑓 𝑥 ^𝑡

𝑖𝑓 𝑥 ^𝑡

𝑗 = 0

 Linearly decorrelated from each other.

 Without this constraint, all of the learned features would simply capture the one slowest signal

(38)

Outline

Linear Factor Model

Slow Feature Analysis Sparse Coding

(39)

U Kang 39

Sparse coding

 Sparse coding

 An unsupervised feature learning and feature extraction mechanism

 Motivation

 The observation that most sensory data such as

natural images may be described as the superposition of a small number of atomic elements such as

surfaces or edges.

(40)

Sparse coding

 2 tasks of sparse coding

 The dimension of 𝐡 can be larger than that of x

 Task 1 (encoding)

 Given: test data 𝐱, dictionary D

 Extract : sparse code 𝐡

 Strictly speaking, “sparse coding”

 Task 2 (learning)

 Given: training data 𝐱

 Learn: dictionary D and sparse code 𝐡

 Strictly speaking, “sparse modeling”

(41)

U Kang 41

Sparse coding

= arg min

ℎ

𝜆 𝐡 ₁ + 𝛽 𝐱 − 𝐃𝐡 ₂²

 x is a given data, D is a given dictionary

 𝜆 𝐡 ₁ is a regularization term for sparsity

 Minimize the reconstruction error and make 𝐡 sparse!

Reconstruction ො𝐱

Reconstruction error

(42)

Task 1 (1)

 How to compute 𝐡

 We could use a gradient descent method

 Algorithm

1. Initialize 𝐡

2. while 𝐡 has not converged

3. 𝐡 ← 𝐡 − α𝐃^𝑇 𝐃𝐡 − 𝐱 (update from reconstruction)

4. 𝐡 ← shrink(𝐡, 𝛼𝜆) (update from sparsity)

5. Return 𝐡

(43)

U Kang 43

Task 1 (2)

 Details of Algorithm

 𝑙 𝐱 = 𝜆 𝐡 ₁ + ¹

2 𝐱 − 𝐃𝐡 ₂²

 𝛻_𝐡𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛(𝐡) +

𝐃

^𝑇

𝐃𝐡 − 𝐱

 => ^𝜕

𝜕ℎ_𝑘 𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛 ℎ_𝑘 + 𝑫_∙,𝑘^𝑇 (𝐃𝐡 − 𝐱)

 Line 3 in Algorithm

 ℎ_𝑘 is updated by ℎ_𝑘 − 𝛼𝑫_∙,𝑘^𝑇 (𝐃𝐡 − 𝐱)

(44)

Task 1 (3)

 Details of Algorithm

 𝑙 𝐱 = 𝜆 𝐡 ₁ + ¹

2 𝐱 − 𝐃𝐡 ₂²

 𝛻_𝐡𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛(𝐡) +

𝐃

^𝑇

𝐃𝐡 − 𝐱

 => ^𝜕

𝜕ℎ_𝑘 𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛 ℎ_𝑘 + 𝑫_∙,𝑘^𝑇 (𝐃𝐡 − 𝐱)

 Line 4 in Algorithm

 If 𝑠𝑖𝑔𝑛 ℎ_𝑘 ≠ 𝑠𝑖𝑔𝑛 ℎ_𝑘 − 𝛼𝜆𝑠𝑖𝑔𝑛 ℎ_𝑘 then ℎ_𝑘 = 0

 else ℎ_𝑘 = ℎ_𝑘 − 𝛼𝜆𝑠𝑖𝑔𝑛 ℎ_𝑘

(45)

U Kang 45

Task 2 (1)

= arg min

ℎ

𝜆 𝐡 ₁ + 𝛽 𝐱 − 𝐃𝐡 ₂²

 x is a given data

 𝜆 𝐡 ₁ is regularization term for sparsity

 Learn dictionary D and sparse code 𝐡

Reconstruction ො𝐱

Reconstruction error

(46)

Task 2 (2)

 Alternating optimization to compute D and h

 Fix h, optimize D in objective function

 Fix D, optimize h

 There are several techniques for learning dictionary D

 Stochastic gradient descent

 …

(47)

U Kang 47

Application

 Image Denoising

 Image Restoration

(48)

Outline

Linear Factor Model

Manifold Interpretation of PCA

(49)

U Kang 49

Manifold Interpretation of PCA

 Manifold is a connected region: a set of points associated with a neighborhood around each region

 The surface of the earth is a 2-D manifold in 3-D space

(50)

 Linear factor models can be interpreted as learning a manifold.

 PCA can be interpreted as aligning this pancake with a linear manifold in a higher dimensional space

 The variance in the direction orthogonal to the manifold is very small (arrow pointing out of plane) and can be

considered “noise”, while the other variances are large (arrows in the plane) and correspond to “signal” and to a coordinate system for the reduced-dimension data

manifold

(51)

U Kang 51

What You Need to Know

 Linear Factor Model

 Probabilistic PCA and factor analysis

 ICA

 Sparse Coding

 Manifold Interpretation of PCA

(52)

Questions?