U Kang 1
Advanced Deep Learning
Linear Factor Models U Kang
Seoul National University
In This Lecture
Linear Factor Model
Probabilistic PCA and factor analysis
ICA
Slow Feature Analysis
Sparse Coding
Manifold Interpretation of PCA
U Kang 3
Outline
Linear Factor Model
Probabilistic PCA and factor analysis ICA
Slow Feature Analysis Sparse Coding
Manifold Interpretation of PCA
Linear Factor Models
linear factor models: the simplest probabilistic models with latent variables
Probabilistic inference
Many of the research frontiers in deep learning involve building a probabilistic model
Such a model can, in principle, use probabilistic inference to predict any of the variables in its environment given any of the other variables
U Kang 5
Latent Variables
Latent variables
Many of probabilistic models have latent variables h
Latent variables are those which represent abstract concepts or theoretical constructs which cannot be directly measured
Latent variables provide another means of representing the data
Representing the data
Localist representations
Distributed representation
Latent Variables
Localist representations
The simplest way to represent things with neural networks is to dedicate one neuron to each thing
Easy to understand
Easy to code by hand
Easy to learn
Easy to associate with other representations or responses
But localist models are very inefficient whenever the data has componential structure
U Kang 7
Latent Variables
Examples of componential structure
Big, yellow, Volkswagen
Do we have a neuron for this combination?
Consider a visual scene
It contains many different objects
Each object has many properties like shape, color, size, motion
Objects have spatial relationships to each other
Latent Variables
Distributed representations
Distributed representation means a many-to-many relationship between two types of representation (such as concepts and neurons)
Each concept is represented by many neurons
Each neuron participates in the representation of many concepts
U Kang 9
Latent Variables Models
Offer a lower dimensional representation of the data and their dependencies
Latent variable model
x: observed variables (d-dimensions)
h: latent variables (q-dimensions)
q < d
Linear Factor Models
A linear factor model describes a data generating process for x that includes latent variables h,
where x is a linear function of h
U Kang 11
Linear Factor Models
Data-generation process
1) Sample the explanatory factors h from a distribution
h ~ p(h)
p h = ςi p(hi) (factorial distribution)
2) Sample the real-valued observable variables given the factors
x = W h + b + noise
The noise is typically Gaussian and diagonal (independent across dimensions)
Special Cases of Linear Factor Models
There are special cases of linear factor models
Probabilistic PCA
Factor Analysis
Independent Component Analysis
Slow Feature Analysis
Sparse Coding
They only differ in the choices made for the noise distribution and the model’s prior over latent
variables h before observing x
U Kang 13
Outline
Linear Factor Model
Probabilistic PCA and factor analysis ICA
Slow Feature Analysis Sparse Coding
Manifold Interpretation of PCA
Principal Component Analysis (PCA)
With a large number of variables, a matrix may be too large to study and interpret properly
There would be too many pairwise correlations between the variables to consider
To interpret the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a few, interpretable linear
combinations of the data
Each linear combination will correspond to a principal component.
U Kang 15
Examples of PCA
Examples of PCA
First, consider a dataset in two dimensions, like (height, weight)
This dataset can be plotted as points in a plane
U Kang 17
Examples of PCA
But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value
Examples of PCA
But if we want to tease out variation, PCA finds a new coordinate system in which every point has a new (x,y) value
U Kang 19
PCA Procedure
Used to transform observed data matrix x into h (find the q principal components)
Fairly simple solution
1. Centralize the x
2. Calculate the covariance matrix C of x
3. Calculate the eigenvectors of the C
4. Select the dimensions that correspond to the q highest eigenvalues
Limitations of PCA
PCA is a simple linear algebra transformation, it does not produce a probabilistic model for the observed data
The covariance matrix needs to be calculated
Can be very computation-intensive for large datasets with a high # of dimensions
Does not deal properly with missing data
Outlying data observations can unduly affect the analysis
U Kang 21
Probabilistic PCA model
Enables comparison with other probabilistic techniques
Maximum-likelihood estimates can be computed for elements associated with principal
components
Extends the scope of PCA
Multiple PCA models can be combined as a probabilistic mixture
PCA projections can be obtained when some data values are missing
Factor Analysis
Latent variable model with a linear relationship
x ~ Wh + b + ε
W is a matrix that relates observed variables x to the latent variables h
Latent variables: h ~ N(0, I)
Error (or noise): ε ~ N(0, ψ) – Gaussian noise
Location term (mean): b
U Kang 23
Aside: Gaussian Distribution
Linear combination of Gaussian
Let x be a multivariate Gaussian with mean 𝜇 and covariance Σ
Consider a new variable y = Wx + b
Then, y is a multivariate Gaussian with mean W𝜇 + b and covariance WΣ𝑊𝑇
Sum of multivariate Gaussian
Suppose y ~ N(𝜇, Σ) and z ~ N(𝜇′, Σ′) are independent multivariate Gaussian.
Then, y+z ~ N(𝜇 + 𝜇′, Σ + Σ′)
Factor Analysis
Then,
x ~ N(b, C)
where C=WWT + ψ is the covariance matrix for observed variables x.
The model’s parameters W, b and ψ can be found using maximum likelihood estimate.
U Kang 25
Probabilistic PCA
A special case of the factor analysis model
We can make a slight modification to the factor
analysis model, making the conditional variances σi2 equal to each other
Noise variances constrained to be equal (ψ= σ2I)
x ~ Wh + b + ε
Latent variables: h ~ N(0, I)
Error (or noise): ε ~ N(0, σ2I)
Location term (mean): b
Probabilistic PCA
Then,
x ~ N(b, C)
where C=WWT + σ2I is the covariance matrix of x
Normal PCA is a limiting case of probabilistic PCA, taken as the limit as the covariance of the noise becomes infinitesimally small (ψ = lim
σ2→0 σ2I)
U Kang 27
Outline
Linear Factor Model
Probabilistic PCA and factor analysis ICA
Slow Feature Analysis Sparse Coding
Manifold Interpretation of PCA
Independent Component Analysis (ICA)
ICA is among the oldest representation learning algorithms
It is an approach to modeling linear factors that seeks to separate an observed signal into many underlying signals that are scaled and added
together to form the observed data.
These signals are intended to be fully
independent, rather than merely decorrelated from each other
U Kang 29
Independent Component Analysis (ICA)
What distinguishes ICA from other methods is that it looks for components that are both
statistically independent, and non-gaussian.
In practical situations, we cannot in general find a representation where the components are really independent, but we can at least find
components that are as independent as possible.
What is the difference between
PCA and ICA?
U Kang 31
Outline
Linear Factor Model
Probabilistic PCA and factor analysis ICA
Slow Feature Analysis Sparse Coding
Manifold Interpretation of PCA
Slow Feature Analysis
Slow Feature Analysis
A linear factor model that uses information from time signals to learn invariant features
Extract slowly varying features from a quickly varying input signal
A particularly efficient application of the slowness principle
SFA is not quite a generative model per se
It defines a linear map between input space and feature space
But it does not define a prior over feature space
Thus it does not impose a distribution p(x) on input space.
U Kang 33
Motivation
Slowness principle
Physical entities in real life are subject to slow and continuous changes.
The important characteristics of scenes change very slowly compared to the individual measurements that make up a description of a scene
Example
In computer vision, individual pixel values can change very rapidly
If a zebra moves from left to right across the image, an individual pixel will rapidly change from black to white and back again as the zebra’s stripes pass over the pixel
By comparison, the feature indicating whether a zebra is in the image will not change at all, and the feature describing the zebra’s position will change slowly
Example of Slowness Principle
Illustration of the slowness principle
[reference - link]
⋯
U Kang 35
Slow Feature Analysis
Task of Slow Feature Analysis (SFA)
Given: a multi-dimensional input signal 𝐱(𝐭)
Find: functions 𝒇𝒋(𝐱) such that the output signals 𝒚𝒋 𝐭 ≔ 𝒇𝒋(𝐱(𝐭)) minimize variation of 𝒚𝒋 over time
Slow Feature Analysis
Objective function
min𝜃 𝔼𝑡 𝑓 𝑥 𝑡+1
𝑖 − 𝑓 𝑥 𝑡
𝑖 2
Constraints
𝔼𝑡𝑓 𝑥 𝑡
𝑖 = 0
𝔼𝑡 𝑓 𝑥 𝑡
𝑖
2 = 1
∀𝑖 < 𝑗, 𝔼𝑡 𝑓 𝑥 𝑡
𝑖𝑓 𝑥 𝑡
𝑗 = 0
U Kang 37
Slow Feature Analysis
Constraints
𝔼𝑡𝑓 𝑥 𝑡
𝑖 = 0
Avoid trivial constant solution
𝔼𝑡 𝑓 𝑥 𝑡
𝑖
2 = 1
Prevent the pathological solution where all features collapse to 0
The SFA features are ordered, with the first feature being the slowest
∀𝑖 < 𝑗, 𝔼𝑡 𝑓 𝑥 𝑡
𝑖𝑓 𝑥 𝑡
𝑗 = 0
Linearly decorrelated from each other.
Without this constraint, all of the learned features would simply capture the one slowest signal
Outline
Linear Factor Model
Probabilistic PCA and factor analysis ICA
Slow Feature Analysis Sparse Coding
Manifold Interpretation of PCA
U Kang 39
Sparse coding
Sparse coding
An unsupervised feature learning and feature extraction mechanism
Motivation
The observation that most sensory data such as
natural images may be described as the superposition of a small number of atomic elements such as
surfaces or edges.
Sparse coding
2 tasks of sparse coding
The dimension of 𝐡 can be larger than that of x
Task 1 (encoding)
Given: test data 𝐱, dictionary D
Extract : sparse code 𝐡
Strictly speaking, “sparse coding”
Task 2 (learning)
Given: training data 𝐱
Learn: dictionary D and sparse code 𝐡
Strictly speaking, “sparse modeling”
U Kang 41
Sparse coding
Objective function
= arg min
ℎ
𝜆 𝐡 1 + 𝛽 𝐱 − 𝐃𝐡 22
x is a given data, D is a given dictionary
𝜆 𝐡 1 is a regularization term for sparsity
Minimize the reconstruction error and make 𝐡 sparse!
Reconstruction ො𝐱
Reconstruction error
Task 1 (1)
How to compute 𝐡
We could use a gradient descent method
Algorithm
1. Initialize 𝐡
2. while 𝐡 has not converged
3. 𝐡 ← 𝐡 − α𝐃𝑇 𝐃𝐡 − 𝐱 (update from reconstruction)
4. 𝐡 ← shrink(𝐡, 𝛼𝜆) (update from sparsity)
5. Return 𝐡
U Kang 43
Task 1 (2)
Details of Algorithm
𝑙 𝐱 = 𝜆 𝐡 1 + 1
2 𝐱 − 𝐃𝐡 22
𝛻𝐡𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛(𝐡) +
𝐃
𝑇𝐃𝐡 − 𝐱
=> 𝜕
𝜕ℎ𝑘 𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛 ℎ𝑘 + 𝑫∙,𝑘𝑇 (𝐃𝐡 − 𝐱)
Line 3 in Algorithm
ℎ𝑘 is updated by ℎ𝑘 − 𝛼𝑫∙,𝑘𝑇 (𝐃𝐡 − 𝐱)
Task 1 (3)
Details of Algorithm
𝑙 𝐱 = 𝜆 𝐡 1 + 1
2 𝐱 − 𝐃𝐡 22
𝛻𝐡𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛(𝐡) +
𝐃
𝑇𝐃𝐡 − 𝐱
=> 𝜕
𝜕ℎ𝑘 𝑙 𝐱 = 𝜆𝑠𝑖𝑔𝑛 ℎ𝑘 + 𝑫∙,𝑘𝑇 (𝐃𝐡 − 𝐱)
Line 4 in Algorithm
If 𝑠𝑖𝑔𝑛 ℎ𝑘 ≠ 𝑠𝑖𝑔𝑛 ℎ𝑘 − 𝛼𝜆𝑠𝑖𝑔𝑛 ℎ𝑘 then ℎ𝑘 = 0
else ℎ𝑘 = ℎ𝑘 − 𝛼𝜆𝑠𝑖𝑔𝑛 ℎ𝑘
U Kang 45
Task 2 (1)
Objective function
= arg min
ℎ
𝜆 𝐡 1 + 𝛽 𝐱 − 𝐃𝐡 22
x is a given data
𝜆 𝐡 1 is regularization term for sparsity
Learn dictionary D and sparse code 𝐡
Reconstruction ො𝐱
Reconstruction error
Task 2 (2)
Alternating optimization to compute D and h
Fix h, optimize D in objective function
Fix D, optimize h
There are several techniques for learning dictionary D
Stochastic gradient descent
…
U Kang 47
Application
Image Denoising
Image Restoration
Outline
Linear Factor Model
Probabilistic PCA and factor analysis ICA
Slow Feature Analysis Sparse Coding
Manifold Interpretation of PCA
U Kang 49
Manifold Interpretation of PCA
Manifold is a connected region: a set of points associated with a neighborhood around each region
The surface of the earth is a 2-D manifold in 3-D space
Manifold Interpretation of PCA
Linear factor models can be interpreted as learning a manifold.
PCA can be interpreted as aligning this pancake with a linear manifold in a higher dimensional space
The variance in the direction orthogonal to the manifold is very small (arrow pointing out of plane) and can be
considered “noise”, while the other variances are large (arrows in the plane) and correspond to “signal” and to a coordinate system for the reduced-dimension data
manifold
U Kang 51
What You Need to Know
Linear Factor Model
Probabilistic PCA and factor analysis
ICA
Slow Feature Analysis
Sparse Coding
Manifold Interpretation of PCA
Questions?