Large Scale Data Analysis Using Deep Learning
Deep Feedforward Networks - 2
U Kang
In This Lecture
Back propagation
Motivation
Main idea
Procedure
Computational Graphs
Formalizes computation
Each node indicates a variable
Each edge indicates an operation
Chain Rule of Calculus
Let x be a real number, and f and g be functions from R to R. Suppose y = g(x), and z = f(g(x)) = f(y). Then the chain rule states that
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑=
𝑑𝑑𝑦𝑦𝑑𝑑𝑑𝑑 𝑑𝑑𝑦𝑦𝑑𝑑𝑑𝑑
Suppose 𝑥𝑥 ∈ 𝑅𝑅
𝑚𝑚, 𝑦𝑦 ∈ 𝑅𝑅
𝑛𝑛, 𝑔𝑔: 𝑅𝑅
𝑚𝑚→ 𝑅𝑅
𝑛𝑛, 𝑓𝑓 : 𝑅𝑅
𝑛𝑛→ 𝑅𝑅 . If 𝒚𝒚 = 𝑔𝑔(𝒙𝒙) and 𝑧𝑧 = 𝑓𝑓(𝒚𝒚) , then
𝜕𝜕𝑑𝑑𝜕𝜕𝑑𝑑𝑖𝑖
= ∑
𝑗𝑗 𝜕𝜕𝑦𝑦𝜕𝜕𝑑𝑑𝑗𝑗
𝜕𝜕𝑦𝑦𝑗𝑗
𝜕𝜕𝑑𝑑𝑖𝑖
. In
vector notation: 𝛻𝛻
𝒙𝒙𝑧𝑧 = (
𝜕𝜕𝒚𝒚𝜕𝜕𝒙𝒙)
𝑇𝑇𝛻𝛻
𝒚𝒚𝑧𝑧 , where
𝜕𝜕𝒚𝒚𝜕𝜕𝒙𝒙is the n x m Jacobian matrix of g
E.g., suppose 𝑧𝑧 = 𝑑𝑑𝑇𝑇𝒚𝒚, and 𝐲𝐲 = 𝛻𝛻𝒙𝒙𝑔𝑔 𝒙𝒙 . Then 𝛻𝛻𝒙𝒙𝑧𝑧 = (𝜕𝜕𝒚𝒚𝜕𝜕𝒙𝒙)𝑇𝑇𝛻𝛻𝒚𝒚𝑧𝑧 = 𝐻𝐻𝑇𝑇𝑑𝑑
Repeated Subexpressions
Computing the same subexpression many times would be wasteful
Backpropagation - Overview
Backpropagation is an algorithm to compute partial derivatives efficiently in neural networks
The main idea is dynamic programming
Many computations share common computations
Store the common computations in memory, and read them from memory when needed, without re-computing them from scratch
Backpropagation - Overview
Assume a computational graph to compute a single scalar 𝑢𝑢(𝑛𝑛)
E.g., this can be the loss
Our goal is to compute a partial derivatives 𝜕𝜕𝑢𝑢𝜕𝜕𝑢𝑢(𝑛𝑛)(𝑖𝑖) for all 𝑖𝑖 ∈ {1,2, … , 𝑛𝑛𝑖𝑖} where 𝑢𝑢(1) to 𝑢𝑢(𝑛𝑛𝑖𝑖) are the parameters of model
We assume that nodes are ordered such that we compute their outputs one after the other, starting at 𝑢𝑢(𝑛𝑛𝑖𝑖+1) and going up to 𝑢𝑢(𝑛𝑛)
Each node 𝑢𝑢(𝑖𝑖) is associated with an operation 𝑓𝑓(𝑖𝑖) and is
computed by 𝑢𝑢(𝑖𝑖) = 𝑓𝑓(𝐴𝐴(𝑖𝑖)) where 𝐴𝐴(𝑖𝑖) is the set of all parents of 𝑢𝑢(𝑖𝑖)
Forward Propagation
This procedure performs the computations mapping 𝑛𝑛𝑖𝑖 inputs 𝑢𝑢(1) to 𝑢𝑢(𝑛𝑛𝑖𝑖) to output 𝑢𝑢(𝑛𝑛)
Back-Propagation
This procedure computes the partial derivatives of 𝑢𝑢(𝑛𝑛) with respect to the variables 𝑢𝑢(1), … , 𝑢𝑢(𝑛𝑛𝑖𝑖)
Back-Propagation Example
Back-propagation procedure (re-use blue-colored computation)
Compute 𝜕𝜕𝑦𝑦𝜕𝜕𝑑𝑑 and 𝜕𝜕𝑑𝑑𝜕𝜕𝑑𝑑
𝜕𝜕𝑑𝑑
𝜕𝜕𝑤𝑤 ← 𝜕𝜕𝑤𝑤𝜕𝜕𝑦𝑦 𝜕𝜕𝑦𝑦𝜕𝜕𝑑𝑑 + 𝜕𝜕𝑤𝑤𝜕𝜕𝑑𝑑 𝜕𝜕𝑑𝑑𝜕𝜕𝑑𝑑
𝜕𝜕𝑑𝑑
𝜕𝜕𝑣𝑣 ← 𝜕𝜕𝑤𝑤𝜕𝜕𝑣𝑣 𝜕𝜕𝑤𝑤𝜕𝜕𝑑𝑑
v w
y x
z
Cost of Back-Propagation
The amount of computation scales linearly with the number of edges in the computational graph
Computation for each edge: computing a partial derivative, one multiplication, and one addition
w
y x
z
Compute 𝜕𝜕𝑑𝑑
𝜕𝜕𝑦𝑦 and 𝜕𝜕𝑑𝑑
𝜕𝜕𝑧𝑧 𝜕𝜕𝑑𝑑
𝜕𝜕𝑤𝑤 ←
𝜕𝜕𝑦𝑦
𝜕𝜕𝑤𝑤
𝜕𝜕𝑧𝑧
𝜕𝜕𝑦𝑦 + 𝜕𝜕𝑥𝑥
𝜕𝜕𝑤𝑤
𝜕𝜕𝑧𝑧
𝜕𝜕𝑧𝑧 𝜕𝜕𝑤𝑤 𝜕𝜕𝑧𝑧 𝜕𝜕𝑥𝑥
Back-Propagation in Fully Connected
Forward propagation
MLP
Back-Propagation in Fully Connected
Backward computation
MLP
⨀: element-wise product
Minibatch Processing
SGD (recap)
We sample a minibatch of examples 𝐵𝐵 = {𝑥𝑥 1 , … , 𝑥𝑥 𝑚𝑚𝑚 } drawn uniformly from the training set
The estimate of the gradient is 𝑔𝑔 = 𝑚𝑚𝑚1 ∑𝑖𝑖=1𝑚𝑚𝑚 𝛻𝛻𝜃𝜃𝐿𝐿(𝑥𝑥 𝑖𝑖 ,𝑦𝑦 𝑖𝑖 ,𝜃𝜃)
Then the gradient descent is given by 𝜃𝜃 ← 𝜃𝜃 − 𝜖𝜖𝑔𝑔
SGD using back propagation
For each instance (𝑥𝑥(𝑖𝑖),𝑦𝑦(𝑖𝑖)), where i=1~m’, we compute the gradient
𝛻𝛻𝜃𝜃𝜕𝜕(𝑖𝑖) using back propagation where 𝜕𝜕(𝑖𝑖) = 𝐿𝐿(𝑥𝑥 𝑖𝑖 ,𝑦𝑦 𝑖𝑖 ,𝜃𝜃) is the i-th loss
The final gradient is 𝑔𝑔 = 𝑚𝑚𝑚1 ∑𝑖𝑖=1𝑚𝑚𝑚 𝛻𝛻𝜃𝜃𝜕𝜕(𝑖𝑖)
Update 𝜃𝜃 ← 𝜃𝜃 − 𝜖𝜖𝑔𝑔
Example
A feedforward neural network with one hidden layer
Forward propagation
𝒂𝒂 ← 𝑾𝑾𝒙𝒙
𝒉𝒉 ← 𝝈𝝈(𝒂𝒂) (elementwise)
�𝒚𝒚 ← 𝒗𝒗𝑇𝑇𝒉𝒉
𝑱𝑱 ← 𝐿𝐿 �𝒚𝒚,𝒚𝒚 + 𝜆𝜆𝜆𝜆 𝑾𝑾,𝒗𝒗 = (�𝒚𝒚 − 𝒚𝒚)𝟐𝟐+𝜆𝜆( 𝑾𝑾 𝐹𝐹2 + 𝒗𝒗 22)
𝒉𝒉 𝒗𝒗
Example
Back propagation
𝑔𝑔 ← 𝛻𝛻�𝑦𝑦𝜕𝜕 = 2 �𝑦𝑦 − 𝑦𝑦
𝛻𝛻𝒗𝒗𝜕𝜕 ← 𝛻𝛻𝒗𝒗[ �𝑦𝑦 − 𝑦𝑦 2 + 𝜆𝜆 𝑾𝑾 𝐹𝐹2 + 𝒗𝒗 22 ] = 𝑔𝑔𝒉𝒉 + 2𝜆𝜆𝒗𝒗
𝒈𝒈 ← 𝛻𝛻𝒉𝒉𝜕𝜕 = 𝛻𝛻𝒉𝒉[ �𝑦𝑦 − 𝑦𝑦 2 + 𝜆𝜆 𝑾𝑾 𝐹𝐹2 + 𝒗𝒗 22 ] = 𝑔𝑔𝒗𝒗
𝒈𝒈 ← 𝛻𝛻𝒂𝒂𝜕𝜕 = 𝒈𝒈 ⨀ 𝜎𝜎𝜎(𝒂𝒂) (elementwise)
𝛻𝛻𝑾𝑾𝜕𝜕 ← 𝛻𝛻𝑾𝑾[ �𝑦𝑦 − 𝑦𝑦 2 + 𝜆𝜆 𝑾𝑾 𝐹𝐹2 + 𝒗𝒗 22 ] = 𝒈𝒈𝒙𝒙𝑇𝑇 + 2𝜆𝜆𝑾𝑾
𝒙𝒙 𝒉𝒉 𝒂𝒂
𝑾𝑾 𝒗𝒗
Higher-Order Derivatives
Some software frameworks (e.g., Theano and Tensorflow) support the use of higher-order derivatives
Hessian is useful for optimizing parameters.
For a function 𝑓𝑓:𝑅𝑅𝑛𝑛 → 𝑅𝑅, the Hessian matrix is n x n matrix
In deep learning, n is the number of parameters which can be millions or billions; thus constructing entire Hessian is intractable
The typical deep learning approach to compute a function of Hessian is to use Krylov methods: a set of iterative techniques for inverting a matrix or computing eigenvectors using only matrix-vector products
Hessian-vector product can be computed efficiently by
What you need to know
Backpropagation is an algorithm to compute partial derivatives efficiently in neural
networks
Reuse partial derivatives
Procedure of backpropagation
Use of backpropagation in SGD
Questions?