Large Scale Data Analysis Using Deep Learning

(1)

Large Scale Data Analysis Using Deep Learning

Deep Feedforward Networks - 2

U Kang

(2)

In This Lecture



Back propagation



Motivation



Main idea



Procedure

(3)

Computational Graphs



Formalizes computation



Each node indicates a variable



Each edge indicates an operation

(4)

Chain Rule of Calculus



Let x be a real number, and f and g be functions from R to R. Suppose y = g(x), and z = f(g(x)) = f(y). Then the chain rule states that

_{𝑑𝑑𝑑𝑑}^{𝑑𝑑𝑑𝑑}

=

_{𝑑𝑑𝑦𝑦}^{𝑑𝑑𝑑𝑑} ^{𝑑𝑑𝑦𝑦}_{𝑑𝑑𝑑𝑑}



Suppose 𝑥𝑥 ∈ 𝑅𝑅

^𝑚𝑚

, 𝑦𝑦 ∈ 𝑅𝑅

^𝑛𝑛

, 𝑔𝑔: 𝑅𝑅

^𝑚𝑚

→ 𝑅𝑅

^𝑛𝑛

, 𝑓𝑓 : 𝑅𝑅

^𝑛𝑛

→ 𝑅𝑅 . If 𝒚𝒚 = 𝑔𝑔(𝒙𝒙) and 𝑧𝑧 = 𝑓𝑓(𝒚𝒚) , then

_{𝜕𝜕𝑑𝑑}^{𝜕𝜕𝑑𝑑}

𝑖𝑖

= ∑

_𝑗𝑗 _{𝜕𝜕𝑦𝑦}^{𝜕𝜕𝑑𝑑}

𝑗𝑗

𝜕𝜕𝑦𝑦_𝑗𝑗

𝜕𝜕𝑑𝑑_𝑖𝑖

. In

vector notation: 𝛻𝛻

_𝒙𝒙

𝑧𝑧 = (

^{𝜕𝜕𝒚𝒚}_{𝜕𝜕𝒙𝒙}

)

^𝑇𝑇

𝛻𝛻

_𝒚𝒚

𝑧𝑧 , where

^{𝜕𝜕𝒚𝒚}_{𝜕𝜕𝒙𝒙}

is the n x m Jacobian matrix of g

 E.g., suppose 𝑧𝑧 = 𝑑𝑑^𝑇𝑇𝒚𝒚, and 𝐲𝐲 = 𝛻𝛻_𝒙𝒙𝑔𝑔 𝒙𝒙 . Then 𝛻𝛻_𝒙𝒙𝑧𝑧 = (^{𝜕𝜕𝒚𝒚}_{𝜕𝜕𝒙𝒙})^𝑇𝑇𝛻𝛻_𝒚𝒚𝑧𝑧 = 𝐻𝐻^𝑇𝑇𝑑𝑑

(5)

Repeated Subexpressions

 Computing the same subexpression many times would be wasteful

(6)

Backpropagation - Overview



Backpropagation is an algorithm to compute partial derivatives efficiently in neural networks



The main idea is dynamic programming

 Many computations share common computations

 Store the common computations in memory, and read them from memory when needed, without re-computing them from scratch

(7)

Backpropagation - Overview

 Assume a computational graph to compute a single scalar 𝑢𝑢^(𝑛𝑛)

 E.g., this can be the loss

 Our goal is to compute a partial derivatives ^{𝜕𝜕𝑢𝑢}_{𝜕𝜕𝑢𝑢}^(𝑛𝑛)_(𝑖𝑖) for all 𝑖𝑖 ∈ {1,2, … , 𝑛𝑛_𝑖𝑖} where 𝑢𝑢⁽¹⁾ to 𝑢𝑢^(𝑛𝑛^𝑖𝑖⁾ are the parameters of model

 We assume that nodes are ordered such that we compute their outputs one after the other, starting at 𝑢𝑢^(𝑛𝑛^𝑖𝑖⁺¹⁾ and going up to 𝑢𝑢^(𝑛𝑛)

 Each node 𝑢𝑢^(𝑖𝑖) is associated with an operation 𝑓𝑓^(𝑖𝑖) and is

computed by 𝑢𝑢^(𝑖𝑖) = 𝑓𝑓(𝐴𝐴^(𝑖𝑖)) where 𝐴𝐴^(𝑖𝑖) is the set of all parents of 𝑢𝑢^(𝑖𝑖)

(8)

Forward Propagation

 This procedure performs the computations mapping 𝑛𝑛_𝑖𝑖 inputs 𝑢𝑢⁽¹⁾ to 𝑢𝑢^(𝑛𝑛^𝑖𝑖⁾ to output 𝑢𝑢^(𝑛𝑛)

(9)

Back-Propagation

 This procedure computes the partial derivatives of 𝑢𝑢^(𝑛𝑛) with respect to the variables 𝑢𝑢⁽¹⁾, … , 𝑢𝑢^(𝑛𝑛^𝑖𝑖⁾

(10)

Back-Propagation Example



Back-propagation procedure (re-use blue-colored computation)

 Compute _{𝜕𝜕𝑦𝑦}^{𝜕𝜕𝑑𝑑} and _{𝜕𝜕𝑑𝑑}^{𝜕𝜕𝑑𝑑}

 𝜕𝜕𝑑𝑑

𝜕𝜕𝑤𝑤 ← _{𝜕𝜕𝑤𝑤}^{𝜕𝜕𝑦𝑦} _{𝜕𝜕𝑦𝑦}^{𝜕𝜕𝑑𝑑} + _{𝜕𝜕𝑤𝑤}^{𝜕𝜕𝑑𝑑} _{𝜕𝜕𝑑𝑑}^{𝜕𝜕𝑑𝑑}

 𝜕𝜕𝑑𝑑

𝜕𝜕𝑣𝑣 ← ^{𝜕𝜕𝑤𝑤}_{𝜕𝜕𝑣𝑣} _{𝜕𝜕𝑤𝑤}^{𝜕𝜕𝑑𝑑}

v w

y x

z

(11)

Cost of Back-Propagation

 The amount of computation scales linearly with the number of edges in the computational graph

 Computation for each edge: computing a partial derivative, one multiplication, and one addition

w

y x

z

Compute ^{𝜕𝜕𝑑𝑑}

𝜕𝜕𝑦𝑦 and ^{𝜕𝜕𝑑𝑑}

𝜕𝜕𝑧𝑧 𝜕𝜕𝑑𝑑

𝜕𝜕𝑤𝑤 ←

𝜕𝜕𝑦𝑦

𝜕𝜕𝑤𝑤

𝜕𝜕𝑧𝑧

𝜕𝜕𝑦𝑦 + 𝜕𝜕𝑥𝑥

𝜕𝜕𝑤𝑤

𝜕𝜕𝑧𝑧

𝜕𝜕𝑧𝑧 𝜕𝜕𝑤𝑤 𝜕𝜕𝑧𝑧 𝜕𝜕𝑥𝑥

(12)

Back-Propagation in Fully Connected

 Forward propagation

MLP

(13)

Back-Propagation in Fully Connected

 Backward computation

MLP

⨀: element-wise product

(14)

Minibatch Processing

 SGD (recap)

 We sample a minibatch of examples 𝐵𝐵 = {𝑥𝑥 ¹ , … , 𝑥𝑥 ^𝑚𝑚𝑚 } drawn uniformly from the training set

 The estimate of the gradient is 𝑔𝑔 = _𝑚𝑚𝑚¹ ∑_𝑖𝑖=1^𝑚𝑚𝑚 𝛻𝛻_𝜃𝜃𝐿𝐿(𝑥𝑥 ^𝑖𝑖 ,𝑦𝑦 ^𝑖𝑖 ,𝜃𝜃)

 Then the gradient descent is given by 𝜃𝜃 ← 𝜃𝜃 − 𝜖𝜖𝑔𝑔

 SGD using back propagation

 For each instance (𝑥𝑥^(𝑖𝑖),𝑦𝑦^(𝑖𝑖)), where i=1~m’, we compute the gradient

𝛻𝛻_𝜃𝜃𝜕𝜕^(𝑖𝑖) using back propagation where 𝜕𝜕^(𝑖𝑖) = 𝐿𝐿(𝑥𝑥 ^𝑖𝑖 ,𝑦𝑦 ^𝑖𝑖 ,𝜃𝜃) is the i-th loss

 The final gradient is 𝑔𝑔 = _𝑚𝑚𝑚¹ ∑_𝑖𝑖=1^𝑚𝑚𝑚 𝛻𝛻_𝜃𝜃𝜕𝜕^(𝑖𝑖)

 Update 𝜃𝜃 ← 𝜃𝜃 − 𝜖𝜖𝑔𝑔

(15)

Example

 A feedforward neural network with one hidden layer

 Forward propagation

 𝒂𝒂 ← 𝑾𝑾𝒙𝒙

 𝒉𝒉 ← 𝝈𝝈(𝒂𝒂) (elementwise)

 �𝒚𝒚 ← 𝒗𝒗^𝑇𝑇𝒉𝒉

 𝑱𝑱 ← 𝐿𝐿 �𝒚𝒚,𝒚𝒚 + 𝜆𝜆𝜆𝜆 𝑾𝑾,𝒗𝒗 = (�𝒚𝒚 − 𝒚𝒚)^𝟐𝟐+𝜆𝜆( 𝑾𝑾 _𝐹𝐹² + 𝒗𝒗 ₂²)

𝒉𝒉 𝒗𝒗

(16)

Example

 Back propagation

 𝑔𝑔 ← 𝛻𝛻_�𝑦𝑦𝜕𝜕 = 2 �𝑦𝑦 − 𝑦𝑦

 𝛻𝛻_𝒗𝒗𝜕𝜕 ← 𝛻𝛻_𝒗𝒗[ �𝑦𝑦 − 𝑦𝑦 ² + 𝜆𝜆 𝑾𝑾 _𝐹𝐹² + 𝒗𝒗 ₂² ] = 𝑔𝑔𝒉𝒉 + 2𝜆𝜆𝒗𝒗

 𝒈𝒈 ← 𝛻𝛻_𝒉𝒉𝜕𝜕 = 𝛻𝛻_𝒉𝒉[ �𝑦𝑦 − 𝑦𝑦 ² + 𝜆𝜆 𝑾𝑾 _𝐹𝐹² + 𝒗𝒗 ₂² ] = 𝑔𝑔𝒗𝒗

 𝒈𝒈 ← 𝛻𝛻_𝒂𝒂𝜕𝜕 = 𝒈𝒈 ⨀ 𝜎𝜎𝜎(𝒂𝒂) (elementwise)

 𝛻𝛻_𝑾𝑾𝜕𝜕 ← 𝛻𝛻_𝑾𝑾[ �𝑦𝑦 − 𝑦𝑦 ² + 𝜆𝜆 𝑾𝑾 _𝐹𝐹² + 𝒗𝒗 ₂² ] = 𝒈𝒈𝒙𝒙^𝑇𝑇 + 2𝜆𝜆𝑾𝑾

𝒙𝒙 𝒉𝒉 𝒂𝒂

𝑾𝑾 𝒗𝒗

(17)

Higher-Order Derivatives

 Some software frameworks (e.g., Theano and Tensorflow) support the use of higher-order derivatives

 Hessian is useful for optimizing parameters.

 For a function 𝑓𝑓:𝑅𝑅^𝑛𝑛 → 𝑅𝑅, the Hessian matrix is n x n matrix

 In deep learning, n is the number of parameters which can be millions or billions; thus constructing entire Hessian is intractable

 The typical deep learning approach to compute a function of Hessian is to use Krylov methods: a set of iterative techniques for inverting a matrix or computing eigenvectors using only matrix-vector products

 Hessian-vector product can be computed efficiently by

(18)

What you need to know



Backpropagation is an algorithm to compute partial derivatives efficiently in neural

networks



Reuse partial derivatives



Procedure of backpropagation



Use of backpropagation in SGD

(19)

Questions?