Backpropagation and Automatic Differentiation

Vector Calculus

5.6 Backpropagation and Automatic Differentiation

“transpose” a tensor, we mean swapping the first two dimensions. Specif- ically, in (5.99) through (5.102), we require tensor-related computations when we work with multivariate functions f(·)and compute derivatives with respect to matrices (and choose not to vectorize them as discussed in

Section 5.4). ♢

5.6 Backpropagation and Automatic Differentiation

A good discussion about

backpropagation and the chain rule is available at a blog by Tim Vieira at https://tinyurl.

com/ycfm2yrw.

In many machine learning applications, we find good model parameters by performing gradient descent (Section 7.1), which relies on the fact that we can compute the gradient of a learning objective with respect to the parameters of the model. For a given objective function, we can obtain the gradient with respect to the model parameters using calculus and applying the chain rule; see Section 5.2.2. We already had a taste in Section 5.3 when we looked at the gradient of a squared loss with respect to the parameters of a linear regression model.

Consider the function

f(x) =^qx²+ exp(x²) + cos x²+ exp(x²). (5.109) By application of the chain rule, and noting that differentiation is linear, we compute the gradient

dx = 2x+ 2xexp(x²)

2^px²+ exp(x²) −sin x²+ exp(x²) 2x+ 2xexp(x²)

= 2x 1

2^px²+ exp(x²) −sin x²+ exp(x²)

1 + exp(x²) . (5.110) Writing out the gradient in this explicit way is often impractical since it often results in a very lengthy expression for a derivative. In practice, it means that, if we are not careful, the implementation of the gradient could be significantly more expensive than computing the function, which imposes unnecessary overhead. For training deep neural network mod-

els, thebackpropagationalgorithm (Kelley, 1960; Bryson, 1961; Dreyfus, backpropagation

1962; Rumelhart et al., 1986) is an efficient way to compute the gradient of an error function with respect to the parameters of the model.

5.6.1 Gradients in a Deep Network

An area where the chain rule is used to an extreme is deep learning, where the function valueyis computed as a many-level function composition

y= (fK◦fK−1◦ · · · ◦f1)(x) =fK(fK−1(· · ·(f1(x))· · ·)), (5.111) where xare the inputs (e.g., images),y are the observations (e.g., class labels), and every functionfi,i= 1, . . . , K, possesses its own parameters.

Figure 5.8 Forward pass in a multi-layer neural network to compute the lossL as a function of the inputsxand the parametersAi,bi.

x fK

A0,b0 AK−1,bK−1

fK−1 L

AK−2,bK−2

A1,b1

In neural networks with multiple layers, we have functions fi(xi−1) =

We discuss the case, where the activation functions are identical in each layer to unclutter notation.

σ(Ai−1xi−1+bi−1)in theith layer. Herexi−1is the output of layeri−1 andσan activation function, such as the logistic sigmoid _1+e¹−x,tanhor a rectified linear unit (ReLU). In order to train these models, we require the gradient of a loss functionLwith respect to all model parametersAj,bj

forj= 1, . . . , K. This also requires us to compute the gradient ofLwith respect to the inputs of each layer. For example, if we have inputsxand observationsyand a network structure defined by

f₀:=x (5.112)

f_i :=σi(Ai−1f_i−1+bi−1), i= 1, . . . , K , (5.113) see also Figure 5.8 for a visualization, we may be interested in finding Aj,bj forj = 0, . . . , K−1, such that the squared loss

L(θ) =∥y−f_K(θ,x)∥² ^(5.114) is minimized, whereθ={A0,b0, . . . ,AK−1,bK−1}^.

To obtain the gradients with respect to the parameter setθ, we require the partial derivatives ofLwith respect to the parameters θj ={Aj,bj} of each layerj = 0, . . . , K−1. The chain rule allows us to determine the partial derivatives as

A more in-depth discussion about gradients of neural networks can be found in Justin Domke’s lecture notes

https://tinyurl.

com/yalcxgtv.

∂L

∂θK−1

= ∂L

∂f_K

∂θK−1

(5.115)

∂L

∂θK−2

= ∂L

∂f_K

∂f_K−1

∂θK−2

(5.116)

∂L

∂θK−3

= ∂L

∂f_K

∂f_K−1

∂f_K−2

∂θK−3

(5.117)

∂L

∂θi

= ∂L

∂f_K

∂f_K−1· · · ∂f_i+2

∂f_i+1

∂θi

(5.118) The orange terms are partial derivatives of the output of a layer with respect to its inputs, whereas the blue terms are partial derivatives of the output of a layer with respect to its parameters. Assuming, we have already computed the partial derivatives∂L/∂θi+1, then most of the computation can be reused to compute∂L/∂θi. The additional terms that we

5.6 Backpropagation and Automatic Differentiation 161

Figure 5.9 Backward pass in a multi-layer neural network to compute the gradients of the loss function.

x fK

A0,b0 AK−1,bK−1

fK−1 L

AK−2,bK−2

A1,b1

Figure 5.10 Simple graph illustrating the flow of data fromxtoyvia some intermediate variablesa, b.

x a b y

need to compute are indicated by the boxes. Figure 5.9 visualizes that the gradients are passed backward through the network.

5.6.2 Automatic Differentiation

It turns out that backpropagation is a special case of a general technique

in numerical analysis calledautomatic differentiation. We can think of au- automatic differentiation

tomatic differentation as a set of techniques to numerically (in contrast to symbolically) evaluate the exact (up to machine precision) gradient of a function by working with intermediate variables and applying the chain

rule. Automatic differentiation applies a series of elementary arithmetic Automatic differentiation is different from symbolic

differentiation and numerical approximations of the gradient, e.g., by using finite differences.

operations, e.g., addition and multiplication and elementary functions, e.g.,sin,cos,exp,log. By applying the chain rule to these operations, the gradient of quite complicated functions can be computed automatically.

Automatic differentiation applies to general computer programs and has forward and reverse modes. Baydin et al. (2018) give a great overview of automatic differentiation in machine learning.

Figure 5.10 shows a simple graph representing the data flow from inputs x to outputs y via some intermediate variables a, b. If we were to compute the derivativedy/dx, we would apply the chain rule and obtain

dy dx = dy

db db da

dx. (5.119)

Intuitively, the forward and reverse mode differ in the order of multipli- In the general case, we work with Jacobians, which can be vectors, matrices, or tensors.

cation. Due to the associativity of matrix multiplication, we can choose between

dy dx =

dy db

db da

dx, (5.120)

dy dx = dy

db db

da da dx

. (5.121)

Equation (5.120) would be thereverse modebecause gradients are prop- reverse mode

agated backward through the graph, i.e., reverse to the data flow. Equa-

tion (5.121) would be theforward mode, where the gradients flow with forward mode

the data from left to right through the graph.

In the following, we will focus on reverse mode automatic differentiation, which is backpropagation. In the context of neural networks, where the input dimensionality is often much higher than the dimensionality of the labels, the reverse mode is computationally significantly cheaper than the forward mode. Let us start with an instructive example.

Example 5.14

Consider the function

f(x) =^qx²+ exp(x²) + cos x²+ exp(x²) (5.122) from (5.109). If we were to implement a functionf on a computer, we would be able to save some computation by using intermediate variables:

intermediate variables

a=x², (5.123)

b= exp(a), (5.124)

c=a+b , (5.125)

d=√

c , (5.126)

e= cos(c), (5.127)

f =d+e . (5.128)

Figure 5.11 Computation graph with inputsx, function valuesf, and intermediate variablesa, b, c, d, e.

x (·)² a

exp(·) b

+ c

√·

cos(·) d

+ f

This is the same kind of thinking process that occurs when applying the chain rule. Note that the preceding set of equations requires fewer operations than a direct implementation of the functionf(x) as defined in (5.109). The corresponding computation graph in Figure 5.11 shows the flow of data and computations required to obtain the function value f.

The set of equations that include intermediate variables can be thought of as a computation graph, a representation that is widely used in imple- mentations of neural network software libraries. We can directly compute the derivatives of the intermediate variables with respect to their corresponding inputs by recalling the definition of the derivative of elementary functions. We obtain the following:

∂a

∂x = 2x (5.129)

∂b

∂a = exp(a) (5.130)

5.6 Backpropagation and Automatic Differentiation 163

∂c

∂a = 1 = ∂c

∂b ^(5.131)

∂d

∂c = 1 2√

c ^(5.132)

∂e

∂c =−sin(c) (5.133)

∂f

∂d = 1 = ∂f

∂e . (5.134)

By looking at the computation graph in Figure 5.11, we can compute

∂f /∂xby working backward from the output and obtain

∂f

∂c = ∂f

∂d

∂c +∂f

∂e

∂c ^(5.135)

∂f

∂b = ∂f

∂c

∂b ^(5.136)

∂f

∂a = ∂f

∂b

∂a +∂f

∂c

∂a ^(5.137)

∂f

∂x = ∂f

∂a

∂x. (5.138)

Note that we implicitly applied the chain rule to obtain∂f /∂x. By substi- tuting the results of the derivatives of the elementary functions, we get

∂f

∂c = 1· 1 2√

c+ 1·(−sin(c)) (5.139)

∂f

∂b = ∂f

∂c ·1 (5.140)

∂f

∂a = ∂f

∂b exp(a) +∂f

∂c ·1 (5.141)

∂f

∂x = ∂f

∂a ·2x . (5.142)

By thinking of each of the derivatives above as a variable, we observe that the computation required for calculating the derivative is of similar complexity as the computation of the function itself. This is quite counter- intuitive since the mathematical expression for the derivative ^∂f_∂x (5.110) is significantly more complicated than the mathematical expression of the functionf(x)in (5.109).

Automatic differentiation is a formalization of Example 5.14. Letx1, . . . , xd

be the input variables to the function,xd+1, . . . , xD−1be the intermediate variables, andxDthe output variable. Then the computation graph can be expressed as follows:

Fori=d+ 1, . . . , D: xi=gi(xPa(xi)), (5.143)

where thegi(·)are elementary functions andxPa(xi)are the parent nodes of the variablexi in the graph. Given a function defined in this way, we can use the chain rule to compute the derivative of the function in a step- by-step fashion. Recall that by definitionf =xDand hence

∂f

∂xD

= 1. (5.144)

For other variablesxi, we apply the chain rule

∂f

∂xi

= ^X

xj:xi∈Pa(xj)

∂f

∂xj

∂xi

= ^X

xj:xi∈Pa(xj)

∂f

∂xj

∂gj

∂xi

, (5.145)

wherePa(xj) is the set of parent nodes ofxj in the computation graph.

Equation (5.143) is the forward propagation of a function, whereas (5.145)

Auto-differentiation in reverse mode requires a parse tree.

is the backpropagation of the gradient through the computation graph.

For neural network training, we backpropagate the error of the prediction with respect to the label.

The automatic differentiation approach above works whenever we have a function that can be expressed as a computation graph, where the elementary functions are differentiable. In fact, the function may not even be a mathematical function but a computer program. However, not all computer programs can be automatically differentiated, e.g., if we cannot find differential elementary functions. Programming structures, such as for loops andifstatements, require more care as well.

Dalam dokumen MATHEMATICS FOR MACHINE LEARNING (Halaman 165-170)