• Tidak ada hasil yang ditemukan

Gradients of Vector-Valued Functions

Dalam dokumen MATHEMATICS FOR MACHINE LEARNING (Halaman 155-161)

Vector Calculus

5.3 Gradients of Vector-Valued Functions

Therefore, the partial derivative of a vector-valued functionf :Rn → Rmwith respect toxi ∈R,i= 1, . . . n, is given as the vector

∂f

∂xi

=

∂f1

∂xi

...

∂fm

∂xi

=

limh→0f1(x1,...,xi−1,xi+h,xi+1,...xn)−f1(x)

.. h

.

limh→0fm(x1,...,xi−1,xi+h,xi+1,...xn)−fm(x) h

∈Rm. (5.55) From (5.40), we know that the gradient off with respect to a vector is the row vector of the partial derivatives. In (5.55), every partial derivative

∂f/∂xi is itself a column vector. Therefore, we obtain the gradient off : Rn →Rmwith respect tox∈Rnby collecting these partial derivatives:

df(x)

dx = ∂f(x)

∂x1

· · ·

∂f∂x(x)n

(5.56a)

=

∂f1(x)

∂x1

· · ·

∂f∂x1(x)n

... ...

∂fm(x)

∂x1

· · ·

∂f∂xm(x)n

∈Rm×n. (5.56b)

Definition 5.6(Jacobian). The collection of all first-order partial deriva- tives of a vector-valued functionf :Rn →Rmis called theJacobian. The

Jacobian

JacobianJ is anm×nmatrix, which we define and arrange as follows:

The gradient of a function

f:RnRmis a matrix of size m×n.

J =∇xf = df(x) dx =

∂f(x)

∂x1 · · · ∂f(x)

∂xn

(5.57)

=

∂f1(x)

∂x1 · · · ∂f1(x)

∂xn

... ...

∂fm(x)

∂x1 · · · ∂fm(x)

∂xn

, (5.58)

x=

 x1

... xn

 , J(i, j) = ∂fi

∂xj

. (5.59)

As a special case of (5.58), a function f : Rn → R1, which maps a vectorx∈Rn onto a scalar (e.g.,f(x) =Pni=1xi), possesses a Jacobian that is a row vector (matrix of dimension1×n); see (5.40).

Remark. In this book, we use thenumerator layoutof the derivative, i.e.,

numerator layout

the derivative df/dx of f ∈ Rm with respect to x ∈ Rn is an m × nmatrix, where the elements of f define the rows and the elements of x define the columns of the corresponding Jacobian; see (5.58). There

5.3 Gradients of Vector-Valued Functions 151

Figure 5.5 The determinant of the Jacobian offcan be used to compute the magnifier between the blue and orange area.

b1

b2 c1 c2

f(·)

exists also thedenominator layout, which is the transpose of the numerator denominator layout

layout. In this book, we will use the numerator layout. ♢ We will see how the Jacobian is used in the change-of-variable method for probability distributions in Section 6.7. The amount of scaling due to the transformation of a variable is provided by the determinant.

In Section 4.1, we saw that the determinant can be used to compute the area of a parallelogram. If we are given two vectors b1 = [1,0], b2= [0,1]as the sides of the unit square (blue; see Figure 5.5), the area of this square is

det

1 0 0 1

= 1. (5.60)

If we take a parallelogram with the sides c1 = [−2,1], c2 = [1,1] (orange in Figure 5.5), its area is given as the absolute value of the deter- minant (see Section 4.1)

det

−2 1 1 1

=| −3|= 3, (5.61) i.e., the area of this is exactly three times the area of the unit square.

We can find this scaling factor by finding a mapping that transforms the unit square into the other square. In linear algebra terms, we effectively perform a variable transformation from (b1,b2) to (c1,c2). In our case, the mapping is linear and the absolute value of the determinant of this mapping gives us exactly the scaling factor we are looking for.

We will describe two approaches to identify this mapping. First, we ex- ploit that the mapping is linear so that we can use the tools from Chapter 2 to identify this mapping. Second, we will find the mapping using partial derivatives using the tools we have been discussing in this chapter.

Approach 1 To get started with the linear algebra approach, we identify both{b1,b2}and{c1,c2}as bases ofR2(see Section 2.6.1 for a recap). What we effectively perform is a change of basis from(b1,b2)to (c1,c2), and we are looking for the transformation matrix that implements the basis change. Using results from Section 2.7.2, we identify the desired basis change matrix as

J =

−2 1 1 1

, (5.62)

such that J b1 = c1 and J b2 = c2. The absolute value of the determi-

nant ofJ, which yields the scaling factor we are looking for, is given as

|det(J)|= 3, i.e., the area of the square spanned by(c1,c2)is three times greater than the area spanned by(b1,b2).

Approach 2 The linear algebra approach works for linear trans- formations; for nonlinear transformations (which become relevant in Sec- tion 6.7), we follow a more general approach using partial derivatives.

For this approach, we consider a functionf :R2R2that performs a variable transformation. In our example,f maps the coordinate represen- tation of any vectorx ∈ R2 with respect to(b1,b2) onto the coordinate representation y ∈ R2 with respect to (c1,c2). We want to identify the mapping so that we can compute how an area (or volume) changes when it is being transformed by f. For this, we need to find out how f(x) changes if we modify x a bit. This question is exactly answered by the Jacobian matrix dfdxR2×2. Since we can write

y1=−2x1+x2 (5.63)

y2=x1+x2 (5.64)

we obtain the functional relationship betweenxandy, which allows us to get the partial derivatives

∂y1

∂x1

=−2, ∂y1

∂x2

= 1, ∂y2

∂x1

= 1, ∂y2

∂x2

= 1 (5.65)

and compose the Jacobian as

J =

∂y1

∂x1

∂y1

∂x2

∂y2

∂x1

∂y2

∂x2

=

−2 1 1 1

. (5.66)

The Jacobian represents the coordinate transformation we are looking

Geometrically, the Jacobian determinant gives the magnification/

scaling factor when we transform an area or volume.

for. It is exact if the coordinate transformation is linear (as in our case), and (5.66) recovers exactly the basis change matrix in (5.62). If the co- ordinate transformation is nonlinear, the Jacobian approximates this non- linear transformation locally with a linear one. The absolute value of the Jacobian determinant|det(J)|is the factor by which areas or volumes are

Jacobian

determinant scaled when coordinates are transformed. Our case yields|det(J)|= 3. The Jacobian determinant and variable transformations will become relevant in Section 6.7 when we transform random variables and prob- ability distributions. These transformations are extremely relevant in ma-

Figure 5.6 Dimensionality of (partial) derivatives.

f(x) x

∂f

∂x

chine learning in the context of training deep neural networks using the reparametrization trick, also calledinfinite perturbation analysis.

In this chapter, we encountered derivatives of functions. Figure 5.6 sum- marizes the dimensions of those derivatives. Iff :R→ Rthe gradient is simply a scalar (top-left entry). Forf :RD → R the gradient is a1×D row vector (top-right entry). Forf : R → RE, the gradient is anE ×1 column vector, and forf :RD →REthe gradient is anE×Dmatrix.

5.3 Gradients of Vector-Valued Functions 153 Example 5.9 (Gradient of a Vector-Valued Function)

We are given

f(x) =Ax, f(x)∈RM, A∈RM×N, x∈RN.

To compute the gradient df/dx we first determine the dimension of df/dx: Since f : RN → RM, it follows that df/dx ∈ RM×N. Second, to compute the gradient we determine the partial derivatives of f with respect to everyxj:

fi(x) =

N

X

j=1

Aijxj =⇒ ∂fi

∂xj

=Aij (5.67)

We collect the partial derivatives in the Jacobian and obtain the gradient df

dx =

∂f1

∂x1 · · · ∂x∂fN1

... ...

∂fM

∂x1 · · · ∂f∂xMN

=

A11 · · · A1N

... ... AM1 · · · AM N

=A∈RM×N. (5.68)

Example 5.10 (Chain Rule)

Consider the functionh:R→R,h(t) = (f◦g)(t)with

f :R2 →R (5.69)

g:R→R2 (5.70)

f(x) = exp(x1x22), (5.71) x=

x1

x2

=g(t) =

tcost tsint

(5.72) and compute the gradient ofhwith respect to t. Since f : R2 → Rand g:R→R2we note that

∂f

∂x ∈R1×2, ∂g

∂t ∈R2×1. (5.73)

The desired gradient is computed by applying the chain rule:

dh dt = ∂f

∂x

∂x

∂t = ∂f

∂x1

∂f

∂x2

∂x1

∂x∂t2

∂t

 (5.74a)

=exp(x1x22)x22 2 exp(x1x22)x1x2

cost−tsint sint+tcost

(5.74b)

= exp(x1x22) x22(cost−tsint) + 2x1x2(sint+tcost), (5.74c) wherex1=tcostandx2=tsint; see (5.72).

Example 5.11 (Gradient of a Least-Squares Loss in a Linear Model) Let us consider the linear model

We will discuss this model in much more detail in Chapter 9 in the context of linear regression, where we need derivatives of the least-squares lossLwith respect to the parametersθ.

y=Φθ, (5.75)

whereθ ∈ RD is a parameter vector, Φ∈ RN×D are input features and y∈RN are the corresponding observations. We define the functions

L(e) :=∥e∥2, (5.76)

e(θ) :=y−Φθ. (5.77) We seek ∂L∂θ, and we will use the chain rule for this purpose.Lis called a least-squares lossfunction.

least-squares loss

Before we start our calculation, we determine the dimensionality of the gradient as

∂L

∂θ ∈R1×D. (5.78)

The chain rule allows us to compute the gradient as

∂L

∂θ = ∂L

∂e

∂e

∂θ, (5.79)

where thedth element is given by dLdtheta =

np.einsum(

’n,nd’,

dLde,dedtheta) ∂L

∂θ[1, d] =

N

X

n=1

∂L

∂e[n]∂e

∂θ[n, d]. (5.80) We know that∥e∥2=ee(see Section 3.2) and determine

∂L

∂e = 2e ∈R1×N. (5.81)

Furthermore, we obtain

∂e

∂θ =−Φ∈RN×D, (5.82)

such that our desired derivative is

∂L

∂θ =−2eΦ(5.77)= −2(y−θΦ)

| {z }

1×N

Φ

|{z}

N×D

R1×D. (5.83) Remark. We would have obtained the same result without using the chain rule by immediately looking at the function

L2(θ) :=∥y−Φθ∥2= (y−Φθ)(y−Φθ). (5.84) This approach is still practical for simple functions like L2 but becomes

impractical for deep function compositions. ♢

Dalam dokumen MATHEMATICS FOR MACHINE LEARNING (Halaman 155-161)