Single and Multivariable Calculus
Text Book (Calculus)
Mathematics for Machine Learning
• https://mml-book.github.io/
• https://github.com/vbartle/MML-Companion
Table of Contents:
5 Vector Calculus 139
5.1 Differentiation of Univariate Functions 141 5.2 Partial Differentiation and Gradients 146 5.3 Gradients of Vector-Valued Functions 149 5.4 Gradients of Matrices 155
5.5 Useful Identities for Computing Gradients 158
5.6 Backpropagation and Automatic Differentiation 159 5.7 Higher-Order Derivatives 164
5.8 Linearization and Multivariate Taylor Series 165 5.9 Further Reading
Table of Contents
Part I: Mathematical Foundations 1.Introduction and Motivation 2.Linear Algebra
3.Analytic Geometry
4.Matrix Decompositions 5.Vector Calculus
6.Probability and Distribution 7.Continuous Optimization
Part II: Central Machine Learning Problems 8.When Models Meet Data
9.Linear Regression
Many algorithms in machine learning optimize an objective function with respect to a set of desired model parameters that control how well a model explains the data:
Finding good parameters can be phrased as an optimization problem. Examples include: (i) linear regression; (ii) neural-network auto-encoders for dimensionality reduction and data compression;
and (iii) Gaussian mixture models for modeling data distributions
(a) Regression problem: Find parameters, such that the curve explains the observations (crosses) well.
(b) Density estimation with a Gaussian mixture model: Find means and covariances, such that
• Function
• 5.1 Differentiation of Univariate Functions
• 5.1.2 Differentiation Rules
• Linear Approximations and Differentials
• 5.1.1 Taylor Series
• Newton’s Method
• What Derivatives Tell Us about the Shape of a Graph
Outline
A function f is a quantity that relates two quantities to each other. In this book, these quantities are typically inputs and targets (function values) f(x), which we assume are real-valued if not stated otherwise. Here is the
domain of f, and the function values f(x) are the image/codomain of f.
Function
to specify a function, where (5.1a) specifies that f is a mapping from to and (5.1b) specifies the explicit assignment of an input x to a function value f(x). A function f assigns every input x exactly one function value f(x).
We often write
Example 5.1
Recall the dot product as a special case of an inner product. In the previous notation, the function , would be specified as
In this chapter, we will discuss how to compute gradients of functions, which is often essential to facilitate learning in machine learning models
since the gradient points in the direction of steepest ascent. Therefore, vector
5.1 Differentiation of Univariate Functions
Differentiation
Definition 5.1 (Difference Quotient). The difference quotient
computes the slope of the secant line through two points on the graph of f. In Figure 5.3, these are the points with x-
coordinates and .
In the limit for , we obtain the tangent of f at x, if f is differentiable. The tangent is then the derivative of f at x.
Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f derivative
The Tangent Problem (1 of 3)
The word tangent is derived from the Latin word tangens, which means “touching.”
For a circle we could simply follow Euclid and say that a
tangent is a line ℓ that intersects the circle once and only once,
as in Figure 1(a). Figure 1(a)
For more complicated curves this definition is inadequate.
Figure 1(b)
We can think of a tangent to a curve as a line that touches the curve and follows the same direction as the curve at the point of contact. How can this idea be made precise?
Figure 1(b) shows a line ℓ that appears to be a tangent to the curve C at point P, but it intersects C twice.
Example 1
Find an equation of the tangent line to the parabola y x2 at the point P(1, 1).
Solution:
We will be able to find an equation of the tangent line ℓ as soon as we know its slope m.
The difficulty is that we know only one point, P, on ℓ, whereas we need two points to compute the slope.
But observe that we can compute an approximation to m by choosing a nearby point Q x x
, 2
(as in Figure 2) and computing the slope mPQ of the secant line PQ. (A secant line, from the Latin word
on the parabola
Example 1 – Solution (1 of 2)
We choose x ≠ 1 so that Q ≠ P. Then
2 1
PQ 1 m x
x
For instance, for the point Q(1.5, 2.25) we have 2.25 1
1.5 1 1.25
0.5 2.5
mPQ
x mPQ
2 3
1.5 2.5
1.1 2.1
1.01 2.01
1.001 2.001
x mPQ
0 1
0.5 1.5
0.9 1.9
0.99 1.99
0.999 1.999
The tables in the margin show the values of mPQ for several values of x close to 1.
The closer Q is to P, the closer x is to 1 and, it appears from the tables, the closer m is to 2.
Example 1 – Solution (2 of 2)
This suggests that the slope of the tangent line ℓ should be m = 2.
We say that the slope of the tangent line is the limit of the slopes of the secant lines, and we express this symbolically by writing
2 1
lim and lim 1 2
PQ 1
Q P x
m m x
x
Assuming that the slope of the tangent line is indeed 2, we use the point-slope form of the equation of a line [y − y1 = m(x − x1)] to write the equation of the tangent line through (1, 1) as
1 2 1 or 2 1
y x y x
Example 1
Find an equation of the tangent line to the parabola y x2 at the point P(1, 1).
Solution:
Figure 2
2 1
PQ 1 m x
x
2 1
lim and lim 1 2
PQ 1
Q P x
m m x
x
The Tangent Problem (2 of 3)
Figure 3 illustrates the limiting process that occurs in Example 1.
Figure 3
Q approaches P from the right
The Tangent Problem (3 of 3)
Figure 3
Q approaches P from the left
As Q approaches P along the parabola, the corresponding secant lines rotate about P and approach the tangent line ℓ.
5.1.2 Differentiation Rules
Constant Functions
Let’s start with the simplest of all functions, the constant function f(x) = c.
The graph of this function is the horizontal line y = c, which has slope 0, so we must have
( ) 0.
f x (See Figure 1.)
Figure 1
The graph of f(x) = cis the line y= c, so f x( )0.
A formal proof, from the definition of a derivative, is also easy:
0 0 0
lim lim lim 0 0
h h h
f x h f x c c
f x h h
In Leibniz notation, we write this rule as follows.
Derivative of a Constant Function
0d c
Power Functions (1 of 3)
We next look at the functions f x( ) xn, where n is a positive integer.
If n = 1, the graph of f(x) = x is the line y = x, which has slope 1. (See Figure 2.)
Figure 2
The graph of f(x) = xis the line y= x, so f x( )1.
So d
x 1dx 1
(You can also verify Equation 1 from the definition of a derivative.) We have already investigated the cases n = 2 and n = 3. We found that
Power Functions (2 of 3)
For n = 4 we find the derivative of f x
x4 as follows:
0
4 4
0
4 3 2 2 3 4 4
0
3 2 2 3 4
0
3 2 2 3
0
lim
lim
4 6 4
lim
4 6 4
lim
lim 4 6 4
h
h
h
h
h
f x h f x
f x h
x h x
h
x x h x h xh h x
h
x h x h xh h h
x x h xh h
Power Functions (3 of 3)
3 dxd
x4 4x3Comparing the equations in (1), (2), and (3), we see a pattern emerging.
It seems to be a reasonable guess that, when n is a positive integer,
d dx/
xn nxn1.This turns out to be true.
The Power Rule If n is a positive integer, then dxd
xn nxn1The Power Rule (General Version) If n is any real number, then
n n 1d x nx dx
The Power Rule enables us to find tangent lines without having to resort to the definition of a derivative. It also enables us to find normal lines.
The normal line to a curve C at a point P is the line through P that is perpendicular
Example 1
(a) If f x
x6,then f x
6x5.(b) If y x1000, then y=1000x999. (c) If y t4, then dy 4 .t3
dt
(d) drd
r3 3r 2New Derivatives from Old (1 of 2)
When new functions are formed from old functions by addition, subtraction, or multiplication by a constant, their derivatives can be calculated in terms of
derivatives of the old functions.
In particular, the following formula says that the derivative of a constant times a function is the constant times the derivative of the function.
The Constant Multiple Rule If c is a constant and f is a differentiable function, then d cf x
c d f x
dx dx lim
lim
x a cf x c x af x
Example 4
(a)
4 4
3 3
3 3
3 4 12
d d
x x
dx dx
x x
(b)
1 1
1 1
d d
x x
dx dx
d x dx
New Derivatives from Old (2 of 2)
The next rule tells us that the derivative of a sum (or difference) of functions is the sum (or difference) of the derivatives.
The Sum and Difference Rules If f and g are both differentiable, then
d d d
f x g x f x g x
dx dx dx
d d d
f x g x f x g x
dx dx dx
The Sum Rule can be extended to the sum of any number of functions. For instance, using this theorem twice, we get
f g h
f g
h (f g) h f g hThe Constant Multiple Rule, the Sum Rule, and the Difference Rule can be combined with the Power Rule to differentiate any polynomial.
lim lim lim
x a f x g x x af x x ag x
lim lim lim
x a f x g x x af x x ag x
Differentiation Rules
The Sum Rules d f x
g x
d f x
d g x
dx dx dx
Derivative of a Constant Function d
c 0dx
The Power Rule (General Version) If n is any real number, then dxd
xn nxn1
d d
cf x c f x dx dx
The Constant Multiple Rule
If c is a constant and f is a differentiable function, then
The Product Rule d f x g x
f x
d g x
g x
d f x
dx dx dx
If f and g are both differentiable, then
2d d
g x f x f x g x
d f x dx dx
dx g x
The Quotient Rule
Example 5.5
Derivative of the Exponential Function
dxd
ex ex
ln
1d x
dx x
2
log
1b ln
d x
dx x b
Derivative of the Logarithmic Functions
1
x x lnd b b b
5 dx
Exponential Functions
Exponential Functions (1 of 4)
Let’s try to compute the derivative of the exponential function f x
bxdefinition of a derivative:
0 0
0 0
lim lim
lim lim 1
x h x
h h
x h
x h x
h h
f x h f x b b
f x h h
b b b b b
h h
The factor bx doesn’t depend on h, so we can take it in front of the limit:
0lim 1
h x
h
f x b b
h
Notice that the limit is the value of the derivative of f at 0, that is,
0
lim 1 0
h h
b f
h
Therefore we have shown that if the exponential function f x
bxis differentiable at 0, then it is differentiable everywhere and 4 f x
f
0 bxThis equation says that the rate of change of any exponential function is
using the
Exponential Functions (2 of 4)
Numerical evidence for the existence of f(0) is given in the table shown below for the cases b = 2 and b = 3. (Values are stated correct to four decimal
places.) It appears that the limits exist and for b = 2,
0
2 1
0 lim 0.693
h
f h
h
for b = 3,
0
3 1
0 lim 1.099
h
f h
h
0 xf x f b 4
Thus, from Equation 4, we have
2x
0.693 2
x
3x
1.099 3
xd d
5
In view of the estimates of f(0) for b = 2 and b = 3, it seems reasonable that there is a number b between 2 and 3 for which f(0) 1.
It is traditional to denote this value by the letter e. Thus we have the following definition.
Definition of the Number e e is the number such that
0
lim 1 1
h h
e
h
0 xf x f b
4 5 dxd
2x
0.693 2
x dxd
3x
1.099 3
xGeometrically, this means that of all the possible exponential
functions y bx, the function is the one whose
xf x e
tangent line at (0, 1) has a slope
(0) f
that is exactly 1. (See Figures 6 and 7.)
Exponential Functions (4 of 4)
If we put b = e and, therefore, f(0) 1 in Equation 4, it becomes the following important differentiation formula.
Derivative of the Natural Exponential Function
x xd e e
dx
Thus the exponential functionf x
ex has the property that it is its own derivative. The geometrical significance of this fact is that the
0 xf x f b 4
Figure 7
Linear Approximations and Differentials
Linearization and Approximation
It might be easy to calculate a value f(a) of a function, but difficult (or even impossible) to compute nearby values of f.
So we settle for the easily computed values of the linear function L whose graph is the tangent line of f at (a, f(a)).
Figure 1
In other words, we use the tangent line at (a, f(a)) as an approximation to the curve y = f(x) when x is near a. An equation of this tangent line is y f a
f a
x a
The linear function whose graph is this tangent line, that is,
L x f a f a x a 1
Example 1
Find the linearization of the function f x
x 3 at a = 1 and use it to approximate the numbers 3.98 and 4.05. Are these approximations overestimates or underestimates?Solution:
The derivative of f x
x 3
1/2 is
1
3
1/2f x 2 x 1
2 x 3
and so we have f(1) = 2 and f
1 14.Putting these values into Equation 1, we see
that the linearization is
1 4
1 1 1
2 1
7
4 4
L x f f x
x x
The corresponding linear approximation (2) is 7 (when is n
3 x ear 1)
x x
L x f a f a x a 1
Example 1 – Solution
The linear approximation is illustrated in Figure 2. Figure 2
We see that, indeed, the tangent line approximation is a good approximation to the given function when x is near 1.
We also see that our approximations are overestimates because the tangent
In particular, we have 7 0.98
4 4
3.98
1.995
and
7 1.05
4 4
4.05
2.0125
7 (when is n
3 4 4x ear 1)
x x
Linearization and Approximation (2 of 5)
In the following table we compare the estimates from the linear approximation in Example 1 with the true values.
Figure 2
Notice from this table, and also from Figure 2, that the tangent line
approximation gives good estimates
when x is close to 1 but the accuracy of the approximation deteriorates when x is farther away from 1.
The next example shows that by using a graphing calculator or computer we can determine an interval throughout which a linear approximation provides a specified accuracy.
Differentials
The geometric meaning of differentials is shown in Figure 5.
y f x x f x
dy f x dx 3
Let dx = Δx
5.1.1 Taylor Series
Taylor Series
The Taylor series is a representation of a function f as an infinite sum of terms.
These terms are determined using derivatives of f evaluated at .
Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of at is defined as
where is the kth derivative of f at (which we assume exists) and are the coefficients of the polynomial.
Definition 5.4 (Taylor Series). For a smooth function , , the Taylor series of f at x0 is defined as
Taylor Series
Taylor Polynomial Taylor Series
Remark. In general, a Taylor polynomial of degree n is an approximation
of a function, which does not need to be a polynomial. The Taylor polynomial is similar to f in a neighborhood around . However, a Taylor polynomial of degree n is an exact representation of a polynomial f of degree k<=n since all derivatives , i > k vanish.
Example 5.3
When Is a Function Represented by Its Taylor Series?
Example 5.4 (Taylor Series)
Consider the function in Figure 5.4 given by f(x) = sin(x) + cos(x)
Figure 5.4 Taylor polynomials.
The original function f(x) = sin(x) + cos(x) (black, solid) is approximated by Taylor
When Is a Function Represented by Its Taylor Series?
The graphs of the exponential function and these three Taylor polynomials are drawn in Figure 1.
Figure 1
As nincreases, Tn (x) appears to approach ex in Figure 1. This suggests that ex is equal to the sum of its Taylor series.
x,f x e
1
2 2
2 3
3
1
1 2!
1 2! 3!
T x x
T x x x
x x
T x x
When Is a Function Represented by Its Taylor Series?
8 Theorem If f(x) = Tn(x) + Rn(x), where Tn is the nth-degree Taylor polynomial of f at a, and if
lim n 0
n R x
for x a R, then f is equal to the sum of its Taylor series on the interval x a R. In trying to show that lim n( ) 0
n R x
for a specific function f, we usually use the following Theorem.
9 Taylor's Inequality If f n1
x M for x a d, then the remainder Rn(x) of一個很重要的問題是:Taylor series 會不會收斂到原始函數 完整定理陳述請參考微積分教科書
Newton’s Method
Newton’s Method
The geometry behind Newton’s method is shown in
Figure 2, where the solution that we are trying to find is labeled r in the figure.
Figure 2
We start with a first approximation x1, which is obtained by guessing, or from a rough sketch of the graph of f, or from a computer-generated graph of f.
The idea behind Newton’s method is that the tangent line is close to the curve and so its x-intercept, x2, is close to the x-intercept of the curve (namely, the root r that we are seeking). Because the tangent is a line, we can easily find its x-intercept.
Consider the tangent line L to the curve y = f(x) at the
point (x1, f(x1)) and look at the x-intercept of L, labeled x2.
Newton’s Method
Since the x-intercept of L is x2, we know that the point (x2, 0) is on the line, and so
1 1 2 1
0 f x( ) f x( )(x x )
If f x( )1 0, we can solve this equation for x2:
1
2 1
1
x x f x
f′ x
We use x2 as a second approximation to r.
Next we repeat this procedure with x1 replaced by the second approximation x2, using the tangent line at (x2, f(x2)).
This gives a third approximation:
2
3 2
( ) ( ) x x f x
f x
Newton’s Method
If we keep repeating this process, we obtain a sequence of approximations x1, x2, x3, x4, . . . as shown in Figure 3.
Figure 3
In general, if the nth approximation is xn and (f x n) 0, then the next approximation is given by
1
( ) ( )
n
n n
n
x x f x
f x 2
If the numbers xn become closer and closer to r as n becomes large, then we
Example 1
Starting with x1 = 2, find the third approximation x3 to the solution of the equation
3 2 5 0.
x x
Solution:
We apply Newton’s method with. f x( ) x3 2x 5 and f x( ) 3x2 2
Newton himself used this equation to illustrate his method and he chose x1 = 2 after some experimentation because f\(1) = −6, f(2) = −1, and f(3) = 16.
1
( ) ( )
n
n n
n
x x f x
f x 2
Equation 2 becomes
3
1 2
2 5
3 2
n n n
n n n
n n
f x x x
x x x
f x x
With n = 1 we have
3
1 1 1
2 1 1 2
1 1
3
2
( ) 2 5
( ) 3 2
2 2 2 5
2 3 2 2
f x x x
x x x
f x x
Example 1 – Solution
Then with n = 2 we obtain
3
2 2
3 2 2
2 3
2
2 5
3 2
2.1 2 2.1 5 2.1
3 2.1 2
x x
x x
x
3
1 1 1
2 1 1 2
1 1
3
2
( ) 2 5
( ) 3 2
2 2 2 5
2
3 2 2
f x x x
x x x
f x x
With n = 1 we have
1
( ) ( )
n
n n
n
x x f x
f x
2 f x( ) x3 2x 5 and f x( ) 3x2 2 x1 = 2
What Derivatives Tell Us about the
Shape of a Graph
What Does f Say About f ?
To see how the derivative of f can tell us where a
function is increasing or decreasing, look at Figure 1.
Figure 1
Between A and B and between C and D, the tangent lines have positive slope and so
f x 0.
Between B and C the tangent lines have negative slope and so f
x 0.Thus it appears that f increases when f
x is positive and decreases when
f x is negative.
To prove that this is always the case, we use the Mean Value Theorem.
Increasing/Decreasing Test 2 f b
f a f c
b a
Increasing/Decreasing Test
(a) If f
x 0on an interval, then f is increasing on that interval.(b) If f
x 0on an interval, then f is decreasing on that interval.The Mean Value Theorem Let f be a function that satisfies the following hypotheses:
1. f is continuous on the closed interval [a, b].
2. f is differentiable on the open interval (a, b).
Then there is a number c in (a, b) such that
f b
f af c b a
1
or, equivalently,
2 f b
f a f c
b a
Example 1
Find where the function f x
3x4 4x3 12x2 5 is increasing and where it is decreasing.Solution:
We start by differentiating f: f
x 12x3 12x2 24x 12x x
2
x 1
To use the I/D Test we have to know where f
x 0 and where f
x 0.To solve these inequalities we first find where f
x 0, namely at x = 0, 2, and −1.Example 1 – Solution
These are the critical numbers of f, and they divide the domain into four intervals (see the number line in Figure 2).
Figure 2
Within each interval, f x
must be always positive or always negative.We can determine which is the case for each interval from the signs of the three factors of f x
, namely, 12x, x − 2, and x + 1, as shown in the chart.
12 3 12 2 24 12
2
1
f x x x x x x x
Example 1 – Solution (2 of 3)
A plus sign indicates that the given expression is positive, and a minus sign indicates that it is negative. The last column of the chart gives the conclusion based on the I/D Test.
For instance, f
x 0 for 0 < x < 2, so f is decreasing on (0, 2). (It would also be true to say that f is decreasing on the closed interval [0, 2].)Example 1 – Solution (3 of 3)
The graph of f shown in Figure 3 confirms the information in the chart.