Linear Algebra for Computer Science
Lecture 23
Introduction to Machine Learning
and learning from data
Machine Learning
Model
input output
Classification
Classifier Apple
Classification
Classifier Orange
Object detection
Detector
Speech Recognition
Model .دوﺑﻧ ﯽﮑﯾ دوﺑ ﯽﮑﯾ
Segmentation
Model
Stock Market Prediction
Predictor
Learning from data
https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/
Supervised Learning
http://seansoleyman.com/effect-of-dataset-size-on-image-classification-accuracy/
Supervised Learning
http://seansoleyman.com/effect-of-dataset-size-on-image-classification-accuracy/
Training data:
X1, y1 X2, y2 X3, y3
:Xn, yn
Supervised Learning
http://seansoleyman.com/effect-of-dataset-size-on-image-classification-accuracy/
Training data:
Apple
Apple
Orange
⋮
Orange
Supervised Learning
http://seansoleyman.com/effect-of-dataset-size-on-image-classification-accuracy/
Training data:
0
0
1
⋮
1
Supervised Learning
Classifier/
Regressor
input output
Classification
Classifier
input features
y ∈ {Class1, Class2, …, Classn}
Classification
Classifier Apple
Classification
Classifier Orange
Regression
Regressor y ∈ R
input features
Regression
Regressor y ∈ Rn
input features
Regression
Learnable Models
Classifier/
Regressor
input output
Learnable Models: Example
Classifier 0
Learnable Models: Example
Classifier 1
Learnable Models: Input-output map
f
x ∈ Rm y ∈ Rn
y = f(x) f: Rm → Rn
Learnable Models: Example
f
x =
I.flatten() y = 0
y = f(x) f: Rm → Rn I
Learnable Models: Example
f
x =
features(I) y = 0
y = f(x) f: Rm → Rn I
Learnable Models: parameters
f θ
x ∈ Rm y ∈ Rn
y = f(θ, x)
θ: model parameters
Learnable Models: parameters
Learnable Models: parameters
f θ
x ∈ Rm y = f(θ, x)
● Parameter Learning:
○ A collection of input-output paris (x1, y1), (x2, y2), …, (xN, yN),
○ choose θ such that y = f(θ, x) is a reasonable output for any input x.
Learning from data
f θ
x ∈ Rm y = f(θ, x)
● Parameter Learning:
○ A collection of input-output paris (x1, y1), (x2, y2), …, (xN, yN),
○ choose θ such that y = f(θ, x) is a reasonable output
■ for training data (x1, y1), (x2, y2), …, (xN, yN)
■ for unseen data (generalization)
Learning from data
f θ
x ∈ Rm y = f(θ, x)
● Parameter Learning:
○ A collection of input-output paris (x1, y1), (x2, y2), …, (xN, yN),
○ choose θ such that y = f(θ, x) is a reasonable output
■ for training data (x1, y1), (x2, y2), …, (xN, yN)
■ for unseen data (generalization)
Learning from data
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
Cost function
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..N d( f(θ, xi), yi )
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n d( f(θ, xi), yi )
data output
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n d( f(θ, xi), yi )
model output given xi
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n d( f(θ, xi), yi )
distance
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n
ǁ
f(θ, xi) - yiǁ
2distance
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n d( f(θ, xi), yi )
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n d( f(θ, xi), yi ) choose θ such that C(θ) is small
Learning from data: Cost function
f θ
x ∈ Rm y = f(θ, x)
● Training data (x1, y1), (x2, y2), …, (xN, yN)
○ choose θ such that f(θ, xi) is close to yi
○ cost function:
C(θ) = 𝚺i=1..n d( f(θ, xi), yi ) θ* = argminθ C(θ)
Cost function
Example: Linear Regression
f θ
x ∈ Rm y = A x + b ∈ Rn
Example: Linear Regression
f θ
x ∈ Rm y = A x + b ∈ Rn
A: ? by ? matrix
b: ?-D vector
Example: Linear Regression
f θ
x ∈ Rm y = A x + b ∈ Rn
A: n by m matrix
b: n-D vector
Example: Linear Regression
f θ
x ∈ Rm y = A x + b ∈ Rn
y =
f(
θ,x)
θ = ?
Example: Linear Regression
f θ
x ∈ Rm y = A x + b ∈ Rn
y =
f(
θ,x)
θ = (A,b)
Example: Linear Regression
Affine maps
Example: Linear Regression
f θ
x ∈ R y = a x + b ∈ R
y =
f(
θ,x)
θ = ?
Example: Linear Regression
f θ
x ∈ R y = a x + b ∈ R
y =
f(
θ,x)
θ = (a,b)
Example: Linear Regression
Example: Linear Regression
Training data (x1, y1), (x2, y2), …, (xN, yN)
Example: Linear Regression
f
θx ∈ R y = a x + b ∈ R
Training data (x1, y1), (x2, y2), …, (xN, yN)
Example: Linear Regression
f
θx ∈ R y = a x + b ∈ R
Training data (x1, y1), (x2, y2), …, (xN, yN)
cost function:
C(θ) = 𝚺i=1..N d( f(θ, xi), yi )
Example: Linear Regression
f
θx ∈ R y = a x + b ∈ R
Training data (x1, y1), (x2, y2), …, (xN, yN)
cost function:
C(a,b) = 𝚺i=1..n d( f(a,b, xi), yi )
Example: Linear Regression
f
θx ∈ R y = a x + b ∈ R
Training data (x1, y1), (x2, y2), …, (xN, yN)
cost function:
C(a,b) = 𝚺i=1..n d( f(a,b, xi), yi ) = 𝚺i=1..n d( a xi + b, yi )
Example: Linear Regression
f
θx ∈ R y = a x + b ∈ R
Training data (x1, y1), (x2, y2), …, (xN, yN)
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
Example: Linear Regression
f
θx ∈ R y = a x + b ∈ R
Training data (x1, y1), (x2, y2), …, (xN, yN)
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
Example: Linear Regression
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
How to find a*,b*?
Solution 1: Least squares
Solution 1: Least squares
Solution 1: Least squares
Example: Linear Regression
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
How to find a*,b*?
Solution 2: partial derivatives
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
∂ C(a,b) / ∂ a = 0
∂ C(a,b) / ∂ b = 0
Solution 2: partial derivatives
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
∂ C(a,b) / ∂ a = 2 𝚺i=1..n xi ( a xi + b - yi ) = 0
∂ C(a,b) / ∂ b = 2 𝚺i=1..n ( a xi + b - yi ) = 0
Solution 2: partial derivatives
cost function (sum of squared errors):
C(a,b) = 𝚺i=1..n ( a xi + b - yi )2
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
𝚺i=1..n xi ( a xi + b - yi ) = 0 𝚺i=1..n ( a xi + b - yi ) = 0
Solution 2: partial derivatives
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
𝚺i=1..n xi ( a xi + b - yi ) = a 𝚺i=1..n xi2 + b 𝚺i=1..n xi - 𝚺i=1..n xi yi = 0 𝚺i=1..n ( a xi + b - yi ) = a 𝚺i=1..n xi + b n - 𝚺i=1..nyi = 0
Solution 2: partial derivatives
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
(
𝚺i=1..n xi2)
a +(
𝚺i=1..n xi)
b = 𝚺i=1..n xi yi(
𝚺i=1..n xi)
a + n b = 𝚺i=1..n yiSolution 2: partial derivatives
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
(
𝚺i=1..n xi2)
a +(
𝚺i=1..n xi)
b = 𝚺i=1..n xi yi(
𝚺i=1..n xi)
a + n b = 𝚺i=1..n yia*,b* ⇐ solve system of linear equations
Solution 2: partial derivatives
Example: Linear Regression
a*,b* = argmina,b 𝚺i=1..n ( a xi + b - yi )2
a*,b* ⇐ solve system of linear equations
y = a* x + b*
f
a,bx y = a x + b
Evaluation
● Find good parameters θ
○ θ* = argminθ 𝚺i=1..n ( f(θ, xi) - yi )2
○ another method
● How good θ* is?
● How well the regressor works?
y = a* x + b*
f
θx
Evaluation
● Find good parameters θ
○ θ* = argminθ 𝚺i=1..n ( f(θ, xi) - yi )2
○ another method
● How good θ* is?
● How well the regressor works?
● Given Training data (x1, y1), (x2, y2), …, (xN, yN)
Error = C(θ*) = 𝚺i=1..N ( f(θ*, xi) - yi )2
y = a* x + b*
Learning from data
f θ
x ∈ Rm y = f(θ, x)
● Parameter Learning:
○ A collection of input-output paris (x1, y1), (x2, y2), …, (xN, yN),
○ choose θ such that y = f(θ, x) is a reasonable output
■ for training data (x1, y1), (x2, y2), …, (xN, yN)
■ for unseen data
Learning from data
f θ
x ∈ Rm y = f(θ, x)
● Parameter Learning:
○ A collection of input-output paris (x1, y1), (x2, y2), …, (xN, yN),
○ choose θ such that y = f(θ, x) is a reasonable output
■ for training data (x1, y1), (x2, y2), …, (xN, yN)
■ for unseen data
○ Generalization: How well the model works on unseen data