U Kang 1
Large Scale Data Analysis Using Deep Learning
Linear Algebra U Kang
Seoul National University
In This Lecture
Overview of linear algebra (but, not a comprehensive survey)
Focused on the subset most relevant to deep learning
For details on linear algebra, refer to Linear
Algebra and Its Applications by Gilbert Strang
U Kang 3
Scalar
A scalar is a single number
Integers, real number, rational numbers, etc.
We denote it with italic font
a, n, x
Vectors
A vector is a 1-D array of numbers
Can be real, binary, integer, etc.
Notation for type and size
U Kang 5
Matrices
A matrix is a 2-D array of numbers
Example notation for type and shape
Tensors
A tensor is an array of numbers, that may have
Zero dimensions, and be a scalar
One dimension, and be a vector
Two dimensions, and be a matrix
Three dimensions or more
Matrix over time
Knowledge base (subject, verb, object)
U Kang 7
Matrix Transpose
(AT)i,j = Aj,I
(AB)T = BTAT
Matrix Product
C = AB
U Kang 9
Matrix Product
Matrix product as sum of outer product
Product with a diagonal matrix
Identity Matrix
Example identity matrix: I3
U Kang 11
Systems of Equations
expands to
Solving Systems of Equations
A linear system of equations can have:
No solution
3x + 2y = 6, 3x + 2y = 12
Many solutions
3x + 2y = 6, 6x + 4y = 12
Exactly one solution: this means multiplication by the matrix is an invertible function
𝐴𝐴𝐴𝐴 = 𝑏𝑏 𝐴𝐴 = 𝐴𝐴−1𝑏𝑏
U Kang 13
Matrix Inversion
Matrix inverse: an inverse A-1 of an nxn matrix A satisfies
𝐴𝐴𝐴𝐴
−1= 𝐴𝐴
−1𝐴𝐴 = 𝐼𝐼
𝑛𝑛 Solving a system using an inverse
Invertibility
Matrix A cannot be inverted if
More rows than columns
More columns than rows
Redundant rows/columns (“linearly dependent”, “low rank”) full rank
The number 0 is an eigenvalue of A
A non-invertible matrix is called a singular matrix
An invertible matrix is called non-singular
U Kang 15
Linear Dependence and Span
Linear combination of vectors {𝑣𝑣 1 , … , 𝑣𝑣 𝑛𝑛 } : ∑𝑖𝑖 𝑐𝑐𝑖𝑖𝑣𝑣(𝑖𝑖)
The span of a set of vectors is the set of all points
obtainable by linear combination of the original vectors
Matrix-vector product Ax can be viewed as a linear combination of column vectors of A: 𝐴𝐴𝐴𝐴 = ∑𝑖𝑖 𝐴𝐴𝑖𝑖𝐴𝐴:,𝑖𝑖
The span of columns of A is called column space or range of A
A set of vectors is linearly independent if no vector in the set is a linear combination of the other vectors
linearly dependent
Rank of a Matrix
Rank of a matrix A: Number of linearly independent columns (or rows) of A
For example:
Matrix A = ’s rank ?
Why?
U Kang 17
Norms
Functions that measure how “large” a vector is
Similar to a distance between zero and the point represented by the vector
Formally, a norm is any function f that satisfies the following:
Norms
Lp norm
Most popular norm: L2 norm, p = 2
Euclidean distance
For matrix/tensor, Frobenius norm (LF) does the same thing
L1 norm, p = 1:
Called ‘Manhattan’ distance
Max norm, infinite p:
U Kang 19
Special Matrices and Vectors
Diagonal matrix
2 by 2 : 𝑎𝑎
𝑏𝑏
We use diag(v) to denote the diagonal matrix where v is the vector containing the diagonal elements
Inverse of diagonal matrix is computed easily
Special Matrices and Vectors
Unit vector
Symmetric matrix
Orthogonal matrix
U Kang 21
Eigenvector and Eigenvalue
Eigenvector and eigenvalue of A
Eigenvalue/eigenvector pairs are defined only for square matrix 𝐴𝐴 ∈ 𝑅𝑅𝑛𝑛×𝑛𝑛
# of eigenvalues: n
There may be duplicates
Eigenvalues and eigenvectors can contain real or imaginary numbers
Intuition
A as vector transformation
2 1 1 3
A
1 0
x
2 1
x’
= x
x’
U Kang 23
Intuition
By defn., eigenvectors remain parallel to them selves (‘fixed points’)
2 1 1 3
A
0.52 0.85
v1 v1
=
0.52
3.62 * 0.85
λ1
Eigendecomposition
Eigendecomposition of a matrix
Let A be a square (n x n) matrix with n linearly independent eigenvectors (=diagonizable)
Then A can be factorized as
where V is an (nxn) matrix whose i th column is the i th
U Kang 25
Eigendecomposition
Every real symmetric matrix has a real, orthogonal eigendecomposition
A real orthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors (orthonormal vectors)
𝑄𝑄𝑇𝑇𝑄𝑄 = 𝑄𝑄 𝑄𝑄𝑇𝑇 = 𝐼𝐼
𝑄𝑄𝑇𝑇 = 𝑄𝑄−1
Eigendecomposition
Interpreting matrix-vector multiplication Ax using eigendecomposition
U Kang 27
Eigendecomposition
Understanding optimization of 𝑓𝑓 𝐴𝐴 = 𝐴𝐴𝑇𝑇𝐴𝐴𝐴𝐴, s.t.
| 𝐴𝐴 |2 = 1, using eigenvalues and eigenvectors
Eigendecomposition
Positive definite matrix: a real symmetric matrix whose eigenvalues are all positive
Positive semidefinite matrix
Negative definite matrix
Negative semidefinite matrix
For positive semidefinite matrix A, ∀𝐴𝐴, 𝐴𝐴𝑇𝑇𝐴𝐴𝐴𝐴 ≥ 0.
For positive definite matrix A, ∀𝐴𝐴 ≠ 0, 𝐴𝐴𝑇𝑇𝐴𝐴𝐴𝐴 > 0
U Kang 29
Singular Value Decomposition (SVD)
Similar to eigendecomposition
More general; matrix need not be square
𝐴𝐴 = 𝑈𝑈Λ𝑉𝑉
𝑇𝑇SVD - Definition
A
[n x m]= U
[n x r]Λ
[ r x r]( V
[m x r])
T
A: n x m matrix (eg., n documents, m terms)
U : n x r matrix (n documents, r concepts)
Left singular vectors
Λ : r x r diagonal matrix (strength of each
‘concept’) (r : rank of the matrix)
V : m x r matrix (m terms, r concepts)
U Kang 31
SVD - Definition
r x r r x m
n x r n x m
A U
Λ V
TA
[n x m]= U
[n x r]Λ
[ r x r](V
[m x r])
T= x x
SVD - Properties
THEOREM [Press+92]: always possible to
decompose matrix A into A = U Λ VT , where
U, Λ, V: unique (*)
U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other)
UT U = I; VT V = I (I: identity matrix)
Λ: singular are positive, and sorted in decreasing order
U Kang 33
𝐴𝐴 ≈ 𝑈𝑈Λ𝑉𝑉
𝑇𝑇= �
𝑖𝑖
𝜎𝜎
𝑖𝑖(𝑢𝑢
𝑖𝑖𝑣𝑣
𝑖𝑖𝑇𝑇)
Best rank-k approximation in LF
A
m
n
m 𝚲𝚲
n
U
VT
≈
SVD - Properties
SVD and eigendecomposition
The left-singular vectors of A are the eigenvectors of AAT
The right-singular vectors of A are the eigenvectors of ATA
The non-zero singular values of A are the square root of the eigenvalues of ATA
U Kang 35
Moore-Penrose Pseudoinverse
Assume we want to solve A x = y for x
What if A is not invertible?
What if A is not square?
Still, we can find the ‘best’ x by using pseudoinverse
For an (n x m) matrix A, the pseudoinverse A+ is an (m x n) matrix
Moore-Penrose Pseudoinverse
If the equation has:
Exactly one solution: this is the same as the inverse
No solution: this gives us the solution with the smallest error ||Ax – y||2
Over-specified case
U Kang 37
Pseudoinverse: Over-specified case
No solution: this gives us the solution with the smallest error ||Ax – y||2
[3 2]T [x] = [1 2]T (i.e., 3x = 1, 2x = 2)
([3 2]T)+ [1 2] = … = 7/13
This method is called the ‘least square’
1 2 3 4 1
2
reachable points (3x, 2x) desirable point y
Pseudoinverse: Under-specified case
Many solutions: this gives us the solution with the smallest norm of x
[1 2] [w z]T = 4 (i.e., 1 w + 2 z = 4)
1
2 x0 all possible solutions z shortest-length solution
U Kang 39
Computing the Pseudoinverse
The SVD allows the computation of the pseudoinverse:
Trace
𝑇𝑇𝑇𝑇 𝐴𝐴 = ∑𝑖𝑖 𝐴𝐴𝑖𝑖,𝑖𝑖
𝑇𝑇𝑇𝑇 𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇 𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇(𝐴𝐴𝐴𝐴𝐴𝐴)
𝑇𝑇𝑇𝑇 𝐴𝐴 = 𝑇𝑇𝑇𝑇(𝐴𝐴𝑇𝑇)
| 𝐴𝐴 |𝐹𝐹 = 𝑇𝑇𝑇𝑇(𝐴𝐴𝐴𝐴𝑇𝑇)
For scalar a, 𝑇𝑇𝑇𝑇 𝑎𝑎 = 𝑎𝑎
The trace of a matrix is also computed by the sum of all eigenvalues
U Kang 41
What you need to know
Linear algebra is crucial for understanding
notations and mechanisms of many ML/deep learning methods
Important concepts
Matrix product, identity, invertibility, norm,
eigendecomposition, singular value decomposition, pseudo inverse, and trace
Questions?