Probability and Statistics
Cookbook
Copyright cMatthias Vallentin, 2011 vallentin@icir.org
This cookbook integrates a variety of topics in probability the-ory and statistics. It is based on literature [1,6,3] and in-class material from courses of the statistics department at the Uni-versity of California in Berkeley but also influenced by other sources [4,5]. If you find errors or have suggestions for further topics, I would appreciate if you send me anemail. The most re-cent version of this document is available athttp://matthias. vallentin.net/probability-and-statistics-cookbook/. To reproduce, please contact me.
Contents
1 Distribution Overview 3
1.1 Discrete Distributions . . . 3
1.2 Continuous Distributions . . . 4
2 Probability Theory 6 3 Random Variables 6 3.1 Transformations . . . 7
4 Expectation 7 5 Variance 7 6 Inequalities 8 7 Distribution Relationships 8 8 Probability and Moment Generating Functions 9 9 Multivariate Distributions 9 9.1 Standard Bivariate Normal . . . 9
9.2 Bivariate Normal . . . 9
9.3 Multivariate Normal . . . 9
10 Convergence 9 10.1 Law of Large Numbers (LLN) . . . 10
10.2 Central Limit Theorem (CLT) . . . 10
11 Statistical Inference 10 11.1 Point Estimation . . . 10
11.2 Normal-Based Confidence Interval . . . 11
11.3 Empirical distribution . . . 11
11.4 Statistical Functionals . . . 11
12 Parametric Inference 11 12.1 Method of Moments . . . 11
12.2 Maximum Likelihood. . . 12
12.2.1 Delta Method . . . 12
12.3 Multiparameter Models . . . 12
12.3.1 Multiparameter delta method . . 13
12.4 Parametric Bootstrap . . . 13
13 Hypothesis Testing 13 14 Bayesian Inference 14 14.1 Credible Intervals. . . 14
14.2 Function of parameters. . . 14
14.3 Priors . . . 15
14.3.1 Conjugate Priors . . . 15
14.4 Bayesian Testing . . . 15
15 Exponential Family 16 16 Sampling Methods 16 16.1 The Bootstrap . . . 16
16.1.1 Bootstrap Confidence Intervals . 16 16.2 Rejection Sampling . . . 17
16.3 Importance Sampling. . . 17
17 Decision Theory 17 17.1 Risk . . . 17
17.2 Admissibility . . . 17
17.3 Bayes Rule . . . 18
17.4 Minimax Rules . . . 18
18 Linear Regression 18 18.1 Simple Linear Regression . . . 18
18.2 Prediction . . . 19
18.3 Multiple Regression . . . 19
18.4 Model Selection . . . 19
19 Non-parametric Function Estimation 20 19.1 Density Estimation . . . 20
19.1.1 Histograms . . . 20
19.1.2 Kernel Density Estimator (KDE) 21 19.2 Non-parametric Regression . . . 21
19.3 Smoothing Using Orthogonal Functions 21 20 Stochastic Processes 22 20.1 Markov Chains . . . 22
20.2 Poisson Processes . . . 22
21 Time Series 23 21.1 Stationary Time Series . . . 23
21.2 Estimation of Correlation . . . 24
21.3 Non-Stationary Time Series . . . 24
21.3.1 Detrending . . . 24
21.4 ARIMA models . . . 24
21.4.1 Causality and Invertibility. . . . 25
21.5 Spectral Analysis . . . 25
22 Math 26 22.1 Gamma Function . . . 26
22.2 Beta Function. . . 26
22.3 Series . . . 27
1
Distribution Overview
1.1
Discrete Distributions
Notation1 FX(x) fX(x) E[X] V[X] MX(s)
Uniform Unif{a, . . . , b}
0 x < a
⌊x⌋−a+1
b−a a≤x≤b 1 x > b
I(a < x < b) b−a+ 1
a+b 2
(b−a+ 1)2
−1
12
eas
−e−(b+1)s
s(b−a)
Bernoulli Bern (p) (1−p)1−x px(1−p)1−x p p(1−p) 1−p+pes
Binomial Bin (n, p) I1−p(n−x, x+ 1) n
x !
px(1−p)n−x np np(1−p) (1−p+pes)n
Multinomial Mult (n, p) n!
x1!. . . xk!
px11 · · ·pxk
k k X
i=1
xi=n npi npi(1−pi)
k X
i=0
piesi
!n
Hypergeometric Hyp (N, m, n) ≈Φ px−np np(1−p)
! m
x m−x
n−x
N x
nm N
nm(N−n)(N−m) N2(N−1)
Negative Binomial NBin (r, p) Ip(r, x+ 1) x+r−1 r−1
!
pr(1−p)x r1−p
p r
1−p p2
p 1−(1−p)es
r
Geometric Geo (p) 1−(1−p)x x∈N+ p(1−p)x−1 x∈N+ 1
p
1−p p2
p 1−(1−p)es
Poisson Po (λ) e−λ
x X
i=0
λi i!
λxe−λ
x! λ λ e
λ(es−1)
● ● ● ● ● ●
Uniform (discrete)
x
PMF
a b
1 n
● ● ● ●● ●
● ●
● ●
●● ●
●
●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 10 20 30 40
0.00
0.05
0.10
0.15
0.20
0.25
Binomial
x
PMF
● n=40, p=0.3 n=30, p=0.6 n=25, p=0.9
● ●
● ●
● ●
● ● ● ●
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
Geometric
x
PMF
● p=0.2 p=0.5 p=0.8
● ●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20
0.0
0.1
0.2
0.3
Poisson
x
PMF
● λ =1
λ =4
λ =10
1We use the notationγ(s, x) and Γ(x) to refer to the Gamma functions (see§22.1), and use B(x, y) andIxto refer to the Beta functions (see§22.2).
1.2
Continuous Distributions
Notation FX(x) fX(x) E[X] V[X] MX(s)
Uniform Unif (a, b)
0 x < a x−a
b−a a < x < b 1 x > b
I(a < x < b) b−a
a+b 2
(b−a)2
12
esb
−esa
s(b−a)
Normal N µ, σ2
Φ(x) = Z x
−∞
φ(t)dt φ(x) = 1 σ√2πexp
−(x−µ)
2
2σ2
µ σ2 exp
µs+σ
2s2
2
Log-Normal lnN µ, σ2 1 2+
1 2erf
lnx−µ
√
2σ2
1 x√2πσ2exp
−(lnx−µ)
2
2σ2
eµ+σ2/2 (eσ2−1)e2µ+σ2
Multivariate Normal MVN (µ,Σ) (2π)−k/2|Σ|−1/2e−1 2(x−µ)
TΣ−1(x−µ)
µ Σ exp
µTs+1
2s TΣs
Student’st Student(ν) Ixν 2,
ν 2
Γ ν+1
2 √νπΓ ν 2
1 +x
2
ν
−(ν+1)/2
0 0
Chi-square χ2k
1 Γ(k/2)γ k 2, x 2 1
2k/2Γ(k/2)x
k/2−1e−x/2 k 2k (1
−2s)−k/2s <1/2
F F(d1, d2) I d1x d1x+d2
d1 2, d1 2 r
(d1x)d1d
d2
2
(d1x+d2)d1 +d2
xB d1 2 , d1 2 d2
d2−2
2d22(d1+d2−2)
d1(d2−2)2(d2−4)
Exponential Exp (β) 1−e−x/β 1
βe
−x/β β β2 1
1−βs(s <1/β)
Gamma Gamma (α, β) γ(α, x/β) Γ(α)
1 Γ (α)βαx
α−1e−x/β αβ αβ2 1
1−βs α
(s <1/β)
Inverse Gamma InvGamma (α, β) Γ α, β x
Γ (α)
βα Γ (α)x
−α−1e−β/x β
α−1 α >1
β2
(α−1)2(α−2)2 α >2
2(−βs)α/2
Γ(α) Kα p
−4βs
Dirichlet Dir (α)
ΓPki=1αi
Qk i=1Γ (αi)
k Y
i=1
xαi−1
i
αi Pk
i=1αi
E[Xi] (1−E[Xi])
Pk
i=1αi+ 1
Beta Beta (α, β) Ix(α, β) Γ (α+β)
Γ (α) Γ (β)x α−1(1
−x)β−1 α
α+β
αβ
(α+β)2(α+β+ 1) 1 +
∞
X
k=1
k−1
Y
r=0
α+r α+β+r
! sk k!
Weibull Weibull(λ, k) 1−e−(x/λ)k k λ
x λ
k−1
e−(x/λ)k λΓ
1 +1 k
λ2Γ
1 +2 k
−µ2
∞
X
n=0
snλn n! Γ
1 +n
k
Pareto Pareto(xm, α) 1−xm x
α
x≥xm α x
α m
xα+1 x≥xm
αxm α−1 α >1
xαm
(α−1)2(α−2) α >2 α(−xms)
αΓ(
−α,−xms)s <0
● ● Uniform (continuous)
x
a b
1
b−a
● ●
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
Normal
x
φ
(
x
)
µ =0, σ2=
0.2
µ =0, σ2=1
µ =0, σ2=
5
µ =−2, σ2=
0.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Log−normal
x
µ =0, σ2=
3
µ =2, σ2=2
µ =0, σ2=
1
µ =0.5, σ2=
1
µ =0.25, σ2=1
µ =0.125, σ2=
1
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Student'st
x
ν =1
ν =2
ν =5
ν = ∞
0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
0.5
χ2
x
k=1 k=2 k=3 k=4 k=5
0 1 2 3 4 5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
F
x
d1=1, d2=1
d1=2, d2=1 d1=5, d2=2
d1=100, d2=1
d1=100, d2=100
0 1 2 3 4 5
0.0
0.5
1.0
1.5
2.0
Exponential
x
β =2
β =1
β =0.4
0 5 10 15 20
0.0
0.1
0.2
0.3
0.4
0.5
Gamma
x
α =1, β =2
α =2, β =2
α =3, β =2
α =5, β =1
α =9, β =0.5
0 1 2 3 4 5
0
1
2
3
4
Inverse Gamma
x
α =1, β =1
α =2, β =1
α =3, β =1
α =3, β =0.5
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Beta
x
α =0.5, β =0.5
α =5, β =1
α =1, β =3
α =2, β =2
α =2, β =5
0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
2.0
2.5
Weibull
x
λ =1, k=0.5
λ =1, k=1
λ =1, k=1.5
λ =1, k=5
0 1 2 3 4 5
0
1
2
3
Pareto
x
xm=1, α =1
xm=1, α =2 xm=1, α =4
2
Probability Theory
Definitions• Sample space Ω
• Outcome (point or element)ω∈Ω
• EventA⊆Ω
• σ-algebraA 1. ∅ ∈ A
2. A1, A2, . . . ,∈ A =⇒ S∞i=1Ai∈ A 3. A∈ A =⇒ ¬A∈ A
• Probability DistributionP
1. P[A]≥0 ∀A
2. P[Ω] = 1
3. P
"∞ G
i=1
Ai #
=
∞
X
i=1
P[Ai]
• Probability space (Ω,A,P)
Properties
• P[∅] = 0
• B= Ω∩B= (A∪ ¬A)∩B= (A∩B)∪(¬A∩B)
• P[¬A] = 1−P[A]
• P[B] =P[A∩B] +P[¬A∩B]
• P[Ω] = 1 P[∅] = 0
• ¬(SnAn) =Tn¬An ¬(TnAn) =Sn¬An DeMorgan
• P[S
nAn] = 1−P[Tn¬An]
• P[A∪B] =P[A] +P[B]−P[A∩B]
=⇒ P[A∪B]≤P[A] +P[B]
• P[A∪B] =P[A∩ ¬B] +P[¬A∩B] +P[A∩B]
• P[A∩ ¬B] =P[A]−P[A∩B]
Continuity of Probabilities
• A1⊂A2⊂. . . =⇒ limn→∞P[An] =P[A] whereA=S∞i=1Ai
• A1⊃A2⊃. . . =⇒ limn→∞P[An] =P[A] whereA=T∞i=1Ai
Independence⊥⊥
A⊥⊥B ⇐⇒ P[A∩B] =P[A]P[B]
Conditional Probability
P[A|B] = P[A∩B]
P[B] P[B]>0
Law of Total Probability
P[B] =
n X
i=1
P[B|Ai]P[Ai] Ω =
n G
i=1
Ai
Bayes’ Theorem
P[Ai|B] = PnP[B|Ai]P[Ai]
j=1P[B|Aj]P[Aj]
Ω = n G
i=1
Ai
Inclusion-Exclusion Principle
n [
i=1
Ai =
n X
r=1
(−1)r−1 X i≤i1<···<ir≤n
r \
j=1
Aij
3
Random Variables
Random Variable (RV)
X: Ω→R
Probability Mass Function (PMF)
fX(x) =P[X =x] =P[{ω∈Ω :X(ω) =x}]
Probability Density Function (PDF)
P[a≤X≤b] =
Z b
a
f(x)dx
Cumulative Distribution Function (CDF)
FX :R→[0,1] FX(x) =P[X ≤x] 1. Nondecreasing: x1< x2 =⇒ F(x1)≤F(x2)
2. Normalized: limx→−∞= 0 and limx→∞= 1
3. Right-Continuous: limy↓xF(y) =F(x)
P[a≤Y ≤b|X =x] =
Z b
a
fY|X(y|x)dy a≤b
fY|X(y|x) =
f(x, y)
fX(x) Independence
1. P[X ≤x, Y ≤y] =P[X ≤x]P[Y ≤y]
2. fX,Y(x, y) =fX(x)fY(y)
3.1
Transformations
Transformation function
Z=ϕ(X)
Discrete
fZ(z) =P[ϕ(X) =z] =P[{x:ϕ(x) =z}] =PX ∈ϕ−1(z)= X
x∈ϕ−1(z)
f(x)
Continuous
FZ(z) =P[ϕ(X)≤z] = Z
Az
f(x)dx withAz={x:ϕ(x)≤z}
Special case if ϕstrictly monotone
fZ(z) =fX(ϕ−1(z))
dzdϕ−1(z)
=fX(x) dxdz
=fX(x) 1
|J|
The Rule of the Lazy Statistician
E[Z] =
Z
ϕ(x)dFX(x)
E[IA(x)] =
Z
IA(x)dFX(x) = Z
A
dFX(x) =P[X ∈A]
Convolution
• Z:=X+Y fZ(z) = Z ∞
−∞
fX,Y(x, z−x)dx X,Y≥0
= Z z
0
fX,Y(x, z−x)dx
• Z:=|X−Y| fZ(z) = 2 Z ∞
0
fX,Y(x, z+x)dx
• Z:= X
Y fZ(z) =
Z ∞
−∞|
x|fX,Y(x, xz)dx ⊥=⊥ Z ∞
−∞
xfx(x)fX(x)fY(xz)dx
4
Expectation
Definition and properties
• E[X] =µX=
Z
x dFX(x) =
X
x
xfX(x) X discrete
Z
xfX(x) X continuous
• P[X =c] = 1 =⇒ E[c] =c
• E[cX] =cE[X]
• E[X+Y] =E[X] +E[Y]
• E[XY] =
Z
X,Y
xyfX,Y(x, y)dFX(x)dFY(y)
• E[ϕ(Y)]6=ϕ(E[X]) (cf.Jenseninequality)
• P[X ≥Y] = 0 =⇒ E[X]≥E[Y]∧P[X=Y] = 1 =⇒ E[X] =E[Y]
• E[X] = ∞
X
x=1
P[X≥x]
Sample mean
¯
Xn= 1
n
n X
i=1
Xi
Conditional expectation
• E[Y |X =x] =
Z
yf(y|x)dy
• E[X] =E[E[X|Y]]
• E[ϕ(X, Y)|X=x] = Z ∞
−∞
ϕ(x, y)fY|X(y|x)dx
• E[ϕ(Y, Z)|X =x] =
Z ∞
−∞
ϕ(y, z)f(Y,Z)|X(y, z|x)dy dz
• E[Y +Z|X] =E[Y|X] +E[Z|X]
• E[ϕ(X)Y |X] =ϕ(X)E[Y |X]
• E[Y|X] =c =⇒ Cov [X, Y] = 0
5
Variance
Definition and properties
• V[X] =σ2
X =E
(X−E[X])2=EX2−E[X]2
• V
" n X
i=1
Xi #
= n X
i=1
V[Xi] + 2X
i6=j
Cov [Xi, Yj]
• V
" n X
i=1
Xi #
= n X
i=1
V[Xi] iffXi⊥⊥Xj
Standard deviation
sd[X] =pV[X] =σX
Covariance
• Cov [X, Y] =E[(X−E[X])(Y −E[Y])] =E[XY]−E[X]E[Y]
• Cov [X, a] = 0
• Cov [X, X] =V[X]
• Cov [X, Y] = Cov [Y, X]
• Cov [aX, bY] =abCov [X, Y]
• Cov [X+a, Y +b] = Cov [X, Y]
• Cov
n X
i=1
Xi, m X
j=1
Yj =
n X
i=1
m X
j=1
Cov [Xi, Yj]
Correlation
ρ[X, Y] = pCov [X, Y]
V[X]V[Y]
Independence
X⊥⊥Y =⇒ ρ[X, Y] = 0 ⇐⇒ Cov [X, Y] = 0 ⇐⇒ E[XY] =E[X]E[Y]
Sample variance
S2= 1
n−1 n X
i=1
(Xi−X¯n)2
Conditional variance
• V[Y|X] =E(Y −E[Y |X])2|X=EY2|X−E[Y |X]2
• V[Y] =E[V[Y |X]] +V[E[Y |X]]
6
Inequalities
Cauchy-Schwarz
E[XY]2≤EX2EY2
Markov
P[ϕ(X)≥t]≤E[ϕ(X)] t
Chebyshev
P[|X−E[X]| ≥t]≤ V[X] t2
Chernoff
P[X ≥(1 +δ)µ]≤
eδ (1 +δ)1+δ
δ >−1
Jensen
E[ϕ(X)]≥ϕ(E[X]) ϕconvex
7
Distribution Relationships
Binomial
• Xi∼Bern (p) =⇒ n X
i=1
Xi∼Bin (n, p)
• X∼Bin (n, p), Y ∼Bin (m, p) =⇒ X+Y ∼Bin (n+m, p)
• limn→∞Bin (n, p) = Po (np) (nlarge,psmall)
• limn→∞Bin (n, p) =N(np, np(1−p)) (nlarge,pfar from 0 and 1) Negative Binomial
• X ∼NBin (1, p) = Geo (p)
• X ∼NBin (r, p) =Pri=1Geo (p)
• Xi∼NBin (ri, p) =⇒ PXi∼NBin (Pri, p)
• X ∼NBin (r, p). Y ∼Bin (s+r, p) =⇒ P[X ≤s] =P[Y ≥r]
Poisson
• Xi∼Po (λi)∧Xi⊥⊥Xj =⇒ n X
i=1
Xi∼Po n X
i=1
λi !
• Xi∼Po (λi)∧Xi⊥⊥Xj =⇒ Xi
n X
j=1
Xj ∼Bin
n X
j=1
Xj,
λi Pn
j=1λj
Exponential
• Xi∼Exp (β)∧Xi⊥⊥Xj =⇒ n X
i=1
Xi∼Gamma (n, β)
• Memoryless property: P[X > x+y|X > y] =P[X > x]
Normal
• X ∼ N µ, σ2 =⇒ X−µ σ
∼ N(0,1)
• X ∼ N µ, σ2
∧Z=aX+b =⇒ Z∼ N aµ+b, a2σ2
• X ∼ N µ1, σ12
∧Y ∼ N µ2, σ22
=⇒ X+Y ∼ N µ1+µ2, σ12+σ22
• Xi∼ N µi, σi2
=⇒ PiXi ∼ N Piµi,Piσ2i
• P[a < X ≤b] = Φb−µ
σ
−Φ a−σµ
• Φ(−x) = 1−Φ(x) φ′(x) =−xφ(x) φ′′(x) = (x2
−1)φ(x)
• Upper quantile ofN(0,1): zα= Φ−1(1−α) Gamma
• X ∼Gamma (α, β) ⇐⇒ X/β∼Gamma (α,1)
• Gamma (α, β)∼Pαi=1Exp (β)
• Xi∼Gamma (αi, β)∧Xi⊥⊥Xj =⇒ PiXi∼Gamma (Piαi, β)
• Γ(λαα) = Z ∞
0
xα−1e−λxdx
Beta
• B(α, β1 )xα−1(1−x)β−1= Γ(α+β) Γ(α)Γ(β)x
α−1(1
−x)β−1
• EXk=B(α+k, β)
B(α, β) =
α+k−1
α+β+k−1E
Xk−1
• Beta (1,1)∼Unif (0,1)
8
Probability and Moment Generating Functions
• GX(t) =E
tX
|t|<1
• MX(t) =GX(et) =EeXt=E "∞
X
i=0
(Xt)i
i! #
=
∞
X
i=0
EXi
i! ·t i
• P[X = 0] =GX(0)
• P[X = 1] =G′
X(0)
• P[X =i] = G
(i)
X(0)
i!
• E[X] =G′X(1−)
• EXk=M(k)
X (0)
• E
X! (X−k)!
=G(Xk)(1−)
• V[X] =G′′
X(1−) +G′X(1−)−(G′X(1−))
2
• GX(t) =GY(t) =⇒ X d =Y
9
Multivariate Distributions
9.1
Standard Bivariate Normal
LetX, Y ∼ N(0,1)∧X ⊥⊥Z whereY =ρX+p1−ρ2Z
Joint density
f(x, y) = 1
2πp1−ρ2exp
−x
2+y2
−2ρxy
2(1−ρ2)
Conditionals
(Y|X =x)∼ N ρx,1−ρ2 and (X|Y =y)∼ N ρy,1−ρ2
Independence
X⊥⊥Y ⇐⇒ ρ= 0
9.2
Bivariate Normal
LetX ∼ N µx, σx2
andY ∼ N µy, σ2y
.
f(x, y) = 1 2πσxσy
p
1−ρ2exp
−2(1z −ρ2)
z= "
x−µx
σx 2
+
y−µy
σy 2
−2ρ
x−µx
σx
y−µy
σy #
Conditional mean and variance
E[X|Y] =E[X] +ρσX σY
(Y −E[Y])
V[X|Y] =σXp1−ρ2
9.3
Multivariate Normal
Covariance matrix Σ (Precision matrix Σ−1)
Σ =
V[X1] · · · Cov [X1, Xk]
..
. . .. ... Cov [Xk, X1] · · · V[Xk]
IfX ∼ N(µ,Σ),
fX(x) = (2π)−n/2|Σ|−1/2exp
−12(x−µ)TΣ−1(x−µ)
Properties
• Z∼ N(0,1)∧X=µ+ Σ1/2Z =
⇒ X ∼ N(µ,Σ)
• X ∼ N(µ,Σ) =⇒ Σ−1/2(X−µ)∼ N(0,1)
• X ∼ N(µ,Σ) =⇒ AX∼ N Aµ, AΣAT
• X ∼ N(µ,Σ)∧ kak=k =⇒ aTX ∼ N aTµ, aTΣa
10
Convergence
Let {X1, X2, . . .} be a sequence of rv’s and let X be anotherrv. LetFn denote thecdfofXn and letF denote thecdfofX.
Types of convergence
1. In distribution (weakly, in law): Xn D
→X
lim
n→∞Fn(t) =F(t) ∀t whereF continuous
2. In probability: Xn P
→X
(∀ε >0) lim n→∞
P[|Xn−X|> ε] = 0
3. Almost surely (strongly): Xn as
→X
Phlim
n→∞Xn =X
i
=Phω∈Ω : lim
n→∞Xn(ω) =X(ω)
i = 1
4. In quadratic mean (L2): Xn qm
→X
lim n→∞E
(Xn−X)2
= 0
Relationships
• Xn qm
→X =⇒ Xn P
→X =⇒ Xn D
→X
• Xn as
→X =⇒ Xn P
→X
• Xn D
→X∧(∃c∈R)P[X=c] = 1 =⇒ Xn →P X
• Xn P
→X∧Yn P
→Y =⇒ Xn+Yn P
→X+Y
• Xn qm
→X∧Yn qm
→Y =⇒ Xn+Yn qm
→X+Y
• Xn P
→X∧Yn P
→Y =⇒ XnYn P
→XY
• Xn P
→X =⇒ ϕ(Xn) P
→ϕ(X)
• Xn D
→X =⇒ ϕ(Xn) D
→ϕ(X)
• Xn qm
→b ⇐⇒ limn→∞E[Xn] =b∧limn→∞V[Xn] = 0
• X1, . . . , Xn iid∧E[X] =µ∧V[X]<∞ ⇐⇒ X¯n qm
→µ
Slutzky’s Theorem
• Xn D
→X andYn P
→c =⇒ Xn+Yn D
→X+c
• Xn D
→X andYn P
→c =⇒ XnYn D
→cX
• In general: Xn D
→X andYn D
→Y =6⇒ Xn+Yn D
→X+Y
10.1
Law of Large Numbers (LLN)
Let{X1, . . . , Xn} be a sequence ofiid rv’s,E[X1] =µ, andV[X1]<∞.
Weak (WLLN)
¯
Xn P
→µ n→ ∞
Strong (WLLN)
¯
Xn as
→µ n→ ∞
10.2
Central Limit Theorem (CLT)
Let{X1, . . . , Xn} be a sequence ofiid rv’s,E[X1] =µ, andV[X1] =σ2.
Zn := ¯
Xn−µ q
VX¯n
=
√n( ¯X
n−µ)
σ
D
→Z whereZ ∼ N(0,1)
lim
n→∞P[Zn≤z] = Φ(z) z∈R
CLT notations
Zn ≈ N(0,1) ¯
Xn ≈ N
µ,σ
2
n
¯
Xn−µ≈ N
0,σ
2
n
√
n( ¯Xn−µ)≈ N 0, σ2
√n( ¯X
n−µ)
n ≈ N(0,1)
Continuity correction
PX¯n≤x≈Φ
x+1 2−µ
σ/√n
PX¯n≥x≈1−Φ
x
−1 2−µ
σ/√n
Delta method
Yn≈ N
µ,σ
2
n
=⇒ ϕ(Yn)≈ N
ϕ(µ),(ϕ′(µ))2σ
2
n
11
Statistical Inference
LetX1,· · ·, Xn iid∼ F if not otherwise noted.
11.1
Point Estimation
• Point estimatorθbn ofθis arv: θbn =g(X1, . . . , Xn)
• bias(bθn) =Ehθbni−θ
• Consistency: θbn P
→θ
• Sampling distribution: F(θbn)
• Standard error: se(θbn) = r
Vhθbni
• Mean squared error: mse=Eh(θbn−θ)2i=bias(θbn)2+V h
b
θn i
• limn→∞bias(θbn) = 0∧limn→∞se(θbn) = 0 =⇒ θbn is consistent
• Asymptotic normality: θbn−θ
se
D
→ N(0,1)
• Slutzky’s Theoremoften lets us replacese(bθn) by some (weakly)
consis-tent estimatorbσn.
11.2
Normal-Based Confidence Interval
Suppose θbn ≈ N
θ,seb2. Let zα/2 = Φ−1(1−(α/2)), i.e., P
Z > zα/2
=α/2 andP−zα/2< Z < zα/2= 1−αwhereZ ∼ N(0,1). Then
Cn=bθn±zα/2seb
11.3
Empirical distribution
Empirical Distribution Function (ECDF)
b
Fn(x) = Pn
i=1I(Xi≤x)
n
I(Xi≤x) = (
1 Xi≤x 0 Xi> x
Properties (for any fixedx)
• E
h b
Fn i
=F(x)
• VhFbni=F(x)(1−F(x)) n
• mse= F(x)(1−F(x))
n
D
→0
• Fbn P
→F(x)
Dvoretzky-Kiefer-Wolfowitz(DKW) inequality (X1, . . . , Xn∼F)
P
sup
x
F(x)−Fbn(x)> ε
= 2e−2nε2
Nonparametric 1−αconfidence band for F
L(x) = max{Fbn−ǫn,0}
U(x) = min{Fbn+ǫn,1}
ǫ= s
1 2nlog
2
α
P[L(x)≤F(x)≤U(x)∀x]≥1−α
11.4
Statistical Functionals
• Statistical functional: T(F)
• Plug-in estimator ofθ= (F): θbn =T(Fbn)
• Linear functional: T(F) =Rϕ(x)dFX(x)
• Plug-in estimator for linear functional:
T(Fbn) = Z
ϕ(x)dFbn(x) = 1
n
n X
i=1
ϕ(Xi)
• Often: T(Fbn)≈ NT(F),seb2 =⇒T(Fbn)±zα/2seb
• pth quantile: F−1(p) = inf
{x:F(x)≥p}
• µb= ¯Xn
• bσ2= 1
n−1 n X
i=1
(Xi−X¯n)2
• bκ=
1
n Pn
i=1(Xi−µb)3
b
σ3j
• ρb=
Pn
i=1(Xi−X¯n)(Yi−Y¯n) qPn
i=1(Xi−X¯n)2qPni=1(Yi−Y¯n)
12
Parametric Inference
Let F=f(x;θ) :θ∈Θ be a parametric model with parameter space Θ⊂Rk
and parameterθ= (θ1, . . . , θk).
12.1
Method of Moments
jth moment
αj(θ) =EXj=
Z
xjdFX(x)
jth sample moment
b
αj = 1
n
n X
i=1
Xij
Method of moments estimator (MoM)
α1(θ) =αb1
α2(θ) =αb2
.. . =...
αk(θ) =αbk
Properties of the MoM estimator
• bθn exists with probability tending to 1
• Consistency: θbn P
→θ
• Asymptotic normality:
√
n(θb−θ)→ ND (0,Σ) where Σ =gEY YTgT,Y = (X, X2, . . . , Xk)T,
g= (g1, . . . , gk) andgj= ∂θ∂α−j1(θ)
12.2
Maximum Likelihood
Likelihood: Ln: Θ→[0,∞)
Ln(θ) = n Y
i=1
f(Xi;θ)
Log-likelihood
ℓn(θ) = logLn(θ) = n X
i=1
logf(Xi;θ)
Maximum likelihood estimator (mle)
Ln(θbn) = sup θ L
n(θ)
Score function
s(X;θ) = ∂
∂θlogf(X;θ)
Fisher information
I(θ) =Vθ[s(X;θ)]
In(θ) =nI(θ)
Fisher information (exponential family)
I(θ) =Eθ
−∂θ∂ s(X;θ)
Observed Fisher information
Iobs
n (θ) =−
∂2
∂θ2
n X
i=1
logf(Xi;θ)
Properties of themle
• Consistency: θbn P
→θ
• Equivariance: θbn is themle =⇒ϕ(θbn) ist themleofϕ(θ)
• Asymptotic normality: 1. se≈p1/In(θ)
(bθn−θ)
se
D
→ N(0,1)
2. seb ≈ q
1/In(θbn)
(bθn−θ) b
se
D
→ N(0,1)
• Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-ples. Ifeθn is any other estimator, the asymptotic relative efficiency is
are(θen,θbn) =
Vhθbni
Vhθeni ≤
1
• Approximately the Bayes estimator
12.2.1 Delta Method
Ifτ =ϕ(bθ) whereϕis differentiable andϕ′(θ)6= 0:
(bτn−τ) b
se(τb)
D
→ N(0,1)
where bτ=ϕ(θb) is themle ofτ and
b
se=ϕ′(θb) bse(bθn)
12.3
Multiparameter Models
Letθ= (θ1, . . . , θk) andbθ= (θb1, . . . ,bθk) be themle.
Hjj =
∂2ℓ
n
∂θ2 Hjk=
∂2ℓ
n
∂θj∂θk
Fisher information matrix
In(θ) =−
Eθ[H11] · · · Eθ[H1k]
..
. . .. ...
Eθ[Hk1] · · · Eθ[Hkk]
Under appropriate regularity conditions
(θb−θ)≈ N(0, Jn)
withJn(θ) =I−1
n . Further, ifθbj is thejth component of θ, then
(θbj−θj) b
sej
D
→ N(0,1)
whereseb2j =Jn(j, j) and Covhbθj,bθk i
=Jn(j, k)
12.3.1 Multiparameter delta method
Letτ=ϕ(θ1, . . . , θk) and let the gradient ofϕbe
∇ϕ=
∂ϕ ∂θ1
.. .
∂ϕ ∂θk
Suppose∇ϕθ=θb6= 0 andτb=ϕ(θb). Then,
(bτ−τ) b
se(τb)
D
→ N(0,1)
where
b
se(bτ) =r∇bϕTJbn
b
∇ϕ
andJbn=Jn(θb) and∇bϕ=∇ϕ
θ=bθ.
12.4
Parametric Bootstrap
Sample from f(x;θbn) instead of from Fbn, whereθbn could be the mleor method of moments estimator.
13
Hypothesis Testing
H0:θ∈Θ0 versus H1:θ∈Θ1
Definitions
• Null hypothesisH0
• Alternative hypothesisH1
• Simple hypothesisθ=θ0
• Composite hypothesisθ > θ0 orθ < θ0
• Two-sided test: H0:θ=θ0 versus H1:θ6=θ0
• One-sided test: H0:θ≤θ0 versus H1:θ > θ0
• Critical valuec
• Test statisticT
• Rejection regionR={x:T(x)> c}
• Power functionβ(θ) =P[X∈R]
• Power of a test: 1−P[Type II error] = 1−β = inf
θ∈Θ1
β(θ)
• Test size: α=P[Type I error] = sup
θ∈Θ0
β(θ)
RetainH0 RejectH0
H0true √ Type I Error (α)
H1true Type II Error (β) √(power)
p-value
• p-value = supθ∈Θ0Pθ[T(X)≥T(x)] = inf
α:T(x)∈Rα
• p-value = supθ∈Θ0 Pθ[T(X ⋆)
≥T(X)]
| {z }
1−Fθ(T(X)) sinceT(X⋆)∼Fθ
= infα:T(X)∈Rα
p-value evidence
<0.01 very strong evidence againstH0
0.01−0.05 strong evidence againstH0
0.05−0.1 weak evidence againstH0
>0.1 little or no evidence againstH0
Wald test
• Two-sided test
• RejectH0when|W|> zα/2whereW =
b
θ−θ0
b
se
• P|W|> zα/2→α
• p-value =Pθ0[|W|>|w|]≈P[|Z|>|w|] = 2Φ(−|w|)
Likelihood ratio test (LRT)
• T(X) = supθ∈ΘLn(θ) supθ∈Θ0Ln(θ)
= Ln(θbn)
• λ(X) = 2 logT(X)→D χ2r−q where k X
i=1
Zi2∼χ2k and Z1, . . . , Zk iid
∼ N(0,1)
• p-value =Pθ0[λ(X)> λ(x)]≈Pχ2r
−q > λ(x)
Multinomial LRT
• mle: bpn=
X1
n , . . . , Xk
n
• T(X) =Ln(bpn)
Ln(p0)
= k Y
j=1
b
pj
p0j Xj
• λ(X) = 2 k X
j=1
Xjlog
b
pj
p0j
D
→χ2k−1
• The approximate sizeαLRT rejectsH0 whenλ(X)≥χ2k−1,α Pearson Chi-square Test
• T = k X
j=1
(Xj−E[Xj])2
E[Xj] where E[Xj] =np0j underH0
• T D
→χ2
k−1
• p-value =Pχ2
k−1> T(x)
• Faster D
→X2
k−1 than LRT, hence preferable for smalln
Independence testing
• Irows,J columns,Xmultinomial sample of sizen=I∗J
• mles unconstrained: pbij = Xij
n
• mles underH0: pb0ij =pbi·pb·j= Xi· n
X·j n
• LRT:λ= 2PIi=1
PJ
j=1Xijlog nX
ij
Xi·X·j
• PearsonChiSq: T =PIi=1PJj=1(Xij−E[Xij])2
E[Xij]
• LRT and Pearson D
→χ2
kν, whereν = (I−1)(J−1)
14
Bayesian Inference
Bayes’ Theorem
f(θ|x) = f(x|θ)f(θ)
f(xn) =
f(x|θ)f(θ) R
f(x|θ)f(θ)dθ ∝ Ln(θ)f(θ)
Definitions
• Xn= (X
1, . . . , Xn)
• xn = (x
1, . . . , xn)
• Prior densityf(θ)
• Likelihoodf(xn|θ): joint density of the data In particular,Xn iid =
⇒f(xn
|θ) = n Y
i=1
f(xi|θ) =Ln(θ)
• Posterior densityf(θ|xn)
• Normalizing constantcn=f(xn) = R
f(x|θ)f(θ)dθ
• Kernel: part of a density that depends onθ
• Posterior mean ¯θn= R
θf(θ|xn)dθ= RθLn(θ)f(θ) R
Ln(θ)f(θ)dθ
14.1
Credible Intervals
Posterior interval
P[θ∈(a, b)|xn] =
Z b
a
f(θ|xn)dθ= 1−α
Equal-tail credible interval Z a
−∞
f(θ|xn)dθ= Z ∞
b
f(θ|xn)dθ=α/2
Highest posterior density (HPD) region Rn
1. P[θ∈Rn] = 1−α
2. Rn={θ:f(θ|xn)> k}for somek
Rn is unimodal =⇒Rn is an interval
14.2
Function of parameters
Letτ =ϕ(θ) andA={θ:ϕ(θ)≤τ}. Posterior CDF forτ
H(r|xn) =P[ϕ(θ)
≤τ|xn] = Z
A
f(θ|xn)dθ
Posterior density
h(τ|xn) =H′(τ|xn) Bayesian delta method
τ|Xn
≈ Nϕ(θb),seb ϕ′(θb)
14.3
Priors
Choice
• Subjective bayesianism.
• Objective bayesianism.
• Robust bayesianism.
Types
• Flat: f(θ)∝constant
• Proper: R−∞∞ f(θ)dθ= 1
• Improper: R−∞∞ f(θ)dθ=∞
• Jeffrey’s prior (transformation-invariant):
f(θ)∝pI(θ) f(θ)∝pdet(I(θ))
• Conjugate: f(θ) andf(θ|xn) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate prior Posterior hyperparameters
Bern (p) Beta (α, β) α+ n X
i=1
xi, β+n− n X
i=1
xi
Bin (p) Beta (α, β) α+ n X
i=1
xi, β+ n X
i=1
Ni− n X
i=1
xi
NBin (p) Beta (α, β) α+rn, β+ n X
i=1
xi
Po (λ) Gamma (α, β) α+ n X
i=1
xi, β+n
Multinomial(p) Dir (α) α+ n X
i=1
x(i)
Geo (p) Beta (α, β) α+n, β+ n X
i=1
xi
Continuous likelihood (subscriptcdenotes constant) Likelihood Conjugate prior Posterior hyperparameters
Unif (0, θ) Pareto(xm, k) max
x(n), xm , k+n
Exp (λ) Gamma (α, β) α+n, β+ n X
i=1
xi
N µ, σ2
c
N µ0, σ02
µ0
σ2 0
+ Pn
i=1xi
σ2
c
/
1
σ2 0
+ n
σ2
c
, 1
σ2 0
+ n
σ2
c −1
N µc, σ2
Scaled Inverse Chi-square(ν, σ2
0)
ν+n, νσ
2 0+
Pn
i=1(xi−µ)2
ν+n
N µ, σ2
Normal-scaled Inverse Gamma(λ, ν, α, β)
νλ+nx¯
ν+n , ν + n, α +
n
2,
β+1 2
n X
i=1
(xi−x¯)2+
γ(¯x−λ)2
2(n+γ)
MVN(µ,Σc) MVN(µ0,Σ0) Σ−01+nΣ−c1 −1
Σ−01µ0+nΣ−1x¯,
Σ−01+nΣ−c1 −1
MVN(µc,Σ) Inverse-Wishart(κ,Ψ)
n+κ,Ψ + n X
i=1
(xi−µc)(xi−µc)T
Pareto(xmc, k) Gamma (α, β) α+n, β+
n X
i=1
log xi
xmc
Pareto(xm, kc) Pareto(x0, k0) x0, k0−knwhere k0> kn
Gamma (αc, β) Gamma (α0, β0) α0+nαc, β0+
n X
i=1
xi
14.4
Bayesian Testing
IfH0:θ∈Θ0:
Prior probabilityP[H0] =
Z
Θ0
f(θ)dθ
Posterior probabilityP[H0|xn] =
Z
Θ0
f(θ|xn)dθ
LetH0, . . . , HK−1 beK hypotheses. Supposeθ∼f(θ|Hk),
P[Hk|xn] = f(x
n
|Hk)P[Hk]
PK
k=1f(xn|Hk)P[Hk]
,
Marginal likelihood
f(xn|Hi) = Z
Θ
f(xn|θ, Hi)f(θ|Hi)dθ
Posterior odds (ofHi relative toHj)
P[Hi|xn]
P[Hj|xn] =
f(xn|Hi)
f(xn|Hj) | {z }
Bayes FactorBFij
× PP[[HHi]
j] | {z }
prior odds
Bayes factor
log10BF10 BF10 evidence
0−0.5 1−1.5 Weak 0.5−1 1.5−10 Moderate 1−2 10−100 Strong
>2 >100 Decisive
p∗=
p
1−pBF10 1 +1−ppBF10
wherep=P[H1] andp∗=P[H1|xn]
15
Exponential Family
Scalar parameter
fX(x|θ) =h(x) exp{η(θ)T(x)−A(θ)}
=h(x)g(θ) exp{η(θ)T(x)} Vector parameter
fX(x|θ) =h(x) exp ( s
X
i=1
ηi(θ)Ti(x)−A(θ) )
=h(x) exp{η(θ)·T(x)−A(θ)} =h(x)g(θ) exp{η(θ)·T(x)}
Natural form
fX(x|η) =h(x) exp{η·T(x)−A(η)}
=h(x)g(η) exp{η·T(x)} =h(x)g(η) expηTT(x)
16
Sampling Methods
16.1
The Bootstrap
LetTn=g(X1, . . . , Xn) be a statistic.
1. EstimateVF[Tn] with Vb
Fn[Tn].
2. ApproximateVb
Fn[Tn] using simulation:
(a) Repeat the followingB times to getT∗
n,1, . . . , Tn,B∗ , aniidsample from the sampling distribution implied byFbn
i. Sample uniformlyX∗
1, . . . , Xn∗∼Fbn. ii. ComputeT∗
n =g(X1∗, . . . , Xn∗). (b) Then
vboot=VbFbn= 1
B
B X
b=1
Tn,b∗ − 1
B
B X
r=1
Tn,r∗ !2
16.1.1 Bootstrap Confidence Intervals
Normal-based interval
Tn±zα/2sebboot
Pivotal interval
1. Location parameterθ=T(F) 2. PivotRn =θbn−θ
3. LetH(r) =P[Rn≤r] be thecdfofRn
4. LetR∗
n,b=θb∗n,b−θbn. ApproximateH using bootstrap:
b
H(r) = 1
B
B X
b=1
I(R∗n,b≤r)
5. θ∗
β =β sample quantile of (θb∗n,1, . . . ,bθ∗n,B) 6. r∗
β =β sample quantile of (R∗n,1, . . . , R∗n,B), i.e.,rβ∗=θβ∗−θbn 7. Approximate 1−αconfidence interval Cn=
ˆ
a,ˆbwhere
ˆ
a= θbn−Hb−1
1−α 2
= bθn−r∗1−α/2= 2bθn−θ∗1−α/2
ˆb= θbn−Hb−1 α
2
= θbn−r∗α/2= 2θbn−θ∗α/2
Percentile interval
Cn=
θ∗
α/2, θ1∗−α/2
16.2
Rejection Sampling
Setup
• We can easily sample fromg(θ)
• We want to sample fromh(θ), but it is difficult
• We knowh(θ) up to a proportional constant: h(θ) = R k(θ)
k(θ)dθ
• Envelope condition: we can findM >0 such thatk(θ)≤M g(θ) ∀θ
Algorithm
1. Drawθcand
∼g(θ) 2. Generateu∼Unif (0,1) 3. Acceptθcand ifu≤ k(θcand)
M g(θcand)
4. Repeat untilB values ofθcand have been accepted
Example
• We can easily sample from the priorg(θ) =f(θ)
• Target is the posteriorh(θ)∝k(θ) =f(xn
|θ)f(θ)
• Envelope condition: f(xn|θ)≤f(xn|θb
n) =Ln(bθn)≡M
• Algorithm
1. Draw θcand∼f(θ) 2. Generate u∼Unif (0,1) 3. Accept θcand ifu
≤Ln(θ
cand)
Ln(bθn)
16.3
Importance Sampling
Sample from an importance function grather than target density h. Algorithm to obtain an approximation to E[q(θ)|xn]:
1. Sample from the priorθ1, . . . , θn iid
∼ f(θ) 2. wi= L
n(θi) PB
i=1Ln(θi)
∀i= 1, . . . , B
3. E[q(θ)|xn]≈PB
i=1q(θi)wi
17
Decision Theory
Definitions
• Unknown quantity affecting our decision: θ∈Θ
• Decision rule: synonymous for an estimatorθb
• Actiona∈ A: possible value of the decision rule. In the estimation context, the action is just an estimate ofθ,θb(x).
• Loss functionL: consequences of taking actionawhen true state isθ or discrepancy betweenθandθb,L: Θ× A →[−k,∞).
Loss functions
• Squared error loss: L(θ, a) = (θ−a)2
• Linear loss: L(θ, a) = (
K1(θ−a) a−θ <0
K2(a−θ) a−θ≥0
• Absolute error loss: L(θ, a) =|θ−a| (linear loss withK1=K2)
• Lp loss: L(θ, a) =|θ−a|p
• Zero-one loss: L(θ, a) = (
0 a=θ
1 a6=θ
17.1
Risk
Posterior risk
r(bθ|x) = Z
L(θ,θb(x))f(θ|x)dθ=Eθ |X
h
L(θ,θb(x))i
(Frequentist) risk
R(θ,bθ) = Z
L(θ,θb(x))f(x|θ)dx=EX |θ
h
L(θ,bθ(X))i
Bayes risk
r(f,θb) = ZZ
L(θ,θb(x))f(x, θ)dx dθ=Eθ,XhL(θ,θb(X))i
r(f,θb) =Eθ
h
EX|θ
h
L(θ,bθ(X)ii=Eθ
h
R(θ,θb)i
r(f,θb) =EXhEθ|XhL(θ,θb(X)ii=EXhr(θb|X)i
17.2
Admissibility
• θb′ dominates θbif
∀θ:R(θ,bθ′)≤R(θ,θb)
∃θ:R(θ,bθ′)< R(θ,θb)
• θbis inadmissible if there is at least one other estimatorθb′ that dominates
it. Otherwise it is called admissible.
17.3
Bayes Rule
Bayes rule (or Bayes estimator)
• r(f,θb) = infeθr(f,eθ)
• bθ(x) = infr(θb|x)∀x =⇒ r(f,θb) =R r(θb|x)f(x)dx
Theorems
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode
17.4
Minimax Rules
Maximum risk ¯
R(θb) = sup θ
R(θ,θb) R¯(a) = sup θ
R(θ, a)
Minimax rule
sup θ
R(θ,θb) = inf
e
θ ¯
R(θe) = inf
e
θ sup
θ
R(θ,θe)
b
θ= Bayes rule ∧ ∃c:R(θ,θb) =c
Least favorable prior
b
θf = Bayes rule ∧ R(θ,bθf)≤r(f,θbf)∀θ
18
Linear Regression
Definitions
• Response variableY
• CovariateX (aka predictor variable or feature)
18.1
Simple Linear Regression
Model
Yi=β0+β1Xi+ǫi E[ǫi|Xi] = 0, V[ǫi|Xi] =σ2 Fitted line
b
r(x) =βb0+βb1x
Predicted (fitted) values
b
Yi=br(Xi)
Residuals
ˆ
ǫi=Yi−Ybi =Yi−
b
β0+βb1Xi
Residual sums of squares (rss)
rss(βb0,βb1) =
n X
i=1
ˆ
ǫ2i
Least square estimates b
βT = (βb0,βb1)T : min b
β0,βb1
rss
b
β0= ¯Yn−βb1X¯n b
β1=
Pn
i=1(Xi−X¯n)(Yi−Y¯n) Pn
i=1(Xi−X¯n)2 =
Pn
i=1XiYi−nXY¯ Pn
i=1Xi2−nX2
Ehβb|Xni=
β0
β1
Vhβb|Xni= σ
2
nsX
n−1Pn
i=1Xi2 −Xn
−Xn 1
b
se(βb0) = b
σ sX√n
r Pn i=1Xi2
n
b
se(βb1) = b
σ sX√n
where s2X=n−1 Pn
i=1(Xi−Xn)2 andbσ2= n−12Pni=1ˆǫ2i (unbiased estimate). Further properties:
• Consistency: βb0
P
→β0 andβb1
P
→β1
• Asymptotic normality: b
β0−β0
b
se(bβ0)
D
→ N(0,1) and βb1−β1 b
se(bβ1)
D
→ N(0,1)
• Approximate 1−αconfidence intervals forβ0andβ1:
b
β0±zα/2seb(βb0) and βb1±zα/2seb(βb1)
• Wald test for H0:β1= 0 vs. H1:β16= 0: reject H0 if|W|> zα/2where
W =βb1/seb(βb1).
R2
R2=
Pn
i=1(Ybi−Y)2 Pn
i=1(Yi−Y)2 = 1−
Pn i=1ˆǫ2i Pn
i=1(Yi−Y)2
= 1−rss
tss
Likelihood
L= n Y
i=1
f(Xi, Yi) = n Y
i=1
fX(Xi)× n Y
i=1
fY|X(Yi|Xi) =L1× L2
L1=
n Y
i=1
fX(Xi)
L2=
n Y
i=1
fY|X(Yi|Xi)∝σ−nexp (
−21σ2
X
i
Yi−(β0−β1Xi)
2)
Under the assumption of Normality, the least squares estimator is also themle
b
σ2= 1
n
n X
i=1
ˆ
ǫ2i
18.2
Prediction
Observe X=x∗ of the covariate and want to predict their outcomeY∗.
b
Y∗=βb0+βb1x∗
VhYb∗i=Vhβb0i+x2 ∗V
h b
β1
i
+ 2x∗Covhβb0,βb1
i
Prediction interval b
ξ2
n=bσ2 Pn
i=1(Xi−X∗)2 nPi(Xi−X¯)2j
+ 1
b
Y∗±zα/2ξbn
18.3
Multiple Regression
Y =Xβ+ǫ
where
X =
X11 · · · X1k ..
. . .. ...
Xn1 · · · Xnk β =
β1
.. .
βk ǫ=
ǫ1
.. .
ǫn
Likelihood
L(µ,Σ) = (2πσ2)−n/2exp
−21σ2rss
rss= (y−Xβ)T(y−Xβ) =kY −Xβk2=
N X
i=1
(Yi−xTiβ)2
If the (k×k) matrixXTX is invertible, b
β= (XTX)−1XTY
Vhβb|Xni=σ2(XTX)−1
b
β≈ N β, σ2(XTX)−1 Estimate regression function
b
r(x) = k X
j=1
b
βjxj
Unbiased estimate forσ2
b
σ2= 1
n−k
n X
i=1
ˆ
ǫ2
i ˆǫ=Xβb−Y
mle
b
µ= ¯X bσ2= n−k
n σ
2
1−αConfidence interval
b
βj±zα/2seb(βbj)
18.4
Model Selection
Consider predicting a new observationY∗ for covariatesX∗ and letS⊂J
denote a subset of the covariates in the model, where |S|=kand|J|=n. Issues
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score Hypothesis testing
H0:βj= 0 vs. H1:βj6= 0 ∀j∈J Mean squared prediction error (mspe)
mspe=Eh(Yb(S)−Y∗)2i
Prediction risk
R(S) = n X
i=1
mspei =
n X
i=1
Eh(Ybi(S)−Y∗
i )2 i
Training error
b
Rtr(S) = n X
i=1
(Ybi(S)−Yi)2
R2
R2(S) = 1
−rsstss(S) = 1−Rbtr(S)
tss = 1−
Pn
i=1(Ybi(S)−Y)2
Pn
i=1(Yi−Y)2 The training error is a downward-biased estimate of the prediction risk.
EhRbtr(S)i< R(S)
bias(Rbtr(S)) =E
h b
Rtr(S)i−R(S) =−2 n X
i=1
CovhYbi, Yi i
AdjustedR2
R2(S) = 1−nn−1 −k
rss
tss
Mallow’sCp statistic
b
R(S) =Rbtr(S) + 2kσb2= lack of fit + complexity penalty
AkaikeInformation Criterion (AIC)
AIC(S) =ℓn(βbS,σb2S)−k Bayesian Information Criterion (BIC)
BIC(S) =ℓn(βbS,σb2S)−
k
2logn Validation and training
b
RV(S) = m X
i=1
(Ybi∗(S)−Yi∗)2 m=|{validation data}|, often
n
4 or
n
2
Leave-one-out cross-validation
b
RCV(S) = n X
i=1
(Yi−Yb(i))2=
n X
i=1
Yi−Ybi(S) 1−Uii(S)
!2
U(S) =XS(XSTXS)−1XS (“hat matrix”)
19
Non-parametric Function Estimation
19.1
Density Estimation
Estimate f(x), wheref(x) =P[X∈A] =R
Af(x)dx. Integrated square error (ise)
L(f,fbn) =Z f(x)−fbn(x)
2
dx=J(h) + Z
f2(x)dx
Frequentist risk
R(f,fbn) =EhL(f,fbn)i=
Z
b2(x)dx+ Z
v(x)dx
b(x) =Ehfbn(x)i−f(x) v(x) =Vhfbn(x)i
19.1.1 Histograms
Definitions
• Number of binsm
• Binwidthh= 1
m
• BinBj hasνj observations
• Definepbj=νj/nandpj= R
Bjf(u)du
Histogram estimator
b
fn(x) = m X
j=1
b
pj
hI(x∈Bj) E
h b
fn(x) i
= pj
h
Vhfbn(x)i= pj(1−pj) nh2
R(fbn, f)≈ h
2
12 Z
(f′(u))2 du+ 1
nh
h∗= 1
n1/3
6 R
(f′(u))2du
!1/3
R∗(fbn, f)≈
C
n2/3 C=
3
4
2/3Z
(f′(u))2 du
1/3
Cross-validation estimate of E[J(h)]
b
JCV(h) = Z
b
fn2(x)dx− 2
n
n X
i=1
b
f(−i)(Xi) =
2 (n−1)h−
n+ 1 (n−1)h
m X
j=1
b
p2j
19.1.2 Kernel Density Estimator (KDE)
KernelK
• K(x)≥0
• RK(x)dx= 1
• RxK(x)dx= 0
• Rx2K(x)dx≡σ2K>0 KDE
b
fn(x) = 1
n
n X
i=1
1
hK
x−Xi
h
R(f,fbn)≈ 1 4(hσK)
4
Z
(f′′(x))2dx+ 1
nh
Z
K2(x)dx
h∗= c
−2/5
1 c−
1/5 2 c−
1/5 3
n1/5 c1=σ
2
K, c2=
Z
K2(x)dx, c3=
Z
(f′′(x))2dx
R∗(f,fbn) = c4
n4/5 c4=
5 4(σ
2
K)2/5 Z
K2(x)dx
4/5
| {z }
C(K)
Z
(f′′)2dx
1/5
Epanechnikov Kernel
K(x) =
( 3
4√5(1−x2/5) |x|<
√
5
0 otherwise
Cross-validation estimate ofE[J(h)]
b
JCV(h) = Z
b
fn2(x)dx− 2
n
n X
i=1
b
f(−i)(Xi)≈
1
hn2
n X
i=1
n X
j=1
K∗
X i−Xj
h
+ 2
nhK(0)
K∗(x) =K(2)(x)−2K(x) K(2)(x) = Z
K(x−y)K(y)dy
19.2
Non-parametric Regression
Estimate f(x) wheref(x) =E[Y |X =x]. Consider pairs of points
(x1, Y1), . . . ,(xn, Yn) related by
Yi=r(xi) +ǫi
E[ǫi] = 0
V[ǫi] =σ2
k-nearest Neighbor Estimator
b
r(x) = 1
k
X
i:xi∈Nk(x)
Yi whereNk(x) ={k values ofx1, . . . , xn closest tox}
Nadaraya-WatsonKernel Estimator
b
r(x) = n X
i=1
wi(x)Yi
wi(x) = K x−xi
h Pn
j=1K
x−xj
h
∈[0,1]
R(rbn, r)≈
h4
4 Z
x2K2(x)dx
4Z
r′′(x) + 2r′(x)f′(x) f(x)
2
dx
+
Z σ2RK2(x)dx
nhf(x) dx
h∗≈ nc11/5
R∗(rbn, r)≈ c2
n4/5
Cross-validation estimate of E[J(h)]
b
JCV(h) = n X
i=1
(Yi−rb(−i)(xi))2=
n X
i=1
(Yi−rb(xi))2
1−Pn K(0) j=1K
x−xj
h
!2
19.3
Smoothing Using Orthogonal Functions
Approximation
r(x) =
∞
X
j=1
βjφj(x)≈ J X
i=1
βjφj(x)
Multivariate regression
Y = Φβ+η
where ηi=ǫi and Φ =
φ0(x1) · · · φJ(x1)
..
. . .. ...
φ0(xn) · · · φJ(xn)
Least squares estimator
b
β= (ΦTΦ)−1ΦTY
≈ n1ΦTY (for equally spaced observations only)
Cross-validation estimate ofE[J(h)]
b
RCV(J) = n X
i=1
Yi−
J X
j=1
φj(xi)βbj,(−i)
2
20
Stochastic Processes
Stochastic Process
{Xt:t∈T} T = (
{0,±1, . . .}=Z discrete
[0,∞) continuous
• NotationsXt,X(t)
• State spaceX • Index setT
20.1
Markov Chains
Markov chain
P[Xn =x|X0, . . . , Xn−1] =P[Xn =x|Xn−1] ∀n∈T, x∈ X
Transition probabilities
pij ≡P[Xn+1=j|Xn=i]
pij(n)≡P[Xm+n =j|Xm=i] n-step
Transition matrixP(n-step: Pn)
• (i, j) element ispij
• pij >0
• Pipij = 1
Chapman-Kolmogorov
pij(m+n) =X k
pij(m)pkj(n)
Pm+n=PmPn
Pn=P× · · · ×P=Pn Marginal probability
µn= (µn(1), . . . , µn(N)) where µi(i) =P[Xn =i]
µ0,initial distribution
µn=µ0Pn
20.2
Poisson Processes
Poisson process
• {Xt:t∈[0,∞)} = number of events up to and including timet
• X0= 0
• Independent increments:
∀t0<· · ·< tn:Xt1−Xt0 ⊥⊥ · · · ⊥⊥Xtn−Xtn−1
• Intensity functionλ(t)
– P[Xt+h−Xt= 1] =λ(t)h+o(h)
– P[Xt+h−Xt= 2] =o(h)
• Xs+t−Xs∼Po (m(s+t)−m(s)) where m(t) =R0tλ(s)ds
Homogeneous Poisson process
λ(t)≡λ =⇒ Xt∼Po (λt) λ >0
Waiting times
Wt:= time at whichXtoccurs
Wt∼Gamma
t,1 λ
Interarrival times
St=Wt+1−Wt
St∼Exp
1
λ
t
Wt−1 Wt
St
21
Time Series
Mean functionµxt =E[xt] =
Z ∞
−∞
xft(x)dx
Autocovariance function
γx(s, t) =E[(xs−µs)(xt−µt)] =E[xsxt]−µsµt γx(t, t) =E(xt−µt)2=V[xt]
Autocorrelation function (ACF)
ρ(s, t) =pCov [xs, xt]
V[xs]V[xt] =
γ(s, t) p
γ(s, s)γ(t, t)
Cross-covariance function (CCV)
γxy(s, t) =E[(xs−µx
s)(yt−µyt)]
Cross-correlation function (CCF)
ρxy(s, t) = p γxy(s, t)
γx(s, s)γy(t, t)
Backshift operator
Bk(xt) =xt−k
Difference operator
∇d = (1