Probability and Statistics Cookbook. pdf

(1)

Probability and Statistics

Cookbook

Copyright cMatthias Vallentin, 2011 vallentin@icir.org

(2)

This cookbook integrates a variety of topics in probability the-ory and statistics. It is based on literature [1,6,3] and in-class material from courses of the statistics department at the Uni-versity of California in Berkeley but also influenced by other sources [4,5]. If you find errors or have suggestions for further topics, I would appreciate if you send me anemail. The most re-cent version of this document is available athttp://matthias. vallentin.net/probability-and-statistics-cookbook/. To reproduce, please contact me.

Contents

1 Distribution Overview 3

1.1 Discrete Distributions . . . 3

1.2 Continuous Distributions . . . 4

2 Probability Theory 6 3 Random Variables 6 3.1 Transformations . . . 7

4 Expectation 7 5 Variance 7 6 Inequalities 8 7 Distribution Relationships 8 8 Probability and Moment Generating Functions 9 9 Multivariate Distributions 9 9.1 Standard Bivariate Normal . . . 9

9.2 Bivariate Normal . . . 9

9.3 Multivariate Normal . . . 9

10 Convergence 9 10.1 Law of Large Numbers (LLN) . . . 10

10.2 Central Limit Theorem (CLT) . . . 10

11 Statistical Inference 10 11.1 Point Estimation . . . 10

11.2 Normal-Based Confidence Interval . . . 11

11.3 Empirical distribution . . . 11

11.4 Statistical Functionals . . . 11

12 Parametric Inference 11 12.1 Method of Moments . . . 11

12.2 Maximum Likelihood. . . 12

12.2.1 Delta Method . . . 12

12.3 Multiparameter Models . . . 12

12.3.1 Multiparameter delta method . . 13

12.4 Parametric Bootstrap . . . 13

13 Hypothesis Testing 13 14 Bayesian Inference 14 14.1 Credible Intervals. . . 14

14.2 Function of parameters. . . 14

14.3 Priors . . . 15

14.3.1 Conjugate Priors . . . 15

14.4 Bayesian Testing . . . 15

15 Exponential Family 16 16 Sampling Methods 16 16.1 The Bootstrap . . . 16

16.1.1 Bootstrap Confidence Intervals . 16 16.2 Rejection Sampling . . . 17

16.3 Importance Sampling. . . 17

17 Decision Theory 17 17.1 Risk . . . 17

17.2 Admissibility . . . 17

17.3 Bayes Rule . . . 18

17.4 Minimax Rules . . . 18

18 Linear Regression 18 18.1 Simple Linear Regression . . . 18

18.2 Prediction . . . 19

18.3 Multiple Regression . . . 19

18.4 Model Selection . . . 19

19 Non-parametric Function Estimation 20 19.1 Density Estimation . . . 20

19.1.1 Histograms . . . 20

19.1.2 Kernel Density Estimator (KDE) 21 19.2 Non-parametric Regression . . . 21

19.3 Smoothing Using Orthogonal Functions 21 20 Stochastic Processes 22 20.1 Markov Chains . . . 22

20.2 Poisson Processes . . . 22

21 Time Series 23 21.1 Stationary Time Series . . . 23

21.2 Estimation of Correlation . . . 24

21.3 Non-Stationary Time Series . . . 24

21.3.1 Detrending . . . 24

21.4 ARIMA models . . . 24

21.4.1 Causality and Invertibility. . . . 25

21.5 Spectral Analysis . . . 25

22 Math 26 22.1 Gamma Function . . . 26

22.2 Beta Function. . . 26

22.3 Series . . . 27

(3)

1

Distribution Overview

1.1

Discrete Distributions

Notation1 FX(x) fX(x) E_[X_] V_[X_] _MX_(s)

Uniform Unif_{a, . . . , b_}   

 

0 x < a

⌊x⌋−a+1

b−a a≤x≤b 1 x > b

I(a < x < b) b₋a+ 1

a+b 2

(b₋a+ 1)2

−1

12

eas

−e−(b+1)s

s(b₋a)

Bernoulli Bern (p) (1₋p)1−x px(1₋p)1−x p p(1₋p) 1₋p+pes

Binomial Bin (n, p) I1−p(n−x, x+ 1) n

x !

px(1₋p)n−x np np(1₋p) (1₋p+pes)n

Multinomial Mult (n, p) n!

x1!. . . xk!

px1₁ _{· · ·}pxk

k k X

i=1

xi=n npi npi(1₋pi)

k X

i=0

piesi

!n

Hypergeometric Hyp (N, m, n) _≈Φ px−np np(1₋p)

! m

x m−x

n−x

N x

nm N

nm(N₋n)(N₋m) N2_(N₋₁₎

Negative Binomial NBin (r, p) Ip(r, x+ 1) x+r−1 r₋1

!

pr(1₋p)x r1−p

p r

1₋p p2

p 1₋(1₋p)es

r

Geometric Geo (p) 1−(1−p)x x∈N+ _p(1₋_p)x−1 _x_∈N+ 1

p

1₋p p2

p 1−(1−p)es

Poisson Po (λ) e−λ

x X

i=0

λi i!

λx_e−λ

x! λ λ e

λ(es₋₁₎

● ● ● ● ● ●

Uniform (discrete)

x

PMF

a b

1 n

● ● ● ●● ●

● ●

●● ●

●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 10 20 30 40

0.00

0.05

0.10

0.15

0.20

0.25

Binomial

x

PMF

● n=40, p=0.3 n=30, p=0.6 n=25, p=0.9

● ●

● ● ● ●

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

Geometric

x

PMF

● p=0.2 p=0.5 p=0.8

● ●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.0

0.1

0.2

0.3

Poisson

x

PMF

● λ =1

λ =4

λ =10

1_{We use the notation}_γ₍_{s, x}_{) and Γ(}_x_{) to refer to the Gamma functions (see}_§_22.1_{), and use B(}_{x, y}_{) and}_I_x_{to refer to the Beta functions (see}_§_22.2_).

(4)

1.2

Continuous Distributions

Notation FX(x) fX(x) E[X] V[X] MX(s)

Uniform Unif (a, b)

  

 

0 x < a x−a

b−a a < x < b 1 x > b

I(a < x < b) b₋a

a+b 2

(b₋a)2

12

esb

−esa

s(b₋a)

Normal _N µ, σ2

Φ(x) = Z x

−∞

φ(t)dt φ(x) = 1 σ√2πexp

−(x−µ)

2

2σ2

µ σ2 exp

µs+σ

2_s2

2

Log-Normal lnN µ, σ2 1 2+

1 2erf

lnx₋µ

√

2σ2

1 x√2πσ2exp

−(lnx−µ)

2

2σ2

eµ+σ2/2 (eσ2−1)e2µ+σ2

Multivariate Normal MVN (µ,Σ) (2π)−k/2|Σ|−1/2_e−1 2(x−µ)

T_Σ−1₍_x₋_µ₎

µ Σ exp

µTs+1

2s T_Σs

Student’st Student(ν) Ixν 2,

ν 2

Γ ν+1

2 √_νπΓ ν 2

1 +x

2

ν

−(ν+1)/2

0 0

Chi-square χ2k

1 Γ(k/2)γ _k 2, x 2 ₁

2k/2_Γ(k/2)x

k/2−1_e−x/2 _k _2k ₍₁

−2s)−k/2s <1/2

F F(d1, d2) I d1x d1x+d2

d1 2, d1 2 r

(d1x)d1d

d₂

2

(d1x+d2)d_{1 +}d₂

xB d1 2 , d1 2 d2

d2−2

2d22(d1+d2−2)

d1(d2−2)2(d2−4)

Exponential Exp (β) 1₋e−x/β 1

βe

−x/β _β _β2 1

1₋βs(s <1/β)

Gamma Gamma (α, β) γ(α, x/β) Γ(α)

1 Γ (α)βαx

α−1_e−x/β _αβ _αβ2 1

1₋βs α

(s <1/β)

Inverse Gamma InvGamma (α, β) Γ α, β x

Γ (α)

βα Γ (α)x

−α−1_e−β/x β

α−1 α >1

β2

(α−1)2_(α₋₂₎2 α >2

2(₋βs)α/2

Γ(α) Kα p

−4βs

Dirichlet Dir (α)

ΓPk_i₌₁αi

Qk i=1Γ (αi)

k Y

i=1

xαi−1

i

αi Pk

i=1αi

E_{[Xi] (1}₋E_[Xi])

Pk

i=1αi+ 1

Beta Beta (α, β) Ix(α, β) Γ (α+β)

Γ (α) Γ (β)x α−1₍₁

−x)β−1 α

α+β

αβ

(α+β)2_(α₊_β_{+ 1)} 1 +

∞

X

k=1

k−1

Y

r=0

α+r α+β+r

! sk k!

Weibull Weibull(λ, k) 1−e−(x/λ)k k λ

x λ

k−1

e−(x/λ)k λΓ

1 +1 k

λ2Γ

1 +2 k

−µ2

∞

X

n=0

sn_λn n! Γ

1 +n

k

Pareto Pareto(xm, α) 1₋xm x

α

x_≥xm α x

α m

xα+1 x≥xm

αxm α−1 α >1

xαm

(α−1)2_(α₋₂₎ α >2 α(−xms)

α_Γ(

−α,₋xms)s <0

(5)

● ● Uniform (continuous)

x

PDF

a b

1

b−a

● ●

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Normal

x

φ

(

x

)

µ =0, σ2₌

0.2

µ =0, σ2₌₁

µ =0, σ2₌

5

µ =−2, σ2₌

0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Log−normal

x

PDF

µ =0, σ2₌

3

µ =2, σ2₌₂

µ =0, σ2₌

1

µ =0.5, σ2₌

1

µ =0.25, σ2₌₁

µ =0.125, σ2₌

1

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Student'st

x

PDF

ν =1

ν =2

ν =5

ν = ∞

0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

0.5

χ2

x

PDF

k=1 k=2 k=3 k=4 k=5

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

F

x

PDF

d1=1, d2=1

d1=2, d2=1 d1=5, d2=2

d1=100, d2=1

d1=100, d2=100

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

Exponential

x

PDF

β =2

β =1

β =0.4

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

Gamma

x

PDF

α =1, β =2

α =2, β =2

α =3, β =2

α =5, β =1

α =9, β =0.5

0 1 2 3 4 5

0

1

2

3

4

Inverse Gamma

x

PDF

α =1, β =1

α =2, β =1

α =3, β =1

α =3, β =0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Beta

x

PDF

α =0.5, β =0.5

α =5, β =1

α =1, β =3

α =2, β =2

α =2, β =5

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

2.5

Weibull

x

PDF

λ =1, k=0.5

λ =1, k=1

λ =1, k=1.5

λ =1, k=5

0 1 2 3 4 5

0

1

2

3

Pareto

x

PDF

xm=1, α =1

xm=1, α =2 xm=1, α =4

(6)

2

Probability Theory

Definitions

• Sample space Ω

• Outcome (point or element)ω∈Ω

• EventA_⊆Ω

• σ-algebra_A 1. _{∅ ∈ A}

2. A1, A2, . . . ,∈ A =⇒ S∞i=1Ai∈ A 3. A_{∈ A} =_{⇒ ¬}A_{∈ A}

• Probability DistributionP

1. P_[_A_]_≥₀ _∀_A

2. P_{[Ω] = 1}

3. P

"_∞ G

i=1

Ai #

=

∞

X

i=1

P_[_A_i]

• Probability space (Ω,_A,P₎

Properties

• P_[_∅_{] = 0}

• B= Ω∩B= (A∪ ¬A)∩B= (A∩B)∪(¬A∩B)

• P_[_¬_A_{] = 1}₋P_[_A_]

• P_[_B_{] =}P_[_A_∩_B_{] +}P_[_¬_A_∩_B_]

• P_{[Ω] = 1} P_[_∅_{] = 0}

• ¬(S_nAn) =T_n¬An ¬(T_nAn) =S_n¬An DeMorgan

• P_[S

nAn] = 1−P[Tn¬An]

• P_[_A_∪_B_{] =}P_[_A_{] +}P_[_B_]₋P_[_A_∩_B_]

=⇒ P_[_A_∪_B_]_≤P_[_A_{] +}P_[_B_]

• P_[_A_∪_B_{] =}P_[_A_{∩ ¬}_B_{] +}P_[_¬_A_∩_B_{] +}P_[_A_∩_B_]

• P_[_A_{∩ ¬}_B_{] =}P_[_A_]₋P_[_A_∩_B_]

Continuity of Probabilities

• A1⊂A2⊂. . . =⇒ limn→∞P[An] =P[A] whereA=S∞i=1Ai

• A1⊃A2⊃. . . =⇒ limn→∞P[An] =P[A] whereA=T∞i=1Ai

Independence_⊥⊥

A_⊥⊥B _⇐⇒ P_[_A_∩_B_{] =}P_[_A_]P_[_B_]

Conditional Probability

P_[_A_|_B_{] =} P[A∩B]

P_[_B_] P[B]>0

Law of Total Probability

P_[_B_{] =}

n X

i=1

P_[_B_|_A_i]P_[_A_i] _{Ω =}

n G

i=1

Ai

Bayes’ Theorem

P_[_A_i_|_B_{] =} _PnP[B|Ai]P[Ai]

j=1P[B|Aj]P[Aj]

Ω = n G

i=1

Ai

Inclusion-Exclusion Principle

n [

i=1

Ai =

n X

r=1

(₋1)r−1 X i≤i1<···<ir≤n

r \

j=1

Aij

3

Random Variables

Random Variable (RV)

X: Ω→R

Probability Mass Function (PMF)

fX(x) =P_[_X ₌_x_{] =}P_[_{_ω_∈_{Ω :}_X₍_ω_{) =}_x_}_]

Probability Density Function (PDF)

P_[_a_≤_X_≤_b_{] =}

Z b

a

f(x)dx

Cumulative Distribution Function (CDF)

FX :R→[0,1] FX(x) =P[X ≤x] 1. Nondecreasing: x1< x2 =⇒ F(x1)≤F(x2)

2. Normalized: limx→−∞= 0 and limx→∞= 1

3. Right-Continuous: limy↓xF(y) =F(x)

P_[_a_≤_Y _≤_b_|_X ₌_x_{] =}

Z b

a

fY|X(y|x)dy a≤b

fY|X(y|x) =

f(x, y)

fX(x) Independence

1. P_[_X _≤_{x, Y} _≤_y_{] =}P_[_X _≤_x_]P_[_Y _≤_y_]

2. fX,Y(x, y) =fX(x)fY(y)

(7)

3.1

Transformations

Transformation function

Z=ϕ(X)

Discrete

fZ(z) =P_[_ϕ₍_X_{) =}_z_{] =}P_[_{_x_:_ϕ₍_x_{) =}_z_}_{] =}P_X _∈_ϕ−1₍_z₎₌ X

x∈ϕ−1₍_z₎

f(x)

Continuous

FZ(z) =P[ϕ(X)≤z] = Z

Az

f(x)dx withAz={x:ϕ(x)≤z}

Special case if ϕstrictly monotone

fZ(z) =fX(ϕ−1(z))

_dzdϕ−1(z)

=fX(x) dx_dz

=fX(x) 1

|J_|

The Rule of the Lazy Statistician

E_[_Z_{] =}

Z

ϕ(x)dFX(x)

E_[_I_A(_x_{)] =}

Z

IA(x)dFX(x) = Z

A

dFX(x) =P_[_X _∈_A_]

Convolution

• Z:=X+Y fZ(z) = Z ∞

−∞

fX,Y(x, z−x)dx X,Y≥0

= Z z

0

fX,Y(x, z−x)dx

• Z:=_|X₋Y_| fZ(z) = 2 Z ∞

0

fX,Y(x, z+x)dx

• Z:= X

Y fZ(z) =

Z ∞

−∞|

x|fX,Y(x, xz)dx ⊥=⊥ Z ∞

−∞

xfx(x)fX(x)fY(xz)dx

4

Expectation

Definition and properties

• E_[_X_{] =}_µ_X₌

Z

x dFX(x) =           

X

x

xfX(x) X discrete

Z

xfX(x) X continuous

• P_[_X ₌_c_{] = 1 =}_⇒ E_[_c_{] =}_c

• E_[_cX_{] =}_cE_[_X_]

• E_[_X₊_Y_{] =}E_[_X_{] +}E_[_Y_]

• E_[_XY_{] =}

Z

X,Y

xyfX,Y(x, y)dFX(x)dFY(y)

• E_[_ϕ₍_Y_)]₆₌_ϕ₍E_[_X_]) _(cf.Jensen_inequality₎

• P_[_X _≥_Y_{] = 0 =}_⇒ E_[_X_]_≥E_[_Y_]_∧P_[_X₌_Y_{] = 1 =}_⇒ E_[_X_{] =}E_[_Y_]

• E_[_X_{] =} ∞

X

x=1

P_[_X_≥_x_]

Sample mean

¯

Xn= 1

n

n X

i=1

Xi

Conditional expectation

• E_[_Y _|_X ₌_x_{] =}

Z

yf(y|x)dy

• E_[_X_{] =}E_[E_[_X_|_Y_]]

• E[ϕ(X, Y)_|X=x] = Z ∞

−∞

ϕ(x, y)fY|X(y|x)dx

• E_[_ϕ₍_{Y, Z}₎_|_X ₌_x_{] =}

Z ∞

−∞

ϕ(y, z)f(Y,Z)|X(y, z|x)dy dz

• E_[_Y ₊_Z_|_X_{] =}E_[_Y_|_X_{] +}E_[_Z_|_X_]

• E_[_ϕ₍_X₎_Y _|_X_{] =}_ϕ₍_X₎E_[_Y _|_X_]

• E[Y|X] =c =⇒ Cov [X, Y] = 0

5

Variance

Definition and properties

• V_[_X_{] =}_σ2

X =E

(X₋E_[_X_])2₌E_X2₋E_[_X_]2

• V

" _n X

i=1

Xi #

= n X

i=1

V_[_X_{i] + 2}X

i6=j

Cov [Xi, Yj]

• V

" _n X

i=1

Xi #

= n X

i=1

V_[_X_i] _iff_X_i_⊥⊥_X_j

Standard deviation

sd[X] =pV_[_X_{] =}_σ_X

Covariance

• Cov [X, Y] =E_[(_X₋E_[_X_])(_Y ₋E_[_Y_{])] =}E_[_XY_]₋E_[_X_]E_[_Y_]

• Cov [X, a] = 0

• Cov [X, X] =V_[_X_]

• Cov [X, Y] = Cov [Y, X]

• Cov [aX, bY] =abCov [X, Y]

(8)

• Cov [X+a, Y +b] = Cov [X, Y]

• Cov  

n X

i=1

Xi, m X

j=1

Yj  =

n X

i=1

m X

j=1

Cov [Xi, Yj]

Correlation

ρ[X, Y] = pCov [X, Y]

V_[_X_]V_[_Y_]

Independence

X_⊥⊥Y =_⇒ ρ[X, Y] = 0 _⇐⇒ Cov [X, Y] = 0 _⇐⇒ E_[_XY_{] =}E_[_X_]E_[_Y_]

Sample variance

S2= 1

n−1 n X

i=1

(Xi−X¯n)2

Conditional variance

• V_[_Y_|_X_{] =}E₍_Y ₋E_[_Y _|_X_])2_|_X₌E_Y2_|_X₋E_[_Y _|_X_]2

• V_[_Y_{] =}E_[V_[_Y _|_X_{]] +}V_[E_[_Y _|_X_]]

6

Inequalities

Cauchy-Schwarz

E_[_XY_]2_≤E_X2E_Y2

Markov

P_[_ϕ₍_X₎_≥_t_]_≤E[ϕ(X)] t

Chebyshev

P_[_|_X₋E_[_X_]_{| ≥}_t_]_≤ V[X] t2

Chernoff

P_[_X _≥_{(1 +}_δ₎_µ_]_≤

eδ (1 +δ)1+δ

δ >−1

Jensen

E_[_ϕ₍_X_)]_≥_ϕ₍E_[_X_]) _ϕ_convex

7

Distribution Relationships

Binomial

• Xi∼Bern (p) =⇒ n X

i=1

Xi∼Bin (n, p)

• X_∼Bin (n, p), Y _∼Bin (m, p) =_⇒ X+Y _∼Bin (n+m, p)

• limn_→∞Bin (n, p) = Po (np) (nlarge,psmall)

• limn_→∞Bin (n, p) =_N(np, np(1₋p)) (nlarge,pfar from 0 and 1) Negative Binomial

• X _∼NBin (1, p) = Geo (p)

• X ∼NBin (r, p) =Pri=1Geo (p)

• Xi∼NBin (ri, p) =⇒ PXi∼NBin (Pri, p)

• X _∼NBin (r, p). Y _∼Bin (s+r, p) =_⇒ P_[_X _≤_s_{] =}P_[_Y _≥_r_]

Poisson

• Xi∼Po (λi)∧Xi⊥⊥Xj =⇒ n X

i=1

Xi∼Po n X

i=1

λi !

• Xi∼Po (λi)∧Xi⊥⊥Xj =⇒ Xi

n X

j=1

Xj ∼Bin  

n X

j=1

Xj,

λi Pn

j=1λj

 

Exponential

• Xi∼Exp (β)∧Xi⊥⊥Xj =⇒ n X

i=1

Xi∼Gamma (n, β)

• Memoryless property: P_[_{X > x}₊_y_|_{X > y}_{] =}P_[_{X > x}_]

Normal

• X ∼ N µ, σ2 ₌_⇒ X−µ σ

∼ N(0,1)

• X _{∼ N} µ, σ2

∧Z=aX+b =_⇒ Z_{∼ N} aµ+b, a2_σ2

• X ∼ N µ1, σ12

∧Y ∼ N µ2, σ22

=⇒ X+Y ∼ N µ1+µ2, σ12+σ22

• Xi∼ N µi, σi2

=_⇒ P_iXi ∼ N Piµi,Piσ2i

• P_[_{a < X} _≤_b_{] = Φ}b−µ

σ

−Φ a−_σµ

• Φ(₋x) = 1₋Φ(x) φ′₍_x_{) =}₋_xφ₍_x₎ _φ′′₍_x_{) = (}_x2

−1)φ(x)

• Upper quantile ofN(0,1): zα= Φ−1(1−α) Gamma

• X ∼Gamma (α, β) ⇐⇒ X/β∼Gamma (α,1)

• Gamma (α, β)_∼Pα_i₌₁Exp (β)

• Xi∼Gamma (αi, β)∧Xi⊥⊥Xj =⇒ PiXi∼Gamma (Piαi, β)

• Γ(_λαα) = Z ∞

0

xα−1e−λxdx

Beta

• _B(_{α, β}1 ₎xα−1(1₋x)β−1= Γ(α+β) Γ(α)Γ(β)x

α−1₍₁

−x)β−1

• E_Xk₌B(α+k, β)

B(α, β) =

α+k₋1

α+β+k−1E

Xk−1

• Beta (1,1)_∼Unif (0,1)

(9)

8

Probability and Moment Generating Functions

• GX(t) =E

tX

|t_|<1

• MX(t) =GX(et) =EeXt=E "_∞

X

i=0

(Xt)i

i! #

=

∞

X

i=0

E_Xi

i! ·t i

• P_[_X _{= 0] =}_G_X(0)

• P_[_X _{= 1] =}_G′

X(0)

• P_[_X ₌_i_{] =} G

(i)

X(0)

i!

• E_[_X_{] =}_G′_X₍₁−₎

• E_Xk₌_M(k)

X (0)

• E

X! (X₋k)!

=G(_Xk)(1−₎

• V_[_X_{] =}_G′′

X(1−) +G′X(1−)−(G′X(1−))

2

• GX(t) =GY(t) =⇒ X d =Y

9

Multivariate Distributions

9.1

Standard Bivariate Normal

LetX, Y ∼ N(0,1)∧X ⊥⊥Z whereY =ρX+p1−ρ2_Z

Joint density

f(x, y) = 1

2πp1₋ρ2exp

−x

2₊_y2

−2ρxy

2(1−ρ2₎

Conditionals

(Y|X =x)∼ N ρx,1−ρ2 and (X|Y =y)∼ N ρy,1−ρ2

Independence

X⊥⊥Y ⇐⇒ ρ= 0

9.2

Bivariate Normal

LetX ∼ N µx, σx2

andY ∼ N µy, σ2y

.

f(x, y) = 1 2πσxσy

p

1₋ρ2exp

−₂₍₁z −ρ2₎

z= "

x−µx

σx 2

+

y−µy

σy 2

−2ρ

x−µx

σx

y−µy

σy #

Conditional mean and variance

E_[_X_|_Y_{] =}E_[_X_{] +}_ρσX σY

(Y ₋E_[_Y_])

V_[_X_|_Y_{] =}_σ_Xp₁₋_ρ2

9.3

Multivariate Normal

Covariance matrix Σ (Precision matrix Σ−1₎

Σ =   

V_[_X₁_] _{· · ·} _{Cov [}_X₁_{, X}_k]

..

. . .. ... Cov [Xk, X1] · · · V[Xk]

  

IfX _{∼ N}(µ,Σ),

fX(x) = (2π)−n/2|Σ|−1/2exp

−1₂(x−µ)TΣ−1(x−µ)

Properties

• Z_{∼ N}(0,1)_∧X=µ+ Σ1/2_Z ₌

⇒ X _{∼ N}(µ,Σ)

• X ∼ N(µ,Σ) =⇒ Σ−1/2₍_X₋_µ₎_{∼ N}₍₀_,₁₎

• X ∼ N(µ,Σ) =⇒ AX∼ N Aµ, AΣAT

• X ∼ N(µ,Σ)∧ kak=k =⇒ aT_X _{∼ N} _aT_{µ, a}T_Σ_a

10

Convergence

Let {X1, X2, . . .} be a sequence of rv’s and let X be anotherrv. LetFn denote thecdf_of_X_n _{and let}_F _{denote the}cdf_of_X_.

Types of convergence

1. In distribution (weakly, in law): Xn D

→X

lim

n→∞Fn(t) =F(t) ∀t whereF continuous

2. In probability: Xn P

→X

(∀ε >0) lim n→∞

P_[_|_X_n₋_X_|_{> ε}_{] = 0}

3. Almost surely (strongly): Xn as

→X

Ph_lim

n→∞Xn =X

i

=Ph_ω_∈_{Ω : lim}

n→∞Xn(ω) =X(ω)

i = 1

(10)

4. In quadratic mean (L2): Xn qm

→X

lim n→∞E

(Xn−X)2

= 0

Relationships

• Xn qm

→X =_⇒ Xn P

→X =_⇒ Xn D

→X

• Xn as

→X =⇒ Xn P

→X

• Xn D

→X_∧(_∃c_∈R₎P_[_X₌_c_{] = 1 =}_⇒ _X_n _→P _X

• Xn P

→X∧Yn P

→Y =⇒ Xn+Yn P

→X+Y

• Xn qm

→X∧Yn qm

→Y =⇒ Xn+Yn qm

→X+Y

• Xn P

→X_∧Yn P

→Y =_⇒ XnYn P

→XY

• Xn P

→X =_⇒ ϕ(Xn) P

→ϕ(X)

• Xn D

→X =_⇒ ϕ(Xn) D

→ϕ(X)

• Xn qm

→b _⇐⇒ limn_→∞E_[_X_{n] =}_b_∧_limn_→∞V_[_X_{n] = 0}

• X1, . . . , Xn iid∧E[X] =µ∧V[X]<∞ ⇐⇒ X¯n qm

→µ

Slutzky’s Theorem

• Xn D

→X andYn P

→c =⇒ Xn+Yn D

→X+c

• Xn D

→X andYn P

→c =⇒ XnYn D

→cX

• In general: Xn D

→X andYn D

→Y =₆_⇒ Xn+Yn D

→X+Y

10.1

Law of Large Numbers (LLN)

Let_{X1, . . . , Xn} be a sequence ofiid rv’s,E[X1] =µ, andV[X1]<∞.

Weak (WLLN)

¯

Xn P

→µ n_{→ ∞}

Strong (WLLN)

¯

Xn as

→µ n_{→ ∞}

10.2

Central Limit Theorem (CLT)

Let_{X1, . . . , Xn} be a sequence ofiid rv’s,E[X1] =µ, andV[X1] =σ2.

Zn := ¯

Xn−µ q

V_X¯_n

=

√_n_{( ¯}_X

n−µ)

σ

D

→Z whereZ _{∼ N}(0,1)

lim

n→∞P[Zn≤z] = Φ(z) z∈R

CLT notations

Zn ≈ N(0,1) ¯

Xn ≈ N

µ,σ

2

n

¯

Xn−µ≈ N

0,σ

2

n

√

n( ¯Xn−µ)≈ N 0, σ2

√_n_{( ¯}_X

n−µ)

n ≈ N(0,1)

Continuity correction

P_X¯_n_≤_x_≈_Φ

_x₊1 2−µ

σ/√n

P_X¯_n_≥_x_≈₁₋_Φ

_x

−1 2−µ

σ/√n

Delta method

Yn≈ N

µ,σ

2

n

=⇒ ϕ(Yn)≈ N

ϕ(µ),(ϕ′(µ))2σ

2

n

11

Statistical Inference

LetX1,· · ·, Xn iid∼ F if not otherwise noted.

11.1

Point Estimation

• Point estimatorθbn ofθis arv: θbn =g(X1, . . . , Xn)

• bias(bθn) =Eh_θb_ni₋_θ

• Consistency: θbn P

→θ

• Sampling distribution: F(θbn)

• Standard error: se(θbn) = r

Vh_θb_ni

• Mean squared error: mse₌Eh₍_θb_n₋_θ₎2i₌bias(θbn)2+V h

b

θn i

• limn→∞bias(θbn) = 0∧limn→∞se(θbn) = 0 =⇒ θbn is consistent

• Asymptotic normality: θbn−θ

se

D

→ N(0,1)

• Slutzky’s Theorem_{often lets us replace}_se₍b_θ_{n) by some (weakly)}

consis-tent estimatorbσn.

(11)

11.2

Normal-Based Confidence Interval

Suppose θbn ≈ N

θ,seb2. Let zα/2 = Φ−1(1−(α/2)), i.e., P

Z > zα/2

=α/2 andP₋_z_α/₂_{< Z < z}_α/₂_{= 1}₋_α_where_Z _{∼ N}₍₀_,_{1). Then}

Cn=bθn±zα/2seb

11.3

Empirical distribution

Empirical Distribution Function (ECDF)

b

Fn(x) = Pn

i=1I(Xi≤x)

n

I(Xi≤x) = (

1 Xi≤x 0 Xi> x

Properties (for any fixedx)

• E

h b

Fn i

=F(x)

• Vh_Fb_ni₌F(x)(1−F(x)) n

• mse₌ F(x)(1−F(x))

n

D

→0

• Fbn P

→F(x)

Dvoretzky-Kiefer-Wolfowitz_{(DKW) inequality (}_X₁_{, . . . , X}_n_∼_F₎

P

sup

x

F(x)₋Fbn(x)> ε

= 2e−2nε2

Nonparametric 1₋αconfidence band for F

L(x) = max_{Fbn−ǫn,0}

U(x) = min_{Fbn+ǫn,1}

ǫ= s

1 2nlog

₂

α

P_[_L₍_x₎_≤_F₍_x₎_≤_U₍_x₎_∀_x_]_≥₁₋_α

11.4

Statistical Functionals

• Statistical functional: T(F)

• Plug-in estimator ofθ= (F): θbn =T(Fbn)

• Linear functional: T(F) =Rϕ(x)dFX(x)

• Plug-in estimator for linear functional:

T(Fbn) = Z

ϕ(x)dFbn(x) = 1

n

n X

i=1

ϕ(Xi)

• Often: T(Fbn)_{≈ N}T(F),seb2 =_⇒T(Fbn)_±zα/2seb

• pth _quantile: _F−1₍_p_{) = inf}

{x:F(x)_≥p_}

• µb= ¯Xn

• bσ2₌ 1

n₋1 n X

i=1

(Xi−X¯n)2

• bκ=

1

n Pn

i=1(Xi−µb)3

b

σ3_j

• ρb=

Pn

i=1(Xi−X¯n)(Yi−Y¯n) qPn

i=1(Xi−X¯n)2qPni=1(Yi−Y¯n)

12

Parametric Inference

Let F=f(x;θ) :θ_∈Θ be a parametric model with parameter space Θ_⊂Rk

and parameterθ= (θ1, . . . , θk).

12.1

Method of Moments

jth _moment

αj(θ) =E_Xj₌

Z

xjdFX(x)

jth _{sample moment}

b

αj = 1

n

n X

i=1

X_ij

Method of moments estimator (MoM)

α1(θ) =αb1

α2(θ) =αb2

.. . =...

αk(θ) =αbk

(12)

Properties of the MoM estimator

• bθn exists with probability tending to 1

→θ

• Asymptotic normality:

√

n(θb−θ)→ ND (0,Σ) where Σ =gE_{Y Y}T_gT_,_Y _{= (}_{X, X}2_{, . . . , X}k₎T_,

g= (g1, . . . , gk) andgj= _∂θ∂α−j1(θ)

12.2

Maximum Likelihood

Likelihood: _Ln: Θ→[0,∞)

Ln(θ) = n Y

i=1

f(Xi;θ)

Log-likelihood

ℓn(θ) = log_Ln(θ) = n X

i=1

logf(Xi;θ)

Maximum likelihood estimator (mle₎

Ln(θbn) = sup θ L

n(θ)

Score function

s(X;θ) = ∂

∂θlogf(X;θ)

Fisher information

I(θ) =V_θ_[_s₍_X_;_θ_)]

In(θ) =nI(θ)

Fisher information (exponential family)

I(θ) =E_θ

−_∂θ∂ s(X;θ)

Observed Fisher information

Iobs

n (θ) =−

∂2

∂θ2

n X

i=1

logf(Xi;θ)

Properties of themle

→θ

• Equivariance: θbn is themle =⇒ϕ(θbn) ist themleofϕ(θ)

• Asymptotic normality: 1. se_≈p1/In(θ)

(bθn−θ)

se

D

→ N(0,1)

2. seb _≈ q

1/In(θbn)

(bθn−θ) b

se

D

→ N(0,1)

• Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-ples. Ifeθn is any other estimator, the asymptotic relative efficiency is

are₍_θe_n_,_θb_{n) =}

Vh_θb_ni

Vh_θe_ni ≤

1

• Approximately the Bayes estimator

12.2.1 Delta Method

Ifτ =ϕ(bθ) whereϕis differentiable andϕ′₍_θ₎₆_{= 0:}

(bτn−τ) b

se(τb)

D

→ N(0,1)

where bτ=ϕ(θb) is themle _of_τ _and

b

se=ϕ′(θb)_bse(bθn)

12.3

Multiparameter Models

Letθ= (θ1, . . . , θk) andbθ= (θb1, . . . ,bθk) be themle.

Hjj =

∂2_ℓ

n

∂θ2 Hjk=

∂2_ℓ

n

∂θj∂θk

Fisher information matrix

In(θ) =−

  

E_θ_[_H₁₁_] _{· · ·} E_θ_[_H₁_k_]

..

. . .. ...

E_θ_[_H_k₁_] _{· · ·} E_θ_[_H_kk]

  

Under appropriate regularity conditions

(θb₋θ)_{≈ N}(0, Jn)

(13)

withJn(θ) =I−1

n . Further, ifθbj is thejth component of θ, then

(θbj−θj) b

sej

D

→ N(0,1)

whereseb2_j =Jn(j, j) and Covhbθj,bθk i

=Jn(j, k)

12.3.1 Multiparameter delta method

Letτ=ϕ(θ1, . . . , θk) and let the gradient ofϕbe

∇ϕ=      

∂ϕ ∂θ1

.. .

∂ϕ ∂θk

     

Suppose_∇ϕ_θ₌_θ_b₆= 0 andτb=ϕ(θb). Then,

(bτ−τ) b

se(τb)

D

→ N(0,1)

where

b

se(bτ) =r∇bϕTJbn

b

∇ϕ

andJbn=Jn(θb) and∇bϕ=∇ϕ

θ=bθ.

12.4

Parametric Bootstrap

Sample from f(x;θbn) instead of from Fbn, whereθbn could be the mleor method of moments estimator.

13

Hypothesis Testing

H0:θ∈Θ0 versus H1:θ∈Θ1

Definitions

• Null hypothesisH0

• Alternative hypothesisH1

• Simple hypothesisθ=θ0

• Composite hypothesisθ > θ0 orθ < θ0

• Two-sided test: H0:θ=θ0 versus H1:θ6=θ0

• One-sided test: H0:θ≤θ0 versus H1:θ > θ0

• Critical valuec

• Test statisticT

• Rejection regionR=_{x:T(x)> c_}

• Power functionβ(θ) =P_[_X_∈_R_]

• Power of a test: 1₋P_{[Type II error] = 1}₋_β _{= inf}

θ∈Θ1

β(θ)

• Test size: α=P_{[Type I error] = sup}

θ∈Θ0

β(θ)

RetainH0 RejectH0

H0true √ Type I Error (α)

H1true Type II Error (β) √(power)

p-value

• p-value = supθ∈Θ0Pθ[T(X)≥T(x)] = inf

α:T(x)_∈Rα

• p-value = supθ∈Θ0 Pθ[T(X ⋆₎

≥T(X)]

| {z }

1−Fθ(T(X)) sinceT(X⋆)∼Fθ

= infα:T(X)_∈Rα

p-value evidence

<0.01 very strong evidence againstH0

0.01−0.05 strong evidence againstH0

0.05−0.1 weak evidence againstH0

>0.1 little or no evidence againstH0

Wald test

• Two-sided test

• RejectH0when|W|> zα/2whereW =

b

θ₋θ0

b

se

• P_|_W_|_{> z}_α/₂_→_α

• p-value =P_θ0_[_|_W_|_>_|_w_|_]_≈P_[_|_Z_|_>_|_w_|_{] = 2Φ(}_−|_w_|₎

Likelihood ratio test (LRT)

• T(X) = supθ∈ΘLn(θ) supθ∈Θ0Ln(θ)

= Ln(θbn)

(14)

• λ(X) = 2 logT(X)→D χ2r−q where k X

i=1

Zi2∼χ2k and Z1, . . . , Zk iid

∼ N(0,1)

• p-value =P_θ0_[_λ₍_X₎_{> λ}₍_x_)]_≈P_χ2_r

−q > λ(x)

Multinomial LRT

• mle_: _b_p_n₌

X1

n , . . . , Xk

n

• T(X) =Ln(bpn)

Ln(p0)

= k Y

j=1

b

pj

p0j Xj

• λ(X) = 2 k X

j=1

Xjlog

b

pj

p0j

D

→χ2k−1

• The approximate sizeαLRT rejectsH0 whenλ(X)≥χ2k−1,α Pearson Chi-square Test

• T = k X

j=1

(Xj−E[Xj])2

E_[_X_j] where E[Xj] =np0j underH0

• T D

→χ2

k−1

• p-value =P_χ2

k−1> T(x)

• Faster D

→X2

k−1 than LRT, hence preferable for smalln

Independence testing

• Irows,J columns,Xmultinomial sample of sizen=I_∗J

• mle_{s unconstrained:} _p_b_ij ₌ Xij

n

• mle_{s under}_H₀_: _p_b₀_ij ₌_p_b_i_·_p_b_·_j₌ Xi· n

X·j n

• LRT:λ= 2PIi=1

PJ

j=1Xijlog _nX

ij

Xi·X·j

• PearsonChiSq: T =PI_i₌₁PJ_j₌₁(Xij−E[Xij])2

E_[_X_ij]

• LRT and Pearson D

→χ2

kν, whereν = (I−1)(J−1)

14

Bayesian Inference

Bayes’ Theorem

f(θ|x) = f(x|θ)f(θ)

f(xn₎ =

f(x_|θ)f(θ) R

f(x_|θ)f(θ)dθ ∝ Ln(θ)f(θ)

Definitions

• Xn_{= (}_X

1, . . . , Xn)

• xn _{= (}_x

1, . . . , xn)

• Prior densityf(θ)

• Likelihoodf(xn_|_θ_{): joint density of the data} In particular,Xn _iid ₌

⇒f(xn

|θ) = n Y

i=1

f(xi|θ) =Ln(θ)

• Posterior densityf(θ|xn₎

• Normalizing constantcn=f(xn) = R

f(x_|θ)f(θ)dθ

• Kernel: part of a density that depends onθ

• Posterior mean ¯θn= R

θf(θ_|xn₎_dθ₌ RθLn(θ)f(θ) R

Ln(θ)f(θ)dθ

14.1

Credible Intervals

Posterior interval

P_[_θ_∈₍_{a, b}₎_|_xn_{] =}

Z b

a

f(θ|xn)dθ= 1−α

Equal-tail credible interval Z a

−∞

f(θ|xn)dθ= Z ∞

b

f(θ|xn)dθ=α/2

Highest posterior density (HPD) region Rn

1. P_[_θ_∈_R_n_{] = 1}₋_α

2. Rn={θ:f(θ|xn)> k}for somek

Rn is unimodal =⇒Rn is an interval

14.2

Function of parameters

Letτ =ϕ(θ) andA=_{θ:ϕ(θ)_≤τ_}. Posterior CDF forτ

H(r_|xn_{) =}_P_[_ϕ₍_θ₎

≤τ_|xn_{] =} Z

A

f(θ_|xn₎_dθ

Posterior density

h(τ|xn) =H′(τ|xn) Bayesian delta method

τ_|Xn

≈ Nϕ(θb),seb ϕ′₍_θ_b₎

(15)

14.3

Priors

Choice

• Subjective bayesianism.

• Objective bayesianism.

• Robust bayesianism.

Types

• Flat: f(θ)∝constant

• Proper: R_−∞∞ f(θ)dθ= 1

• Improper: R_−∞∞ f(θ)dθ=_∞

• Jeffrey_{’s prior (transformation-invariant):}

f(θ)_∝pI(θ) f(θ)_∝pdet(I(θ))

• Conjugate: f(θ) andf(θ|xn_{) belong to the same parametric family}

14.3.1 Conjugate Priors

Discrete likelihood

Likelihood Conjugate prior Posterior hyperparameters

Bern (p) Beta (α, β) α+ n X

i=1

xi, β+n− n X

i=1

xi

Bin (p) Beta (α, β) α+ n X

i=1

xi, β+ n X

i=1

Ni− n X

i=1

xi

NBin (p) Beta (α, β) α+rn, β+ n X

i=1

xi

Po (λ) Gamma (α, β) α+ n X

i=1

xi, β+n

Multinomial(p) Dir (α) α+ n X

i=1

x(i)

Geo (p) Beta (α, β) α+n, β+ n X

i=1

xi

Continuous likelihood (subscriptcdenotes constant) Likelihood Conjugate prior Posterior hyperparameters

Unif (0, θ) Pareto(xm, k) max

x(n), xm , k+n

Exp (λ) Gamma (α, β) α+n, β+ n X

i=1

xi

N µ, σ2

c

N µ0, σ02

µ0

σ2 0

+ Pn

i=1xi

σ2

c

/

1

σ2 0

+ n

σ2

c

, ₁

σ2 0

+ n

σ2

c −1

N µc, σ2

Scaled Inverse Chi-square(ν, σ2

0)

ν+n, νσ

2 0+

Pn

i=1(xi−µ)2

ν+n

N µ, σ2

Normal-scaled Inverse Gamma(λ, ν, α, β)

νλ+nx¯

ν+n , ν + n, α +

n

2,

β+1 2

n X

i=1

(xi−x¯)2+

γ(¯x₋λ)2

2(n+γ)

MVN(µ,Σc) MVN(µ0,Σ0) Σ−01+nΣ−c1 −1

Σ−01µ0+nΣ−1x¯,

Σ−01+nΣ−c1 −1

MVN(µc,Σ) Inverse-Wishart(κ,Ψ)

n+κ,Ψ + n X

i=1

(xi−µc)(xi−µc)T

Pareto(xmc, k) Gamma (α, β) α+n, β+

n X

i=1

log xi

xmc

Pareto(xm, kc) Pareto(x0, k0) x0, k0−knwhere k0> kn

Gamma (αc, β) Gamma (α0, β0) α0+nαc, β0+

n X

i=1

xi

14.4

Bayesian Testing

IfH0:θ∈Θ0:

Prior probabilityP_[_H₀_{] =}

Z

Θ0

f(θ)dθ

Posterior probabilityP_[_H₀_|_xn_{] =}

Z

Θ0

f(θ_|xn)dθ

LetH0, . . . , HK−1 beK hypotheses. Supposeθ∼f(θ|Hk),

P_[_H_k_|_xn_{] =} f(x

n

|Hk)P_[_H_k]

PK

k=1f(xn|Hk)P[Hk]

,

(16)

Marginal likelihood

f(xn_|Hi) = Z

Θ

f(xn_|θ, Hi)f(θ_|Hi)dθ

Posterior odds (ofHi relative toHj)

P_[_H_i_|_xn_]

P_[_H_j_|_xn_] =

f(xn_|_H_i)

f(xn_|_H_j) | {z }

Bayes FactorBFij

× _PP[_[_HHi]

j] | {z }

prior odds

Bayes factor

log10BF10 BF10 evidence

0−0.5 1−1.5 Weak 0.5−1 1.5−10 Moderate 1−2 10−100 Strong

>2 >100 Decisive

p∗₌

p

1−pBF10 1 +₁₋p_pBF10

wherep=P_[_H₁_{] and}_p∗₌P_[_H₁_|_xn_]

15

Exponential Family

Scalar parameter

fX(x|θ) =h(x) exp{η(θ)T(x)−A(θ)}

=h(x)g(θ) exp_{η(θ)T(x)_} Vector parameter

fX(x|θ) =h(x) exp ( _s

X

i=1

ηi(θ)Ti(x)−A(θ) )

=h(x) exp_{η(θ)_·T(x)₋A(θ)_} =h(x)g(θ) exp{η(θ)·T(x)}

Natural form

fX(x|η) =h(x) exp{η·T(x)−A(η)}

=h(x)g(η) exp_{η_·T(x)_} =h(x)g(η) expηTT(x)

16

Sampling Methods

16.1

The Bootstrap

LetTn=g(X1, . . . , Xn) be a statistic.

1. EstimateV_F_[_T_{n] with} V_b

Fn[Tn].

2. ApproximateV_b

Fn[Tn] using simulation:

(a) Repeat the followingB times to getT∗

n,1, . . . , Tn,B∗ , aniidsample from the sampling distribution implied byFbn

i. Sample uniformlyX∗

1, . . . , Xn∗∼Fbn. ii. ComputeT∗

n =g(X1∗, . . . , Xn∗). (b) Then

vboot=Vb_F_b_n= 1

B

B X

b=1

Tn,b∗ − 1

B

B X

r=1

Tn,r∗ !2

16.1.1 Bootstrap Confidence Intervals

Normal-based interval

Tn±zα/2sebboot

Pivotal interval

1. Location parameterθ=T(F) 2. PivotRn =θbn−θ

3. LetH(r) =P_[_R_n_≤_r_{] be the}cdf_of_R_n

4. LetR∗

n,b=θb∗n,b−θbn. ApproximateH using bootstrap:

b

H(r) = 1

B

B X

b=1

I(R∗n,b≤r)

5. θ∗

β =β sample quantile of (θb∗n,1, . . . ,bθ∗n,B) 6. r∗

β =β sample quantile of (R∗n,1, . . . , R∗n,B), i.e.,rβ∗=θβ∗−θbn 7. Approximate 1₋αconfidence interval Cn=

ˆ

a,ˆbwhere

ˆ

a= θbn−Hb−1

1₋α 2

= bθn−r∗1−α/2= 2bθn−θ∗1−α/2

ˆ_b₌ θbn−Hb−1 _α

2

= θbn−r∗α/2= 2θbn−θ∗α/2

Percentile interval

Cn=

θ∗

α/2, θ1∗−α/2

(17)

16.2

Rejection Sampling

Setup

• We can easily sample fromg(θ)

• We want to sample fromh(θ), but it is difficult

• We knowh(θ) up to a proportional constant: h(θ) = R k(θ)

k(θ)dθ

• Envelope condition: we can findM >0 such thatk(θ)_≤M g(θ) _∀θ

Algorithm

1. Drawθcand

∼g(θ) 2. Generateu_∼Unif (0,1) 3. Acceptθcand _if_u_≤ k(θcand)

M g(θcand₎

4. Repeat untilB values ofθcand _{have been accepted}

Example

• We can easily sample from the priorg(θ) =f(θ)

• Target is the posteriorh(θ)_∝k(θ) =f(xn

|θ)f(θ)

• Envelope condition: f(xn_|_θ₎_≤_f₍_xn_|_θ_b

n) =Ln(bθn)≡M

• Algorithm

1. Draw θcand_∼_f₍_θ₎ 2. Generate u_∼Unif (0,1) 3. Accept θcand _if_u

≤Ln(θ

cand₎

Ln(bθn)

16.3

Importance Sampling

Sample from an importance function grather than target density h. Algorithm to obtain an approximation to E_[_q₍_θ₎_|_xn_]:

1. Sample from the priorθ1, . . . , θn iid

∼ f(θ) 2. wi= L

n(θi) PB

i=1Ln(θi)

∀i= 1, . . . , B

3. E_[_q₍_θ₎_|_xn_]_≈PB

i=1q(θi)wi

17

Decision Theory

Definitions

• Unknown quantity affecting our decision: θ_∈Θ

• Decision rule: synonymous for an estimatorθb

• Actiona_{∈ A}: possible value of the decision rule. In the estimation context, the action is just an estimate ofθ,θb(x).

• Loss functionL: consequences of taking actionawhen true state isθ or discrepancy betweenθandθb,L: Θ_{× A →}[₋k,_∞).

Loss functions

• Squared error loss: L(θ, a) = (θ₋a)2

• Linear loss: L(θ, a) = (

K1(θ−a) a−θ <0

K2(a−θ) a−θ≥0

• Absolute error loss: L(θ, a) =|θ−a| (linear loss withK1=K2)

• Lp loss: L(θ, a) =|θ−a|p

• Zero-one loss: L(θ, a) = (

0 a=θ

1 a6=θ

17.1

Risk

Posterior risk

r(bθ|x) = Z

L(θ,θb(x))f(θ|x)dθ=E_θ |X

h

L(θ,θb(x))i

(Frequentist) risk

R(θ,bθ) = Z

L(θ,θb(x))f(x|θ)dx=E_X |θ

h

L(θ,bθ(X))i

Bayes risk

r(f,θb) = ZZ

L(θ,θb(x))f(x, θ)dx dθ=E_θ,Xh_L₍_θ,_θb₍_X₎₎i

r(f,θb) =E_θ

h

E_X_|_θ

h

L(θ,bθ(X)ii=E_θ

h

R(θ,θb)i

r(f,θb) =E_XhE_θ_|_Xh_L₍_θ,_θb₍_X₎ii₌E_Xh_r₍_θb_|_X₎i

17.2

Admissibility

• θb′ _dominates _θ_b_if

∀θ:R(θ,bθ′₎_≤_R₍_θ,_θ_b₎

∃θ:R(θ,bθ′)< R(θ,θb)

• θbis inadmissible if there is at least one other estimatorθb′ _{that dominates}

it. Otherwise it is called admissible.

(18)

17.3

Bayes Rule

Bayes rule (or Bayes estimator)

• r(f,θb) = inf_e_θr(f,eθ)

• bθ(x) = infr(θb|x)∀x =⇒ r(f,θb) =R r(θb|x)f(x)dx

Theorems

• Squared error loss: posterior mean

• Absolute error loss: posterior median

• Zero-one loss: posterior mode

17.4

Minimax Rules

Maximum risk ¯

R(θb) = sup θ

R(θ,θb) R¯(a) = sup θ

R(θ, a)

Minimax rule

sup θ

R(θ,θb) = inf

e

θ ¯

R(θe) = inf

e

θ sup

θ

R(θ,θe)

b

θ= Bayes rule ∧ ∃c:R(θ,θb) =c

Least favorable prior

b

θf = Bayes rule ∧ R(θ,bθf)≤r(f,θbf)∀θ

18

Linear Regression

Definitions

• Response variableY

• CovariateX (aka predictor variable or feature)

18.1

Simple Linear Regression

Model

Yi=β0+β1Xi+ǫi E[ǫi|Xi] = 0, V[ǫi|Xi] =σ2 Fitted line

b

r(x) =βb0+βb1x

Predicted (fitted) values

b

Yi=br(Xi)

Residuals

ˆ

ǫi=Yi−Ybi =Yi−

b

β0+βb1Xi

Residual sums of squares (rss₎

rss₍_βb₀_,_βb₁_{) =}

n X

i=1

ˆ

ǫ2i

Least square estimates b

βT = (βb0,βb1)T : min b

β0,βb1

rss

b

β0= ¯Yn−βb1X¯n b

β1=

Pn

i=1(Xi−X¯n)(Yi−Y¯n) Pn

i=1(Xi−X¯n)2 =

Pn

i=1XiYi−nXY¯ Pn

i=1Xi2−nX2

Eh_βb_|_Xni₌

β0

β1

Vh_βb_|_Xni₌ σ

2

nsX

n−1Pn

i=1Xi2 −Xn

−Xn 1

b

se(βb0) = b

σ sX√n

r Pn i=1Xi2

n

b

se(βb1) = b

σ sX√n

where s2X=n−1 Pn

i=1(Xi−Xn)2 andbσ2= _n₋1₂Pni=1ˆǫ2i (unbiased estimate). Further properties:

• Consistency: βb0

P

→β0 andβb1

P

→β1

• Asymptotic normality: b

β0−β0

b

se(bβ0)

D

→ N(0,1) and βb1−β1 b

se(bβ1)

D

→ N(0,1)

• Approximate 1₋αconfidence intervals forβ0andβ1:

b

β0±zα/2seb(βb0) and βb1±zα/2seb(βb1)

• Wald test for H0:β1= 0 vs. H1:β16= 0: reject H0 if|W|> zα/2where

W =βb1/seb(βb1).

R2

R2₌

Pn

i=1(Ybi−Y)2 Pn

i=1(Yi−Y)2 = 1₋

Pn i=1ˆǫ2i Pn

i=1(Yi−Y)2

= 1₋rss

tss

(19)

Likelihood

L= n Y

i=1

f(Xi, Yi) = n Y

i=1

fX(Xi)× n Y

i=1

fY|X(Yi|Xi) =L1× L2

L1=

n Y

i=1

fX(Xi)

L2=

n Y

i=1

fY|X(Yi|Xi)∝σ−nexp (

−₂1_σ2

X

i

Yi−(β0−β1Xi)

2)

Under the assumption of Normality, the least squares estimator is also themle

b

σ2= 1

n

n X

i=1

ˆ

ǫ2i

18.2

Prediction

Observe X=x_∗ of the covariate and want to predict their outcomeY_∗.

b

Y∗=βb0+βb1x∗

Vh_Yb_∗i₌Vh_βb₀i₊_x2 ∗V

h b

β1

i

+ 2x_∗Covhβb0,βb1

i

Prediction interval b

ξ2

n=bσ2 Pn

i=1(Xi−X∗)2 nP_i(Xi−X¯)2j

+ 1

b

Y_∗_±zα/2ξbn

18.3

Multiple Regression

Y =Xβ+ǫ

where

X =

  

X11 · · · X1k ..

. . .. ...

Xn1 · · · Xnk    β =

  

β1

.. .

βk    ǫ=

  

ǫ1

.. .

ǫn   

Likelihood

L(µ,Σ) = (2πσ2)−n/2exp

−₂1_σ2rss

rss_{= (}_y₋_Xβ₎T₍_y₋_Xβ_{) =}_k_Y ₋_Xβ_k2₌

N X

i=1

(Yi−xTiβ)2

If the (k_×k) matrixXT_X _{is invertible,} b

β= (XTX)−1XTY

Vh_βb_|_Xni₌_σ2₍_XT_X₎−1

b

β_{≈ N} β, σ2(XTX)−1 Estimate regression function

b

r(x) = k X

j=1

b

βjxj

Unbiased estimate forσ2

b

σ2₌ 1

n−k

n X

i=1

ˆ

ǫ2

i ˆǫ=Xβb−Y

mle

b

µ= ¯X bσ2= n−k

n σ

2

1−αConfidence interval

b

βj±zα/2seb(βbj)

18.4

Model Selection

Consider predicting a new observationY∗ _{for covariates}_X∗ _{and let}_S_⊂_J

denote a subset of the covariates in the model, where |S|=kand|J|=n. Issues

• Underfitting: too few covariates yields high bias

• Overfitting: too many covariates yields high variance Procedure

1. Assign a score to each model

2. Search through all models to find the one with the highest score Hypothesis testing

H0:βj= 0 vs. H1:βj6= 0 ∀j∈J Mean squared prediction error (mspe₎

mspe₌Eh₍_Yb₍_S₎₋_Y∗₎2i

Prediction risk

R(S) = n X

i=1

mspe_i ₌

n X

i=1

Eh₍_Yb_i(_S₎₋_Y∗

i )2 i

Training error

b

Rtr(S) = n X

i=1

(Ybi(S)₋Yi)2

(20)

R2

R2₍_S_{) = 1}

−rss_tss(S) = 1₋Rbtr(S)

tss = 1−

Pn

i=1(Ybi(S)−Y)2

Pn

i=1(Yi−Y)2 The training error is a downward-biased estimate of the prediction risk.

Eh_Rb_tr(_S₎i_{< R}₍_S₎

bias(Rbtr(S)) =E

h b

Rtr(S)i₋R(S) =₋2 n X

i=1

CovhYbi, Yi i

AdjustedR2

R2(S) = 1−n_n−1 −k

rss

tss

Mallow’s_C_p _statistic

b

R(S) =Rbtr(S) + 2kσb2= lack of fit + complexity penalty

Akaike_{Information Criterion (AIC)}

AIC(S) =ℓn(βbS,σb2S)−k Bayesian Information Criterion (BIC)

BIC(S) =ℓn(βbS,σb2S)−

k

2logn Validation and training

b

RV(S) = m X

i=1

(Ybi∗(S)−Yi∗)2 m=|{validation data}|, often

n

4 or

n

2

Leave-one-out cross-validation

b

RCV(S) = n X

i=1

(Yi−Yb(i))2=

n X

i=1

Yi−Ybi(S) 1−Uii(S)

!2

U(S) =XS(XSTXS)−1XS (“hat matrix”)

19

Non-parametric Function Estimation

19.1

Density Estimation

Estimate f(x), wheref(x) =P_[_X_∈_A_{] =}R

Af(x)dx. Integrated square error (ise₎

L(f,fbn) =Z f(x)−fbn(x)

2

dx=J(h) + Z

f2(x)dx

Frequentist risk

R(f,fbn) =Eh_L₍_f,_fb_n₎i₌

Z

b2(x)dx+ Z

v(x)dx

b(x) =Eh_fb_n₍_x₎i₋_f₍_x₎ v(x) =Vh_fb_n(_x₎i

19.1.1 Histograms

Definitions

• Number of binsm

• Binwidthh= 1

m

• BinBj hasνj observations

• Definepbj=νj/nandpj= R

Bjf(u)du

Histogram estimator

b

fn(x) = m X

j=1

b

pj

hI(x∈Bj) E

h b

fn(x) i

= pj

h

Vh_fb_n₍_x₎i₌ pj(1−pj) nh2

R(fbn, f)≈ h

2

12 Z

(f′(u))2 du+ 1

nh

h∗= 1

n1/3

6 R

(f′₍_u₎₎2du

!1/3

R∗(fbn, f)≈

C

n2/3 C=

₃

4

2/3Z

(f′(u))2 du

1/3

Cross-validation estimate of E_[_J₍_h_)]

b

JCV(h) = Z

b

fn2(x)dx− 2

n

n X

i=1

b

f(−i)(Xi) =

2 (n₋1)h−

n+ 1 (n₋1)h

m X

j=1

b

p2j

(21)

19.1.2 Kernel Density Estimator (KDE)

KernelK

• K(x)≥0

• RK(x)dx= 1

• RxK(x)dx= 0

• Rx2K(x)dx≡σ2K>0 KDE

b

fn(x) = 1

n

n X

i=1

1

hK

x−Xi

h

R(f,fbn)_≈ 1 4(hσK)

4

Z

(f′′₍_x₎₎2_dx₊ 1

nh

Z

K2₍_x₎_dx

h∗= c

−2/5

1 c−

1/5 2 c−

1/5 3

n1/5 c1=σ

2

K, c2=

Z

K2(x)dx, c3=

Z

(f′′(x))2dx

R∗(f,fbn) = c4

n4/5 c4=

5 4(σ

2

K)2/5 Z

K2(x)dx

4/5

| {z }

C(K)

Z

(f′′)2dx

1/5

Epanechnikov _Kernel

K(x) =

( ₃

4√5(1−x2_/₅₎ |x|<

√

5

0 otherwise

Cross-validation estimate ofE_[_J₍_h_)]

b

JCV(h) = Z

b

fn2(x)dx− 2

n

n X

i=1

b

f(−i)(Xi)≈

1

hn2

n X

i=1

n X

j=1

K∗

_X i−Xj

h

+ 2

nhK(0)

K∗(x) =K(2)(x)−2K(x) K(2)(x) = Z

K(x−y)K(y)dy

19.2

Non-parametric Regression

Estimate f(x) wheref(x) =E_[_Y _|_X ₌_x_{]. Consider pairs of points}

(x1, Y1), . . . ,(xn, Yn) related by

Yi=r(xi) +ǫi

E_[_ǫ_{i] = 0}

V_[_ǫ_{i] =}_σ2

k-nearest Neighbor Estimator

b

r(x) = 1

k

X

i:xi∈Nk(x)

Yi whereNk(x) ={k values ofx1, . . . , xn closest tox}

Nadaraya-Watson_{Kernel Estimator}

b

r(x) = n X

i=1

wi(x)Yi

wi(x) = K x−xi

h Pn

j=1K

x−xj

h

∈[0,1]

R(rbn, r)≈

h4

4 Z

x2_K2₍_x₎_dx

4Z

r′′₍_x_{) + 2}_r′₍_x₎f′(x) f(x)

2

dx

+

Z _σ2R_K2₍_x₎_dx

nhf(x) dx

h∗≈ _nc11/5

R∗(rbn, r)≈ c2

n4/5

Cross-validation estimate of E_[_J₍_h_)]

b

JCV(h) = n X

i=1

(Yi−rb(−i)(xi))2=

n X

i=1

(Yi−rb(xi))2

1₋Pn K(0) j=1K

x−xj

h

!2

19.3

Smoothing Using Orthogonal Functions

Approximation

r(x) =

∞

X

j=1

βjφj(x)≈ J X

i=1

βjφj(x)

Multivariate regression

Y = Φβ+η

where ηi=ǫi and Φ =   

φ0(x1) · · · φJ(x1)

..

. . .. ...

φ0(xn) · · · φJ(xn)   

Least squares estimator

b

β= (ΦTΦ)−1ΦTY

≈ _n1ΦTY (for equally spaced observations only)

(22)

Cross-validation estimate ofE_[_J₍_h_)]

b

RCV(J) = n X

i=1

 Yi−

J X

j=1

φj(xi)βbj,(−i)

 

2

20

Stochastic Processes

Stochastic Process

{Xt:t∈T} T = (

{0,_±1, . . ._}=Z _discrete

[0,∞) continuous

• NotationsXt,X(t)

• State spaceX • Index setT

20.1

Markov Chains

Markov chain

P_[_X_n ₌_x_|_X₀_{, . . . , X}_n₋₁_{] =}P_[_X_n ₌_x_|_X_n₋₁_] _∀_n_∈_{T, x}_{∈ X}

Transition probabilities

pij ≡P[Xn+1=j|Xn=i]

pij(n)_≡P_[_X_m₊_n ₌_j_|_X_m₌_i_] _n-step

Transition matrixP(n-step: Pn)

• (i, j) element ispij

• pij >0

• Pipij = 1

Chapman-Kolmogorov

pij(m+n) =X k

pij(m)pkj(n)

Pm+n=PmPn

Pn=P× · · · ×P=Pn Marginal probability

µn= (µn(1), . . . , µn(N)) where µi(i) =P[Xn =i]

µ0,initial distribution

µn=µ0Pn

20.2

Poisson Processes

Poisson process

• {Xt:t∈[0,∞)} = number of events up to and including timet

• X0= 0

• Independent increments:

∀t0<· · ·< tn:Xt1−Xt0 ⊥⊥ · · · ⊥⊥Xtn−Xtn−1

• Intensity functionλ(t)

– P_[_X_t₊_h₋_X_t_{= 1] =}_λ₍_t₎_h₊_o₍_h₎

– P_[_X_t₊_h₋_X_t_{= 2] =}_o₍_h₎

• Xs+t−Xs∼Po (m(s+t)−m(s)) where m(t) =R₀tλ(s)ds

Homogeneous Poisson process

λ(t)≡λ =⇒ Xt∼Po (λt) λ >0

Waiting times

Wt:= time at whichXtoccurs

Wt∼Gamma

t,1 λ

Interarrival times

St=Wt+1−Wt

St∼Exp

1

λ

t

Wt−1 Wt

St

(23)

21

Time Series

Mean function

µxt =E[xt] =

Z ∞

−∞

xft(x)dx

Autocovariance function

γx(s, t) =E_[(_x_s₋_µ_s)(_x_t₋_µ_{t)] =}E_[_x_s_x_t]₋_µ_s_µ_t γx(t, t) =E₍_x_t₋_µ_t)2₌V_[_x_t]

Autocorrelation function (ACF)

ρ(s, t) =pCov [xs, xt]

V_[_x_s]V_[_x_t] =

γ(s, t) p

γ(s, s)γ(t, t)

Cross-covariance function (CCV)

γxy(s, t) =E_[(_x_s₋_µ_x

s)(yt−µyt)]

Cross-correlation function (CCF)

ρxy(s, t) = p γxy(s, t)

γx(s, s)γy(t, t)

Backshift operator

Bk(xt) =xt−k

Difference operator

∇d = (1