• Tidak ada hasil yang ditemukan

Probability and Statistics Cookbook. pdf

N/A
N/A
Protected

Academic year: 2018

Membagikan "Probability and Statistics Cookbook. pdf"

Copied!
28
0
0

Teks penuh

(1)

Probability and Statistics

Cookbook

Copyright cMatthias Vallentin, 2011 vallentin@icir.org

(2)

This cookbook integrates a variety of topics in probability the-ory and statistics. It is based on literature [1,6,3] and in-class material from courses of the statistics department at the Uni-versity of California in Berkeley but also influenced by other sources [4,5]. If you find errors or have suggestions for further topics, I would appreciate if you send me anemail. The most re-cent version of this document is available athttp://matthias. vallentin.net/probability-and-statistics-cookbook/. To reproduce, please contact me.

Contents

1 Distribution Overview 3

1.1 Discrete Distributions . . . 3

1.2 Continuous Distributions . . . 4

2 Probability Theory 6 3 Random Variables 6 3.1 Transformations . . . 7

4 Expectation 7 5 Variance 7 6 Inequalities 8 7 Distribution Relationships 8 8 Probability and Moment Generating Functions 9 9 Multivariate Distributions 9 9.1 Standard Bivariate Normal . . . 9

9.2 Bivariate Normal . . . 9

9.3 Multivariate Normal . . . 9

10 Convergence 9 10.1 Law of Large Numbers (LLN) . . . 10

10.2 Central Limit Theorem (CLT) . . . 10

11 Statistical Inference 10 11.1 Point Estimation . . . 10

11.2 Normal-Based Confidence Interval . . . 11

11.3 Empirical distribution . . . 11

11.4 Statistical Functionals . . . 11

12 Parametric Inference 11 12.1 Method of Moments . . . 11

12.2 Maximum Likelihood. . . 12

12.2.1 Delta Method . . . 12

12.3 Multiparameter Models . . . 12

12.3.1 Multiparameter delta method . . 13

12.4 Parametric Bootstrap . . . 13

13 Hypothesis Testing 13 14 Bayesian Inference 14 14.1 Credible Intervals. . . 14

14.2 Function of parameters. . . 14

14.3 Priors . . . 15

14.3.1 Conjugate Priors . . . 15

14.4 Bayesian Testing . . . 15

15 Exponential Family 16 16 Sampling Methods 16 16.1 The Bootstrap . . . 16

16.1.1 Bootstrap Confidence Intervals . 16 16.2 Rejection Sampling . . . 17

16.3 Importance Sampling. . . 17

17 Decision Theory 17 17.1 Risk . . . 17

17.2 Admissibility . . . 17

17.3 Bayes Rule . . . 18

17.4 Minimax Rules . . . 18

18 Linear Regression 18 18.1 Simple Linear Regression . . . 18

18.2 Prediction . . . 19

18.3 Multiple Regression . . . 19

18.4 Model Selection . . . 19

19 Non-parametric Function Estimation 20 19.1 Density Estimation . . . 20

19.1.1 Histograms . . . 20

19.1.2 Kernel Density Estimator (KDE) 21 19.2 Non-parametric Regression . . . 21

19.3 Smoothing Using Orthogonal Functions 21 20 Stochastic Processes 22 20.1 Markov Chains . . . 22

20.2 Poisson Processes . . . 22

21 Time Series 23 21.1 Stationary Time Series . . . 23

21.2 Estimation of Correlation . . . 24

21.3 Non-Stationary Time Series . . . 24

21.3.1 Detrending . . . 24

21.4 ARIMA models . . . 24

21.4.1 Causality and Invertibility. . . . 25

21.5 Spectral Analysis . . . 25

22 Math 26 22.1 Gamma Function . . . 26

22.2 Beta Function. . . 26

22.3 Series . . . 27

(3)

1

Distribution Overview

1.1

Discrete Distributions

Notation1 FX(x) fX(x) E[X] V[X] MX(s)

Uniform Unif{a, . . . , b}   

 

0 x < a

⌊x⌋−a+1

b−a a≤x≤b 1 x > b

I(a < x < b) ba+ 1

a+b 2

(ba+ 1)2

−1

12

eas

−e−(b+1)s

s(ba)

Bernoulli Bern (p) (1p)1−x px(1p)1−x p p(1p) 1p+pes

Binomial Bin (n, p) I1−p(n−x, x+ 1) n

x !

px(1p)n−x np np(1p) (1p+pes)n

Multinomial Mult (n, p) n!

x1!. . . xk!

px11 · · ·pxk

k k X

i=1

xi=n npi npi(1pi)

k X

i=0

piesi

!n

Hypergeometric Hyp (N, m, n) Φ px−np np(1p)

! m

x m−x

n−x

N x

nm N

nm(Nn)(Nm) N2(N1)

Negative Binomial NBin (r, p) Ip(r, x+ 1) x+r−1 r1

!

pr(1p)x r1−p

p r

1p p2

p 1(1p)es

r

Geometric Geo (p) 1−(1−p)x x∈N+ p(1p)x−1 xN+ 1

p

1p p2

p 1−(1−p)es

Poisson Po (λ) e−λ

x X

i=0

λi i!

λxe−λ

x! λ λ e

λ(es1)

● ● ● ● ● ●

Uniform (discrete)

x

PMF

a b

1 n

● ● ● ●● ●

● ●

● ●

●● ●

● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 10 20 30 40

0.00

0.05

0.10

0.15

0.20

0.25

Binomial

x

PMF

● n=40, p=0.3 n=30, p=0.6 n=25, p=0.9

● ●

● ●

● ●

● ● ● ●

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

Geometric

x

PMF

● p=0.2 p=0.5 p=0.8

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20

0.0

0.1

0.2

0.3

Poisson

x

PMF

● λ =1

λ =4

λ =10

1We use the notationγ(s, x) and Γ(x) to refer to the Gamma functions (see§22.1), and use B(x, y) andIxto refer to the Beta functions (see§22.2).

(4)

1.2

Continuous Distributions

Notation FX(x) fX(x) E[X] V[X] MX(s)

Uniform Unif (a, b)

  

 

0 x < a x−a

b−a a < x < b 1 x > b

I(a < x < b) ba

a+b 2

(ba)2

12

esb

−esa

s(ba)

Normal N µ, σ2

Φ(x) = Z x

−∞

φ(t)dt φ(x) = 1 σ√2πexp

−(x−µ)

2

2σ2

µ σ2 exp

µs+σ

2s2

2

Log-Normal lnN µ, σ2 1 2+

1 2erf

lnxµ

2σ2

1 x√2πσ2exp

−(lnx−µ)

2

2σ2

eµ+σ2/2 (eσ2−1)e2µ+σ2

Multivariate Normal MVN (µ,Σ) (2π)−k/2|Σ|−1/2e−1 2(x−µ)

TΣ−1(xµ)

µ Σ exp

µTs+1

2s TΣs

Student’st Student(ν) Ixν 2,

ν 2

Γ ν+1

2 √νπΓ ν 2

1 +x

2

ν

−(ν+1)/2

0 0

Chi-square χ2k

1 Γ(k/2)γ k 2, x 2 1

2k/2Γ(k/2)x

k/2−1e−x/2 k 2k (1

−2s)−k/2s <1/2

F F(d1, d2) I d1x d1x+d2

d1 2, d1 2 r

(d1x)d1d

d2

2

(d1x+d2)d1 +d2

xB d1 2 , d1 2 d2

d2−2

2d22(d1+d2−2)

d1(d2−2)2(d2−4)

Exponential Exp (β) 1e−x/β 1

βe

−x/β β β2 1

1βs(s <1/β)

Gamma Gamma (α, β) γ(α, x/β) Γ(α)

1 Γ (α)βαx

α−1e−x/β αβ αβ2 1

1βs α

(s <1/β)

Inverse Gamma InvGamma (α, β) Γ α, β x

Γ (α)

βα Γ (α)x

−α−1e−β/x β

α−1 α >1

β2

(α−1)22)2 α >2

2(βs)α/2

Γ(α) Kα p

−4βs

Dirichlet Dir (α)

ΓPki=1αi

Qk i=1Γ (αi)

k Y

i=1

xαi−1

i

αi Pk

i=1αi

E[Xi] (1E[Xi])

Pk

i=1αi+ 1

Beta Beta (α, β) Ix(α, β) Γ (α+β)

Γ (α) Γ (β)x α−1(1

−x)β−1 α

α+β

αβ

(α+β)2+β+ 1) 1 +

X

k=1

k−1

Y

r=0

α+r α+β+r

! sk k!

Weibull Weibull(λ, k) 1−e−(x/λ)k k λ

x λ

k−1

e−(x/λ)k λΓ

1 +1 k

λ2Γ

1 +2 k

−µ2

X

n=0

snλn n! Γ

1 +n

k

Pareto Pareto(xm, α) 1xm x

α

xxm α x

α m

xα+1 x≥xm

αxm α−1 α >1

xαm

(α−1)22) α >2 α(−xms)

αΓ(

−α,xms)s <0

(5)

● ● Uniform (continuous)

x

PDF

a b

1

b−a

● ●

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Normal

x

φ

(

x

)

µ =0, σ2=

0.2

µ =0, σ2=1

µ =0, σ2=

5

µ =−2, σ2=

0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Log−normal

x

PDF

µ =0, σ2=

3

µ =2, σ2=2

µ =0, σ2=

1

µ =0.5, σ2=

1

µ =0.25, σ2=1

µ =0.125, σ2=

1

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Student'st

x

PDF

ν =1

ν =2

ν =5

ν = ∞

0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

0.5

χ2

x

PDF

k=1 k=2 k=3 k=4 k=5

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

F

x

PDF

d1=1, d2=1

d1=2, d2=1 d1=5, d2=2

d1=100, d2=1

d1=100, d2=100

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

Exponential

x

PDF

β =2

β =1

β =0.4

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

Gamma

x

PDF

α =1, β =2

α =2, β =2

α =3, β =2

α =5, β =1

α =9, β =0.5

0 1 2 3 4 5

0

1

2

3

4

Inverse Gamma

x

PDF

α =1, β =1

α =2, β =1

α =3, β =1

α =3, β =0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Beta

x

PDF

α =0.5, β =0.5

α =5, β =1

α =1, β =3

α =2, β =2

α =2, β =5

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

2.5

Weibull

x

PDF

λ =1, k=0.5

λ =1, k=1

λ =1, k=1.5

λ =1, k=5

0 1 2 3 4 5

0

1

2

3

Pareto

x

PDF

xm=1, α =1

xm=1, α =2 xm=1, α =4

(6)

2

Probability Theory

Definitions

• Sample space Ω

• Outcome (point or element)ω∈Ω

• EventA

• σ-algebraA 1. ∅ ∈ A

2. A1, A2, . . . ,∈ A =⇒ S∞i=1Ai∈ A 3. A∈ A =⇒ ¬A∈ A

• Probability DistributionP

1. P[A]0 A

2. P[Ω] = 1

3. P

" G

i=1

Ai #

=

X

i=1

P[Ai]

• Probability space (Ω,A,P)

Properties

• P[] = 0

• B= Ω∩B= (A∪ ¬A)∩B= (A∩B)∪(¬A∩B)

• P[¬A] = 1P[A]

• P[B] =P[AB] +P[¬AB]

• P[Ω] = 1 P[] = 0

• ¬(SnAn) =Tn¬An ¬(TnAn) =Sn¬An DeMorgan

• P[S

nAn] = 1−P[Tn¬An]

• P[AB] =P[A] +P[B]P[AB]

=⇒ P[AB]P[A] +P[B]

• P[AB] =P[A∩ ¬B] +P[¬AB] +P[AB]

• P[A∩ ¬B] =P[A]P[AB]

Continuity of Probabilities

• A1⊂A2⊂. . . =⇒ limn→∞P[An] =P[A] whereA=S∞i=1Ai

• A1⊃A2⊃. . . =⇒ limn→∞P[An] =P[A] whereA=T∞i=1Ai

Independence⊥⊥

A⊥⊥B ⇐⇒ P[AB] =P[A]P[B]

Conditional Probability

P[A|B] = P[A∩B]

P[B] P[B]>0

Law of Total Probability

P[B] =

n X

i=1

P[B|Ai]P[Ai] Ω =

n G

i=1

Ai

Bayes’ Theorem

P[Ai|B] = PnP[B|Ai]P[Ai]

j=1P[B|Aj]P[Aj]

Ω = n G

i=1

Ai

Inclusion-Exclusion Principle

n [

i=1

Ai =

n X

r=1

(1)r−1 X i≤i1<···<ir≤n

r \

j=1

Aij

3

Random Variables

Random Variable (RV)

X: Ω→R

Probability Mass Function (PMF)

fX(x) =P[X =x] =P[{ωΩ :X(ω) =x}]

Probability Density Function (PDF)

P[aXb] =

Z b

a

f(x)dx

Cumulative Distribution Function (CDF)

FX :R→[0,1] FX(x) =P[X ≤x] 1. Nondecreasing: x1< x2 =⇒ F(x1)≤F(x2)

2. Normalized: limx→−∞= 0 and limx→∞= 1

3. Right-Continuous: limy↓xF(y) =F(x)

P[aY b|X =x] =

Z b

a

fY|X(y|x)dy a≤b

fY|X(y|x) =

f(x, y)

fX(x) Independence

1. P[X x, Y y] =P[X x]P[Y y]

2. fX,Y(x, y) =fX(x)fY(y)

(7)

3.1

Transformations

Transformation function

Z=ϕ(X)

Discrete

fZ(z) =P[ϕ(X) =z] =P[{x:ϕ(x) =z}] =PX ϕ−1(z)= X

x∈ϕ−1(z)

f(x)

Continuous

FZ(z) =P[ϕ(X)≤z] = Z

Az

f(x)dx withAz={x:ϕ(x)≤z}

Special case if ϕstrictly monotone

fZ(z) =fX(ϕ−1(z))

dzdϕ−1(z)

=fX(x) dxdz

=fX(x) 1

|J|

The Rule of the Lazy Statistician

E[Z] =

Z

ϕ(x)dFX(x)

E[IA(x)] =

Z

IA(x)dFX(x) = Z

A

dFX(x) =P[X A]

Convolution

• Z:=X+Y fZ(z) = Z ∞

−∞

fX,Y(x, z−x)dx X,Y≥0

= Z z

0

fX,Y(x, z−x)dx

• Z:=|XY| fZ(z) = 2 Z ∞

0

fX,Y(x, z+x)dx

• Z:= X

Y fZ(z) =

Z ∞

−∞|

x|fX,Y(x, xz)dx ⊥=⊥ Z ∞

−∞

xfx(x)fX(x)fY(xz)dx

4

Expectation

Definition and properties

• E[X] =µX=

Z

x dFX(x) =           

X

x

xfX(x) X discrete

Z

xfX(x) X continuous

• P[X =c] = 1 = E[c] =c

• E[cX] =cE[X]

• E[X+Y] =E[X] +E[Y]

• E[XY] =

Z

X,Y

xyfX,Y(x, y)dFX(x)dFY(y)

• E[ϕ(Y)]6=ϕ(E[X]) (cf.Jenseninequality)

• P[X Y] = 0 = E[X]E[Y]P[X=Y] = 1 = E[X] =E[Y]

• E[X] =

X

x=1

P[Xx]

Sample mean

¯

Xn= 1

n

n X

i=1

Xi

Conditional expectation

• E[Y |X =x] =

Z

yf(y|x)dy

• E[X] =E[E[X|Y]]

• E[ϕ(X, Y)|X=x] = Z ∞

−∞

ϕ(x, y)fY|X(y|x)dx

• E[ϕ(Y, Z)|X =x] =

Z ∞

−∞

ϕ(y, z)f(Y,Z)|X(y, z|x)dy dz

• E[Y +Z|X] =E[Y|X] +E[Z|X]

• E[ϕ(X)Y |X] =ϕ(X)E[Y |X]

• E[Y|X] =c =⇒ Cov [X, Y] = 0

5

Variance

Definition and properties

• V[X] =σ2

X =E

(XE[X])2=EX2E[X]2

• V

" n X

i=1

Xi #

= n X

i=1

V[Xi] + 2X

i6=j

Cov [Xi, Yj]

• V

" n X

i=1

Xi #

= n X

i=1

V[Xi] iffXi⊥⊥Xj

Standard deviation

sd[X] =pV[X] =σX

Covariance

• Cov [X, Y] =E[(XE[X])(Y E[Y])] =E[XY]E[X]E[Y]

• Cov [X, a] = 0

• Cov [X, X] =V[X]

• Cov [X, Y] = Cov [Y, X]

• Cov [aX, bY] =abCov [X, Y]

(8)

• Cov [X+a, Y +b] = Cov [X, Y]

• Cov  

n X

i=1

Xi, m X

j=1

Yj  =

n X

i=1

m X

j=1

Cov [Xi, Yj]

Correlation

ρ[X, Y] = pCov [X, Y]

V[X]V[Y]

Independence

X⊥⊥Y = ρ[X, Y] = 0 ⇐⇒ Cov [X, Y] = 0 ⇐⇒ E[XY] =E[X]E[Y]

Sample variance

S2= 1

n−1 n X

i=1

(Xi−X¯n)2

Conditional variance

• V[Y|X] =E(Y E[Y |X])2|X=EY2|XE[Y |X]2

• V[Y] =E[V[Y |X]] +V[E[Y |X]]

6

Inequalities

Cauchy-Schwarz

E[XY]2EX2EY2

Markov

P[ϕ(X)t]E[ϕ(X)] t

Chebyshev

P[|XE[X]| ≥t] V[X] t2

Chernoff

P[X (1 +δ)µ]

eδ (1 +δ)1+δ

δ >−1

Jensen

E[ϕ(X)]ϕ(E[X]) ϕconvex

7

Distribution Relationships

Binomial

• Xi∼Bern (p) =⇒ n X

i=1

Xi∼Bin (n, p)

• XBin (n, p), Y Bin (m, p) = X+Y Bin (n+m, p)

• limn→∞Bin (n, p) = Po (np) (nlarge,psmall)

• limn→∞Bin (n, p) =N(np, np(1p)) (nlarge,pfar from 0 and 1) Negative Binomial

• X NBin (1, p) = Geo (p)

• X ∼NBin (r, p) =Pri=1Geo (p)

• Xi∼NBin (ri, p) =⇒ PXi∼NBin (Pri, p)

• X NBin (r, p). Y Bin (s+r, p) = P[X s] =P[Y r]

Poisson

• Xi∼Po (λi)∧Xi⊥⊥Xj =⇒ n X

i=1

Xi∼Po n X

i=1

λi !

• Xi∼Po (λi)∧Xi⊥⊥Xj =⇒ Xi

n X

j=1

Xj ∼Bin  

n X

j=1

Xj,

λi Pn

j=1λj

 

Exponential

• Xi∼Exp (β)∧Xi⊥⊥Xj =⇒ n X

i=1

Xi∼Gamma (n, β)

• Memoryless property: P[X > x+y|X > y] =P[X > x]

Normal

• X ∼ N µ, σ2 = X−µ σ

∼ N(0,1)

• X ∼ N µ, σ2

∧Z=aX+b = Z∼ N aµ+b, a2σ2

• X ∼ N µ1, σ12

∧Y ∼ N µ2, σ22

=⇒ X+Y ∼ N µ1+µ2, σ12+σ22

• Xi∼ N µi, σi2

= PiXi ∼ N Piµi,Piσ2i

• P[a < X b] = Φb−µ

σ

−Φ a−σµ

• Φ(x) = 1Φ(x) φ′(x) =(x) φ′′(x) = (x2

−1)φ(x)

• Upper quantile ofN(0,1): zα= Φ−1(1−α) Gamma

• X ∼Gamma (α, β) ⇐⇒ X/β∼Gamma (α,1)

• Gamma (α, β)i=1Exp (β)

• Xi∼Gamma (αi, β)∧Xi⊥⊥Xj =⇒ PiXi∼Gamma (Piαi, β)

• Γ(λαα) = Z ∞

0

xα−1e−λxdx

Beta

B(α, β1 )xα−1(1x)β−1= Γ(α+β) Γ(α)Γ(β)x

α−1(1

−x)β−1

• EXk=B(α+k, β)

B(α, β) =

α+k1

α+β+k−1E

Xk−1

• Beta (1,1)Unif (0,1)

(9)

8

Probability and Moment Generating Functions

• GX(t) =E

tX

|t|<1

• MX(t) =GX(et) =EeXt=E "

X

i=0

(Xt)i

i! #

=

X

i=0

EXi

i! ·t i

• P[X = 0] =GX(0)

• P[X = 1] =G

X(0)

• P[X =i] = G

(i)

X(0)

i!

• E[X] =GX(1)

• EXk=M(k)

X (0)

• E

X! (Xk)!

=G(Xk)(1−)

• V[X] =G′′

X(1−) +G′X(1−)−(G′X(1−))

2

• GX(t) =GY(t) =⇒ X d =Y

9

Multivariate Distributions

9.1

Standard Bivariate Normal

LetX, Y ∼ N(0,1)∧X ⊥⊥Z whereY =ρX+p1−ρ2Z

Joint density

f(x, y) = 1

2πp1ρ2exp

−x

2+y2

−2ρxy

2(1−ρ2)

Conditionals

(Y|X =x)∼ N ρx,1−ρ2 and (X|Y =y)∼ N ρy,1−ρ2

Independence

X⊥⊥Y ⇐⇒ ρ= 0

9.2

Bivariate Normal

LetX ∼ N µx, σx2

andY ∼ N µy, σ2y

.

f(x, y) = 1 2πσxσy

p

1ρ2exp

2(1z −ρ2)

z= "

x−µx

σx 2

+

y−µy

σy 2

−2ρ

x−µx

σx

y−µy

σy #

Conditional mean and variance

E[X|Y] =E[X] +ρσX σY

(Y E[Y])

V[X|Y] =σXp1ρ2

9.3

Multivariate Normal

Covariance matrix Σ (Precision matrix Σ−1)

Σ =   

V[X1] · · · Cov [X1, Xk]

..

. . .. ... Cov [Xk, X1] · · · V[Xk]

  

IfX ∼ N(µ,Σ),

fX(x) = (2π)−n/2|Σ|−1/2exp

−12(x−µ)TΣ−1(x−µ)

Properties

• Z∼ N(0,1)X=µ+ Σ1/2Z =

⇒ X ∼ N(µ,Σ)

• X ∼ N(µ,Σ) =⇒ Σ−1/2(Xµ)∼ N(0,1)

• X ∼ N(µ,Σ) =⇒ AX∼ N Aµ, AΣAT

• X ∼ N(µ,Σ)∧ kak=k =⇒ aTX ∼ N aTµ, aTΣa

10

Convergence

Let {X1, X2, . . .} be a sequence of rv’s and let X be anotherrv. LetFn denote thecdfofXn and letF denote thecdfofX.

Types of convergence

1. In distribution (weakly, in law): Xn D

→X

lim

n→∞Fn(t) =F(t) ∀t whereF continuous

2. In probability: Xn P

→X

(∀ε >0) lim n→∞

P[|XnX|> ε] = 0

3. Almost surely (strongly): Xn as

→X

Phlim

n→∞Xn =X

i

=PhωΩ : lim

n→∞Xn(ω) =X(ω)

i = 1

(10)

4. In quadratic mean (L2): Xn qm

→X

lim n→∞E

(Xn−X)2

= 0

Relationships

• Xn qm

→X = Xn P

→X = Xn D

→X

• Xn as

→X =⇒ Xn P

→X

• Xn D

→X(cR)P[X=c] = 1 = Xn P X

• Xn P

→X∧Yn P

→Y =⇒ Xn+Yn P

→X+Y

• Xn qm

→X∧Yn qm

→Y =⇒ Xn+Yn qm

→X+Y

• Xn P

→XYn P

→Y = XnYn P

→XY

• Xn P

→X = ϕ(Xn) P

→ϕ(X)

• Xn D

→X = ϕ(Xn) D

→ϕ(X)

• Xn qm

→b ⇐⇒ limn→∞E[Xn] =blimn→∞V[Xn] = 0

• X1, . . . , Xn iid∧E[X] =µ∧V[X]<∞ ⇐⇒ X¯n qm

→µ

Slutzky’s Theorem

• Xn D

→X andYn P

→c =⇒ Xn+Yn D

→X+c

• Xn D

→X andYn P

→c =⇒ XnYn D

→cX

• In general: Xn D

→X andYn D

→Y =6 Xn+Yn D

→X+Y

10.1

Law of Large Numbers (LLN)

Let{X1, . . . , Xn} be a sequence ofiid rv’s,E[X1] =µ, andV[X1]<∞.

Weak (WLLN)

¯

Xn P

→µ n→ ∞

Strong (WLLN)

¯

Xn as

→µ n→ ∞

10.2

Central Limit Theorem (CLT)

Let{X1, . . . , Xn} be a sequence ofiid rv’s,E[X1] =µ, andV[X1] =σ2.

Zn := ¯

Xn−µ q

VX¯n

=

n( ¯X

n−µ)

σ

D

→Z whereZ ∼ N(0,1)

lim

n→∞P[Zn≤z] = Φ(z) z∈R

CLT notations

Zn ≈ N(0,1) ¯

Xn ≈ N

µ,σ

2

n

¯

Xn−µ≈ N

0,σ

2

n

n( ¯Xn−µ)≈ N 0, σ2

n( ¯X

n−µ)

n ≈ N(0,1)

Continuity correction

PX¯nxΦ

x+1 2−µ

σ/√n

PX¯nx1Φ

x

−1 2−µ

σ/√n

Delta method

Yn≈ N

µ,σ

2

n

=⇒ ϕ(Yn)≈ N

ϕ(µ),(ϕ′(µ))2σ

2

n

11

Statistical Inference

LetX1,· · ·, Xn iid∼ F if not otherwise noted.

11.1

Point Estimation

• Point estimatorθbn ofθis arv: θbn =g(X1, . . . , Xn)

• bias(bθn) =Ehθbniθ

• Consistency: θbn P

→θ

• Sampling distribution: F(θbn)

• Standard error: se(θbn) = r

Vhθbni

• Mean squared error: mse=Eh(θbnθ)2i=bias(θbn)2+V h

b

θn i

• limn→∞bias(θbn) = 0∧limn→∞se(θbn) = 0 =⇒ θbn is consistent

• Asymptotic normality: θbn−θ

se

D

→ N(0,1)

• Slutzky’s Theoremoften lets us replacese(bθn) by some (weakly)

consis-tent estimatorbσn.

(11)

11.2

Normal-Based Confidence Interval

Suppose θbn ≈ N

θ,seb2. Let zα/2 = Φ−1(1−(α/2)), i.e., P

Z > zα/2

=α/2 andPzα/2< Z < zα/2= 1αwhereZ ∼ N(0,1). Then

Cn=bθn±zα/2seb

11.3

Empirical distribution

Empirical Distribution Function (ECDF)

b

Fn(x) = Pn

i=1I(Xi≤x)

n

I(Xi≤x) = (

1 Xi≤x 0 Xi> x

Properties (for any fixedx)

• E

h b

Fn i

=F(x)

• VhFbni=F(x)(1−F(x)) n

• mse= F(x)(1−F(x))

n

D

→0

• Fbn P

→F(x)

Dvoretzky-Kiefer-Wolfowitz(DKW) inequality (X1, . . . , XnF)

P

sup

x

F(x)Fbn(x)> ε

= 2e−2nε2

Nonparametric 1αconfidence band for F

L(x) = max{Fbn−ǫn,0}

U(x) = min{Fbn+ǫn,1}

ǫ= s

1 2nlog

2

α

P[L(x)F(x)U(x)x]1α

11.4

Statistical Functionals

• Statistical functional: T(F)

• Plug-in estimator ofθ= (F): θbn =T(Fbn)

• Linear functional: T(F) =Rϕ(x)dFX(x)

• Plug-in estimator for linear functional:

T(Fbn) = Z

ϕ(x)dFbn(x) = 1

n

n X

i=1

ϕ(Xi)

• Often: T(Fbn)≈ NT(F),seb2 =T(Fbn)±zα/2seb

• pth quantile: F−1(p) = inf

{x:F(x)p}

• µb= ¯Xn

• bσ2= 1

n1 n X

i=1

(Xi−X¯n)2

• bκ=

1

n Pn

i=1(Xi−µb)3

b

σ3j

• ρb=

Pn

i=1(Xi−X¯n)(Yi−Y¯n) qPn

i=1(Xi−X¯n)2qPni=1(Yi−Y¯n)

12

Parametric Inference

Let F=f(x;θ) :θΘ be a parametric model with parameter space ΘRk

and parameterθ= (θ1, . . . , θk).

12.1

Method of Moments

jth moment

αj(θ) =EXj=

Z

xjdFX(x)

jth sample moment

b

αj = 1

n

n X

i=1

Xij

Method of moments estimator (MoM)

α1(θ) =αb1

α2(θ) =αb2

.. . =...

αk(θ) =αbk

(12)

Properties of the MoM estimator

• bθn exists with probability tending to 1

• Consistency: θbn P

→θ

• Asymptotic normality:

n(θb−θ)→ ND (0,Σ) where Σ =gEY YTgT,Y = (X, X2, . . . , Xk)T,

g= (g1, . . . , gk) andgj= ∂θ∂α−j1(θ)

12.2

Maximum Likelihood

Likelihood: Ln: Θ→[0,∞)

Ln(θ) = n Y

i=1

f(Xi;θ)

Log-likelihood

ℓn(θ) = logLn(θ) = n X

i=1

logf(Xi;θ)

Maximum likelihood estimator (mle)

Ln(θbn) = sup θ L

n(θ)

Score function

s(X;θ) = ∂

∂θlogf(X;θ)

Fisher information

I(θ) =Vθ[s(X;θ)]

In(θ) =nI(θ)

Fisher information (exponential family)

I(θ) =Eθ

∂θ∂ s(X;θ)

Observed Fisher information

Iobs

n (θ) =−

∂2

∂θ2

n X

i=1

logf(Xi;θ)

Properties of themle

• Consistency: θbn P

→θ

• Equivariance: θbn is themle =⇒ϕ(θbn) ist themleofϕ(θ)

• Asymptotic normality: 1. sep1/In(θ)

(bθn−θ)

se

D

→ N(0,1)

2. seb q

1/In(θbn)

(bθn−θ) b

se

D

→ N(0,1)

• Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-ples. Ifeθn is any other estimator, the asymptotic relative efficiency is

are(θen,θbn) =

Vhθbni

Vhθeni ≤

1

• Approximately the Bayes estimator

12.2.1 Delta Method

Ifτ =ϕ(bθ) whereϕis differentiable andϕ′(θ)6= 0:

(bτn−τ) b

se(τb)

D

→ N(0,1)

where bτ=ϕ(θb) is themle ofτ and

b

se=ϕ′(θb) bse(bθn)

12.3

Multiparameter Models

Letθ= (θ1, . . . , θk) andbθ= (θb1, . . . ,bθk) be themle.

Hjj =

∂2

n

∂θ2 Hjk=

∂2

n

∂θj∂θk

Fisher information matrix

In(θ) =−

  

Eθ[H11] · · · Eθ[H1k]

..

. . .. ...

Eθ[Hk1] · · · Eθ[Hkk]

  

Under appropriate regularity conditions

(θbθ)≈ N(0, Jn)

(13)

withJn(θ) =I−1

n . Further, ifθbj is thejth component of θ, then

(θbj−θj) b

sej

D

→ N(0,1)

whereseb2j =Jn(j, j) and Covhbθj,bθk i

=Jn(j, k)

12.3.1 Multiparameter delta method

Letτ=ϕ(θ1, . . . , θk) and let the gradient ofϕbe

∇ϕ=      

∂ϕ ∂θ1

.. .

∂ϕ ∂θk

     

Supposeϕθ=θb6= 0 andτb=ϕ(θb). Then,

(bτ−τ) b

se(τb)

D

→ N(0,1)

where

b

se(bτ) =r∇bϕTJbn

b

∇ϕ

andJbn=Jn(θb) and∇bϕ=∇ϕ

θ=bθ.

12.4

Parametric Bootstrap

Sample from f(x;θbn) instead of from Fbn, whereθbn could be the mleor method of moments estimator.

13

Hypothesis Testing

H0:θ∈Θ0 versus H1:θ∈Θ1

Definitions

• Null hypothesisH0

• Alternative hypothesisH1

• Simple hypothesisθ=θ0

• Composite hypothesisθ > θ0 orθ < θ0

• Two-sided test: H0:θ=θ0 versus H1:θ6=θ0

• One-sided test: H0:θ≤θ0 versus H1:θ > θ0

• Critical valuec

• Test statisticT

• Rejection regionR={x:T(x)> c}

• Power functionβ(θ) =P[XR]

• Power of a test: 1P[Type II error] = 1β = inf

θ∈Θ1

β(θ)

• Test size: α=P[Type I error] = sup

θ∈Θ0

β(θ)

RetainH0 RejectH0

H0true √ Type I Error (α)

H1true Type II Error (β) √(power)

p-value

• p-value = supθ∈Θ0Pθ[T(X)≥T(x)] = inf

α:T(x)

• p-value = supθ∈Θ0 Pθ[T(X ⋆)

≥T(X)]

| {z }

1−Fθ(T(X)) sinceT(X⋆)∼Fθ

= infα:T(X)

p-value evidence

<0.01 very strong evidence againstH0

0.01−0.05 strong evidence againstH0

0.05−0.1 weak evidence againstH0

>0.1 little or no evidence againstH0

Wald test

• Two-sided test

• RejectH0when|W|> zα/2whereW =

b

θθ0

b

se

• P|W|> zα/2α

• p-value =Pθ0[|W|>|w|]P[|Z|>|w|] = 2Φ(−|w|)

Likelihood ratio test (LRT)

• T(X) = supθ∈ΘLn(θ) supθ∈Θ0Ln(θ)

= Ln(θbn)

(14)

• λ(X) = 2 logT(X)→D χ2r−q where k X

i=1

Zi2∼χ2k and Z1, . . . , Zk iid

∼ N(0,1)

• p-value =Pθ0[λ(X)> λ(x)]Pχ2r

−q > λ(x)

Multinomial LRT

• mle: bpn=

X1

n , . . . , Xk

n

• T(X) =Ln(bpn)

Ln(p0)

= k Y

j=1

b

pj

p0j Xj

• λ(X) = 2 k X

j=1

Xjlog

b

pj

p0j

D

→χ2k−1

• The approximate sizeαLRT rejectsH0 whenλ(X)≥χ2k−1,α Pearson Chi-square Test

• T = k X

j=1

(Xj−E[Xj])2

E[Xj] where E[Xj] =np0j underH0

• T D

→χ2

k−1

• p-value =Pχ2

k−1> T(x)

• Faster D

→X2

k−1 than LRT, hence preferable for smalln

Independence testing

• Irows,J columns,Xmultinomial sample of sizen=IJ

• mles unconstrained: pbij = Xij

n

• mles underH0: pb0ij =pbi·pb·j= Xi· n

X·j n

• LRT:λ= 2PIi=1

PJ

j=1Xijlog nX

ij

Xi·X·j

• PearsonChiSq: T =PIi=1PJj=1(Xij−E[Xij])2

E[Xij]

• LRT and Pearson D

→χ2

kν, whereν = (I−1)(J−1)

14

Bayesian Inference

Bayes’ Theorem

f(θ|x) = f(x|θ)f(θ)

f(xn) =

f(x|θ)f(θ) R

f(x|θ)f(θ)dθ ∝ Ln(θ)f(θ)

Definitions

• Xn= (X

1, . . . , Xn)

• xn = (x

1, . . . , xn)

• Prior densityf(θ)

• Likelihoodf(xn|θ): joint density of the data In particular,Xn iid =

⇒f(xn

|θ) = n Y

i=1

f(xi|θ) =Ln(θ)

• Posterior densityf(θ|xn)

• Normalizing constantcn=f(xn) = R

f(x|θ)f(θ)dθ

• Kernel: part of a density that depends onθ

• Posterior mean ¯θn= R

θf(θ|xn)= RθLn(θ)f(θ) R

Ln(θ)f(θ)dθ

14.1

Credible Intervals

Posterior interval

P[θ(a, b)|xn] =

Z b

a

f(θ|xn)dθ= 1−α

Equal-tail credible interval Z a

−∞

f(θ|xn)dθ= Z ∞

b

f(θ|xn)dθ=α/2

Highest posterior density (HPD) region Rn

1. P[θRn] = 1α

2. Rn={θ:f(θ|xn)> k}for somek

Rn is unimodal =⇒Rn is an interval

14.2

Function of parameters

Letτ =ϕ(θ) andA={θ:ϕ(θ)τ}. Posterior CDF forτ

H(r|xn) =P[ϕ(θ)

≤τ|xn] = Z

A

f(θ|xn)

Posterior density

h(τ|xn) =H′(τ|xn) Bayesian delta method

τ|Xn

≈ Nϕ(θb),seb ϕ′(θb)

(15)

14.3

Priors

Choice

• Subjective bayesianism.

• Objective bayesianism.

• Robust bayesianism.

Types

• Flat: f(θ)∝constant

• Proper: R−∞∞ f(θ)dθ= 1

• Improper: R−∞∞ f(θ)dθ=

• Jeffrey’s prior (transformation-invariant):

f(θ)pI(θ) f(θ)pdet(I(θ))

• Conjugate: f(θ) andf(θ|xn) belong to the same parametric family

14.3.1 Conjugate Priors

Discrete likelihood

Likelihood Conjugate prior Posterior hyperparameters

Bern (p) Beta (α, β) α+ n X

i=1

xi, β+n− n X

i=1

xi

Bin (p) Beta (α, β) α+ n X

i=1

xi, β+ n X

i=1

Ni− n X

i=1

xi

NBin (p) Beta (α, β) α+rn, β+ n X

i=1

xi

Po (λ) Gamma (α, β) α+ n X

i=1

xi, β+n

Multinomial(p) Dir (α) α+ n X

i=1

x(i)

Geo (p) Beta (α, β) α+n, β+ n X

i=1

xi

Continuous likelihood (subscriptcdenotes constant) Likelihood Conjugate prior Posterior hyperparameters

Unif (0, θ) Pareto(xm, k) max

x(n), xm , k+n

Exp (λ) Gamma (α, β) α+n, β+ n X

i=1

xi

N µ, σ2

c

N µ0, σ02

µ0

σ2 0

+ Pn

i=1xi

σ2

c

/

1

σ2 0

+ n

σ2

c

, 1

σ2 0

+ n

σ2

c −1

N µc, σ2

Scaled Inverse Chi-square(ν, σ2

0)

ν+n, νσ

2 0+

Pn

i=1(xi−µ)2

ν+n

N µ, σ2

Normal-scaled Inverse Gamma(λ, ν, α, β)

νλ+nx¯

ν+n , ν + n, α +

n

2,

β+1 2

n X

i=1

(xi−x¯)2+

γ(¯xλ)2

2(n+γ)

MVN(µ,Σc) MVN(µ0,Σ0) Σ−01+nΣ−c1 −1

Σ−01µ0+nΣ−1x¯,

Σ−01+nΣ−c1 −1

MVN(µc,Σ) Inverse-Wishart(κ,Ψ)

n+κ,Ψ + n X

i=1

(xi−µc)(xi−µc)T

Pareto(xmc, k) Gamma (α, β) α+n, β+

n X

i=1

log xi

xmc

Pareto(xm, kc) Pareto(x0, k0) x0, k0−knwhere k0> kn

Gamma (αc, β) Gamma (α0, β0) α0+nαc, β0+

n X

i=1

xi

14.4

Bayesian Testing

IfH0:θ∈Θ0:

Prior probabilityP[H0] =

Z

Θ0

f(θ)dθ

Posterior probabilityP[H0|xn] =

Z

Θ0

f(θ|xn)dθ

LetH0, . . . , HK−1 beK hypotheses. Supposeθ∼f(θ|Hk),

P[Hk|xn] = f(x

n

|Hk)P[Hk]

PK

k=1f(xn|Hk)P[Hk]

,

(16)

Marginal likelihood

f(xn|Hi) = Z

Θ

f(xn|θ, Hi)f(θ|Hi)dθ

Posterior odds (ofHi relative toHj)

P[Hi|xn]

P[Hj|xn] =

f(xn|Hi)

f(xn|Hj) | {z }

Bayes FactorBFij

× PP[[HHi]

j] | {z }

prior odds

Bayes factor

log10BF10 BF10 evidence

0−0.5 1−1.5 Weak 0.5−1 1.5−10 Moderate 1−2 10−100 Strong

>2 >100 Decisive

p∗=

p

1−pBF10 1 +1ppBF10

wherep=P[H1] andp=P[H1|xn]

15

Exponential Family

Scalar parameter

fX(x|θ) =h(x) exp{η(θ)T(x)−A(θ)}

=h(x)g(θ) exp{η(θ)T(x)} Vector parameter

fX(x|θ) =h(x) exp ( s

X

i=1

ηi(θ)Ti(x)−A(θ) )

=h(x) exp{η(θ)·T(x)A(θ)} =h(x)g(θ) exp{η(θ)·T(x)}

Natural form

fX(x|η) =h(x) exp{η·T(x)−A(η)}

=h(x)g(η) exp{η·T(x)} =h(x)g(η) expηTT(x)

16

Sampling Methods

16.1

The Bootstrap

LetTn=g(X1, . . . , Xn) be a statistic.

1. EstimateVF[Tn] with Vb

Fn[Tn].

2. ApproximateVb

Fn[Tn] using simulation:

(a) Repeat the followingB times to getT∗

n,1, . . . , Tn,B∗ , aniidsample from the sampling distribution implied byFbn

i. Sample uniformlyX∗

1, . . . , Xn∗∼Fbn. ii. ComputeT∗

n =g(X1∗, . . . , Xn∗). (b) Then

vboot=VbFbn= 1

B

B X

b=1

Tn,b∗ − 1

B

B X

r=1

Tn,r∗ !2

16.1.1 Bootstrap Confidence Intervals

Normal-based interval

Tn±zα/2sebboot

Pivotal interval

1. Location parameterθ=T(F) 2. PivotRn =θbn−θ

3. LetH(r) =P[Rnr] be thecdfofRn

4. LetR∗

n,b=θb∗n,b−θbn. ApproximateH using bootstrap:

b

H(r) = 1

B

B X

b=1

I(R∗n,b≤r)

5. θ∗

β =β sample quantile of (θb∗n,1, . . . ,bθ∗n,B) 6. r∗

β =β sample quantile of (R∗n,1, . . . , R∗n,B), i.e.,rβ∗=θβ∗−θbn 7. Approximate 1αconfidence interval Cn=

ˆ

a,ˆbwhere

ˆ

a= θbn−Hb−1

1α 2

= bθn−r∗1−α/2= 2bθn−θ∗1−α/2

ˆb= θbn−Hb−1 α

2

= θbn−r∗α/2= 2θbn−θ∗α/2

Percentile interval

Cn=

θ∗

α/2, θ1∗−α/2

(17)

16.2

Rejection Sampling

Setup

• We can easily sample fromg(θ)

• We want to sample fromh(θ), but it is difficult

• We knowh(θ) up to a proportional constant: h(θ) = R k(θ)

k(θ)dθ

• Envelope condition: we can findM >0 such thatk(θ)M g(θ) θ

Algorithm

1. Drawθcand

∼g(θ) 2. GenerateuUnif (0,1) 3. Acceptθcand ifu k(θcand)

M g(θcand)

4. Repeat untilB values ofθcand have been accepted

Example

• We can easily sample from the priorg(θ) =f(θ)

• Target is the posteriorh(θ)k(θ) =f(xn

|θ)f(θ)

• Envelope condition: f(xn|θ)f(xn|θb

n) =Ln(bθn)≡M

• Algorithm

1. Draw θcandf(θ) 2. Generate uUnif (0,1) 3. Accept θcand ifu

≤Ln(θ

cand)

Ln(bθn)

16.3

Importance Sampling

Sample from an importance function grather than target density h. Algorithm to obtain an approximation to E[q(θ)|xn]:

1. Sample from the priorθ1, . . . , θn iid

∼ f(θ) 2. wi= L

n(θi) PB

i=1Ln(θi)

∀i= 1, . . . , B

3. E[q(θ)|xn]PB

i=1q(θi)wi

17

Decision Theory

Definitions

• Unknown quantity affecting our decision: θΘ

• Decision rule: synonymous for an estimatorθb

• Actiona∈ A: possible value of the decision rule. In the estimation context, the action is just an estimate ofθ,θb(x).

• Loss functionL: consequences of taking actionawhen true state isθ or discrepancy betweenθandθb,L: Θ× A →[k,).

Loss functions

• Squared error loss: L(θ, a) = (θa)2

• Linear loss: L(θ, a) = (

K1(θ−a) a−θ <0

K2(a−θ) a−θ≥0

• Absolute error loss: L(θ, a) =|θ−a| (linear loss withK1=K2)

• Lp loss: L(θ, a) =|θ−a|p

• Zero-one loss: L(θ, a) = (

0 a=θ

1 a6=θ

17.1

Risk

Posterior risk

r(bθ|x) = Z

L(θ,θb(x))f(θ|x)dθ=Eθ |X

h

L(θ,θb(x))i

(Frequentist) risk

R(θ,bθ) = Z

L(θ,θb(x))f(x|θ)dx=EX

h

L(θ,bθ(X))i

Bayes risk

r(f,θb) = ZZ

L(θ,θb(x))f(x, θ)dx dθ=Eθ,XhL(θ,θb(X))i

r(f,θb) =Eθ

h

EX|θ

h

L(θ,bθ(X)ii=Eθ

h

R(θ,θb)i

r(f,θb) =EXhEθ|XhL(θ,θb(X)ii=EXhr(θb|X)i

17.2

Admissibility

• θb′ dominates θbif

∀θ:R(θ,bθ′)R(θ,θb)

∃θ:R(θ,bθ′)< R(θ,θb)

• θbis inadmissible if there is at least one other estimatorθb′ that dominates

it. Otherwise it is called admissible.

(18)

17.3

Bayes Rule

Bayes rule (or Bayes estimator)

• r(f,θb) = infeθr(f,eθ)

• bθ(x) = infr(θb|x)∀x =⇒ r(f,θb) =R r(θb|x)f(x)dx

Theorems

• Squared error loss: posterior mean

• Absolute error loss: posterior median

• Zero-one loss: posterior mode

17.4

Minimax Rules

Maximum risk ¯

R(θb) = sup θ

R(θ,θb) R¯(a) = sup θ

R(θ, a)

Minimax rule

sup θ

R(θ,θb) = inf

e

θ ¯

R(θe) = inf

e

θ sup

θ

R(θ,θe)

b

θ= Bayes rule ∧ ∃c:R(θ,θb) =c

Least favorable prior

b

θf = Bayes rule ∧ R(θ,bθf)≤r(f,θbf)∀θ

18

Linear Regression

Definitions

• Response variableY

• CovariateX (aka predictor variable or feature)

18.1

Simple Linear Regression

Model

Yi=β0+β1Xi+ǫi E[ǫi|Xi] = 0, V[ǫi|Xi] =σ2 Fitted line

b

r(x) =βb0+βb1x

Predicted (fitted) values

b

Yi=br(Xi)

Residuals

ˆ

ǫi=Yi−Ybi =Yi−

b

β0+βb1Xi

Residual sums of squares (rss)

rss(βb0,βb1) =

n X

i=1

ˆ

ǫ2i

Least square estimates b

βT = (βb0,βb1)T : min b

β0,βb1

rss

b

β0= ¯Yn−βb1X¯n b

β1=

Pn

i=1(Xi−X¯n)(Yi−Y¯n) Pn

i=1(Xi−X¯n)2 =

Pn

i=1XiYi−nXY¯ Pn

i=1Xi2−nX2

Ehβb|Xni=

β0

β1

Vhβb|Xni= σ

2

nsX

n−1Pn

i=1Xi2 −Xn

−Xn 1

b

se(βb0) = b

σ sX√n

r Pn i=1Xi2

n

b

se(βb1) = b

σ sX√n

where s2X=n−1 Pn

i=1(Xi−Xn)2 andbσ2= n12Pni=1ˆǫ2i (unbiased estimate). Further properties:

• Consistency: βb0

P

→β0 andβb1

P

→β1

• Asymptotic normality: b

β0−β0

b

se(bβ0)

D

→ N(0,1) and βb1−β1 b

se(bβ1)

D

→ N(0,1)

• Approximate 1αconfidence intervals forβ0andβ1:

b

β0±zα/2seb(βb0) and βb1±zα/2seb(βb1)

• Wald test for H0:β1= 0 vs. H1:β16= 0: reject H0 if|W|> zα/2where

W =βb1/seb(βb1).

R2

R2=

Pn

i=1(Ybi−Y)2 Pn

i=1(Yi−Y)2 = 1

Pn i=1ˆǫ2i Pn

i=1(Yi−Y)2

= 1rss

tss

(19)

Likelihood

L= n Y

i=1

f(Xi, Yi) = n Y

i=1

fX(Xi)× n Y

i=1

fY|X(Yi|Xi) =L1× L2

L1=

n Y

i=1

fX(Xi)

L2=

n Y

i=1

fY|X(Yi|Xi)∝σ−nexp (

21σ2

X

i

Yi−(β0−β1Xi)

2)

Under the assumption of Normality, the least squares estimator is also themle

b

σ2= 1

n

n X

i=1

ˆ

ǫ2i

18.2

Prediction

Observe X=x of the covariate and want to predict their outcomeY.

b

Y∗=βb0+βb1x∗

VhYbi=Vhβb0i+x2 ∗V

h b

β1

i

+ 2xCovhβb0,βb1

i

Prediction interval b

ξ2

n=bσ2 Pn

i=1(Xi−X∗)2 nPi(Xi−X¯)2j

+ 1

b

Y±zα/2ξbn

18.3

Multiple Regression

Y =Xβ+ǫ

where

X =

  

X11 · · · X1k ..

. . .. ...

Xn1 · · · Xnk    β =

  

β1

.. .

βk    ǫ=

  

ǫ1

.. .

ǫn   

Likelihood

L(µ,Σ) = (2πσ2)−n/2exp

21σ2rss

rss= (y)T(y) =kY k2=

N X

i=1

(Yi−xTiβ)2

If the (k×k) matrixXTX is invertible, b

β= (XTX)−1XTY

Vhβb|Xni=σ2(XTX)−1

b

β≈ N β, σ2(XTX)−1 Estimate regression function

b

r(x) = k X

j=1

b

βjxj

Unbiased estimate forσ2

b

σ2= 1

n−k

n X

i=1

ˆ

ǫ2

i ˆǫ=Xβb−Y

mle

b

µ= ¯X bσ2= n−k

n σ

2

1−αConfidence interval

b

βj±zα/2seb(βbj)

18.4

Model Selection

Consider predicting a new observationY∗ for covariatesXand letSJ

denote a subset of the covariates in the model, where |S|=kand|J|=n. Issues

• Underfitting: too few covariates yields high bias

• Overfitting: too many covariates yields high variance Procedure

1. Assign a score to each model

2. Search through all models to find the one with the highest score Hypothesis testing

H0:βj= 0 vs. H1:βj6= 0 ∀j∈J Mean squared prediction error (mspe)

mspe=Eh(Yb(S)Y)2i

Prediction risk

R(S) = n X

i=1

mspei =

n X

i=1

Eh(Ybi(S)Y

i )2 i

Training error

b

Rtr(S) = n X

i=1

(Ybi(S)Yi)2

(20)

R2

R2(S) = 1

−rsstss(S) = 1Rbtr(S)

tss = 1−

Pn

i=1(Ybi(S)−Y)2

Pn

i=1(Yi−Y)2 The training error is a downward-biased estimate of the prediction risk.

EhRbtr(S)i< R(S)

bias(Rbtr(S)) =E

h b

Rtr(S)iR(S) =2 n X

i=1

CovhYbi, Yi i

AdjustedR2

R2(S) = 1−nn−1 −k

rss

tss

Mallow’sCp statistic

b

R(S) =Rbtr(S) + 2kσb2= lack of fit + complexity penalty

AkaikeInformation Criterion (AIC)

AIC(S) =ℓn(βbS,σb2S)−k Bayesian Information Criterion (BIC)

BIC(S) =ℓn(βbS,σb2S)−

k

2logn Validation and training

b

RV(S) = m X

i=1

(Ybi∗(S)−Yi∗)2 m=|{validation data}|, often

n

4 or

n

2

Leave-one-out cross-validation

b

RCV(S) = n X

i=1

(Yi−Yb(i))2=

n X

i=1

Yi−Ybi(S) 1−Uii(S)

!2

U(S) =XS(XSTXS)−1XS (“hat matrix”)

19

Non-parametric Function Estimation

19.1

Density Estimation

Estimate f(x), wheref(x) =P[XA] =R

Af(x)dx. Integrated square error (ise)

L(f,fbn) =Z f(x)−fbn(x)

2

dx=J(h) + Z

f2(x)dx

Frequentist risk

R(f,fbn) =EhL(f,fbn)i=

Z

b2(x)dx+ Z

v(x)dx

b(x) =Ehfbn(x)if(x) v(x) =Vhfbn(x)i

19.1.1 Histograms

Definitions

• Number of binsm

• Binwidthh= 1

m

• BinBj hasνj observations

• Definepbj=νj/nandpj= R

Bjf(u)du

Histogram estimator

b

fn(x) = m X

j=1

b

pj

hI(x∈Bj) E

h b

fn(x) i

= pj

h

Vhfbn(x)i= pj(1−pj) nh2

R(fbn, f)≈ h

2

12 Z

(f′(u))2 du+ 1

nh

h∗= 1

n1/3

6 R

(f′(u))2du

!1/3

R∗(fbn, f)≈

C

n2/3 C=

3

4

2/3Z

(f′(u))2 du

1/3

Cross-validation estimate of E[J(h)]

b

JCV(h) = Z

b

fn2(x)dx− 2

n

n X

i=1

b

f(−i)(Xi) =

2 (n1)h−

n+ 1 (n1)h

m X

j=1

b

p2j

(21)

19.1.2 Kernel Density Estimator (KDE)

KernelK

• K(x)≥0

• RK(x)dx= 1

• RxK(x)dx= 0

• Rx2K(x)dx≡σ2K>0 KDE

b

fn(x) = 1

n

n X

i=1

1

hK

x−Xi

h

R(f,fbn) 1 4(hσK)

4

Z

(f′′(x))2dx+ 1

nh

Z

K2(x)dx

h∗= c

−2/5

1 c−

1/5 2 c−

1/5 3

n1/5 c1=σ

2

K, c2=

Z

K2(x)dx, c3=

Z

(f′′(x))2dx

R∗(f,fbn) = c4

n4/5 c4=

5 4(σ

2

K)2/5 Z

K2(x)dx

4/5

| {z }

C(K)

Z

(f′′)2dx

1/5

Epanechnikov Kernel

K(x) =

( 3

4√5(1−x2/5) |x|<

5

0 otherwise

Cross-validation estimate ofE[J(h)]

b

JCV(h) = Z

b

fn2(x)dx− 2

n

n X

i=1

b

f(−i)(Xi)≈

1

hn2

n X

i=1

n X

j=1

K∗

X i−Xj

h

+ 2

nhK(0)

K∗(x) =K(2)(x)−2K(x) K(2)(x) = Z

K(x−y)K(y)dy

19.2

Non-parametric Regression

Estimate f(x) wheref(x) =E[Y |X =x]. Consider pairs of points

(x1, Y1), . . . ,(xn, Yn) related by

Yi=r(xi) +ǫi

E[ǫi] = 0

V[ǫi] =σ2

k-nearest Neighbor Estimator

b

r(x) = 1

k

X

i:xi∈Nk(x)

Yi whereNk(x) ={k values ofx1, . . . , xn closest tox}

Nadaraya-WatsonKernel Estimator

b

r(x) = n X

i=1

wi(x)Yi

wi(x) = K x−xi

h Pn

j=1K

x−xj

h

∈[0,1]

R(rbn, r)≈

h4

4 Z

x2K2(x)dx

4Z

r′′(x) + 2r(x)f′(x) f(x)

2

dx

+

Z σ2RK2(x)dx

nhf(x) dx

h∗≈ nc11/5

R∗(rbn, r)≈ c2

n4/5

Cross-validation estimate of E[J(h)]

b

JCV(h) = n X

i=1

(Yi−rb(−i)(xi))2=

n X

i=1

(Yi−rb(xi))2

1Pn K(0) j=1K

x−xj

h

!2

19.3

Smoothing Using Orthogonal Functions

Approximation

r(x) =

X

j=1

βjφj(x)≈ J X

i=1

βjφj(x)

Multivariate regression

Y = Φβ+η

where ηi=ǫi and Φ =   

φ0(x1) · · · φJ(x1)

..

. . .. ...

φ0(xn) · · · φJ(xn)   

Least squares estimator

b

β= (ΦTΦ)−1ΦTY

n1ΦTY (for equally spaced observations only)

(22)

Cross-validation estimate ofE[J(h)]

b

RCV(J) = n X

i=1

 Yi−

J X

j=1

φj(xi)βbj,(−i)

 

2

20

Stochastic Processes

Stochastic Process

{Xt:t∈T} T = (

{0,±1, . . .}=Z discrete

[0,∞) continuous

• NotationsXt,X(t)

• State spaceX • Index setT

20.1

Markov Chains

Markov chain

P[Xn =x|X0, . . . , Xn1] =P[Xn =x|Xn1] nT, x∈ X

Transition probabilities

pij ≡P[Xn+1=j|Xn=i]

pij(n)P[Xm+n =j|Xm=i] n-step

Transition matrixP(n-step: Pn)

• (i, j) element ispij

• pij >0

• Pipij = 1

Chapman-Kolmogorov

pij(m+n) =X k

pij(m)pkj(n)

Pm+n=PmPn

Pn=P× · · · ×P=Pn Marginal probability

µn= (µn(1), . . . , µn(N)) where µi(i) =P[Xn =i]

µ0,initial distribution

µn=µ0Pn

20.2

Poisson Processes

Poisson process

• {Xt:t∈[0,∞)} = number of events up to and including timet

• X0= 0

• Independent increments:

∀t0<· · ·< tn:Xt1−Xt0 ⊥⊥ · · · ⊥⊥Xtn−Xtn−1

• Intensity functionλ(t)

– P[Xt+hXt= 1] =λ(t)h+o(h)

– P[Xt+hXt= 2] =o(h)

• Xs+t−Xs∼Po (m(s+t)−m(s)) where m(t) =R0tλ(s)ds

Homogeneous Poisson process

λ(t)≡λ =⇒ Xt∼Po (λt) λ >0

Waiting times

Wt:= time at whichXtoccurs

Wt∼Gamma

t,1 λ

Interarrival times

St=Wt+1−Wt

St∼Exp

1

λ

t

Wt−1 Wt

St

(23)

21

Time Series

Mean function

µxt =E[xt] =

Z ∞

−∞

xft(x)dx

Autocovariance function

γx(s, t) =E[(xsµs)(xtµt)] =E[xsxt]µsµt γx(t, t) =E(xtµt)2=V[xt]

Autocorrelation function (ACF)

ρ(s, t) =pCov [xs, xt]

V[xs]V[xt] =

γ(s, t) p

γ(s, s)γ(t, t)

Cross-covariance function (CCV)

γxy(s, t) =E[(xsµx

s)(yt−µyt)]

Cross-correlation function (CCF)

ρxy(s, t) = p γxy(s, t)

γx(s, s)γy(t, t)

Backshift operator

Bk(xt) =xt−k

Difference operator

∇d = (1

Referensi

Dokumen terkait

Pangandaran Dengan demikian, identifikasi masalah penelitian ini adalah tata ruang kantor Dinas Kependudukan Catatan Sipil Kabupaten Pangandaran masih belum optimal dan

Menindaklanjuti Hasil Evaluasi Penawaran dan Evaluasi Kualifikasi, Pekerjaan Belanja Modal Konstruksi Jaringan Irigasi Paket 6 Dam Parit hamparan sawah Air Jernih Desa Tanjung

Adapun hasil dari analisis kandungan P-Tersedia dari empat tipe penggunaan lahan yang berbeda, menunjukkan bahwa semua nilai P-Tersedia berada pada kriteria sedang

Pada saat kualifikasi/pembuktian data, diminta kepada saudara untuk menunjukan semua Dokumen Asli/Legalisir yang dipersyaratkan dalam Dokumen Pengadaan dan Aanwijzing

belajar mahasiswa UT, kompetensi mata kuliah ini, pola pelaksanaan tutorial tatap muka pada mata kuliah ini, tugas tutorial wajib dan partisipasi

[r]

sebagai siswa, peserta latihan ataupun seorang yang memerlukan bantuan. Setelah selesai, maka giliran peserta kedua untuk berperan sebagai tutor, fasilitator atupun

[r]