Statistics Lab with R VERSION 2015

(1)

Statistics Lab

Rodolfo Metulini

IMT Institute for Advanced Studies, Lucca, Italy

(2)

Introduction

The modern statistics was built and developed around the normal distribution.

Academic world use to say that, if the empirical distribution is normal (or approximative normal), everything works good. This depends mainly on the sample dimension

Said this, it is important to undestand in which circumstances we can state the distribution is normal.

(3)

The Law of Large Numbers (LLN)

Suppose we have a random variableX with expected value E(X) =µ.

We extractn observation from X (say{x =x1,x2, ...,xn}).

If we define ˆX_n = P

ixi

n =

x1+x2+...+xn

n , the LLN states that, for

n−→ ∞,

ˆ

(4)

The Central Limit Theorem (CLT)

Suppose we have a random variableX with expected value E(X) =µandv(X) =σ2

X_n distributes with expected value µand variance σ

2

n ), whatever the distribution ofx be.

N.B. IfX is normal distributed, ˆXn∼N(µ, σ2

(5)

CLT: Empiricals

To better understand the CLT, it is recommended to examine the theorem empirically and step by step.

By the introduction of new commands in the R programming language.

In the first part, we will show how to draw and visualize a sample of random numbers from a distribution.

(6)

Drawing random numbers - 1

We already introduced the use of the lettersd,pandq in relations to the various distributions (e.g. normal, uniform, exponential). A reminder of their use follows:

◮ d is for density: it is used to find values of the probability

density function.

◮ p is for probability: it is used to find the probability that the

random variable lies on the left of a giving number.

◮ q is for quantile: it is used to find the quantiles of a given

distribution.

There is a fourth letter, namelyr, used todraw random numbers

from a distribution. For examplerunif andrexpwould be used to draw random numbers from the uniform and exponential

(7)

Drawing random numbers - 2

Let use thernormcommand to draw 500 number at random from a normal distribution having mean 100 and standard deviation (sd) 10.

>x= rnorm(500,mean=100,sd=10)

The results, typing in the r consollex, is a list of 500 numbers extracted at random from a normal distribution with mean 500 and sd 100.

When you examine the numbers stored in the vectorx, There is a sense that you are pulling random numbers that are clumped about a mean of 100. However, a histagram of this selection provides a different picture of the data stored.

(8)

Drawing random numbers - Comments

Several comments are in order regarding the histogram in the figure.

1. The histogram is approximately normal in shape.

2. The balance point of the histogram appears to be located near 100, suggesting that the random numbers were drawn from a distribution having mean 100.

(9)

Drawing random numbers - a new drawing

Lets try the experiment again, drawing a new set of 500 random numbers from the normal distribution having mean 100 and standard deviation 10:

>x=rnorm(500,mean= 100,sd = 10) >hist(x,prob=TRUE,ylim=c(0,0.04))

Give a look to the histogram ... It is different from the first one, however, it share some common traits: (1) it appears normal in shape; (2) it appears to be balanced around 100; (3) all values appears to occur within 3 increments of 10 of the mean.

This is a strong evidence that the random numbers have been drawn from a normal distribution having mean 100 andsd 10. We can provide evidence of this claim by imposing a normal density curve:

(10)

The

curve

_command

Thecurve command is new. Some comments on its use follow:

1. In its simplest form, the sintax curve(f(x),from=,to =) draws the function defined by f(x) on the interval (from, to). Our function is dnorm(x,mean= 100,sd = 10). The curve command sketches this function ofX on the interval (from,to).

2. The notation from= and to = may be omitted if the arguments are in the proper order to the curve command: function first, value of from second, value ofto third. That is what we have done.

(11)

The distribution of ˆ

X

_n

_{(sample mean)}

In our previous example we drew 500 random numbers from a normal distribution with mean 100 and standard deviation 10. This leads to ONE sample ofn= 500. Now the question is: what is the mean of our sample?

>mean(x)

[1]100.14132

If we take another sample of 500 random numbers from the SAME distribution, we get a new sample with different mean.

>x=rnorm(500,mean= 100,sd = 10)

mean(x)

[1]100.07884

(12)

Producing a vector of sample means

We will repeatedly sample from the normal distribution, 500 times. Each of the 500 samples will select 5 random numbers (instead of 500) from the normal distribution having mean 100 and sd 10. We will then compute the mean of those samples.

We begin by declaring the mean and the standard deviation. Then, we declare the sample mean.

> µ= 100;σ = 10

>n= 5

We then need some place to store the mean of the samples. We initalize a vectorxbar to initially contain 500 zeros.

(13)

Producing a vector of sample means - cycle

for

It is easy to draw a sample of sizen = 5 from the normal

distribution having meanµ= 100 and standard deviationσ= 10. We simply issue the command

rnorm(n,mean=µ,sd =σ).

To find the mean of this results, we simply add the adjustment

mean(rnorm(n,mean=µ,sd=σ)).

The final step is to store this results in the vectorxbar. Then we must repeat this same process an addintional 499 times. This require the use of afor loop.

(14)

Cycle

for

◮ The i infor(iin1 : 500) is called theindex of the for loop.

◮ The index i is first set equal to 1, then the body of thefor

loop is executed. On the next iteration, i is set equal to 2 and the body of the loop is executed again. The loop continues in this manner, incrementing by 1, finally setting the index i to 500. After executing the last loop, thefor cycle is terminated

◮ In the body of the for loop, we have

xbar[i] =mean(rnorm(n,mean=µ,sd =σ)). This draws a sample of size 5 from the normal distribution, calculates the mean of the sample, and store the results inxbar[i].

◮ When thefor loop completes 500 iterations, the vector xbar

contains the means of 500 samples of size 5 drawn from the normal distribution having µ= 100 and σ= 10

(15)

Distribution of ˆ

X

_n

_{- observations}

1. The previous histograms describes the shape of the 500 random number randomly selected, here, the histogram describe the distribution of 500 different sample means, each of which founded by selectingn = 5 random number from the normal distribution.

2. The distribution of xbar appears normal in shape. This is so even though the sample size is relatively small ( n= 5).

3. It appears that the balance point occurs near 100. This can be checked with the following command:

>mean(xbar)

That is the mean of the sample means, that is almost equal to the mean of the draw of random numbers.

(16)

Increasing the sample size

Lets repeat the last experiment, but this time let’s draw a sample size ofn= 10 from the same distribution (µ= 100,σ = 10)

> µ= 100;σ = 10

>n= 10

>xbar =rep(0,500)

>for(iin1 : 500){xbar[i] =mean(rnorm(n,mean=µ,sd = σ))}

hist(xbar,prob=TRUE,breaks = 12,xlim=c(70,130),ylim= c(0,0.1))

(17)

Key Ideas

1. When we select samples from a normal distribution, then the distribution of sample means is also normal in shape

2. The mean of the distribution of sample means appears to be the same as the mean of the random numbers

(parentpopulation) (see the balance points compared)

3. By increasing the sample size of our samples, the histograms becomes narrower. Infact, we would expect a more accurate estimate of the mean of the parent population if we take the mean from a larger sample size.

(18)

Summarise

We finish replicating the statement about CLT:

1. If you draw samples from a normal distribution, then the distribution of the sample means is also normal

2. The mean of the sample means is roughly identical to the mean of the parent population

(19)

Homeworks

Experiment 1: Draw theXbar histogram forn = 1000. How is

the histogram shape?

Experiment 2: Repeat the full experiment drawing random

numbers and sample means from a (1) uniform and from (2) a poisson distribution. Is the histogram ofXbar normal in shape for n= 5 and for n=30?

Experiment 3: Repeat the full experiment using real data instead

of random numbers. (HINT: select samples of dimensionn= 5 from the real data, notrnorm)

(20)

Application to Large Number Law

Experiment: toss the coin 100 times.

This experiment is like repeating 100 times a random draw from a bernoulli distribution with parameterρ= 0.5

We expect to have 50 times (value = 1) head and 50 times cross (value = 0), if the coin is not distorted

But, in practice, this not happen: repeating the experiment we are going to have a distributions centered in 50, but spread out.

(21)

Application to Large Number Law - 2

x = rbinom(100,1,0.5) x x2 = rbinom(100,2,0.5)

hist random numbers hist(x)

define the empirical frequency sum(x)

define the empirical frequency for the sample mean xfreq = rep(0,1000) xfreq

for loop define the number i N = rep(0,1000) for (i in 1:1000) N[i] = i N

define the cumulated frequency (total) xfreq[1] =

sum(rbinom(100,1,0.5)) xfreq[1] for (i in 2:1000) xfreq[i] = (sum(rbinom(100,1,0.5)) + xfreq[i-1]) xfreq

define the sample mean (cumulative freq divided by number of experiments) xfreq2 = rep(0,1000) for (i in 1:1000) xfreq2[i] = xfreq[i]/i xfreq2 plot(xfreq2, ylim= c(48,52))