• Tidak ada hasil yang ditemukan

Lecture 02 2017 18 Ch02 EDA

N/A
N/A
Protected

Academic year: 2018

Membagikan "Lecture 02 2017 18 Ch02 EDA"

Copied!
131
0
0

Teks penuh

(1)

Exploratory Data Analysis

Prof. dr. Siswanto Agus Wilopo, M.Sc., Sc.D.

Department of Biostatistics, Epidemiology and

Population Health

Faculty of Medicine

(2)

Table:

Assessing the use of table for each type of data,

Differentiate a frequency distribution,

Create a frequency table from raw data,

Constructs relative frequency, cumulative

frequency and relative cumulative frequency

tables.

Construct grouped frequency tables.

Construct a cross-tabulation table.

Illustrate the use of a contingency table is.

(3)

Graph:

Assessing the most appropriate chart for a given data type.

Construct pie charts and simple, clustered and stacked, bar

charts.

Create histograms.

Create step charts and ogives.

Construct time series charts, including statistics process control

(SPC).

Interpret and assess a chart reveals.

Assess the meaning by looking at the ‘shape’ of a frequency

distribution.

Appraise negatively skewed, symmetric and positively skewed

distributions.

Describe a bimodal distribution.

Describe the approximate shape of a frequency distribution from

(4)

Numeric Summary:

Describe a summary measure of location is, and understand

the meaning of, and the difference between, the mode, the

median and the mean.

Compute the mode, median and mean for a set of values.

Formulate the role of data type and distributional shape in

choosing the most appropriate measure of location.

Describe what a percentile is, and calculate any given

percentile value.

Describe what a summary measure of spread is

Differentiate the difference between, and can calculate, the

range, the interquartile range and the standard deviation.

Interpret estimate percentile values

Formulate the role of data type and distributional shape in

(5)

The Big Picture

Recall “The Big Picture,” the four-step process

that encompasses statistics (as it is presented in

this course):

1.

Producing Data — Choosing a sample from the

population of interest and collecting data.

2.

Exploratory Data Analysis (EDA) or Descriptive

Statistics —

3.

Summarizing the data we’ve collected. Probability and

Inference —

4.

Drawing conclusions about the entire population

based on the data collected from the sample.

(6)
(7)
(8)
(9)
(10)
(11)

Goals of EDA

Exploratory Data Analysis (EDA) is how

we make sense of the data by

(12)

EDA consists of:

organizing and summarizing the raw

data,

discovering important features and

patterns in the data and any striking

deviations from those patterns, and then

interpreting our findings in the context of

(13)

(continued)

And can be useful for:

describing the distribution of a single

variable (center, spread, shape, outliers)

checking data (for errors or other

problems)

checking assumptions to more complex

statistical analyses

investigating relationships between

(14)

EDA

Exploratory data analysis (EDA) methods are

often called Descriptive Statistics due to the

fact that they simply describe, or provide

estimates based on, the data at hand.

Comparisons can be visualized and values of

interest estimated using EDA but descriptive

statistics alone will provide no information

(15)

Important Features of Exploratory Data

Analysis

There are two important features to the

structure of the EDA unit in this course:

The material in this unit covers two

broad topics:

Examining Distributions — exploring data one

variable at a time.

Examining Relationships — exploring data two

(16)

Important Features of Exploratory Data

Analysis

In Exploratory Data Analysis, our

exploration of data will always consist

of the following two elements:

visual displays, supplemented by

numerical measures.

(17)
(18)

Examining Distributions

We will begin the EDA part of the course

by exploring (or looking at) one variable

at a time.

As we have seen, the data for each

(19)

Examining Distributions

In order to convert these raw data into

useful information, we need to summarize

and then examine the distribution of the

variable.

By distribution of a variable, we mean:

what values the variable takes, and

how often the variable takes those values.

We will first learn how to summarize and

examine the distribution of a single

(20)
(21)

Example:

Distribution of One Categorical Variable

What is your perception of your own

body? Do you feel that you are

overweight, underweight, or about right?

A random sample of 1,200 college

(22)

Example Raw Data out of 1200 students

Student

Body Image

student 25

overweight

student 26

about right

student 27

underweight

student 28

about right

(23)

Here is some information that would be

interesting to get from these data:

What percentage of the sampled students fall into

each category?

How are students divided across the three body

image categories?

Are they equally divided? If not, do the

(24)

There is no way that we can answer

these questions by looking at the raw

data, which are in the form of a long list

of 1,200 responses, and thus not very

useful.

However, both of these questions will be

easily answered once we summarize and

look at the distribution of the variable

Body Image (i.e., once we summarize

how often each of the categories

(25)

Numerical Measures

In order to summarize the distribution of

a categorical variable, we first create a

table of the different values (categories)

the variable takes, how many times each

value occurs (count) and, more

importantly, how often each value occurs

(by converting the counts to

percentages).

The result is often called a Frequency

(26)

A Frequency Distribution or Frequency

Table

Category

Count

Percent

About right

855

(855/1200)*100 = 71.3%

Overweight

235

(235/1200)*100 = 19.6%

Underweight

110

(110/1200)*100 = 9.2%

(27)
(28)

Visual or Graphical Displays

(29)
(30)

To display data from one quantitative

variable graphically, we can use either

a histogram or boxplot.

We will also present several “by-hand”

(31)

Numerical Measures

The overall pattern of the distribution of

a quantitative variable is described by

its shape, center, and spread.

By inspecting the histogram or boxplot,

we can describe the shape of the

(32)

Numerical Measures

A description of the distribution of a

quantitative variable must include, in

addition to the graphical display, a

more precise numerical description of

the center and spread of the

(33)

Numerical Measures

how to quantify the center and spread of

a distribution with various numerical

measures;

some of the properties of those numerical

measures; and

how to choose the appropriate numerical

measures of center and spread to

supplement the histogram.

We will also discuss a few measures of

position or location which allow us to

(34)

How To Create Histograms

Score

Count

[40-50)

1

[50-60)

2

[60-70)

4

[70-80)

5

[80-90)

2

[90-100)

1

Here are the exam grades of 15 students:

(35)

Stemplot (Stem and Leaf Plot)

The stemplot (also called stem and leaf plot) is

another graphical display of the distribution of

quantitative variable.

The idea is to separate each data point into a

stem and leaf, as follows:

The leaf is the right-most digit.

The stem is everything except the right-most digit.

So, if the data point is 34, then 3 is the stem and 4 is the leaf.

If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.

Note: For this to work, ALL data points should

(36)

Stemplot (Stem and Leaf Plot)

EXAMPLE: Best Actress Oscar Winners

We will use the Best Actress Oscar winners

example

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21

41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

To make a stemplot:

Separate each observation into a stem and a leaf.

Write the stems in a vertical column with the

smallest at the top, and draw a vertical line at the

right of this column.

Go through the data points, and write each leaf in

the row to the right of its stem.

(37)
(38)

Summary Measures

Arithmetic Mean

Median

Mode

Describing Data Numerically

Variance

Standard Deviation

Coefficient of Variation

Range

Interquartile Range

Geometric Mean

Skewness

(39)
(40)

Measures of Central Tendency

Central Tendency

Arithmetic Mean

Median

Mode

Geometric Mean

n

(41)

Arithmetic Mean

The arithmetic mean (sample mean)

(42)

Arithmetic Mean

The most common measure of central tendency

Mean = sum of values divided by the number of

values

Affected by extreme values (outliers)

(43)

Median

In an ordered array, the median is the

“middle” number (50% above, 50%

below)

Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10

Median = 3

0 1 2 3 4 5 6 7 8 9 10

(44)

Finding the Median

The location of the median:

If the number of values is odd, the median is the middle

number

If the number of values is even, the median is the average of

the two middle numbers

Note that is not the

value

of the median, only

the

position

of the median in the ranked data

(45)

Mode

A measure of central tendency

Value that occurs most often

Not affected by extreme values

Used for either numerical or

categorical (nominal) data

There may be no mode

There may be several modes

(46)

Mean

is generally used, unless

extreme values (outliers) exist

Then

median

is often used, since the

median is not sensitive to extreme

values.

Problem

(47)

Measures of Location

Comparison of Mean and Median

Let use cholesterol data as an example:

(48)

Measures of Location

Comparison of Mean and Median

Suppose we replace

250

with

215

:

We will find the mean is

178.7

and the

median remains

166

.

(49)

Geometric Mean

Geometric mean

Used to measure the rate of change of a variable

over time

Geometric mean rate of return

Measures the status of an investment over time

(50)

Example

An investment of $100,000 declined to $50,000 at

the end of year one and rebounded to $100,000

at end of year two:

(51)

Example

Use the 1-year returns to compute the

arithmetic mean and the geometric mean:

(52)
(53)

Same center,

Measures of Variation

Variation

Variance

Standard

Deviation

of Variation

Coefficient

Range

Interquartile

Range

Measures of variation give

information on the

spread

(54)

Range

Simplest measure of variation

Difference between the largest and

the smallest values in a set of data:

Range = X

largest

– X

smallest

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

(55)

Ignores the way in which data are

distributed

Sensitive to outliers

7 8 9 10 11 12

Range = 12 - 7 = 5

7 8 9 10 11 12

Range = 12 - 7 = 5

Disadvantages of the Range

1

,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,

5

1

,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,

120

Range = 5 - 1 = 4

(56)

Quartiles

Quartiles split the ranked data into 4 segments

with an equal number of values per segment

25%

25%

25%

25%

The first quartile, Q

1

, is the value for which 25% of the

observations are smaller and 75% are larger

Q

2

is the same as the median (50% are smaller, 50% are

larger)

Only 25% of the observations are greater than the third

quartile

(57)

Quartile Formulas

Find a quartile by determining the value in the

appropriate position in the ranked data, where

First quartile position:

Q

1

= (n+1)/4

Second quartile position

:

Q

2

= (n+1)/2

(the median position)

Third quartile position

:

Q

3

= 3(n+1)/4

(58)

Calculating Quartiles

Sample Data in Ordered Array:

11 12 13 16 16 17 18 21 22

Example: Find the first quartile

Q

1

and Q

3

are measures of noncentral location

Q

= median, a measure of central tendency

(n = 9)

Q

1

is in the

(9+1)/4 = 2.5 position

of the ranked data

so use the value half way between the 2

nd

and 3

rd

values,

(59)

(n = 9)

Q

1

is in the

(9+1)/4 = 2.5 position

of the ranked data,

so

Q

1

= 12.5

Q

2

is in the

(9+1)/2 = 5

th

position

of the ranked data,

so

Q

2

= median = 16

Q

3

is in the

3(9+1)/4 = 7.5 position

of the ranked data,

Quartiles

Sample Data in Ordered Array:

11 12 13 16 16 17 18 21 22

Example:

(60)

Interquartile Range

Can eliminate some outlier problems by

using the

interquartile range

Eliminate some high- and low-valued

observations and calculate the range

from the remaining values

Interquartile range = 3

rd

quartile – 1

st

quartile

(61)

Interquartile Range

Median

(Q2)

X

maximum

X

minimum

Q1

Q3

Example:

25% 25% 25% 25%

12 30 45 57 70

(62)

Average (approximately) of squared

deviations of values from the mean

Sample variance:

Variance

Where

= mean

n = sample size

(63)

Standard Deviation

Most commonly used measure of

variation

Shows variation about the mean

Is the square root of the variance

Has the

same units as the original data

Sample standard deviation:

(64)

Calculation Example:

Sample Standard Deviation

Sample

(65)

Measuring variation

Small standard deviation

(66)

Comparing Standard Deviations

Mean = 15.5

S =

3.338

11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

Data B

Data A

Mean = 15.5

S =

0.926

11 12 13 14 15 16 17 18 19 20 21

(67)

Advantages of Variance and

Standard Deviation

Each value in the data set is used in

the calculation

Values far from the mean are given

extra weight

(68)

Coefficient of Variation

Measures

relative variation

Always in percentage (%)

Shows

variation relative to mean

Can be used to compare two or more

(69)

Comparing Coefficient

of Variation

Hospital A:

Average surplus in the last 10 years = 50 Billion Rp.

Standard deviation = 5 Billion Rp.

Hospital B:

Average surplus last in the last 10 years = 100 Billion Rp.

Standard deviation = 5 Billion Rp.

Both hospital

have the same

standard

deviation, but

hospital B is

less variable

relative to its

surplus

100 Bill Rp.

X

(70)

Standardized Scores (Z-Scores)

Z-scores use the mean and standard deviation as the

primary measures of center and spread and are therefore

most useful when the mean and standard deviation are

appropriate, i.e. when the distribution is reasonably

symmetric with no extreme outliers.

For any individual, the z-score tells us how many standard

deviations the raw score for that individual deviates from

the mean and in what direction.

To calculate a z-score, we take the individual value and

subtract the mean and then divide this difference by the

standard deviation.

A positive z-score indicates the individual is above

(71)

Z Scores

A measure of distance from the mean (for

example, a Z-score of 2.0 means that a value is 2.0

standard deviations from the mean)

The difference between a value and the mean,

divided by the standard deviation

A Z score above 3.0 or below -3.0 is considered an

outlier

S

X

X

(72)

Z Scores

Example:

If the mean is 14.0 and the standard deviation is

3.0, what is the Z score for the value 18.5?

The value 18.5 is 1.5 standard deviations above the

mean

(A negative Z-score would mean that a value is less

than the mean)

(73)

MEASURE SPREAD AND DISTRIBUTION

(74)
(75)
(76)

Shape

When describing the shape of a

distribution, we should consider:

Symmetry/skewness of the

distribution.

Peakedness (modality) — the

(77)
(78)
(79)
(80)

Shape of a Distribution

Describes how data are distributed

Measures of shape

Symmetric or skewed

Mean =

Median

Mean <

Median

Median <

Mean

(81)

Numerical Measures

for a Population

Population summary measures are called

parameters

The

population mean

is the sum of the values in the

population divided by the population size, N

N

μ = population mean

N = population size

X

i

= i

th

value of the variable X

(82)

Average of squared deviations of

values from the mean

Population variance:

Population Variance

Where

μ = population mean

N = population size

(83)

Population Standard Deviation

Most commonly used measure of

variation

Shows variation about the mean

Is the square root of the population

variance

Has the

same units as the original data

Population standard deviation:

(84)

The Sample Covariance

The sample covariance measures the strength of

the linear relationship between

two variables

(called bivariate data)

The

sample covariance

:

Only concerned with the strength of the relationship

No causal effect is implied

(85)

Covariance

between two random

variables:

cov(X,Y) > 0 X and Y tend to move in

the

same

direction

cov(X,Y) < 0 X and Y tend to move in

opposite

directions

cov(X,Y) = 0 X and Y are independent

(86)

Coefficient of Correlation

Measures the relative strength of the

linear relationship between two

variables

Sample coefficient of correlation

:

(87)

Features of

Correlation Coefficient, r

Unit free

Ranges between –1 and 1

The closer to –1, the stronger the

negative linear relationship

The closer to 1, the stronger the

positive linear relationship

The closer to 0, the weaker the linear

(88)
(89)

The Empirical Rule

If the data distribution is approximately

bell-shaped, then the interval:

contains about

68%

of the values

in the population or the sample

μ

(90)

The Empirical Rule

contains about

95%

of the

values in

the population or the sample

contains about

99.7%

of the values in

the population or the sample

(91)

Chebyshev Rule

Regardless of how the data are

distributed, at least

(1 - 1/k

2

) x 100%

of

the values will fall within

k

standard

deviations of the mean (for k > 1)

Examples:

(1 - 1/1

2

) x 100% =

0%

……...

k=1 (

μ ±

1

σ

)

(1 - 1/2

2

) x 100% =

75%

…...

k=2 (

μ ±

2

σ

)

(1 - 1/3

2

) x 100% =

89%

……….

k=3 (

μ ±

3

σ

)

(92)
(93)
(94)

Five-Number Summary

The combination of the five numbers (min, Q1,

M, Q3, Max) is called the

five number

summary.

It provides a quick numerical description of

both the center and spread of a distribution.

Each of the values represents a measure of

position in the dataset.

The min and max providing the boundaires

and the quartiles and median providing

(95)
(96)
(97)
(98)

The 1.5(IQR) Criterion for Outliers

An observation is considered

a suspected outlier or potential

outlier if it is:

below Q1 – 1.5(IQR) or

(99)
(100)

EXAMPLE:

Best Actress Oscar Winners

We can now use the 1.5(IQR) criterion to check whether the

three highest ages should indeed be classified as potential

outliers:

For this example, we found Q1 = 32 and

Q3 = 41.5 which give an IQR = 9.5

Q1 – 1.5 (IQR) = 32 – (1.5)(9.5) = 17.75

Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75

The 1.5(IQR) criterion tells us that any

observation with an age that is below

17.75 or above 55.75 is considered a

suspected outlier.

We therefore conclude that the

observations with ages of 61, 74 and

80 should be flagged as suspected

outliers in the distribution of ages.

Note that since the smallest observation is 21,

there are no suspected low outliers in this

We will continue with the Best Actress Oscar winners example

(101)

Possible methods for handling

outliers in practice

Why is it important to identify possible outliers, and how should

they be dealt with? The answers to these questions depend on the

reasons for the outlying values.

Here are several possibilities:

Even though it is an extreme value, if an outlier can be

understood to have been produced by essentially the same sort

of physical or biological process as the rest of the data, and if

such extreme values are expected to eventually occur again,

then such an outlier indicates something important and

(102)

If an outlier can be explained to have been produced

under fundamentally different conditions from the rest of

the data (or by a fundamentally different process), such

an outlier can be removed from the data if your goal is to

investigate only the process that produced the rest of the

data.

An outlier might indicate a mistake in the data (like a

typo, or a measuring error), in which case it should be

corrected if possible or else removed from the data

before calculating summary statistics or making

(103)

BOXPLOTS

(104)

EXAMPLE: Best Actress Oscar Winners

We will use data on the Best Actress Oscar

winners as an example

34 34 26 37 42 41 35 31 41 33 30 74 33 49

38 61 21 41 26 80 43 29 33 35 45 49 39 34

26 25 35 33

The five number summary of the age of

Best Actress Oscar winners (1970-2001) is:

min = 21, Q1 = 32, M = 35,

(105)

Box Plot and Outliers

Lines extend from the

edges of the box to the

smallest and largest

observations that were

not classified as

suspected outliers

(using the 1.5xIQR

criterion).

In our example, we have

no low outliers, so the

bottom line goes down

to the smallest

observation, which is

21.

Since we have three

(106)

The following information is visually

depicted in the boxplot

the five

number

summary

(blue)

the range

and IQR

(red

)

outliers

(107)
(108)

Box Plot Summarized

The five-number summary of a distribution

consists of M, Q1, Q3 and the extremes Min, Max.

The median describes the center, and the

extremes (which give the range) and the quartiles

(which give the IQR) describe the spread.

The boxplot is visually displaying the five number

summary and any suspected outlier using the

1.5(IQR) criterion.

Boxplots presented in side-by-side to compare

(109)
(110)

Classification

In most studies involving two variables, each of the

variables has a role. We distinguish between:

the response variable (dependent) — the outcome of the

study; and

the explanatory variable (independent) — the variable that

claims to explain, predict or affect the response.

The variable we wish to predict is commonly called

the dependent variable, the outcome variable, or

the response variable.

Any variable we are using to predict (or explain

(111)

If we further classify each of the two relevant

variables according to type (categorical or

quantitative),

We get the following 4 possibilities

for “role-type classification”

(112)
(113)

Case C

Q:

Exploring the relationship amounts

to comparing the distributions of the

quantitative response variable for each

category of the explanatory variable.

To do this, we use:

Display: side-by-side boxplots.

Numerical summaries: descriptive statistics of the

(114)

Case C

C:

Exploring the relationship amounts

to comparing the distributions of the

categorical response variable, for

each category of the explanatory

variable.

To do this, we use:

Display: two-way table.

Numerical summaries: conditional percentages (of

(115)
(116)
(117)

Case Q

Q

We examine the relationship using:

Display: scatterplot.

When describing the relationship as

displayed by the scatterplot, be sure to

consider:

Overall pattern → direction, form, strength.

Deviations from the pattern → outliers.

Labeling the scatterplot (including a

(118)
(119)
(120)

Interpreting Scatterplots

• How do we explore the relationship between two

quantitative variables using the scatterplot?

(121)
(122)
(123)

In the special case

The scatterplot displays a linear

relationship (and only then), we supplement

the scatterplot with:

Numerical summaries: Pearson’s correlation

coefficient (r) measures the direction and, more

importantly, the strength of the linear relationship.

The closer r is to 1 (or -1), the stronger the positive

(or negative) linear relationship. r is unitless,

(124)
(125)

When the relationship is linear (as

displayed by the scatterplot, and

supported by the correlation r), we can

summarize the linear pattern using

the least squares regression line.

Remember that:

The slope of the regression line tells us the average change in

the response variable that results from a 1-unit increase in the

explanatory variable.

When using the regression line for predictions, you should

(126)
(127)

When examining the relationship between

two variables (regardless of the case),

any observed

relationship (association) does not imply

causation, due to the possible presence

of lurking variables.

When we include a lurking variable in our

analysis, we might need to rethink the

(128)
(129)
(130)

Simpson’s paradox

Note that despite our earlier finding that overall Hospital A has a

higher death rate (3% vs. 2%) when we take into account the lurking

variable, we find that actually it is Hospital B that has the higher

death rate both among the severely ill patients (4% vs. 3.8%) and

among the not severely ill patients (1.3% vs. 1%).

(131)

Gambar

table shows part of the responses:
table of the different values (categories) the variable takes, how many times each value occurs (count) and, more

Referensi

Dokumen terkait

Sejak awal perkembangan akuntansi telah dibalut dengan strategi meningkatkan perekonomian dalam negeri dengan mengundang masuknya investor asing sehingga pada tahun

Telah disetujui dan diterima dengan baik oleh tim penguji Skripsi Fakultas Ekonomi Universitas Sebelas Maret guna melengkapi tugas-tugas dan memenuhi syarat-syarat

Layanan informasi karir dengan media buku bergambar efektif meningkatkan pemahaman terhadap studi lanjutan siswa, ini terjadi karena melalui proses yang panjang dimana

memperoleh paling kurang 1 (satu) pekerjaan sebagai Penyedia Barang/Jasa dalam kurun waktu 4 (empat) tahun terakhir baik di lingkungan pemerintah maupun swasta, termasuk

Banjir yang terjadi ini menyebabkan banyak infrastruktur di Wasior hancur termasuk lapangan udara di Wasior, sementara kerusakan juga menimpa rumah warga, rumah sakit, jembatan,

dengan media yang digunakan adalah air, dimana sistem kerja Cooling Tower dapat dijelaskan sebagai berikut : condenser di unit Chiller akan memiliki temperatur dan tekanan yang

sarana untuk memperoleh dana dari hutang jangka panjang, maka biaya hutang adalah sama dengan Kd atau Yield To Maturity (YTM) yaitu tingkat keuntungan yang dinikmati oleh

Realisme merupakan sebuah aliran yang melihat naskah drama sebagai cerminan dari.. “realitas sesungguhnya”, karena ketika seorang manusia melihat dunia, lalu Ia