Descriptive statistics (part 2)
Thursday, June 13, 2024
1
Fakultas Ilmu Komputer Universitas Indonesia 2014
CSF2600102 – Statistics and Probability
FASILKOM, Universitas Indonesia
Credits
These course slides were prepared by Alfan F. Wicaksono.
Suggestions, comments, and criticism regarding these slides are welcome. Please kindly send your inquiries to [email protected].
Technical questions regarding the topic should be directed to current lecturer team members:
• Ika Alfina, S.Kom., M.Kom.
• Prof. T. Basaruddin, Ph.D.
• Alfan F. Wicaksono
The content was based on previous semester’s (odd semester
2013/2014) course slides created by all previous lecturers. FA
SILKOM, Universitas Indonesia
2
Thursday, June 13, 2024
References
Introduction to Probability and Statistics for Engineers &
Scientists, 4th ed.,
Sheldon M. Ross, Elsevier, 2009.
Applied Statistics for the Behavioral Sciences, 5th Edition,
Hinkle., Wiersma., Jurs., Houghton Mifflin Company, New York, 2003.
Elementary Statistics A Step-by-step Approach, 8th ed.,
Allan G. Bluman, Mc Graw Hill, 2012.
FASILKOM, Universitas Indonesia
3
Thursday, June 13, 2024
Summarizing data sets
part II: measures of variability (dispersion)
FASILKOM, Universitas Indonesia
4
Thursday, June 13, 2024
Measures of variability (dispersion)
These statistics measure the amount of scatter in a data set.
Basic question: how widely scores are spread throughout the distribution ?
Measures of dispersion give us information about how much our variables vary from the mean.
• The measures of variation to be discussed:
• range
• mean deviation
• variance
• standard deviation
Thursday, June 13, 2024
5
FASILKOM, Universitas Indonesia
Range
Range [Hinkle, 2003]
• Range is the number of units on the scale of measurement that include the highest and lowest values.
Range = (highest score – lowest score) + 1 unit Sample: ordered data:
Dist. 1: 11 16 18 ... 31 37 Dist. 2: 18 19 21 ... 26 29
Range(Dist.1) = 37-11+1=27, Range(Dist.2) = 29-18+1=12 Dist.1 are more “varied” !
Thursday, June 13, 2024
6
FASILKOM, Universitas Indonesia
Mean Deviation
Deviation score is the difference between given score and the mean.
Mean deviation (MD) is the average of the absolute values of the deviation scores.
Larger MD have greater variations !
•
Thursday, June 13, 2024
7
FASILKOM, Universitas Indonesia
n DS n
x x
MD
n i
i n
i
i
1 1Variance
• Using square instead of absolute.
• Variance is the average of the sum of squared deviations around the mean.
Symbol:
σ2 is the variance of a population
s2 is the variance of a sample
Thursday, June 13, 2024
8 SS: sum of square
FASILKOM, Universitas Indonesia
Population variance
Sample variance
N x N
SS
N
i
i
1
2 2
)
(
1 ) (
1
1
2 2
n
x x
n s SS
n
i
i
Sample Variance
Definition
The sample variance, call it s2, of the data set is defined by
•
Thursday, June 13, 2024
9
FASILKOM, Universitas Indonesia
1 ) (
1
2 2
n
x x
s
n i
i
Sample Variance
• Modify the data; multiply with a constant a and add with a constant b.
• Only constant a affects the variance of the new data.
• This can be used to simplify our computation.
Thursday, June 13, 2024
10
FASILKOM, Universitas Indonesia
Sample Variance
Thursday, June 13, 2024
11 High variance
Low variance
FASILKOM, Universitas Indonesia
Sample Variance
for Grouped Data [Hinkle et al, 2003]
Thursday, June 13, 2024FASILKOM, Universitas Indonesia
12
f
i: frequency of the i
thinterval m
i: midpoint of the i
thinterval
1
) (
1
2 2
n
x m
f s
n i
i i
ni
f
in
1
Algebraic Identity
Proof: T
hursday, June 13, 2024
13
FASILKOM, Universitas Indonesia
ni
i n
i
i
x x n x
x
1
2 2
1
)
2(
2 1
2
2 2
1 2
1 2 1
1 2 1
2 2
1
2
2 2
) 2
( )
(
x n x
x n x
n x
x x
x x
x x
x x
x x
n i
i n i
i
n i n
i
i n
i
i n i
i i
n i
i
Standard Deviation
Standard deviation is the square root of the variance.
Symbol:
σ is the standard deviation of a population
s is the sample standard deviation
Thursday, June 13, 2024
14
FASILKOM, Universitas Indonesia
2
s
2s
Example
n=7
total= 42
mean= 42/7= 6
Thursday, June 13, 2024
15
i Xi
1 9 3 9
2 12 6 36
3 7 1 1
4 5 -1 1
5 2 -4 16
6 3 -3 9
7 4 -2 4
∑ 42 0 76
A sample consisting of 7 elements
FASILKOM, Universitas Indonesia
56 . 3 67
. 12
67 . 6 12
76 1
2
2 2
s s
n
X
s Xi
Problem
Compute mean, mean deviation, and variance of the following grouped frequency table !
Thursday, June 13, 2024FASILKOM, Universitas Indonesia
16 Class Interval Frequency
0-2 3
3-5 6
6-8 6
9-11 4
12-14 1
Summarizing data sets
part III: measures of position
FASILKOM, Universitas Indonesia
17
Thursday, June 13, 2024
Percentile & Percentile Rank
• Statistics for describing individual scores
• A score is actually meaningless without an adequate frame of reference, i.e., without an indication of the relative position of a score in the total distribution of scores.
• Some statistics for this problem
• Percentile
• Percentile Rank
Thursday, June 13, 2024
18
FASILKOM, Universitas Indonesia
Sample Percentile
Definition
The sample 100p percentile is that data value such that:
• 100p percent of the data are less than or equal to it
• 100(1 - p) percent are greater than or equal to it
• If two data values satisfy this condition, then the sample 100p percentile is the average of these two values.
Sample 100p percentile = P100p Sample 25 percentile = P25
FASILKOM, Universitas Indonesia
19
Thursday, June 13, 2024
Sample Percentile
To determine the sample 100p percentile of a data set of size n, we need to determine the data values such that:
1. At least np of the values are less than or equal to it.
2. At least n(1-p) of the values are greater than or equal to it.
First, You need to arrange the data in increasing order ! Example:
If n = 22, determine the position of 80 percentile ! FASI
LKOM, Universitas Indonesia
20
Thursday, June 13, 2024
Sample Percentile
Example:
If n = 22, determine the position of 80 percentile !
np = 22(0.8) = 17.6 of the values are less than or equal to it.
Clearly, only the 18th smallest value satisfies both conditions ! So, this is the sample 80 percentile ! P80 = 18th value.
If np is integer, then both values in positions np and np+1 satisfy both conditions, and so the sample 100p percentile is the average of these values.
FASILKOM, Universitas Indonesia
21
Thursday, June 13, 2024
Sample Percentile
Summary, for non-grouped data
To determine the sample 100p percentile of data of size n:
1. Arrange the data in order (lowest to highest) 2. Compute np
3. Test:
1. If np is not whole number, round up to the next whole number !
2. If np is whole number, compute the average of values in the position np and np+1.
FASILKOM, Universitas Indonesia
22
Thursday, June 13, 2024
Sample Percentile
Definition
First quartile (Q1): the sample 25 percentile.
Second quartile (Q2): the sample 50 percentile Third quartile (Q3): the sample 75 percentile Second quartile is the sample median.
Interquartile Range (IQR) = Q3 – Q1. Example:
Determine first, second, and third quartile, as well as P70 of the following data set !
{17.11, 6.6, 6.59, 11.06, 2.78, 6.96, 3.79, 4.3}
FASILKOM, Universitas Indonesia
23
Thursday, June 13, 2024
Sample Percentile
Ordered data set:
{2.78, 3.79, 4.3, 6.59, 6.6, 6.96, 11.06, 17.11}
P25 -> np = 8(0.25) = 2. P25 = (3.79 + 4.3)/2 = 4.045 P50 -> np = 8(0.50) = 4. P50 = (6.59 + 6.6)/2 = 6.595 P75 -> np = 8(0.75) = 6. P75 = (6.96 + 11.06)/2 = 9.01 P70 -> np = 8(0.70) = 5.6. P70 = 6.96
Interquartile Range (IQR) = Q3 – Q1 = 9.01 – 4.045 = 4.965 FASILK
OM, Universitas Indonesia
24
Thursday, June 13, 2024
Sample Percentile
For grouped frequency table [Hinkle, 2003]...
ll: lower exact limit of the interval containing the percentile point n: total number of scores
p: proportion corresponding to the desired percentile
cf: cumulative freq. of scores below the interval containing the percentile point
fi: freq. of scores in the interval containing the percentile point w: width of class interval
Thursday, June 13, 2024
25
FASILKOM, Universitas Indonesia
For left-end-inclusion case, lower limit of an interval is the left-interval-bound
) . (
f w cf p
ll n P
percentile X
i X
th
Sample Percentile
Find 34th percentile !
Thursday, June 13, 2024
26
FASILKOM, Universitas Indonesia
83 . 45 )
5 42 (
50 - 180(0.34) 5
.
34 44
P
Percentile Rank
Percentile Rank of a score is the percent of scores less than or equal to that score.
Suppose you got 65 on the final exam of this course. You want to know what percent of students scored lower.
Find percentile rank of score 65 ! Notation : PR65
Determining percentile rank in non-grouped data is easy ! How ?
We will focus on how to determine percentile rank in grouped data.
Thursday, June 13, 2024
27
FASILKOM, Universitas Indonesia
Percentile Rank
For non-grouped data:
Find percentile rank of a score of 12 from the following data:
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Thursday, June 13, 2024
28
FASILKOM, Universitas Indonesia
5 100 .
0
values of
number total
X below values
of number PRX
Percentile Rank
Percentile Rank in grouped data [Hinkle, 2003]
PRX = percentile rank of score X
cf = cumulative frequency of scores below the interval containing percentile point
ll = exact lower limit of the interval containing percentile point w = width of class interval
fi = frequency of scores in the interval containing percentile point n = total number of scores
Thursday, June 13, 2024
29
FASILKOM, Universitas Indonesia
For left-end-inclusion case, lower limit of an interval is the left-interval-bound
) 100 (
n
w f ll cf X
PRX i
Percentile Rank
Find percentile rank of score 61 !
Thursday, June 13, 2024
30
FASILKOM, Universitas Indonesia
83 . 90 )
100 180 (
5 15 5 . 59 159 61
61
PR
Ogive & Percentile
Ogive can be used to find percentile & percentile rank
Thursday, June 13, 2024
31
FASILKOM, Universitas Indonesia
Percentile Rank – an ordinal scale
Position of percentile for normal distribution
In the middle, a difference of 6 raw score (45-51) is equivalent to a difference of 20 percentile points !
In the tails, the opposite phenomenon occurs ! Percentile Rank is an ordinal scale !
Thursday, June 13, 2024
32
FASILKOM, Universitas Indonesia
Percentile Rank – an ordinal scale
The difference between P50-P40 and P20-P10 may not be the same ! -> ordinal scale
Suppose there are two distributions. Score A is from distribution 1, and score B is from distribution 2.
PRA = 50, PRB = 48
Even though the difference of Both PRs is just 2 points, we don’t have any idea about |A – B|. It could be small or large.
Percentile should be used only for describing points in a distribution (relative position/rank in a distribution), NOT for making comparisons accross distribution.
Thursday, June 13, 2024
33
FASILKOM, Universitas Indonesia
Box Plot
A straight line segment stretching from the smallest to the largest data value. It contains information about first to the third quartile on the “box” part.
Q
3
–
Q
1
.
Q
3
–
Q
1
. 17.11 2.78
4.045 6.595 9.01
Min Q1 Q2 Q3 Max
Box plot of previous example
Length of the box represents interquartile range.
FASILKOM, Universitas Indonesia
34
Thursday, June 13, 2024
2.78 4.045 6.595 9.01 17.11
Outliers
• An outlier is an unusual score in a distribution that may warrant special consideration.
• Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc.
• An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient.
Slide: introduction to statistics, Colm O’Dushlaine, Neuropsychiatric Genetics, TCD
FASILKOM, Universitas Indonesia
35
Thursday, June 13, 2024
RLB RUB
Q1 Q3
Mdn
Outliers & Box Plot
Modify box plot to use 5 numbers as follow:
RUB (reasonable upper boundary)
RUB = Q3 + 1,5 (IQR)
Q3 (third quartile)
Median
Q1(first quartile)
RLB (reasonable lower boundary) RLB = Q1 – 1,5(IQR)
Outliers
are all scores above the RUB or below the RLB
This dot represents an outlier !
FASILKOM, Universitas Indonesia
36
Thursday, June 13, 2024
Modified Boxplot
Standard Scores
FASILKOM, Universitas Indonesia
37
Thursday, June 13, 2024
Standard Scores
Suppose a student has the scores in 3 classes: 68 in math, 77 in physics, and 83 in history.
In which class did the student perform best ?
How to answer this question ? We can not use raw scores !
• Those 3 distributions may have different mean & variance
• Those 3 scores may have different scale of measurement We can not use percentile ! Percentile is an ordinal scale !
Thursday, June 13, 2024
38
FASILKOM, Universitas Indonesia
Standard Scores
One way is to transform the scores into scores on an equal interval scale.
Standard scores do this by using standard deviation as the unit of measure.
Standard score or z-score is computed as follow:
Thursday, June 13, 2024
39
FASILKOM, Universitas Indonesia
s X z x
Standard Scores
Thursday, June 13, 2024
40
Example
z = (10-6)/3.18 = 1.26
Properties of z-score
retains the shape of the distribution of the original scores
mean = 0
variance = 1,
standard deviation = 1
Z-score indicates the number of standard deviations a corresponding raw score is above or below the mean.
FASILKOM, Universitas Indonesia
Standard Scores
Back to previous question....
Suppose we know the mean and standard deviation of those
3 distributions. So we can compute z-scores: Thu
rsday, June 13, 2024
41
Subject x s z
Math 68 65 6 0.50
Physics 77 77 9 0.00
History 83 89 8 -0.75
Subject x s z
Math 68 65 6 0.50
Physics 77 77 9 0.00
History 83 89 8 -0.75
Relative to the other students, this student performed best on the math exam.
FASILKOM, Universitas Indonesia
Weighted Averages
How to develop a composite score from two or more individual score ?
For example: we want to compute composite scores of two technical test and one personality test to evaluate a job seeker.
Wi = weight of each test
Zij = standard score for person j on test i
Thursday, June 13, 2024
Why do we use standard score ? 42
Weighted score j
FASILKOM, Universitas Indonesia
i ij i
W z
W
Transformed Standard Scores
Z-scores are for purposes of comparison.
But, they can be misleading for people. In the previous table, the score of math is 0.5 and history is -0.75. People might consider
“0.5” as a bad score !
So, we need to transform those z-scores into a different distribution of scores, so that people are easy to interpret those value.
X’ = the transformed score
s’ = the desired standard deviation = the desired mean
Thursday, June 13, 2024
43
FASILKOM, Universitas Indonesia
' X
' )
)(
' (
' s z X
X
Chebyshev’s Inequality
FASILKOM, Universitas Indonesia
44
Thursday, June 13, 2024
Chebyshev’s Inequality
Let and s be the sample mean and sample standard deviation of the data set consisting of the data , where s > 0. Let
And let N(Sk) be the number of elements in the set Sk. Then, for any k ≥ 1,
•
Thursday, June 13, 2024
45
FASILKOM, Universitas Indonesia
i i n x x ks
S
k , 1 :
i
2 2
( ) 1 1
1 1
N S
kn
n nk k
Chebyshev’s Inequality
Proof:
Thursday, June 13, 2024
46 By definition of sample SD.
By definition of Sk
FASILKOM, Universitas Indonesia
)) (
(
) (
) (
) (
) (
) 1 (
1 ) (
2 2
2 2
2
2 2
1
2 2
1
2 2
k S
i S i
i
S i
i S
i
i n
i
i n
i
i
S N n
s k
s k
x x
x x
x x
x x
s n
n
x x
s
k k
k k
,,11 ::( )2 2 2
: 1
,
s k x
x n i
i
ks x
x n i
i S
ks x
x n i
i S
i i C
k
i k
Chebyshev’s Inequality
Proof (Cont’d):
Thursday, June 13, 2024
47 n > 0, sample size
Q.E.D
FASILKOM, Universitas Indonesia
2 2 2 2
2 2 2
1 1
1 1 1
1 1 )
(
) 1 (
1
)) (
( )
1 (
k k
n nk
n n
S N
n S N nk
n
S N n
s k s
n
k
k
k
Chebyshev’s Inequality
It means, greater than (at least) 100(1 – 1/k2) percent of the data lie within the interval from to .
We only need to know the standard deviation & mean !
•
Thursday, June 13, 2024
48 proportions
FASILKOM, Universitas Indonesia
i i i i n n x x ks x x ks x ks
S
i i
k
: 1
,
: 1
,
2 2
( ) 1 1
1 1
N S
kn
n nk k
Chebyshev’s Inequality
10 top-selling passenger cars in the U.S in 2008.
Thursday, June 13, 2024
49
= 31,879.4
s = 7,514.7
we obtain from Chebyshev’s Inequality that greater than 100(5/9) = 55.55 percent of the data from any data set lies between to . Here, we know that k = 3/2.
(, ) = (20,607.35, 43,151.45)
FASILKOM, Universitas Indonesia
Normal Data Sets
FASILKOM, Universitas Indonesia
50
Thursday, June 13, 2024
Normal Data Sets
Many of the large data sets observed in practice have histograms that are similar in shape.
These histrograms often reach their peaks at the sample median and then decrease on both sides of this point in bell- shaped symmetric fashion.
Thursday, June 13, 2024
51
Normal histogram
Skewed lef Skewed right
FASILKOM, Universitas Indonesia
Normal Data Sets
If the histogram of a data set is close to being a normal histogram, then we say that the data set it approximately normal.
If a data set is approximately normal with sample mean and standard deviation s. The following statements are true:
• 68% of data lies within ± s
• 95% of data lies within ± 2s
• 99.7% lies within ± 3s
•
Thursday, June 13, 2024
52
FASILKOM, Universitas Indonesia