Central Tendency and Variability Measures

(1)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/353378173

Chapter 2 Central Tendency and Variability Measures

Presentation · July 2021

DOI: 10.13140/RG.2.2.22991.20641

CITATIONS

0

READS

3,337

1 author:

Raid Salha

Islamic University of Gaza 43PUBLICATIONS 60CITATIONS

SEE PROFILE

All content following this page was uploaded by Raid Salha on 22 July 2021.

The user has requested enhancement of the downloaded file.

(2)

1

Chapter 2 Central Tendency and Variability Measures

2.1 Central Tendency Measures 2.2 Variability Measures

2.3 Outliers

(3)

2

2.1 Central Tendency Measures

Central tendency refers to the location of a “typical” data value—the data value around which other data values tend to cluster. Because a value is more likely to be typical if it is in the middle of a distribution than if it is at an extreme, the term central tendency has come to be used for this class of descriptive statistics.

1. The Mean

The most commonly used index of central tendency is the mean, the term used in statistics for the arithmetic average. The equation for calculating the mean is as follows:

Example 1: Consider the following set of values

10 12 15 5 10 8 6 14

𝑋̅ = 80

8 = 10

(4)

3

2. The Median

A second descriptive statistic used to indicate a central tendency is the median. The median is the point in a data distribution that divides the distribution into two equal halves 50% of the data values lie above the median, and the other 50% lie below the median.

To calculate the median, the data values must first be sorted in ascending order from the smallest to the largest value or in descending order from the largest value to the smallest.

There are two cases:

1. If the data size N is odd, then the median is the observation whose order is ^𝑁+1₂ .

2. If the data size N is even, then the median is the mean of the two observations whose order are ^𝑁₂ and ^𝑁+2₂

Example 2: The median of the following data

9 10 15 5 10 8 6 14 7

Note the sample size N = 9 is odd

5 6 7 8 9 10 10 14 15

Median = 9

(5)

4

Example 3: The median of the following data

9 10 15 5 10 8 6 14 7 13

Note the sample size N = 10 is even

5 6 7 8 9 10 10 13 14 15

Median = ⁹⁺¹⁰

2 = 9.5

3. The Mode

The mode is the numerical value in a distribution that occurs most frequently.

20 21 21 22 22 22 22 23 23 24 Mode = 22

21 21 21 22 22 22 23 24 25 25 There are two modes, 21 and 22.

(6)

5

Example 6: Using SPSS, find the central tendency measure for the data in Example 2 (Heart rate data).

Statistics heartrate

N Valid 100

Missing 0

Mean 65.21

Median 66.00

Mode 66

The relationship between 𝑴𝒆𝒂𝒏, 𝑴𝒆𝒅𝒊𝒂𝒏 𝐚𝐧𝐝 𝑴𝒐𝒅𝒆 For the unimodal distribution, the relationship between the three central tendency measures is given by:

1. If the distribution is symmetric, then 𝑀𝑒𝑎𝑛 = 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑀𝑜𝑑𝑒.

2. If the distribution is positively skewed, then 𝑀𝑜𝑑𝑒 < 𝑀𝑒𝑑𝑖𝑎𝑛 < 𝑀𝑒𝑎𝑛.

3. If the distribution is negatively skewed, then 𝑀𝑒𝑎𝑛 < 𝑀𝑒𝑑𝑖𝑎𝑛 < 𝑀𝑜𝑑𝑒.

(7)

6

2.2 Variability Measures

In addition to a distribution’s shape and central tendency, another important characteristic is its variability. Variability refers to how spread out or dispersed the scores about its mean. Two distributions with identical means and similar shapes (e.g., both symmetric) could nevertheless differ considerably in terms of variability.

Consider, for example, the two distributions in Figure 1. This figure shows body weight data for two hypothetical samples, both of which have means of 150 pounds, but, clearly, the two samples differ markedly. In sample A, there is great diversity: Some people weigh as little as 100 pounds, while others weigh up to 200 pounds. In sample B, by contrast, there are few people at either extreme: The weights cluster more tightly around the mean of 150. We can verbally describe a sample’s variability. We can say, for example, that sample A is higher varied than sample B, with regard to weight.

Figure 1: Two distributions with the same mean and different variability

(8)

7

. Variability indexes

Statisticians have developed indexes that express the extent to which data values on quantitative variables deviate from one another in a distribution.

Four indexes of which are described here.

1. The Range

The range, the simplest measure of variability, is the difference between the highest data value and the lowest data value in the distribution.

For example, in Figure 1,

Range for sample A =200 -100 =100 Range for sample B =175 - 125 =50.

Note: In research reports, the range is often shown as the minimum and maximum value, without the subtracted difference score.

2. Interquartile Range

The interquartile range (𝐼𝑄𝑅) is a variability index calculated on the basis of quartiles. The lower (first) quartile (𝑄₁) is the point below which 25% of the data lie, while the upper (third) quartile (𝑄₃) is the point below which 75% of the data lie. The interquartile range (𝐼𝑄𝑅) is the distance between these two values:

𝐼𝑄𝑅 = 𝑄₃ − 𝑄₁

(9)

8

Note: The median is the second quartile is the point below which 50% of the data lie.

Figure 2: The three quartiles

Example 7: Use SPSS to find for the heart rate data the following a. The three quartiles and 𝐼𝑄𝑅.

b. 10 % (percentiles) points.

c. 15 % , 25%, 45%, 50%, 60%, 75%, 95% (percentiles) points.

a.

N Valid 100

Missing 0

Percentiles 25 62.00

50 66.00

75 68.00

𝐼𝑄𝑅 = 68 − 62 = 6

(10)

9

b.

N Valid 100

Missing 0

20 61.00

30 63.00

40 64.00

50 66.00

60 66.60

70 68.00

80 69.00

90 71.00

c.

N Valid 100

Missing 0

25 62.00

45 65.00

50 66.00

60 66.60

75 68.00

95 72.00

(11)

10

3. The Variance

The most widely used index of variability is the variance (often abbreviated as Var). The variance is based on differences between every data value and the value of the mean. Thus, the formula for variance is given by:

𝑉𝑎𝑟 = ^∑^𝑁^𝑖=1^(𝑋^𝑖^−𝑋̅)²

𝑁−1 , where

𝑁 is the sample size.

𝑋̅ is the sample mean.

𝑋_𝑖′𝑠 are the data.

Example 8: The following data represents the weights of 10 people in pounds. Compute the variance of the data.

110 120 130 140 150 150 160 170 180 190

𝑋_𝑖 𝑋_𝑖− 𝑋̅ (𝑋_𝑖 − 𝑋̅)²

110 - 40 1600

120 -30 900

130 -20 400

140 -10 100

150 0 0

160 10 100

170 20 400

180 30 900

190 40 1600

∑ 𝑋_𝑖 = 1500

𝑁

𝑖=1

∑(𝑋_𝑖 − 𝑋̅) = 0

𝑁

𝑖=1

∑(𝑋_𝑖 − 𝑋̅)²

𝑁

𝑖=1

= 6000

(12)

11

𝑋̅ =^∑^𝑁^𝑖=1^𝑋^𝑖

𝑁 =¹⁵⁰⁰

10 = 150.

𝑉𝑎𝑟 = ∑^𝑁_𝑖=1(𝑋_𝑖 − 𝑋̅)²

𝑁 − 1 = ⁶⁰⁰⁰

9 = 666.67

Note: Because the variance is not in the same measurement units as the original data (in this example, it is in pounds squared), the variance is rarely used as a descriptive statistic

4. The Standard Deviation

The most widely used index of variability is the standard deviation (often abbreviated as SD). The standard deviation is the square root of the variance

𝑆𝐷 = √^∑^𝑁^𝑖=1_𝑁−1^(𝑋^𝑖^−𝑋̅)² = √𝑉𝑎𝑟.

Example 9: The standard deviation for the data in Example 7 is given by

𝑆𝐷 = √𝑉𝑎𝑟 = √666.67 = 25.82.

Note: The standard deviation is often easier to interpret in a comparative context. For example, looking back at Figure 1, distributions A and B both had a mean of 150, but sample A would have an SD of about 20, while sample B would have an SD of about 10. The SD index communicates that sample A is higher varied than sample B.

(13)

12

Example 10: Using SPSS, find the variability measures for the data in Heart rate data.

N Valid 100

Missing 0

Std. Deviation 4.495 Variance 20.208

Range 19

Example 11: Using SPSS, find the central tendency and variability measures for the Heart rate data.

N Valid 100

Missing 0

Mean 65.21

Median 66.00

Mode 66

Std. Deviation 4.495 Variance 20.208

Range 19

(14)

13

2.3 Outliers

Outliers are often identified in relation to the value of a distribution’s IQR.

Types of outliers: There are two types of outliers

1. A mild outlier is a data value that lies between 1.5 and 3.0 times the IQR below Q₁ or above Q₃.

2. An extreme outlier is a data value that is more than three times the IQR below Q₁ or above Q₃.

Example 12: Are there any outliers in the heart rate data? If yes, find them?

From Example 7, for the heart rate data,

𝐼𝑄𝑅 = 6 , Q₁ = 62 and Q₃ = 68 1.5 × 𝐼𝑄𝑅 = 1.5 × 6 = 9

3 × 𝐼𝑄𝑅 = 3 × 6 = 18 Q₁ − 1.5 × 𝐼𝑄𝑅 = 62 − 9 = 53

Q₁− 3 × 𝐼𝑄𝑅 = 62 − 18 = 44 Q₃+ 1.5 × 𝐼𝑄𝑅 = 68 + 9 = 77

Q₃+ 3 × 𝐼𝑄𝑅 = 68 + 18 = 86

Figure 3: The outliers for the heart rate data

(15)

14

• A mild lower outlier would be any value between 44 and 53 and A mild upper outlier would be any value between 77 and 86.

There is no mild outlier.

• An extreme lower outlier would be a value less than 44 and an extreme upper outlier would be a value above 86.

There is no extreme outlier.

In our data distribution, there are no outliers.

A graph called a boxplot is a useful way to visualize percentiles and to identify outliers. Figure 4 shows the boxplot for the original heart rate data for 100 people. A boxplot shows a box that has the 75th percentile as its upper edge (here, at 68) and the 25th percentile at its lower edge (here, at 62). The horizontal line through the middle of the box is the median (here, 66). The “whiskers” that extend from the box show the highest and lowest values that are not outliers, in relation to the IQR, as defined earlier. This graph confirms that there are no outliers in the original dataset.

(16)

15

Figure 4: The boxplot graph for the heart rate data

Example 13:

To illustrate what a computer-generated boxplot shows when there are outliers, we added six extreme values to the original dataset: 40, 45, and 50 at the lower end and 90, 95, and 100 at the upper end. Figure 5 shows the resulting boxplot. The six data values that we added all are shown as outliers, outside the outer limits of the whiskers. Mild outliers are shown with circles: Case number 106 with a value of 50 and case number 105 with a value of 45 are mild outliers, for example. Cases that are extreme outliers are shown with asterisks. When there are outliers, the first thing to do is to

(17)

16

see if they are legitimate values, or reflect errors in data entry. If they are true values, researchers can decide on whether it is appropriate to make adjustments, such as trimming the mean.

Figure 5: The boxplot graph for the heart rate data after adding six extreme values

(18)

17

Exercises

Q1. The following numbers represent the data values of 30 psychiatric inpatients on a widely used measure of depression (the Center for Epidemiologic Studies-Depression scale). What is the mean, the median, and the mode for these data?

41 27 32 24 21 28 22 25 35 27 31 40 23 27 29 33 42 30 26 30 27 39 26 34 28 38 29 36 24 37

If the values of these indexes are not the same, discuss what they suggest about the shape of the distribution.

Q2. Find the medians for the following distributions:

(a) 1 5 7 8 9

(b) 3 5 6 8 9 10 (c) 3 4 4 4 6 20 (d) 2 4 5 5 8 9

Q3. For which distribution in question Q2 would the median be preferred to the mean as the index of central tendency? Why?

Q4. The following ten data values are systolic blood pressure readings.

Compute the mean, the range, the SD, and the variance for these data.

130 110 160 120 170 120 150 140 160 140

(19)

18

Q5. The following data represent blood pressure of 30 people

115 110 130 135 120 125 135 120 90 130 140 250 50 125 120 110 130 140 150 75 145 60 125 110 80 140 160 120 135 100 a. Calculate IQR.

b. Are there any outliers in the blood pressure data? If yes, find them and classify them as mild or extreme outliers?

c. Graph the boxplot for the data.

View publication stats