HOW TO: QM1
PART 1 è DESCRIBING AND DEPICTING DATA PART 2 è PROBABILITY
PART 3 è INFERENCE AND HYPOTHESIS TESTING
PART 1 è DESCRIBING AND DEPICTING DATA 3 Broad Categories/Types of data:
1. Interval (numerical) (i.e. heights between 100 to 120cm) 2. Categorical (nominal) (i.e. favourite colours)
3. Ordinal
a. Sort of a combo of interval and categorical
b. Has a sense of order or ranking of these categories but there is no legit value to it
c. (i.e. disagree, neither, agree)
Make an appropriate graphical depiction to generate immediate impression of a set of observations/data
ê
For numerical data ê
Histograms
þ depicts distribution well by dividing range of observed val.s into bins and counting freq of val.s in ea bin
ê
But there are limitations on how precisely/efficiently we can describe data from histograms, sooo…
ê
Use Descriptive Statistics: (UNIVARIATE)
1. Sample Mean (Measure of Central Tendency) 2. Sample Variance (Measure of Spread)
3. Sample Skewness (Measure of Skewness) Left = -ve:
will give –ve sample skewness
Right = +ve:
will give +ve sample skewness
If highly skew, the median will give a better sense of location (Skewness: tells us about the symmetry of data)
4. Sample Kurtosis (Measure of Kurtosis) (Kurtosis: tells us about tail behavior in data)
Leptokurtic (Fat Tailed) data:
Few observations that are extreme deviations from mean
+ve sample kurtosis
Platykurtic data:
Many observations that only modestly deviate from the mean
-ve sample kurtosis
buutttt Numerical Descriptive Statistics are sensitive to extreme observations (can distort view of data)
ê
Consider Relative Measures of central tendency (median) and spread (IQR)
Should specify percentile ranges therefore outlier range is identified too
Nowwww BIVARIATE DATA J Lets call them Data Set X and Y
Sample Covariance (FS)
- The average of the product of deviations of x and y from their corresponding means
- is +ve when X and Y are positively related (meaning X values that are greater than mean of X correspond to Y values that are greater than mean)
- 0 if no rel
- -ve if negatively related
Sample Correlation (FS): (aka coefficient of correlation)
- scaled version of sample covariance, gives more of like the degree to which they are linearly related
- so that’s why it takes values -1 to 1 (inclusive)
- same as covariance, +ve if +ve, -ve is –ve, no relationship 0 Line of Best Fit: (NOT IN FS)
𝑏" = 𝑦 − 𝑏&𝑥 ,
where 𝑏& =(()*
)+ (this is like sample covariance divided by the variance of x)