ECMT1010: Business and Economics Statistics A Notes
The University of Sydney Part 2 -> Describing Data
One Categorical Variable: Important where we consider the proportion of candidates that are part of a certain outcome (eg comparing people who agree against those who disagree or don’t know)
NOTE: Proportions are also known as relative frequencies. We are able to calculate frequencies (or proportions) as
follows:
Using the above information, we can visually graph this to help get a better understanding of the sample data. We can choose to graph the frequency in a bar chart as shown below OR we can use the relative frequency in order to display a pie chart of the given data.
NOTE: When writing the notation for a proportion:
• p denotes the proportion of a population
• p̂ denotes the proportion of a sample
Two Categorical variables
Sometimes we might ask questions like “does the opinion differ between male and female?” These types of questions ask about a relationship between two categorical variables. NOTE that the categorical variables are the opinion and also gender.
For this, we set up a two-way table, which has one of the categorical variables along the row and the other categorical variable along the bottom. Sometimes it is useful if we add a total column. NOTE:
the total for the two-way table should be the same as the total for a frequency table if you are looking at the same scenario.
Two Way Table: A two-way table shows the relationship between two categorical variables by plotting the categories of one variable along the rows and the other along the columns.
CASE STUDY EXAMPLE: In the StudentSurvey dataset, students are asked which award they would prefer to win: an Academy Award, a Nobel prize, or an Olympic gold medal. The data show that 20 of the 31 students who prefer an Academy award are female, 76 of the 149 students who prefer a Nobel prize are female, and 73 of the 182 who prefer an Olympic gold medal are female. Create a two-way table for these variables!
NOTE: The most important categorical variables to use in this case are gender and type of award
When observing a histogram, we have to ask ourselves whether the data is symmetrical or skewed. There are 4 cases that can occur:
• Symmetrical: The data is considered symmetrical if we are able to fold the graph and both sides are similar to each other
• Right Skewed: The data is considered right skewed if all the data is on the left and therefore there is an extended tail running on the right
• Left Skewed: The data is considered left skewed if all the data is on the right and there is an extended tail running on the left
• Belled: A belled shape histogram will look like one hill that slopes up and then down.
Mean: The mean of a given quantitative variable is the average of the numerical data. This can be denoted in the following ways:
• µ denotes the average of a given population
• x̄ denotes the average of a sample within that population
Median: The median is either the middle entry of an ordered data list if the list has odd amount of variables or the average of the two middle values if the order list is even. This means that the median would split the data in half.
CASE STUDY EXAMPLE: Find the median and the mean for the heart rates, beats per minute, of 20 year old patients in an ICU treatment study.
2) Resistance
Sometimes in statistics, we can get outliers which in turn can a resistance or robust. This will have an effect on the mean and median.
Resistance/Robust: A statistic is resistant when it is relatively unaffected by extreme or large values.
This will cause the median to be resistant but not the mean.
• In other words, the median is resistant because no matter how extreme a value is placed, it is relatively unaffected. However the mean would be severely impacted.
Standard Deviation: Standard deviation is the quantitative measure of the spread of data in a dataset. As the standard deviation increases, the spread of the data also increases. This is calculated by the following formula:
Similar to proportion, standard deviation also has its own notation.
• ‘s’ denotes the standard deviation for a sample group, measuring how spread out the data is from the mean 𝑥̅
• σ denotes the standard deviation for a population, measuring how far the data is from the mean μ
Note: Standard deviation allows us to determine how many deviations a certain value is from the mean (Eg: 1 deviation, 2 deviation etc.)
IMPORTANT: When looking at a bell-shaped symmetrical curve, approximately 95% of the values should fall within the 2 deviations of the mean. Mathematically this means that the values should fill within -2s and +2s. This is shown in the graph on the right.
When analysing data, we are particularly interested in the centre and spread of a distribution. To do this we look at how many standard deviations a value is from the mean. This is called the z-score
Z-Score: Measures how many standard deviations a given data value is from the mean.
Mathematically, this is calculated using the following formula:
Percentiles: When looking at a range of data, we can distribute the data based on percentiles.
EXACTLY LIKE THE HSC MARKS, the percentile of a given mark highlights that that mark beat a certain %of all other marks. Some examples are:
• 92th percentile in mathematics means that the students mark beat 92% of all other mathematics marks
• 21.5th percentile in visual arts means that the students mark beat 21.5% of all other visual arts marks
This graph shows how apx.
95% of the data falls within the 2 standard deviations
Z score formula