The success of the open-source statistical software "R" has had a significant impact on statistics education and research over the past decade. In the first part of the book, we're going to introduce methods that help us describe data, and the second and third parts of the book focus on inferential statistics, which means drawing conclusions from data.
Population, Sample, and Observations
If we are interested in the social conditions under which Indian people live, then we would define all the inhabitants of India as Ω and each of its inhabitants as ω. All participants in the course form the populationΩ, and each participant refers to a unit or observationω.
Variables
- Qualitative and Quantitative Variables
- Discrete and Continuous Variables
- Scales
- Grouped Data
The values that these variables can take can be arranged in a logical and natural way. But even quantitative variables can be discrete: shoe size or number of academic semesters would be discrete because the number of values these variables can take is limited.
Data Collection
If the researcher decided to randomly assign toothpaste A to half of the study participants and toothpaste B to the other half, then this is an experiment because only the researcher decides which toothpaste should be used by any of the participants. The production process for the various units (facilities) is therefore under the control of management.
Creating a Data Set
Statistical Software
In most of these applications, it is possible to save the data as an ASCII file (.dat), as a tab-separated file (.txt), or as a comma-separated value file (.csv). A detailed description of data input and output can be found in the respective Manual available at http://.
Key Points and Further Issues
Exercises
Now, given this data set, try to write down the research questions above as accurately as possible. Let's write the salary of the first student as x1, the salary of the second student as x2, etc.
Absolute and Relative Frequencies
The sum of the absolute frequencies is equal to the total number of units in the data:k. The relative frequencies of the jth class are defined as. We are interested in pizza deliveries by branches, and we generate a corresponding frequency table showing the distribution of the data, using the table command in R (after reading and appending the data) as.
Empirical Cumulative Distribution Function
ECDF for Ordinal Variables
The 200 customers who have had a car service performed within the last 30 days were asked to respond regarding their overall satisfaction with the quality of the car service on a scale of 1 to 5 based on the following options: 1 = not at all satisfied, 2= not satisfied, 3= satisfied, 4= very satisfied and 5= completely satisfied. Example 2.2.2 Suppose that in Example 2.2.1 we want to know how many customers are not satisfied with their car service.
ECDF for Continuous Variables
It is interesting to see that the graphs resulting from using the grouped data and ungrouped data are similar in this particular example. We therefore conclude, based on the grouped data, that only about 42 % of the deliveries were completed in the required time frame.
Graphical Representation of a Variable
Bar Chart
The frequency table forms the basis for the bar chart, using the absolute or relative frequencies on the side axis. Figure 2.5 shows the bar charts for the number and share of pizza deliveries per store.
Pie Chart
Instead, the area of the segment is proportional to the angle fj ·360◦ (and also depends on the radius of the whole circle). It has been argued that this can cause misinterpretations as the human eye can capture the area of the segment more easily than the angle of a segment.
Histogram
Since the width of the class intervals is the same, the heights of the bars are proportional to the relative frequency of the corresponding category. Table 2.3 shows the summary of the grouped data and the values needed to calculate the histogram.
Kernel Density Plots
Summarizing the functions as described in Eq. 2.12), gives the solid black line, which is a plot of the kernel density of the five observations. If we reduce the bandwidth to half the default bandwidth (optionadjust=0.5), the kernel density graph becomes more varied, as shown in Figure 2.10b.
Key Points and Further Issues
Exercises
Determine the time when the first goal in 80% of the matches was scored on or before this time. The use of each measure depends on the scale of the relevant variable, see Appendix D.1 for a detailed overview.
Measures of Central Tendency
- Arithmetic Mean
- Median and Quantiles
- Quantile – Quantile Plots (QQ-Plots)
- Mode
- Geometric Mean
- Harmonic Mean
Properties of the arithmetic mean. i) The sum of the deviations of each variable around the arithmetic mean is zero: The mean and median are similar here because the distribution of observations is symmetrical about the center.
Measures of Dispersion
Range and Interquartile Range
Range is a measure of dispersion, defined as the difference between the largest and smallest data values as. However, in accordance with most of the statistical literature, we define the interquartile range as a measure of dispersion, i.e.
Absolute Deviation, Variance, and Standard
If another eventBis is defined as "the sum of numbers on the upper levels is 6", then. Then the probability of the complementary event is A, that is, the probability of not finding a marzipan chocolate.
Coef fi cient of Variation
Box Plots
The lower end of the box refers to the first quartile and the upper end of the box to the third quartile. Whiskers at the end of the graph indicate the minimum and maximum data values.
Measures of Concentration
Lorenz Curve
A sure event is "get an even number or an odd number"; an impossible event would be "the sum of the two dice is greater than 13". This observation shows the basic conclusion of the following chapter:. the larger the sample size, the more confident we are in our conclusions.
Gini Coef fi cient
Key Points and Further Issues
If data for a continuous variable are grouped and the original ungrouped data are not known, additional assumptions are needed to calculate measures of central tendency and dispersion. In some cases, however, these prerequisites may not be met, and the specified formulas may give imprecise results.
Exercises
Figure 3.8 shows the QQ-plots for Y and XgivenZ. Exercise 3.6 There is no built-in function in R to calculate the mode of a variable. The manager of a bus company operating a route wants his buses to finish their run in 8 hours. a).
Summarizing the Distribution of Two Discrete Variables
Contingency Tables for Discrete Data
In this chapter, we present measures and graphical summaries for the association of two variables—depending on their scale. For example, the last column of the table shows that 7 passengers flew in economy class, 4 passengers in business class and 1 passenger in first class.
Joint, Marginal, and Conditional Frequency
We now collect and evaluate the responses of 100 customers (instead of 12 passengers as in example 4.1.1) regarding their choice of travel class and their overall satisfaction with the quality of the flight. The conditional frequency distribution of "travel class" (X) of passengers given "overall rating of flight quality" (Y) is obtained from fX|Y=satisfaction level.
Graphical Representation of Two Nominal or
In connection with contingency tables, two variables are independent of each other when the common relative frequency is equal to the product of the marginal relative frequencies of the two variables, i.e. Note that the absolute frequencies are always integers, but the expected absolute frequencies may not always be integers.
Measures of Association for Two Discrete Variables
- Pearson ’ s χ 2 Statistic
- Cramer ’ s V Statistic
- Contingency Coef fi cient C
- Relative Risks and Odds Ratios
Thus, χ2≈57 indicates a moderate association between "travel class" and "overall rating of flight quality" of passengers. There is a moderate to strong relationship between "travel class" and "overall flight quality rating" of passengers.
Association Between Ordinal and Continuous Variables
- Graphical Representation of Two Continuous
- Correlation Coef fi cient
- Spearman ’ s Rank Correlation Coef fi cient
- Measures Using Discordant and Concordant Pairs
Example 4.3.5 Let's follow Example 4.3.3a a bit further and calculate the Spearman rank correlation coefficient for the first five observations of the decathlon data. The differences between the correlation coefficient and the rank correlation coefficient are numerous: first, the Pearson correlation coefficient can only be used for continuous variables, but not for nominal or ordinal variables.
Visualization of Variables from Different Scales
Key Points and Further Issues
Exercises
Suppose there are two players A and B. The probabilities of hitting the target by AandBare are 0.4 and 0.5 respectively. a). After collecting the weights of 200 babies, the researcher has a sample of 200 realized values (ie weights in grams).
Probability Calculus
Introduction
This can be generalized to a situation in which there are balls in the urn and we want to draw balls. In urn parlance, we are getting draws without replacement (since each athlete can only have one particular seat).
Permutations
- Permutations without Replacement
- Permutations with Replacement
Combinations
- Combinations without Replacement
- Combinations with Replacement
- Combinations with Replacement
The total number of different combinations for the setting without replacement and subject to the order is . The total number of different combinations with substitution and without taking into account the order is.
Key Points and Further Issues
Exercises
Another interpretation refers to a geometric representation of the binomial coefficient,n. a) Show that each item in the bold third diagonal line can be represented vian. The investment in a new advertising campaign can therefore only make sense if the chance of success is greater than that of the current advertising.
Basic Concepts and Set Theory
The collection of all simple events contained in event A is denoted by ΩA. This means that all simple events of Aare are also part of the sample space of B.
Relative Frequency and Laplace Probability
Example 7.1.1 The properties of a pinball experiment, a game of roulette, or the lifetime of a television can all be described by a random variable, see Table 7.1. We can calculate the waiting time variance for a train using the probability density function.
The Axiomatic De fi nition of Probability
- Corollaries Following from Kolomogorov ’ s
- Calculation Rules for Probabilities
Conditional Probability
- Bayes ’ Theorem
The blood sample does not have an infection and the test does not diagnose it, ie. if one already knows that the test is positive and wants to determine the probability that the infection is actually present, then this can be achieved by the respective conditional probabilityP( I P|T+), which is.
Independence
The difference between pairwise independence and general stochastic independence is explained in the following example. The probability of all three events happening at the same time is zero because there is no ball with 111 printed on it.
Key Points and Further Issues
Exercises
What is the probability that Dr. Obermeier's neighbors did not take care of the plant. What is the probability that at least one of the players succeeds in his shot.
Random Variables
Now we discuss the concepts needed to draw statistical conclusions from a sample of data about a population of interest. However, Chapters 9–11 show how a sample of data can be used to estimate unknown probabilities and other quantities, given a pre-specified level of uncertainty.
Cumulative Distribution Function (CDF)
- CDF of Continuous Random Variables
- CDF of Discrete Random Variables
We can see that X is a discrete random variable because its space is finite and countable. We can easily calculate various types of probability for discrete random variables using CDF.
Expectation and Variance of a Random Variable
- Expectation
- Variance
- Quantiles of a Distribution
- Standardization
E(X), usually denoted by μ=E(X) and refers to the arithmetic mean of the distribution of the population. A low standard deviation value indicates that the values are highly concentrated around the mean.
Tschebyschev ’ s Inequality
However, if we apply our distribution knowledge that F(x)=201x(for 0≤X ≤20), then we get a much more accurate result, viz. It should be kept in mind that only the lack of distribution knowledge makes the inequality useful.
Bivariate Random Variables
The marginal distribution of X is contained in the last column of the table and lists the probability of smoking (unconditional on educational level), e.g. The slope of the marginal distribution is essentially the slope of the surface of the joint distribution shown in Fig.7.6a.
Calculation Rules for Expectation and Variance
- Expectation and Variance of the Arithmetic Mean
If we roll two dice X1andX2, then the expectation is the sum of the two outcomes. If this bus only arrives every 60 min, then the PDF of the random variable Y is the waiting time for the bus.
Covariance and Correlation
- Covariance
- Correlation Coef fi cient
This makes sense as the waiting time for the train and the bus should be independent of each other. Therefore the correlation coefficient is also 0 indicating no linear relationship between bus and train waiting times.
Key Points and Further Issues
Exercises
What is the variance of X. Exercise 7.4 A quality index summarizes various properties of a product using a score. What is the joint PDF of XandY. b) Calculate the marginal distributions of XogY. d) Determine the joint PDF for U=X+Y. Exercise 7.8 Remember the urn model we introduced in Chap.5.
Standard Discrete Distributions
- Discrete Uniform Distribution
- Degenerate Distribution
- Bernoulli Distribution
- Binomial Distribution
- Poisson Distribution
- Multinomial Distribution
- Geometric Distribution
- Hypergeometric Distribution
The probability of the random event of drawing "2 red balls, 1 white ball and 1 black ball" is:. The geometric distribution can be used to determine the probability that an event of interest occurs for the first time on the kth trial.
Standard Continuous Distributions
- Continuous Uniform Distribution
- Normal Distribution
- Exponential Distribution
The weights of the boxes vary and are assumed to be normally distributed with μ=15 kg and σ2=. An interesting property of the exponential distribution is the lack of memory: if the time has already been reached, the probability of reaching a time greater than +Δ does not depend on .
Sampling Distributions
- χ 2 -Distribution
- t -Distribution
- F -Distribution
Example 8.2.4 Let be the random variable that counts “the number of accesses per second for a search engine”. The random variable X, “waiting time until next access”, is then exponentially distributed with parameterλ=10.
Key Points and Further Issues
An application of the F-distribution relates to the ratio of two sample variances of two independent samples of size baskets, where each sample has an i.i.d. is. We encourage the use of R to obtain quantiles of sampling distributions, but Tables C.1–C.3 also list some of them.
Exercises
It is also intuitively clear that the sample must be a representative sample of the voters' population to avoid any inconsistency or bias in the prediction. Note that the standard deviation based on the sample values divided by the square root of the sample size, i.e.σ/ˆ.
Inductive Statistics
Introduction
In the election example, the intuitive estimates for the ratios in the population are the ratios in the sample and we call them sample estimates. It is reasonable to assume that the weight distribution of k-year-old boys follows a normal distribution with some unknown parametersμkbandσ2kb.
Properties of Point Estimators
- Unbiasedness and Ef fi ciency
- Consistency of Estimators
- Suf fi ciency of Estimators
We conclude that X¯ is an unbiased estimator of μ and its variance is σn2 regardless of the choice of distribution of X. Therefore, an unbiased estimator of the population proportion of all library members who return books late is .
Point Estimation
- Maximum Likelihood Estimation
- Method of Moments
The principle of maximum likelihood estimation now states that the estimator pˆ of p is the value of p which maximizes the likelihood (9.8) or (9.9). We use the well-known maxima-minima principle to maximize the likelihood function in this case.
Interval Estimation
- Introduction
- Con fi dence Interval for the Mean of a Normal
- Con fi dence Interval for a Binomial Probability
- Con fi dence Interval for the Odds Ratio
Sample Size Determinations
Key Points and Further Issues
Exercises
Introduction
Basic De fi nitions
- One- and Two-Sample Problems
- Hypotheses
- One- and Two-Sided Tests
- Type I and Type II Error
- How to Conduct a Statistical Test
- Test Decisions Using the p -Value
- Test Decisions Using Con fi dence Intervals
Parametric Tests for Location Parameters
- Test for the Mean When the Variance
- Test for the Mean When the Variance
- Comparing the Means of Two Independent
- Test for Comparing the Means
Parametric Tests for Probabilities
- One-Sample Binomial Test for the Probability p
- Two-Sample Binomial Test
Tests for Scale Parameters
Wilcoxon – Mann – Whitney (WMW) U-Test
Key Points and Further Issues
Exercises
The Linear Model
Method of Least Squares
- Properties of the Linear Regression Line
Goodness of Fit
Linear Regression with a Binary Covariate
Linear Regression with a Transformed Covariate
Linear Regression with Multiple Covariates
- Matrix Notation
- Categorical Covariates
- Transformations
The Inductive View of Linear Regression
- Properties of Least Squares and Maximum
- The ANOVA Table
- Interactions
Comparing Different Models
Checking Model Assumptions
Association Versus Causation
Key Points and Further Issues
Exercises