PDF Christian Heumann · Michael Schomaker Shalabh Introduction to ... - ITDA

The success of the open-source statistical software "R" has had a significant impact on statistics education and research over the past decade. In the first part of the book, we're going to introduce methods that help us describe data, and the second and third parts of the book focus on inferential statistics, which means drawing conclusions from data.

Population, Sample, and Observations

If we are interested in the social conditions under which Indian people live, then we would define all the inhabitants of India as Ω and each of its inhabitants as ω. All participants in the course form the populationΩ, and each participant refers to a unit or observationω.

Variables

Qualitative and Quantitative Variables
Discrete and Continuous Variables
Scales
Grouped Data

The values that these variables can take can be arranged in a logical and natural way. But even quantitative variables can be discrete: shoe size or number of academic semesters would be discrete because the number of values these variables can take is limited.

Data Collection

If the researcher decided to randomly assign toothpaste A to half of the study participants and toothpaste B to the other half, then this is an experiment because only the researcher decides which toothpaste should be used by any of the participants. The production process for the various units (facilities) is therefore under the control of management.

Creating a Data Set

Statistical Software

In most of these applications, it is possible to save the data as an ASCII file (.dat), as a tab-separated file (.txt), or as a comma-separated value file (.csv). A detailed description of data input and output can be found in the respective Manual available at http://.

Key Points and Further Issues

Exercises

Now, given this data set, try to write down the research questions above as accurately as possible. Let's write the salary of the first student as x1, the salary of the second student as x2, etc.

Absolute and Relative Frequencies

The sum of the absolute frequencies is equal to the total number of units in the data:k. The relative frequencies of the jth class are defined as. We are interested in pizza deliveries by branches, and we generate a corresponding frequency table showing the distribution of the data, using the table command in R (after reading and appending the data) as.

Empirical Cumulative Distribution Function

ECDF for Ordinal Variables

The 200 customers who have had a car service performed within the last 30 days were asked to respond regarding their overall satisfaction with the quality of the car service on a scale of 1 to 5 based on the following options: 1 = not at all satisfied, 2= not satisfied, 3= satisfied, 4= very satisfied and 5= completely satisfied. Example 2.2.2 Suppose that in Example 2.2.1 we want to know how many customers are not satisfied with their car service.

Fig. 2.1 ECDF for the satisfaction survey

ECDF for Continuous Variables

It is interesting to see that the graphs resulting from using the grouped data and ungrouped data are similar in this particular example. We therefore conclude, based on the grouped data, that only about 42 % of the deliveries were completed in the required time frame.

Fig. 2.2 Illustration of the ECDF for continuous data available in groups/intervals ∗

Graphical Representation of a Variable

Bar Chart

The frequency table forms the basis for the bar chart, using the absolute or relative frequencies on the side axis. Figure 2.5 shows the bar charts for the number and share of pizza deliveries per store.

Pie Chart

Instead, the area of the segment is proportional to the angle fj ·360◦ (and also depends on the radius of the whole circle). It has been argued that this can cause misinterpretations as the human eye can capture the area of the segment more easily than the angle of a segment.

Histogram

Since the width of the class intervals is the same, the heights of the bars are proportional to the relative frequency of the corresponding category. Table 2.3 shows the summary of the grouped data and the values needed to calculate the histogram.

Kernel Density Plots

Summarizing the functions as described in Eq. 2.12), gives the solid black line, which is a plot of the kernel density of the five observations. If we reduce the bandwidth to half the default bandwidth (optionadjust=0.5), the kernel density graph becomes more varied, as shown in Figure 2.10b.

Exercises

Determine the time when the first goal in 80% of the matches was scored on or before this time. The use of each measure depends on the scale of the relevant variable, see Appendix D.1 for a detailed overview.

Measures of Central Tendency

Arithmetic Mean
Median and Quantiles
Quantile – Quantile Plots (QQ-Plots)
Mode
Geometric Mean
Harmonic Mean

Properties of the arithmetic mean. i) The sum of the deviations of each variable around the arithmetic mean is zero: The mean and median are similar here because the distribution of observations is symmetrical about the center.

Fig. 3.3 Different patterns for a QQ-plot

Measures of Dispersion

Range and Interquartile Range

Range is a measure of dispersion, defined as the difference between the largest and smallest data values as. However, in accordance with most of the statistical literature, we define the interquartile range as a measure of dispersion, i.e.

Absolute Deviation, Variance, and Standard

If another eventBis is defined as "the sum of numbers on the upper levels is 6", then. Then the probability of the complementary event is A, that is, the probability of not finding a marzipan chocolate.

Coef ﬁ cient of Variation

Box Plots

The lower end of the box refers to the first quartile and the upper end of the box to the third quartile. Whiskers at the end of the graph indicate the minimum and maximum data values.

Measures of Concentration

Lorenz Curve

A sure event is "get an even number or an odd number"; an impossible event would be "the sum of the two dice is greater than 13". This observation shows the basic conclusion of the following chapter:. the larger the sample size, the more confident we are in our conclusions.

Gini Coef ﬁ cient

Key Points and Further Issues

If data for a continuous variable are grouped and the original ungrouped data are not known, additional assumptions are needed to calculate measures of central tendency and dispersion. In some cases, however, these prerequisites may not be met, and the specified formulas may give imprecise results.

Exercises

Figure 3.8 shows the QQ-plots for Y and XgivenZ. Exercise 3.6 There is no built-in function in R to calculate the mode of a variable. The manager of a bus company operating a route wants his buses to finish their run in 8 hours. a).

Figure 3.8 shows the QQ-plots for both Y and X given Z. Interpret both graphs.

Summarizing the Distribution of Two Discrete Variables

Contingency Tables for Discrete Data

In this chapter, we present measures and graphical summaries for the association of two variables—depending on their scale. For example, the last column of the table shows that 7 passengers flew in economy class, 4 passengers in business class and 1 passenger in first class.

Table 4.1 Contingency table for travel class and satisfaction

Joint, Marginal, and Conditional Frequency

We now collect and evaluate the responses of 100 customers (instead of 12 passengers as in example 4.1.1) regarding their choice of travel class and their overall satisfaction with the quality of the flight. The conditional frequency distribution of "travel class" (X) of passengers given "overall rating of flight quality" (Y) is obtained from fX|Y=satisfaction level.

Table 4.3 Contingency table for travel class and satisfaction Overall rating of flight quality

Graphical Representation of Two Nominal or

In connection with contingency tables, two variables are independent of each other when the common relative frequency is equal to the product of the marginal relative frequencies of the two variables, i.e. Note that the absolute frequencies are always integers, but the expected absolute frequencies may not always be integers.

Measures of Association for Two Discrete Variables

Pearson ’ s χ 2 Statistic
Cramer ’ s V Statistic
Contingency Coef ﬁ cient C
Relative Risks and Odds Ratios

Thus, χ2≈57 indicates a moderate association between "travel class" and "overall rating of flight quality" of passengers. There is a moderate to strong relationship between "travel class" and "overall flight quality rating" of passengers.

Association Between Ordinal and Continuous Variables

Graphical Representation of Two Continuous
Correlation Coef ﬁ cient
Spearman ’ s Rank Correlation Coef ﬁ cient
Measures Using Discordant and Concordant Pairs

Example 4.3.5 Let's follow Example 4.3.3a a bit further and calculate the Spearman rank correlation coefficient for the first five observations of the decathlon data. The differences between the correlation coefficient and the rank correlation coefficient are numerous: first, the Pearson correlation coefficient can only be used for continuous variables, but not for nominal or ordinal variables.

Fig. 4.4 Scatter plot between tweets and followers

Visualization of Variables from Different Scales

Exercises

Suppose there are two players A and B. The probabilities of hitting the target by AandBare are 0.4 and 0.5 respectively. a). After collecting the weights of 200 babies, the researcher has a sample of 200 realized values (ie weights in grams).

Fig. 4.7 Temperature and hotel occupancy for the different cities

Probability Calculus

Introduction

This can be generalized to a situation in which there are balls in the urn and we want to draw balls. In urn parlance, we are getting draws without replacement (since each athlete can only have one particular seat).

Fig. 5.1 a Representation of the urn model. Drawing from the urn model b with replacement and c without replacement

Permutations

Permutations without Replacement
Permutations with Replacement

Combinations

Combinations without Replacement
Combinations with Replacement
Combinations with Replacement

The total number of different combinations for the setting without replacement and subject to the order is . The total number of different combinations with substitution and without taking into account the order is.

Exercises

Another interpretation refers to a geometric representation of the binomial coefficient,n. a) Show that each item in the bold third diagonal line can be represented vian. The investment in a new advertising campaign can therefore only make sense if the chance of success is greater than that of the current advertising.

Fig. 5.2 Excerpt from Pascal’s triangle (left) and its representation by means of binomial coeffi- coeffi-cients (right)

Basic Concepts and Set Theory

The collection of all simple events contained in event A is denoted by ΩA. This means that all simple events of Aare are also part of the sample space of B.

Relative Frequency and Laplace Probability

Example 7.1.1 The properties of a pinball experiment, a game of roulette, or the lifetime of a television can all be described by a random variable, see Table 7.1. We can calculate the waiting time variance for a train using the probability density function.

The Axiomatic De ﬁ nition of Probability

Corollaries Following from Kolomogorov ’ s
Calculation Rules for Probabilities

Conditional Probability

Bayes ’ Theorem

The blood sample does not have an infection and the test does not diagnose it, ie. if one already knows that the test is positive and wants to determine the probability that the infection is actually present, then this can be achieved by the respective conditional probabilityP( I P|T+), which is.

Table 6.1 Absolute frequencies of test results and infection status

Independence

The difference between pairwise independence and general stochastic independence is explained in the following example. The probability of all three events happening at the same time is zero because there is no ball with 111 printed on it.

Exercises

What is the probability that Dr. Obermeier's neighbors did not take care of the plant. What is the probability that at least one of the players succeeds in his shot.

Random Variables

Now we discuss the concepts needed to draw statistical conclusions from a sample of data about a population of interest. However, Chapters 9–11 show how a sample of data can be used to estimate unknown probabilities and other quantities, given a pre-specified level of uncertainty.

Cumulative Distribution Function (CDF)

CDF of Continuous Random Variables
CDF of Discrete Random Variables

We can see that X is a discrete random variable because its space is finite and countable. We can easily calculate various types of probability for discrete random variables using CDF.

Fig. 7.1 Probability density function (PDF) and cumulative distribution function (CDF) for waiting time in Example 7.2.1

Expectation and Variance of a Random Variable

Expectation
Variance
Quantiles of a Distribution
Standardization

E(X), usually denoted by μ=E(X) and refers to the arithmetic mean of the distribution of the population. A low standard deviation value indicates that the values are highly concentrated around the mean.

Fig. 7.4 First quartile, median, and third quartile ∗

Tschebyschev ’ s Inequality

However, if we apply our distribution knowledge that F(x)=201x(for 0≤X ≤20), then we get a much more accurate result, viz. It should be kept in mind that only the lack of distribution knowledge makes the inequality useful.

Bivariate Random Variables

The marginal distribution of X is contained in the last column of the table and lists the probability of smoking (unconditional on educational level), e.g. The slope of the marginal distribution is essentially the slope of the surface of the joint distribution shown in Fig.7.6a.

Fig. 7.5 Area covering all points of ( X , Y ) with ( x 1 ≤ X ≤ x 2 , y 1 ≤ Y ≤ y 2 ) ∗

Calculation Rules for Expectation and Variance

Expectation and Variance of the Arithmetic Mean

If we roll two dice X1andX2, then the expectation is the sum of the two outcomes. If this bus only arrives every 60 min, then the PDF of the random variable Y is the waiting time for the bus.

Covariance and Correlation

Covariance
Correlation Coef ﬁ cient

This makes sense as the waiting time for the train and the bus should be independent of each other. Therefore the correlation coefficient is also 0 indicating no linear relationship between bus and train waiting times.

Exercises

What is the variance of X. Exercise 7.4 A quality index summarizes various properties of a product using a score. What is the joint PDF of XandY. b) Calculate the marginal distributions of XogY. d) Determine the joint PDF for U=X+Y. Exercise 7.8 Remember the urn model we introduced in Chap.5.

Standard Discrete Distributions

Discrete Uniform Distribution
Degenerate Distribution
Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Multinomial Distribution
Geometric Distribution
Hypergeometric Distribution

The probability of the random event of drawing "2 red balls, 1 white ball and 1 black ball" is:. The geometric distribution can be used to determine the probability that an event of interest occurs for the first time on the kth trial.

Fig. 8.1 Frequency distribution of 1000 generated discrete uniform random numbers with possible outcomes ( 2 , 5 , 8 , 10 )

Standard Continuous Distributions

Continuous Uniform Distribution
Normal Distribution
Exponential Distribution

The weights of the boxes vary and are assumed to be normally distributed with μ=15 kg and σ2=. An interesting property of the exponential distribution is the lack of memory: if the time has already been reached, the probability of reaching a time greater than +Δ does not depend on .

Fig. 8.5 PDF of N ( 0 , 2 ), N ( 0 , 1 ) and N ( 0 , 0 . 5 ) distributions ∗

Sampling Distributions

χ 2 -Distribution
t -Distribution
F -Distribution

Example 8.2.4 Let be the random variable that counts “the number of accesses per second for a search engine”. The random variable X, “waiting time until next access”, is then exponentially distributed with parameterλ=10.

Fig. 8.8 Probability density functions for different F-distributions ∗

Key Points and Further Issues

An application of the F-distribution relates to the ratio of two sample variances of two independent samples of size baskets, where each sample has an i.i.d. is. We encourage the use of R to obtain quantiles of sampling distributions, but Tables C.1–C.3 also list some of them.

Exercises

It is also intuitively clear that the sample must be a representative sample of the voters' population to avoid any inconsistency or bias in the prediction. Note that the standard deviation based on the sample values divided by the square root of the sample size, i.e.σ/ˆ.

Inductive Statistics

Introduction

In the election example, the intuitive estimates for the ratios in the population are the ratios in the sample and we call them sample estimates. It is reasonable to assume that the weight distribution of k-year-old boys follows a normal distribution with some unknown parametersμkbandσ2kb.

Properties of Point Estimators

Unbiasedness and Ef ﬁ ciency
Consistency of Estimators
Suf ﬁ ciency of Estimators

We conclude that X¯ is an unbiased estimator of μ and its variance is σn2 regardless of the choice of distribution of X. Therefore, an unbiased estimator of the population proportion of all library members who return books late is .

Fig. 9.1 Illustration of bias and variance

Point Estimation

Maximum Likelihood Estimation
Method of Moments

The principle of maximum likelihood estimation now states that the estimator pˆ of p is the value of p which maximizes the likelihood (9.8) or (9.9). We use the well-known maxima-minima principle to maximize the likelihood function in this case.

Interval Estimation

Introduction
Con ﬁ dence Interval for the Mean of a Normal
Con ﬁ dence Interval for a Binomial Probability
Con ﬁ dence Interval for the Odds Ratio

Sample Size Determinations

Exercises

Introduction

Basic De ﬁ nitions

One- and Two-Sample Problems
Hypotheses
One- and Two-Sided Tests
Type I and Type II Error
How to Conduct a Statistical Test
Test Decisions Using the p -Value
Test Decisions Using Con ﬁ dence Intervals

Parametric Tests for Location Parameters

Test for the Mean When the Variance
Test for the Mean When the Variance
Comparing the Means of Two Independent
Test for Comparing the Means

Parametric Tests for Probabilities

One-Sample Binomial Test for the Probability p
Two-Sample Binomial Test

Tests for Scale Parameters

Wilcoxon – Mann – Whitney (WMW) U-Test

Exercises

The Linear Model

Method of Least Squares

Properties of the Linear Regression Line

Goodness of Fit

Linear Regression with a Binary Covariate

Linear Regression with a Transformed Covariate

Linear Regression with Multiple Covariates

Matrix Notation
Categorical Covariates
Transformations

The Inductive View of Linear Regression

Properties of Least Squares and Maximum
The ANOVA Table
Interactions

Comparing Different Models

Checking Model Assumptions

Association Versus Causation

Exercises