Probability and Statistics for Computer Science

In my experience, many learned some or all of this material without realizing how useful it was and then forgot about it. In my experience, computer science students find simple Markov chains natural (although they may find the notation annoying) and will suggest simulating a chain before the instructor does.

Notation and Conventions

Fairness: Each face of a fair coin or a head has the same probability of landing heads up on a flip or spin. If the wheel is properly balanced, the ball has the same probability of landing in each slot.

Describing Datasets

First Tools for Looking at Data

Datasets

Throughout the book we will see many datasets downloaded from various internet sources because people are so generous in publishing interesting datasets on the internet. In the next chapter, we'll look at two-dimensional data, and in Chapter 10, we'll look at high-dimensional data.

What’s Happening? Plotting Data

Bar Charts
Histograms
How to Make Histograms
Conditional Histograms

In this case, the height of the box is determined by the number of data items in the box. Each entry represents the number of data items that lie in that interval.

Table 1.2 Chase and Dunner, in a study described in the text, collected data on what students thought made other students popular

Summarizing 1D Data

The Mean
Standard Deviation
Computing Mean and Standard Deviation Online
Variance
The Median
Interquartile Range
Using Summaries Sensibly

Similarly, after viewing the elements, you will have an estimate of the standard deviation based on those elements. The properties of the variance derive from the fact that it is the square of the standard deviation.

Fig. 1.3 On top, a histogram of body temperatures, from the dataset published at http://www2.stetson.edu/~jrasp/data.htm

Plots and Summaries

Some Properties of Histograms
Standard Coordinates and Normal Data
Box Plots

This means that the right tail of the histogram is longer, so the histogram is skewed to the right. Recall the definition of the median (form an ordered list of the data points, and find the point halfway along the list).

Fig. 1.5 On the top, an example of a symmetric histogram, showing its tails (relatively uncommon values that are significantly larger or smaller than the peak or mode)

Whose is Bigger? Investigating Australian Pizzas

There is a vertical box whose height corresponds to the interquartile range of the data (the width is just to make the figure easy to interpret). One possible explanation is that Eagleboys has tighter control over the size of the final pizza.

Fig. 1.8 A box plot showing the box, the median, the whiskers and two outliers. Notice that we can compare the two datasets rather easily; the next section explains the comparison

You Should

Remember These Definitions
Remember These Terms
Remember These Facts
Be Able to

Yet another possibility is that Dominos controls portions by mass of dough (so thin crust diameters tend to be larger), but Eagleboys controls by crust diameter. The fact that Dominos and EagleBoys seem to follow different strategies with success suggests that more than one strategy may work.

Problems

Programming Exercises

Use conditional histogram plots to investigate whether students from small families drink more alcohol on the weekend than those from large families. Use box plots to investigate whether gender, education, or marital status have any effect on the amount of debt (again, use X1 for debt).

Looking at Relationships

Plotting 2D Data

Categorical Data, Counts, and Charts

Then I made the bar chart on the left, which shows the number of children of each gender, selecting each target. The height of the bar is given by the number of elements in the type and the bar is divided into sections corresponding to the number of elements of that subtype.

Fig. 2.1 I sorted the children in the Chase and Dunner study into six categories (two genders by three goals), and counted the number of children that fell into each cell

Gender by goals

Goals by gender, relative frequencies

Goals by gender

Gender by goals, relative frequencies

Series
Scatter Plots for Spatial Data
Exposing Relationships with Scatter Plots
Correlation

The Correlation Coefficient
Using Correlation to Predict
Confusion Caused by Correlation

Sterile Males in Wild Horse Herds
You Should

Be Able to

Figure 2.10 shows a scatter plot of the shipping data, where I have plotted the number of skins against price. Fortunately, scaling or translating data does not change the value of the correlation coefficient (although it may change the sign if a scale is negative).

Fig. 2.3 A heat map of the Chase and Dunner data. The color of each cell corresponds to the count of the number of elements of that type

ProbabilityProbability

Experiments, Outcomes and Probability

Outcomes and Probability

Worked example 3.2 (Find the queen, twice) We play find the queen twice, replacing the card we have chosen. This number indicates the relative frequency of the result of interest when an experiment is repeated a very large number of times. The probability is one because each experiment must have one of the outcomes in the sample space.

Events

Computing Event Probabilities by Counting Outcomes
The Probability of Events
Computing Probabilities by Reasoning About Sets

The number of outcomes in the event comes from noting that each outcome in the event is an order of the cards, with the first seven cards being 2-8 of hearts, in that order. The number of outcomes in the event is the number of 30-day lists, all different. The total number of lists is the number of lists of 29 days per year.

Fig. 3.1 If you think of the probability of an event as measuring its “size”, many of the rules are quite straightforward to remember

Independence

Example: Airline Overbooking

This coin is flipped seven times, and we are interested in the probability that there are sevenTs. Solution Now we flip the coin eight times and are interested in the probability of getting more than sixTs. Solution Now we flip the coin eight times and are interested in the probability of getting exactly six T's.

Fig. 3.2 On the left, A and B are independent. A spans 1=4 of , and A \ B spans 1=4 of B

Conditional Probability

Evaluating Conditional Probabilities
Detecting Rare Events Is Hard
Conditional Probability and Various Forms of Independence Two events are independent ifTwo events are independent if
Warning Example: The Prosecutor’s Fallacy
Warning Example: The Monty Hall Problem

What is the conditional probability of getting a royal flush, conditioned on the event that this card is the spade nine. If the test says you do have the disease, what is the probability that you actually have the disease. This means that knowing that A has occurred tells you nothing about B—the probability that B will occur is the same whether you know that it has occurred or not.

Extra Worked Examples .1 Outcomes and Probability.1Outcomes and Probability

Events
Independence
Conditional Probability

Worked example 3.45 (Birthdays in a row) We randomly stop three people and ask them the day of the week they were born. Worked example 3.53 (Which disease do you have?) Disease A occurs with probability 0.1 (ie, it is present in 20% of the population) and disease B occurs with probability 0.2. Worked Example 3.54 (Fraud or Psychic Powers?) You want to investigate the powers of a supposed psychic.

You Should

Remember and Use These Facts
Remember These Points
Be Able to

What is the probability that the ball will end up in a red spot with an even number? What is the conditional probability that the card you draw is a red king, conditional on the card drawn being a king. What is the conditional probability that the card you draw is a red king, conditional on the removed card being a red king.

Random Variables and Expectations

Random Variables

Joint and Conditional Probability for Random Variables

A function that takes a discrete random variable into a set of numbers is also a discrete random variable. Worked example 4.2 (coin bets) One way to obtain a random variable is to think about the prize for a bet. We will write P.X/ to denote the probability distribution of the random variable, and P.x/or P.XDx/ to denote the probability that the random variable assumes a particular value.

Bayes’ Rule)

Just a Little Continuous Probability
Expectations and Expected Values

Expected Values
Mean, Variance and Covariance
Expectations and Statistics

The Weak Law of Large Numbers

IID Samples
Two Inequalities
Proving the Inequalities
The Weak Law of Large Numbers

Using the Weak Law of Large Numbers

Should You Accept a Bet?
Odds, Expectations and Bookmaking: A Cultural Diversion
Ending a Game Early
Making a Decision with Decision Trees and Expectations
Utility

You Should

Use and Remember These Facts
Remember These Points 4.5.5 Be Able to4.5.5Be Able to

This is not the actual revenue from a single game (which would be different, depending on what the coin did). NowXIs a random variable (there are IID samples, and for a different set of samples you will get a different, random XN). Otherwise, Xgets the value of the four-sided die and gets the value of the six-sided die. Zalways gets the value of the dice amount. a) What is P.X/, the probability distribution of this random variable.

Table 4.1 A table of the joint probability distribution of S (vertical axis; scale 2; : : : ; 12) and D (horizontal axis; scale

Useful Probability Distributions

Discrete Distributions

The Discrete Uniform Distribution
Bernoulli Random Variables
The Geometric Distribution
The Binomial Probability Distribution
Multinomial Probabilities
The Poisson Distribution

Note that the number of heads in N tosses can be obtained by adding the number of heads in each toss. For example, you can take the length of a road, divide it into equal intervals, and then count the number of animals killed on the road in each interval. A Poisson point process with intensity is a set of random points with the property that the number of points in an interval of length is a Poisson random variable with parameters.

Continuous Distributions

The Continuous Uniform Distribution
The Beta Distribution
The Gamma Distribution
The Exponential Distribution

Figure 5.2 shows plots of the probability density function of the Gamma distribution for a variety of different values of ˛ and ˇ. We assume that failures form a Poisson process over time; then the time to the next failure is exponentially distributed. The time between calls will be exponentially distributed with parameter, and the expected time until the next call is1=(in hours).

Fig. 5.1 Probability density functions for the Beta distribution with a variety of different choices of ˛ and ˇ

The Normal Distribution

The Standard Normal Distribution
The Normal Distribution
Properties of the Normal Distribution

From this (and tables for the error function, or your favorite math package) we get that, for a standard normal random variable. About 95% of the time, a normal random variable takes a value within two standard deviations of the mean. About 99% of the time, a normal random variable takes a value within three standard deviations of the mean.

Approximating Binomials with Large N

Large N
Getting Normal
Using a Normal Approximation to the Binomial Distribution I have proven an extremely useful fact, which I shall now put in a box.I have proven an extremely useful fact, which I shall now put in a box

The main problem with Figure 5.4 (and with the argument above) is that the mean and standard deviation of the binomial distribution tend to infinity as the number of coin tosses tends to infinity. Then, for sufficiently large N, the probability distribution P.x/ can be approximated by a probability density function of 1. We know, for example, that a standard normal random variable has a value between 1 and 1 68% of the time.

Fig. 5.5 Plots of the distribution for the normalized variable x, with P.x/ given in the text, obtained from the binomial distribution with p D q D 0:5 for different values of N

You Should

Remember These Points

Toss a coinN once and countN. Show that the probability distribution for is the same as the probability distribution for hN. What is the probability that the plane will travel with one or more empty seats. 5.20 Show that the multinomial distribution. Use this and sample matching to show that the variance of the Poisson distribution is with an intensity parameter.

Inference

Samples and Populations

The Sample Mean

The Sample Mean Is an Estimate of the Population Mean
The Variance of the Sample Mean
When The Urn Model Works
Distributions Are Like Populations

Under our sampling model, the expected value of the sample mean is the population mean. It is random because different samples from the population will have different values of the sample mean. Knowing the variance of X.N/ will tell us how accurate our estimate of the population mean is.

Confidence Intervals

Constructing Confidence Intervals
Estimating the Variance of the Sample Mean Recall the variance of the sample mean isRecall the variance of the sample mean is
The Probability Distribution of the Sample Mean
Confidence Intervals for Population Means
Standard Error Estimates from Simulation

Our estimate of the unknown number popmean.fXg/is the mean of the sample we have, which we write mean.fxg/. Assume the sample is large enough so that mean.fxg/ popmean.fXg/=stderr.fxg/ is a standard normal random variable. Suppose we want to estimate the standard error of a statisticS.fxg/, which is a function of our datasetfxgofNdata items.

Fig. 6.1 A simple demonstration that sample means behave as described, by computing sample means from the heights dataset

You Should

Remember These Facts
Use These Procedures

Use the reasoning and data above to construct a 99% confidence interval for the likelihood of a male birth. Use each sample to calculate a centered 90% confidence interval for the population mean, using the t-distribution. Use each sample to make a bootstrap estimate of a centered 90% confidence interval for the population median.

The Significance of Evidence

Significance

Evaluating Significance
P-Values

We estimate how odd the sample would have to be to give the value we actually see, if the hypothesis is true. You should think of a fraction that represents the fraction of samples that would give an absolute value greater than the observed one if the hypothesis were true. Sometimes, the p-value is even smaller, and this can be interpreted as very strong evidence that the null hypothesis is wrong.

Comparing the Mean of Two Populations

Assuming Known Population Standard Deviations
Assuming Same, Unknown Population Standard Deviation
Assuming Different, Unknown Population Standard Deviation

Ifpopmean.fXg/ D popmean.fYg/, then we have thatmean.fxg/ mean.fyg/ is the value of a random variable whose mean is 0 and whose variance is. If popmean.fXg/ D popmean.fYg/, then we have that mean.fxg/ mean.fyg/ is the value of a random variable whose mean is 0 and whose variance is . Calculate the p-value using the recipe in Procedure 7.2; the number of degrees of freedom is stdunbiased.fxg/2=kxCstdunbiased.fyg/2=ky.

Other Useful Tests of Significance

F-Tests and Standard Deviations

This means that the distribution depends on the number of degrees of freedom for each data set (ie Nx 1 and Ny 1). Once we have the best estimate of the intensity, we still want to know if the model is consistent with the data. If we estimate parameters from the data, then the number of degrees of freedom will be k p 1 (because there are numbers, they must carry peak parameter values, and they must be added to 1).

P-Value Hacking and Other Dangerous Behavior

You Should

Weigh 20 fatty Zucker rats and get a mean weight of 1000 grams with a standard deviation of 100 grams. Weigh 35 fatty Zucker rats and get a mean weight of 1000 grams with a standard deviation of 100 grams. You will step by step evaluate the evidence against the claim that a fat Zucker rat weighs exactly twice as much as a thin Zucker rat.

A Simple Experiment: The Effect of a Treatment

Randomized Balanced Experiments
Decomposing Error in Predictions
Estimating the Noise Variance
The ANOVA Table

Finding out whether the groups are different requires careful consideration, as the subjects will differ for various irrelevant reasons (body weight, sensitivity to medication, and so on). are due to the treatment or to the irrelevant reasons. This means that it is helpful if there are the same number of subjects in each group - we say the experiment is balanced - so that the error due to chance effects is the same in each group. This means that MSB would be greater than it would be if the treatment had no effect.

Evaluating Whether a Treatment Has Significant Effects with a One-Way ANOVA for Balanced Experiments)

Unbalanced Experiments
Significant Differences
Two Factor Experiments

Decomposing the Error

This is the result of the weighting terms in the expressions for the mean squared error. With luck, the treatment effect will be so strong that significance testing is a formality. As in the case of a single factor, we assume that noise is independent of treatment level.