An introduction to statistical learning with application in Python

For example, take a look at the left panel of Figure 1.1, which shows the consumption for each of the individuals in the data set. Since then, Python implementations of the important techniques in statistical learning have been in increasing demand.

What Is Statistical Learning?

Why Estimate f ?
How Do We Estimate f?
Supervised Versus Unsupervised Learning
Regression Versus Classification Problems

In this case, the non-parametric fit provided a remarkably accurate estimate of the truef, shown in Figure 2.3. In this setting, we can try to cluster the customers based on the measured variables to identify them.

Assessing Model Accuracy

Measuring the Quality of Fit

The test MSE is displayed using the red curve in the right-hand panel of Figure 2.9. As model flexibility increases, the training MSE will decrease, but the test MSE may not.

The Bias-Variance Trade-Off

Again, we observe that the training MSE decreases monotonically as the model flexibility increases, and that there is a U-shape in the test MSE. In Chapter 5, we discuss cross-validation, which is a way to estimate the test MSE using the training data.

The Classification Setting

When K= 1, the decision boundary is too flexible and finds patterns in the data that do not correspond to the Bayesian decision boundary. The jumpiness of the curves is due to the small size of the training data set.

Lab: Introduction to Python

Getting Started

Basic Commands

The following command instructs Python to concatenate the numbers 3, 4, and 5 and store them as a list named x. It is reasonable to try the following code, although it will not produce the desired results.

Introduction to Numerical Python

We can learn more about this feature by looking at the help page, via a call tonp.random.normal?. In all the labs in this book, we use np.random.default_rng() when performing calculations with random quantities within numpy.

Graphics

To fine-tune the output of the ax.contour() function, take a look at the help file by typing ?plt.contour. The ax.imshow() method is similar to toax.contour(), except that the pro- .imshow() produces a color-coded plot whose colors depend on the z value.

Sequences and Slice Notation

Indexing Data

Now suppose we want to select the submatrix consisting of the second and fourth rows as well as the first and third columns. The slice 1:4:2 captures the second and fourth items of a sequence, while the slice 0:3:2 captures the first and third items (the third element in a slice sequence is the step size).

Loading Data

Remember that the first argument of the[] method always applies to array rows. In summary, a powerful set of operations is available to index the rows and columns of data frames. For functional queries that filter rows, use the loc[] method with a function (usually a lambda) in the rows argument.

For Loops

As another example, suppose we want to retrieve all Ford and Datsun cars with displacements less than 300. We check whether each name entry contains either the string ford or datsun using the str.contains() method in .str. Inserting the value of something into a string is a common task made simple using some of the powerful string formatting tools in Python.

Additional Graphical and Numerical Summaries

We see that thetemplate.format() method accepts two arguments{0}and {1:.2%}, and the latter includes some formatting information. However, since there are only a small number of possible values for this variable, we may wish to treat it as qualitative.

Exercises

What is the range, mean, and standard deviation of each predictor in the subset of data that remains. Comment on the range of each predictor. f) How many suburbs in this dataset border the Charles River. G). How many of the suburbs in this dataset have an average of more than seven rooms per home?

Simple Linear Regression

Estimating the Coefficients

In the advertising example, this data set consists of the TV advertising budget and product sales in n= 200 different markets. We define the residual sum of squares (RSS) as the residual sum of squares. 3.3) The least squares approach chooses βˆ0 and βˆ1 to minimize RSS. In each graph, the red dot represents the pair of least-squares estimates (ˆβ0,βˆ1) given by (3.4).

Assessing the Accuracy of the Coefficient Estimates

At first glance, the difference between a population regression line and a least squares line can seem subtle and confusing. Each least-squares line is different, but the least-squares lines are, on average, quite close to the population regression line. For the advertising data, least squares model coefficients for the regression of the number of units sold on the television advertising budget.

Assessing the Accuracy of the Model

RSE is considered a measure of the lack of fit of the model (3.5) to the data. RSE provides an absolute measure of the lack of fit of the model (3.5) to the data. An R2 statistic close to 1 indicates that a large proportion of the variability in the response is explained by the regression.

Multiple Linear Regression

Estimating the Regression Coefficients

Table 3.4 shows the estimates of multiple regression coefficients when television, radio, and newspaper advertising budgets are used to predict product sales using Advertising Data. Does it make sense for the multiple regression to imply no relationship between sales and newspaper while the simple linear regression implies. For advertising data, least squares coefficient estimates of the multiple linear regression of number of units sold on television, radio, and newspaper advertising budgets.

Some Important Questions

Recall that in simple regression, R2 is the square of the correlation between the response and the variable. An R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable. From the pattern of the residuals we can see that there is a pronounced nonlinear relationship in the data.

Other Considerations in the Regression Model

Qualitative Predictors

However, the coefficients and their p-values depend on the choice of dummy variable coding. Using this dummy variable approach presents no difficulties when quantitative and qualitative predictors are included. There are many different ways of coding qualitative variables besides the dummy variable approach taken here.

Extensions of the Linear Model

β3 can be interpreted as the increase in television advertising effectiveness associated with a one-unit increase in radio advertising (or vice versa). The green curve in Figure 3.8 shows the fit resulting from including all polynomials up to the fifth degree in model (3.36). The approach we just described to extend the linear model to accommodate nonlinear relationships is known as polynomial regression, since we have included polynomial functions of the predictors in the regression model.

Potential Problems

Unfortunately, it is often the case that the variances of the error terms are not constant. An example is shown in the left panel of Figure 3.11, in which the size of the residuals tends to increase with the fitted values. The right panel of Figure 3.13 provides a plot of the studentized residuals versus the chi for the data in the left panel of Figure 3.13.

The Marketing Plan

This can usually be done without major compromises in the regression fit, since the presence of collinearity means that the information that this variable provides about the response is redundant in the presence of other variables. In Section 3.3.2, we discussed the inclusion of predictor transformations in a linear regression model to adjust for non-linear relationships. Including the interaction term in the model results in a significant increase in R2, from about 90% to almost 97.

Comparison of Linear Regression with K-Nearest NeighborswithK-Nearest Neighbors

The blue dashed line on the left side of Figure 3.18 represents a linear regression fit to the same data. In this case, we see that the MSE test for linear regression is still better. Test MSE for linear regression (black dashed lines) and KNN (green curves) as the number of variables increases.

Lab: Linear Regression

Importing packages
Simple Linear Regression
Multiple Linear Regression
Multivariate Goodness of Fit
Interaction Terms
Non-linear Transformations of the Predictors
Qualitative Predictors

For a complete and somewhat exhaustive summary of the fit we can use the summary() method (output not shown). The effectively zero p-value associated with the quadratic term (i.e. the third row above) suggests that this leads to an improved model. In this case, the F-statistic is the square of the t-statistic for the quadratic term in the summary linear model for results3.

Exercises

In this problem we will investigate the thet-statistic for the null hypothesis H0 :β = 0 in a simple linear regression without an intercept. This problem involves simple linear regression without an intercept. a) Recall that the estimate of the coefficient βˆ for a linear regression of Y on X without an intercept is given by (3.38). In which of the models is there a statistically significant relationship between the predictor and the response?

An Overview of Classification

In this chapter, we will illustrate the concept of classification using a simulated default data set. In this chapter, we learn how to build a model to predict default (Y) for any value of balance (X1) and income (X2). However, to illustrate the classification procedures discussed in this chapter, let us use an example in which the predictor-response ratio is slightly exaggerated.

Why Not Linear Regression?

Using this coding, least squares can be used to fit a linear regression model to predictY based on a set of predictorsX1,. In the binary case, it is not difficult to show that even if we reverse the above encoding, linear regression will produce the same final predictions. Oddly enough, it appears that the classifications we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.

Logistic Regression

The Logistic Model
Estimating the Regression Coefficients
Making Predictions
Multiple Logistic Regression
Multinomial Logistic Regression

Although we can use (non-linear) least squares to fit the model (4.4), the more general maximum likelihood method is preferred, as it has better statistical properties. To be precise, a one-unit increase in the balance is associated with an increase in the log odds of default by 0.0055 units. A one-unit increase in the balance is associated with an increase in the log odds of default by 0.0055 units.

Generative Models for Classification

Linear Discriminant Analysis for p = 1
Linear Discriminant Analysis for p >1
Quadratic Discriminant Analysis
Naive Bayes

The figure indicates that the LDA decision boundary is slightly to the left of the optimal Bayesian decision boundary, which is instead equal to (µ1+µ2)/2 = 0. We can perform LDA on the default data to predict whether an individual will not default based on credit card balance and student status.4 The LDA model fit to the 10,000 training samples results in a training error rate of 2.75%. Comparison of the naive Bayes predictions with the true default status for the 10,000 training observations in the default data set, when we predict default for any observation for which P(Y =default|X=x)>0.2.

A Comparison of Classification Methods

An Analytical Comparison

Naive Bayes has a higher error rate, but correctly predicts nearly two-thirds of the actual default values. Every classifier with a linear decision boundary is a special case of naive Bayes with gkj(xj) =bkjxj. In this case, naive Bayes is actually a special case of LDA, where Σ is restricted to a diagonal matrix with the jth diagonal element equal to σj2.

An Empirical Comparison

Boxplots of the test error rates for each of the linear scenarios described in the main text. Boxplots of the test error rates for each of the nonlinear scenarios described in the main text. LDA and logistic regression performed poorly because the true decision boundary is non-linear, due to the unequal covariance matrices.

Generalized Linear Models

Linear Regression on the Bikeshare Data

A least-squares linear regression model was fit to predict cyclists in the Bikeshare dataset. Left: coefficients related to the month of the year. Therefore, the heteroskedasticity of the data calls into question the appropriateness of the linear regression model. The fit of the smoothing joint is shown in green. Right: The rider count log is now displayed on their axis.

Poisson Regression on the Bikeshare Data

A Poisson regression model was fit to predict cyclists in the Bikesharedata set. Left: coefficients related to the month of the year. Bicycle use is highest in spring and autumn, and lowest in winter. Right: coefficients related to hour of the day. In contrast, nearly 10% of the predictions were negative when we fit a linear regression model to the Bikeshare data set.

Generalized Linear Models in Greater Generality

Thus, by modeling bicycle use with a Poisson regression, we implicitly assume that mean bicycle use in a given hour is equal to the variance of bicycle use during that hour. In contrast, under a linear regression model, the variance of bicycle use always assumes a constant value. Recall from Figure 4.14 that in the Bikeshare data, when cycling conditions are favorable, both the mean and the variation in bicycle use are much higher than when conditions are unfavorable.

Lab: Logistic Regression, LDA, QDA, and KNN

The Stock Market Data
Logistic Regression
Linear Discriminant Analysis
Quadratic Discriminant Analysis
Naive Bayes
K-Nearest Neighbors
Linear and Poisson Regression on the Bikeshare Data

We compute the correlation matrix using the corr() method of data .corr() frames, which produces a matrix containing all the pairwise corre-. We suppress the output here.) The Panda library does not report a correlation for the direction variable because it is qualitative. The != notation does not mean equal to, and so the last command calculates !=. the error rate of the test set. As we observed in our comparison of classification methods (Section 4.5), LDA and logistic regression predictions are almost identical.