Statistical Analysis Using Regression
4. SOME CAUTIONARY NOTES ABOUT REGRESSION ANALYSIS
4. SOME CAUTIONARY NOTES ABOUT
dependent variables. What can possibly go wrong? If you actually want to use your regression for something other than fun, the answer is “plenty.”
First, it is by no means true that a high correlation (a high t-statistic) between an independent and a dependent variable means that the former
"causes" the latter to vary. For example, if we regress number of airline
flights from an airport on the number of crimes in the area, the correlation is bound to be high. But this does not mean that more airline flights cause an increase in crimes. Two variables can be highly correlated without one causing the other. In this case, both total airline flights and number of crimes are probably caused (in part) by a third variable, population density.
Computer programs allow us to include dozens of independent variables with relative ease, but that ease does not mean that all variables should be included. It is extremely important that our reasons for including each variable be based on some theory or policy relevance of the variable to the decision at hand. Otherwise, you might well end up with peculiar or meaningless results from your analysis.
Second, even if an observed correlation is due to a causal relationship, the direction of causation may be the reverse of that implied by the regression.
For example, suppose that we regress the profit of a car manufacturer on its advertising expenditures, where profit is the dependent variable to be explained and advertising expenditure the independent variable. If the correlation between these two variables turns out to be high, does this imply that high advertising expenditures produce high profits? Perhaps not. The line of causation could run the other way: High profits could enable management to spend more on advertising. Thus, in interpreting the results of regression studies, it is important to ask whether the line of causation assumed in the studies is correct. Again, you need to draw on theory – in this case, theories of how firms operate – to develop your model.
Third, regressions are sometimes used to forecast values of the dependent variable lying beyond the range of sample data. For example, an examination of observations of road tolls and auto commuter trips may show that the data for the dependent variable, trips, range from 16 to 28, and the independent variable, tolls, ranges from $0-$5. If the regression is used to evaluate an extreme policy of very high road tolls (e.g., $15), both the causal variable and the resulting forecast for the number of trips per month (say, 9 trips) are outside the range over which the original data were collected. This procedure, known as"extrapolation,"is dangerous because the available data provide no evidence that the estimated relationship found in your regression holds beyond the range of the sample data. Within your sample, you may make a very good estimate of the effect of changes in an independent variable, but, outside the range of that independent variable, the true relationship between the independent and dependent variable may take a very
different form. Just as 2 points are best connected with a straight line but including a third may make a curve more desirable, extrapolating beyond the known data will lead to unreliable results.
Fourth, you should be careful to avoid creating a spurious correlation by dividing or multiplying both the dependent variable and an independent variable by the same quantity. For example, we might want to determine whether a city's auto commuter trips is related to its trips for all purposes. To normalize for differences in population among cities, it may seem sensible to use trips per capita. This procedure may result in these ratios being highly correlated (because the denominator of the ratio is the same for both variables), even if there is little or no relationship between the number of auto commuter trips and trips of all purposes.
Fifth, a regression is based on past data and may not be a good predictor of future values due to changes in the relationships between the variables.
For example, past data indicate that age has a negative effect on travel demand. However, better health care and greater physical fitness may cause future consumers not to reduce their travel as fast with age as they did in the past when the data were collected. Thus, predictions based on historic data may be a poor predictor. One way to account for this in the regression is to test whether there is a systematic trend over time. If so, future values of this trend variable can be inserted to reflect continuation of this trend. (Of course the hazard here is that past trends may not continue at the same rate into the future.) Another way to account for these changes is to incorporate data, such as on population health, that are likely to be the true underlying factors for amount of travel.
4.2 Assumptions of Regression Techniques
When we undertake a regression analysis, it is important to check whether the assumptions underlying the approach are at least approximately met.
Essentially, this involves evaluating misspecification of the model, heteroscedasticity, multicollinearity, and true independence of the independent variables. In some cases, problems in regression violation are unlikely to affect your results much; in other cases, you may end up with incorrect estimates of the statistical relationships you want to uncover.
Regression analysis makes the following assumptions:
First, the dependent variable is assumed to be a linear function of the independent variables, or it can be transformed (for instance, using logs) to a linear function. (Note, again, that the coefficients are linear, but the variables themselves may be nonlinear.) This was discussed previously.
Second, the distribution of sample observations around the predicted value of the dependent variable is the same for all values of an independent
variable. This characteristic is called homoscedasticity, and its violation, heteroscedasticity. Under heteroscedasticity, the variance around your prediction may increase as your independent variable either increases or decreases. Correcting for this problem often involves weighting the data or observations in relation to their variance. While it is unlikely to lead to major problems in your estimates of the coefficients, it could lead to problems in the estimates of your standard errors, and thus your t-statistics may be misleading.
Third, the values of each observation are independent of each other, and each independent variable is independent of the other independent variables.
The individual observations are generally independent of one another in the case of survey data from hundreds of individuals, but they not be with time series data. The failure of this condition, known as autocorrelation, may (but does not always) lead to biased estimates of your coefficients. Additionally, the values of an independent variable are not always independent of the values of other variables. For instance, age may be correlated with wealth, or wealth may be correlated with educational attainment. Violation of this assumption is multicollinearity, discussed above under Regression Step 6.
Fourth, only the dependent variable is regarded as a random variable, and the independent variables are independent of the dependent variable. The values of the independent variables are assumed to be known with certainty.
For example, if regression analysis is used to estimate the demand for a good when the price is $2, the true quantity purchased at this price can be predicted subject to confidence limits, but the price ($2) is known precisely.
Additionally, that price should not be affected by the quantity demanded. For an individual consumer, the price is likely to be exogenous – you rarely have control over the prices you are charged at a store, for instance. However, if you are a major purchaser of a good and can negotiate over both the price and quantity of a good that you buy, then price is not an independent variable: it is indeed another dependent variable. In this situation, a linear regression with quantity as the dependent variable and price as the independent variable may give you very poor results.
This chapter cannot possibly cover completely all the issues you need to understand for a regression. Our major hope in reviewing these assumptions is to flag for you places where you may need to seek advanced help, either from a statistical consultant or from a statistics or econometrics text.