Statistical Analysis Using Regression
3. STEPS IN RUNNING A REGRESSION
3.6 Evaluating Regression Results Using Other Statistical Information
As is probably clear by now, you have a number of decisions in running a regression – what variables to include, what functional form to use, and what sources of data to employ, for instance. How do you decide whether you’ve done a good job? In part, the regression itself can tell you. Some of the information provided by the statistical package (and any package that runs a regression will provide you with at least some of this information) can give you some guidance in how well your regression worked.
Look again at Equation 6-22, above. In addition to the regression equation itself, the information provided includes t-statistics for each coefficient in the regression, the for the equation, an F statistic, and the standard error of the estimate. Each of these can tell you about the quality of your regression.
3.6.1 The T-Statistics
How can you determine whether a particular independent variable has an influence on your dependent variable? If the variable has no effect, then its coefficient should be zero: a change in the independent variable leads to no change in the dependent variable. Of course, just due to chance the regression equation might find some effect other than zero. To determine whether a non-zero coefficient is in fact different from zero, you can use the t-statistics provided by the regression equation. These numbers, in parentheses below the regression coefficients in Equation 6-22, measure how many standard errors of the coefficient lie between that coefficient and zero.
The t-statistic is calculated by the computer program by dividing the coefficient by the "standard error of the coefficient." The standard error of the coefficient is like the standard deviation; it provides a measure of the variability of the coefficient. If the standard error is quite small relative to the magnitude of the coefficient, then we can place more confidence in our conclusion that this particular variable has a systematic (i.e., non-zero) effect on the dependent variable. A frequent test of whether the variable is significant is whether each regression coefficient is at least twice its standard error. This is equivalent, in most cases, to observing a "t-value" of
approximately 2 for a regression coefficient. If this is the case, there is less than a 5 percent chance that the independent variable is not significantly related to the dependent variable. An equivalent way of thinking of this is in terms of the confidence interval around the coefficient. If a variable is statistically significant at the 5% level, then the 95% confidence interval around the coefficient does not include zero.
In our example above, all of the variables are significantly different from zero. However, we often learn as much from what variables are insignificant as those that are significant. That is, it would be quite valuable when evaluating alternative policies to find out if one of the policy variables actually has no systematic effect on the environmental parameter of interest.
In this case, we can rule out that policy variable, as it would be a totally ineffective instrument to achieve the environmental policy. For example, if we find that that the presence of a carpool lane in other cities has no detectable effect on vehicle miles travelled, this would be quite useful in suggesting that policy instrument may not work in our city.
The rule of thumb for interpreting a t-statistic is that a value of around 2 or higher is statistically significant, while a t-statistic below about 1.5 is not significantly different from zero at commonly used alpha levels of 10% or higher. Like any rule of thumb, though, it needs to be viewed cautiously. In fact, it is possible that a variable does in fact have a significant effect, even if your t-statistic suggests that it does not. One common reason for insignificance is a small data set, and the associated large variance.
Identifying statistical significance is in part dependent on the number of observations: with more data, it is easier to identify a statistical effect (or to be confident of no statistical effect). For example, think back to estimating the average height of the people in your city. If you only have the heights of a few people–say, the people in your family– you are not likely to be very confident of your results. If you sampled a thousand people in your city, you would be much more confident that your estimate of the average height in your city was good. Similarly, for any statistical analysis, your results will be more reliable with more data.
Another possible reason for insignificance when there really is an effect is multicollinearity, a situation where two independent variables are closely related to each other. The t-statistics identify the effect of each independent variable separately. If two independent variables are closely correlated -- for instance, the price of gasoline and the price of motor oil, both petroleum products – then the effect of either variable separately from the other is small, though the effect of these variables together is large. The two variables are, in a way, sharing explanatory power between them; though neither is significant on its own, they might be significant together. It is possible to test for the joint significance of variables using an F test (described below). It is
also possible, if two variables are highly correlated (calculated as a correlation coefficient), to drop one from the regression equation and rely on the second, if the separate effects are not considered important. There is also no harm to leaving both in the regression equation, even though the variables are not statistically significant. A statistics or econometrics book can provide more ideas on how to cope with multicollinearity in a regression.
In sum, the t-statistics are your first suggestion whether an independent variable influences your dependent variable. They provide an important explanatory role in regressions. Significance of a variable in a regression provides strong evidence that that variable in fact influences the dependent variable and can, in and of itself, provide important policy information. For instance, a great deal of research on environmental justice hinges on whether race and income are statistically significantly related to the location of a hazardous waste site or other environmental bad. Being able to interpret t- statistics enables you to understand and critique others’ analysis as well as to understand your own regression results.
3.6.2 as a Measure of Goodness of Fit
The coefficient of determination, identified by the symbol is defined as the proportion (or percent) of the variation in the dependent variable that is explained by the full set of independent variables included in the equation. In other words, is an approximate indicator of how well you’ve done in explaining your dependent variable. In equation (6-22), it indicates that the regression explains 82 percent of the variation in the dependent variable, can have a value ranging from 0, indicating the equation explains none of the variation in the dependent variable, to 1.0, indicating that all of the variation has been explained by the independent variables.
Regression models will seldom have an equal to either 0 or 1.0. In empirical demand estimation values of 0.50, indicating that 50 percent of the variation in demand is explained, are quite acceptable. For aggregate or time series regression models, of 0.80 or even slightly higher are sometimes obtainable. However, for regression models based on individual consumer or producer behavior, you must be satisfied with considerably less explanation of variation in demand. When the coefficient of determination is very low -- say, in the range of 0.05 to 0.15--it is an indication that the equation may be inadequate for explaining the effect of the policy on human behavior. The most general cause of this problem is the omission of an important variable or variables from the equation or model. Thus, the can be used as a feedback mechanism to both the analyst and the decision maker regarding the need for additional variables and associated data.
Because of 1 are virtually impossible to attain, and because the value of for an equation varies somewhat with the nature of the data with which you’re working, having a high or low in and of itself is not highly useful.
Where it can be especially useful is in selecting among different regression models that you may run. For instance, suppose you run both the following equations, using the same set of data:
If one has a much higher value for than the other, then the regression with a higher explains more of the variation in Q and is thus a better fit for the data.
Two additional points should be made with respect to the coefficient of determination, First, most computer programs print out two coefficients of determination. The first, referred to as is a straight calculation of the amount of variation explained by the independent variables; the second, adjusted is adjusted (in an imprecise way) for the number of variables included. Adding an independent variable to a regression can’t hurt your regression results: if it is unrelated, it will leave the rest of your regression alone. If it is related, of course you want it included. The harder cases are where adding a variable has an effect, but you’re not sure if it’s real or just by chance. The unadjusted will never go down, and might (just by chance) go up due to including more variables. The adjusted adds a penalty for adding a variable; it might not go up when a new variable is added, because the new variable may add something just by chance to the equation. The adjusted is a more appropriate measure, since it puts a burden of proof on the new variable to justify its inclusion in the regression.
Second, some computer programs show the proportion of the variation in the dependent variable that is explained by each of the independent variables included in the equation, the sum of which equals the adjusted For example, price may explain 20 percent of the variation, income 10 percent, etc.
3.6.3 The F-Test
The "F-test"or"F-ratio"is used to estimate whether there is a significant
relationship between the dependent variable and all the independent variables taken together. Tables are available in any statistical textbook showing the values of F that are exceeded with certain probabilities, such as 0.05 and
0.01, for various degrees of freedom of the dependent and independent variables. The F-ratio of 126.4 indicates the overall equation is highly significant.
As mentioned above in the context of the t-statistic, an F-test can also be used to test whether two or more variables are important to an equation, even if individually they are insignificant. More detail about this approach can be found in Kennedy (1998), p. 56.
3.6.4 Standard Error of the Estimate
Another useful statistic printed out by the computer is the "standard error of the estimate." This statistic provides a measure of the precision of the estimates of the dependent variable from the regression model as a whole.
Greater predictive accuracy is associated with smaller standard errors of the estimate. If the errors are normally distributed about the regression equation, there is a 95 percent probability that observations of the dependent variable will lie within the range of two standard errors on either side of the estimate at the mean of the data.
The confidence band concept is illustrated in Figure 6-4. Here we see the least squares regression line and the upper and lower 95 percent confidence limits. Thus, there is only a 5 percent chance that the true estimate lies outside this confidence interval. Note, however, the confidence interval or confidence band is tightest around the center (that is, the mean the observed independent variables) and wider the further one moves from that mean. This suggests that predictions from independent variables that are substantially above or below the mean of the data carry with them less confidence about their accuracy, and the prediction errors are much greater than just two standard errors of the estimate. The interested reader should consult a regression or econometrics book for the details of calculating confidence intervals around regression predictions.