Statistical Analysis Using Regression
1. WHAT IS A REGRESSION?
Often, a variable of interest to a policy analyst – for instance, the quantity of gasoline a family purchases, or proximity of someone’s home to a hazardous waste site – is influenced by a number of factors, such as the kind of car a family drives and where the family lives relative to jobs, or race and income. A regression is a statistical method that uses observations – data on the relevant variables – to identify the contribution of multiple independent, or exogenous, variables to a dependent, or endogenous, variable. The independent variables predict the dependent variable: that is, if you are given the independent variables and your regression model, you can come up with a good estimate of the dependent variable.
For instance, suppose that you wanted to develop a model to predict the height of a person as a fully grown adult. That height is the dependent variable. One way to estimate a random person’s height is to take the average height of the population and use that as a prediction. While that method works well on average, it does not tell you much about any particular individual. You would like to use characteristics of an individual to predict that person’s height. After some research, you identify several factors that influence an individual’s height as an adult. Those might include the heights of that person’s parents and that person’s gender, along with nutrition as a youth. Your model then becomes:
Now you collect data, not only on the height of everyone in your sample, but also on their parents’ heights, their gender, and their nutrition as a child.
You can graph each independent variable against your dependent variable;
you will probably get a scattering of points, but perhaps with an identifiable trend to them. You may believe that each of these independent variables makes some contribution in a linear fashion to determining height, though you don’t know the relative contributions of each factor. The magnitude of the effect of each independent variable on the dependent variable is measured by the coefficients or Bs in the equation below. The equation is:
The purpose of running the regression is to estimate the most likely numerical relationship between your independent variables and your dependent variable, i.e., the numerical values of the Bs. A regression takes
all the independent variables into account and deciphers the contribution of each factor individually to height. Running the regression tells you the coefficients,the change in the dependent variable associated with a change in each independent variable. For instance, once you have run your data through a multiple regression computer package, the resulting equation might be:
where the first 2 is referred to as the intercept (since it is not multiplied by an independent variable), 0.5 is the coefficient on Height of father, 0.46 is the coefficient on Height of mother, etc. Now, if you know how tall a person’s parents are, whether the person is male or female, and the quality of the person’s diet while growing up, you can use this regression equation to predict that person’s height. For instance, if a woman’s father was 70 inches tall, her mother 64 inches tall, and she had a good diet growing up, then plugging these values into this equation provides the prediction that she would be 66.94 inches tall, while her brother (with the same nutrition) would be 2 inches taller (because the coefficient on Male adds 2 inches to height).
If you go back to the original data, you can now predict, for any person in your data set, how tall you expect that person to be, just as we did with the woman and her brother in the previous paragraph. When you actually find the woman and her brother, though, you discover that in fact she is 65.5 inches tall (1.44 inches less than predicted), and her brother is 70 inches tall (1.06 inches taller than predicted). Does that mean that your regression is invalid? Remember that your regression is a way of estimating the relationship between the independent and dependent variables when you don’t know the underlying truth. You should not expect to get exact answers (except in some unusual cases where there is a true physical relationship between variables, such as with some constants in physics). Instead, you hope to find statistically significant relationships between the independent and dependent variables. Statistical significance refers to a high probability (typically, more than 90 or 95 percent) that a coefficient is different than zero. If a coefficient is significantly different than zero, then the variable with which that coefficient is associated does influence the dependent variable; if there is no statistically significant relationship, then the coefficient might be zero, and changing the independent variable might not affect the dependent variable.
How can you determine statistical significance? The same computer package that gave you your coefficients probably gave you, as well, the
standard error and the t-statistics for your coefficients. The standard errors provide an estimate of the variation around the estimate of the coefficient – that is, whether the range of error around the coefficient estimate is small or large. The t-statistics (calculated as the coefficient divided by the standard error) are a way of measuring whether the coefficient is statistically significantly different from zero. A t-statistic of about 1.7 or higher (in absolute value) suggests statistical significance; below that level, the estimate is unreliable, and the independent variable should usually be treated as having a zero effect on the dependent variable.
For instance, suppose the results from your regression, above, gave the following t-statistics along with each coefficient:
Coefficients that are clearly statistically significant from zero include height of mother, gender, and nutrition, since, in absolute value, they are all above 1.7. Height of father approaches statistical significance but does not quite achieve that goal, and the intercept is also not statistically significant.
In other words, this regression has found significant relationships for three of the five variables, but it has not demonstrated a relationship between height of father and an adult’s height. This issue of statistical significance is discussed further below.
Note that the causality runs from the independent variables to the dependent variable. A child does not cause her parent’s height, nor does a child’s height cause its gender or diet. In a regression model, careful thought needs to be given to what is dependent on other factors. The causal relationship should only go one way: the independent variables influence the dependent variable, but a dependent variable should not influence the independent variables. If no variable is dependent on another variable, then there is no predictability: the occurrence of sunspots, for instance, is unlikely to have any influence at all on someone’s height, or vice versa. If two variables are likely to influence each other –for instance, a person’s height is likely to influence her weight, and weight may influence height through nutrition – then special thought needs to be given to what exactly is being
studied. Perhaps some independent variables predict each separately, or perhaps each affects the other. More advanced statistical methods may be necessary in such cases.
Learning how to calculate the effects of changes in the independent variables in a regression will enable you to provide useful predictions for policy analysis. For example, the demand curve for beach recreation is influenced by the quality of water at a beach, price (in distance and money) of travel to the beach, a person’s income, and availability of substitute vacation spots. An example of such a regression is:
As will be discussed in Chapter 7, having a demand curve for recreation permits calculation of the benefits associated with that recreational use. The benefits (or costs) of changing water quality can be estimated using this regression equation, by changing the water quality variable and re-calculating the benefits of the recreational resource.
Among the independent variables, it is useful to distinguish between policy variables, which can be controlled by policy makers, and state-of- nature variables, which cannot be influenced by a decision-maker. For example, suppose that the number of bus trips a person takes per day is estimated as:
Income, average age of the population, and distance a person lives from work are often taken as state-of-nature variables in the short run. It is hard to imagine a transportation policy that can have any effect on the age distribution of the population. Yet the age distribution of the population influences the demand for mass transit. In contrast, bus fare and price of gasoline can be influenced by a policy-maker through tax or subsidy measures.