Statistical Analysis Using Regression
3. STEPS IN RUNNING A REGRESSION
3.2 Choosing a Functional Form: Linear or Non-Linear?
Once the variables have been chosen, the next step is to specify the form of the equation, or the manner in which the independent variables are assumed to interact to determine the level of the dependent variable.
Multiple regression is often termed linear regression, a term that makes
people think that a linear relationship between the dependent and independent variables is the only one possible. In fact, the “linear” in linear regression only refers to the coefficients, not to the dependent or independent variables.
While:
is obviously a linear regression, so are the following:
In each case, even though the relationship between and Y or between and Y is more complicated than the simple linear one above, the coefficients are still linear and can be estimated using ordinary least squares, the classical linear regression method.
If you were to graph the relationship between the dependent variable and an independent variable, sometimes it makes sense for the relationship to be a straight line. For instance, if you double the use of an input to production, you are likely to double the cost associated with that input. In other cases, though, a straight line might not make sense. For instance, if you were graphing the relationship between an input to production (such as fertilizer) and output (such as crop yield), the theory of diminishing marginal product would argue that doubling fertilizer would not be expected to double yield;
yield would increase more slowly. In this case, a linear relationship between yield and fertilizer would not be appropriate, and a functional form that allowed for a nonlinear relationship would be more appropriate.
To illustrate, let us assume that we are analyzing the demand for commuting trips to work. If the demand function has been specified as a linear relationship, each independent variable enters into the equation as itself, and only itself:
Linear demand functions, such as this, have great appeal in empirical work for two reasons. First, the slope or regression coefficients
provide a direct measure of the marginal relationships in the demand function. That is, they indicate the change in quantity demanded caused by a one-unit change in each of the related variables. For example, if
quantity demanded will decline by 2 units per time period with each $1 increase in the price. Second, sometimes demand relationships are in fact approximately linear over the range in which decisions are made. The demand curve derived from a demand function such as this is a straight line, as shown in Figure 6-2.
If a linear relationship between the dependent and independent variables is not appropriate, there are numerous other functional forms that can reflect nonlinear relationships, as illustrated above. Figure 6-2 illustrates some of the most common functional forms used in regressions, including quadratic, double-log, and semi-log.
The quadratic equation includes, in addition to the linear variables, squared terms for one or more of the independent variables:
In this form, a one-unit change in price now has a more complicated effect on quantity: dQ/dP, the slope of the demand curve, is now
The relationship between price and quantity (or quantity and income, for that matter) is no longer constrained to be a straight line. If is negative, the first-order effect is that the demand curve slopes down; however, depending on the sign of (that is, whether it is positive or negative), the demand curve may become steeper or more shallow as price increases. Because the quadratic form is shaped like a parabola, with either a maximum or minimum, in some cases this shape may result in nonsensical results in some range of the policy analysis. For instance, as price goes up, quantity may decrease over one range but increase over another. While this shape might make sense for other kinds of regressions (for instance, in cases of extreme diminishing marginal product), it may make less sense for a demand curve.
Still, if the analyst is aware that the resulting equation should only be used in a limited range of the independent variable, the quadratic form can be useful in many cases.
The double-logarithmic equation takes the natural (i.e., to the base e) logarithmic form on both sides of the equation:
This expression is equivalent to the so-called Cobb-Douglas functional form,
(where e refers to the exponential, sometimes called the antilog, of 1, 2.71828. . .).
Depending on the sign and size of the Bs (the coefficients), the relationship between Q and an independent variable could be increasing or decreasing at either an increasing or a decreasing rate; thus, again, a curved relationship is possible between the dependent and independent variables.
Here the graph of the relationship between the dependent and an independent variable does not have a maximum or minimum; thus, for some data sets it may be more desirable than the linear or quadratic. On the other hand, the double log may not be desirable for a demand curve, because there is no finite “choke” price that will lead to quantity demanded going to zero (a likely result of very high prices for most goods).
The unique advantage of the double log form is the ease with which the regression coefficients can be compared. It is the only functional form in which the regression coefficients are the elasticities, the percent change in the dependent variable associated with a one percent change in each of the independent variables. For instance, if and in the above equation, then a 10% increase in price leads to a 5% decrease in quantity purchased (5%/(-10%) = -0.5), while a 10% increase in income leads to a 10% increase in quantity purchased. Because the coefficients are elasticities, their magnitudes can be directly compared to provide an accurate indication of which independent variable has the largest effect on the dependent variable. This cannot be done with a linear functional form because the magnitude of the B coefficients depend on the scale of the units of measurement for each independent variables.
In the semi-logarithmic form, either the dependent variable or one or more independent variables may be logged. Thus, the semi-logarithmic equation can be written as:
or as
In many recreation demand studies, applications of the semilog form frequently take the first form, converting number of trips to a natural logarithm and leaving all independent variables in their linear form. This form has a unique advantage in estimating the average consumer surplus per unit Q as simply one divided by the regression coefficient for price.
A large range of functional forms is possible for any regression.
Identifying a form that fits the data well is often the reason for running many regressions on the same data set. As suggested above, theory can be used to suggest characteristics of the functional form. For instance, it is likely that a
production function (a relationship between input and output) will display diminishing marginal product – a one-unit change in input use will lead to a decreasing amount of additional output – and thus one of the non-linear forms suggested above would be appropriate. In some production functions, it might be true that the amount of one input can influence the marginal productivity of another input. For instance, ground-level ozone, a pollutant, is produced by a chemical reaction between hydrocarbons (HC) and nitrogen oxides in the presence of sunlight. The amount of ozone produced when is increased depends on how much HC is present. If you were to run the regression:
then the effect of one more unit of on ozone production would be:
now depends on the level of hydrocarbon, due to the use of this interaction term as theory predicts.
In other cases, theory may not suggest what form the equation should take.
In these cases, you should let the data suggest the functional form and the form that provides the best possible fit of the data. Scatter diagrams of the relationship between the dependent variable and each of the independent variables may suggest which form reflects the true relationships between them. In practice, several different forms may be tested, and the one that best fits the data (using some of the statistical tests described below) should be selected as being most likely to reflect the true relationship.