C AUSALITY VSC ORRELATION

(1)

Ekki Syamsulhakim*

C

AUSALITY VS

C

ORRELATION

1. Regression analysis is a method to obtain a specific functional (or “causal” in most cases) relationship when we have the data of 2 (or more) variables. Without regression analysis, there may only exist general functions. Regression analysis is widely used in empirical analysis especially to compute the marginal effects of a change in one or more variable(s) on one particular variable of interest. Regression analysis is the foundation of econometrics.

2. When we talk about causal relationship (or causality), we talk about theoretical background (e.g. economic theory/ies) about the relationship between the variables: which of those variables is dependent on other variables? For example, age of a person and his/her wage. Or, price of goods x and its demand or quantity demanded. In other words, which of these variables is dependent variable and independent variable? 

3. Another important issue is the correlation between the variables. To understand the concept of correlation, we need to understand the concept of covariance. Covariance between two variables x and y is defined as 𝐶𝑜𝑣(𝑥, 𝑦) = 1_𝑛∑(𝑥 − 𝑥̅)(𝑦 − 𝑦̅), which is a number representing whether x and y has a positive, negative, or neutral relationship. The bigger the value of x (or y), the bigger the values of the covariance. 

4. We can plot x and y in a scatter plot and divide the plot using two lines which is the average of both x and y, 𝑥̅ and 𝑦̅, to get the plot divided into 4 quadrants. From the formula 𝐶𝑜𝑣(𝑥, 𝑦) =_𝑛1∑(𝑥 − 𝑥̅)(𝑦 − 𝑦̅), we can see that if data is spread more in quadrants 1 and 3, then we can conclude that covariance is positive, and if the data is distributed more in quadrants 2 and 4 then the covariance will be negative. 

5. Covariance, however, does not tell us about the strength of the relationship, which can be explained using correlation coefficient. The formula for correlation coefficient is

𝐶𝑜𝑟𝑟(𝑥, 𝑦) = _{√𝑣𝑎𝑟(𝑥)𝑣𝑎𝑟(𝑦)}𝐶𝑜𝑣(𝑥,𝑦) . By dividing the covariance value with the square root of the product between var(x) and var(y), we will have 0 ≤ corr(x,y) ≤ 1. The higher the

*_{sites.google.com/a/fe.unpad.ac.id/ekki}_{; Senior Lecturer, Department of Economics & Development Studies,}

(2)

absolute value of the correlation coefficient the stronger the relationship between x and y. 

6. Important notes: “Correlation does not imply causality”. 2 variables may be closely related (high correlation coefficient). Does it mean that the change in one of the variables causes the change in the other?

T

HE STRUCTURE OF ECONOMIC DATA

7. Knowledge about the data structure is important in doing regression analysis, because there are specific regression methods that depend on the dataset that we have in hand. This will be studied more in econometrics.

8. The first data structure is what is called cross-sectional, which consists of a sample of individuals, households, firms, cities, states, countries, or a variety of other units, taken at a given point in time.

Example : IFLS 2014

9. Second, it can be time series, which consists of observations on a variable or several variables over time.

Example : Education Indicators from BPS

10.Third, we can have pooled (repeated) cross-section structure, which is a combination of two or more cross-sectional data.

Example : SUSENAS KOR RT 2012 & SUSENAS KOR RT 2013

11.The fourth one is panel structure, which consists of a time series for each cross-sectional

member in the data set. 

Example : IFLS 2014 and IFLS 2007

R

EGRESSION

M

ODELS

:

S

IMPLE VS

M

ULTIPLE

12.The equation that describes how y is related to x is called the regression model. Regression models may be linear (in parameter) or non-linear. We focus on linear models.

(3)

14.In micro-econometrics (econometrics using micro-data), SLR is very uncommon. However, there may be examples of SLR applications in time-series or macro-econometrics. As Wooldridge (2013) said: How can we hope to learn in general about the ceteris paribus effect of 𝑥on 𝑦, holding other factors fixed, when we are ignoring all those other factors?

uncorrelated, then, as random variables, they are not linearly related. Hence, E(u|x)=0.

16.We now have our simple linear regression equation (or Population Regression Function, PRF), as 𝐸(𝑦|𝑥) = 𝛽₀+ 𝛽₁𝑥. We called 𝛽₀ is the intercept of regression line and 𝛽₁ is the slope of the regresson line.

17.We rarely have population data in hand, so we use sample data instead. When we use sample data we have Sample Regression Function (SRF) in the form of 𝑦̂ = 𝛽̂₀+ 𝛽̂₁𝑥

H

OW

T

O

G

ET

T

HE

C

OEFFICIENT

(M

ATHEMATICALLY

)

18.The widely used method to get the parameter (statistics) of the regression is Ordinary Least Square (OLS) method. It minimizes the Sum of Squared of the Error (SSE = ∑ 𝑢2) .

21. The variables “female” and “married” are called “dummy variables”. It essentially

(4)

22.When we use dummy variables, only one category can be put into the equation.

Otherwise we are in a situation called “dummy variable trap” (later in econometrics,

this should be discussed in “perfect collinearity”.

23.We are going to apply the formula above (point number 18). Suppose we have a PRF of SLR:

𝑤𝑎𝑔𝑒𝑖 = 𝛽0+ 𝛽1𝑒𝑑𝑢𝑐𝑖 + 𝑢𝑖

24.What is the values of both 𝛽₀ and 𝛽₁ ? 25.What is our SRF?

26.Suppose we add more variables, say, experience so we have a PRF of MLR:

𝑤𝑎𝑔𝑒𝑖 = 𝛽0+ 𝛽1𝑒𝑑𝑢𝑐𝑖 + 𝛽2𝑒𝑥𝑝𝑒𝑟𝑖+ 𝑢𝑖

27.What is our SRF?

28.Suppose we add more variables, say, female and married, so we have a PRF of MLR:

𝑤𝑎𝑔𝑒𝑖 = 𝛽0 + 𝛽1𝑒𝑑𝑢𝑐𝑖 + 𝛽2𝑒𝑥𝑝𝑒𝑟𝑖 + 𝛽3𝑓𝑒𝑚𝑎𝑙𝑒𝑖+ 𝛽4𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖+ 𝑢𝑖 29.What is our SRF?

30.To be able to display regression statistics completely, you have to activate the “data analysis toolpak” add-ins (go to Excel Options, select Add-ins, activate data analysis toolpak)

I

NTERPRETATION OF

T

HE COEFFICIENT

(

AND THE INTERCEPT

)

31.SLR  𝛽̂₀ : The model predicted [dependent variable] to be [𝛽̂₀] [unit of dependent variable] if [independent variable] is zero [unit of independent variable]

32.SLR 𝛽̂₁: A one [unit of independent variable] increase in [independent variable] [increases/decreases] [dependent variable] by [𝛽̂₁] [unit of dependent variable]

33. MLR  𝛽̂₀ : The model predicted [dependent variable] to be [𝛽̂₀] [unit of dependent variable] if [all independent variables] are zero [unit of independent variable]

34.MLR 𝛽̂_𝑖: A one [unit of independent variable] increase in [independent variable] [increases/decreases] [dependent variable] by [𝛽̂₁] [unit of dependent variable], assuming all other independent variables do not change (– or ceteris paribus)

35.MLR  for dummy variables, cross-section data: “Suppose we have two [cross-section unit], which are identical in every aspect, except that one is [dummy category=1] and the other is [dummy category=0]. The [dummy category=1] has [dependent variable] [𝛽̂_𝑖] [unit of dependent variable] [higher/lower] than [dummy category=0]

(5)

M

ODEL

’

S

G

OODNESS OF

F

IT

37.Do we have a good or bad model? It is often useful to compute a statistics that summarizes how well the OLS regression line fits the data. How we can conclude that the model is acceptable? One way to see is to use coefficient of determination (R2_{) to}

check the goodness of fit of the model. Note: R2_{is applicable to models with the same}

dependent variables only. Limitation of R2_{is explained below.}

38.R2_{or the coefficient of determination measures the variations in the dependent variable}

due to the variations in the explanatory variable. The calculation of R2 _{involves the}

breakdown of y for each observation. As Wooldridge (2013) said, “Thus, we can view

OLS as decomposing each 𝑦_𝑖 into two parts, a fitted value and a residual. The fitted values and residuals are uncorrelated in the sample”.

39.Define the total sum of squares (SST), the explained sum of squares (SSE), and the residual sum of squares (SSR) (also known as the sum of squared residuals), as follow:

𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦̅)2; 𝑛

𝑖=1

𝑆𝑆𝐸 = ∑(𝑦̂𝑖 − 𝑦̅)2; 𝑛

𝑖=1

𝑆𝑆𝑅 = ∑ 𝑢̂𝑖2 𝑛

𝑖=1

(6)

40.

𝑅

2

=

𝑆𝑆𝐸

𝑆𝑆𝑇

=

∑(𝑌̂𝑖−𝑌̅)2

∑(𝑌𝑖−𝑌̅)2 (Wooldridge, 2013) ; in Anderson (2014), abbreviation is

different but formula is the same. Note: as more and more independent variables are included in the model, R2_{will increase. For example:}

We have 2 independent variables:

(7)

43.We have two types of statistical tests, the individual t-test and the joint F-test. For the individual t-test, we can have one-sided or two-sided test. For all of the tests, we can use the test-statistic method or p-value method

44.The “long, formal” steps of completing a statistical test:

o _{State the null and alternative hypothesis}

o _{Choose the level of significance}

o _{For t-test method: observe t-statistics and compute t-critical}

o _{For p-value method: compute p-value}

o _{State the decision rule}

o _{State the conclusion}

C

OMPUTER

E

XERCISE

2

45.Open bwght.xlsx. The data is taken from Wooldridge (2013), and contains information in 1988 about birth weight of a baby (in oz), price of the cigarettes per pack (cents), family income (in $000), and the number of cigarettes smoked per day while pregnant.

46.Run a regression analysis where birth weight of a baby is the dependent variable

o _{Interpret the coefficient of all variables}

o _{Explain whether the model fits the data well}

o _{Does number of cigarettes smoked during pregnant affect the birth weight of a}

baby?

o _{Does cigarette price is}_positively_{affected birth weight of a baby?}

o _{Do all variables jointly have effects on birth weight of a baby?}

o _{Calculate the weight of a baby if the price of the cigarettes per pack is 122 cents,}

family income is 80 ($000), and the number of cigarettes smoked by the mother per day is 5

B

ASIC

T

IME

S

ERIES

A

NALYSIS

47.Time series data set comes with a temporal ordering. For analysing time series data in the social sciences, we must recognize that the past can affect the future, but not vice versa (unlike in the Star Trek universe).

(8)

49.There are 4 elements of a time series (Inder, et al, 2010). The first one is the trend, which is the “persistent, long term upward or downward pattern of movement”. Secondly, there is cycle, “a pattern of up-and-down swings that tend to repeat every 2-10 years”. Then we have seasonal, which is “… a regular pattern of fluctuations that occur within each year, and tend to repeat year after year”. Lastly we have irregularity, that “represents whatever is “left over” after identifying the other three systematic components”.

50.Economic time series often have trend. It is very important to be able to check the trend of the series of which causal relationship we want to investigate. As Wooldridge (2013) states “Ignoring the fact that two sequences are trending in the same or opposite directions can lead us to falsely conclude that changes in one variable are actually caused by changes in another variable.”

51.We need to model the trend in time series analysis. Inder, et al (2010) argue that there may be a lot of ways to model the trend. Wooldridge (2013) discusses two methods in capturing the trend, namely linear and exponential. In this preliminary (matrikulasi) program, we will discuss only the linear trend model.

Linear trend

(9)

Exponential (Non-Linear) Trend

Source: Wooldridge (2013)

52.The linear trend model is written:

𝑦𝑡 = 𝛽0+ 𝛽1𝑡 + 𝑒𝑡, 𝑡 = 1,2,3 … , 𝑇

The subscript “t” denotes that the data we are dealing with is a time-series. The unit of the series may be annually, quarterly, monthly, daily, etc.

C

OMPUTER

E

XERCISE

3 –

&

E

LASTICITIES

53.Open linvhouse1.xlsx. This data is taken from Wooldridge (2013), and describes housing investment in 1947 – 1988. In the data we see real housing investment (inv, million $), population (in 000s), housing price index (1982=1), time trend (t=1,2,…42) and some logarithmic transformation of the variables.

54.Plot investment and population variables and check if there is a trend associated with them.

55.Regress log of investment per capita on log of price. Interpret the result. Note: remember the result from point number 54 above.

56.Regress log of investment per capita on trend. Interpret the result. (you can use the words: “the estimated average growth in [dependent variable] is 𝛽̂_{𝑡𝑟𝑒𝑛𝑑} per year)

(10)

58.Conduct hypothesis testing on log of price 59.Conduct forecasting

C

OMPUTER

E

XERCISE

4 –

M

ONTHLY

D

ATA

60.Open data-time-series.xlsx

61.Conduct regression of international airline passenger on trend; interpret the result 62.Conduct regression of international airline passenger on trend and monthly dummy;

(11)

References

Anderson, Sweeney, Williams, Camm, Cochran; Statistics for Business and Economics, 12th Edition, 2014

Dougherty, C., Introduction to Econometrics, 2011

Ellis, P. D., The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results, Cambridge, 2010

Inder, B., et al., Business and Economic Statistics, Department of Econometrics & Business Statistics, Monash University, Australia, 2010

Syamsulhakim, E., “Pengajaran Pengantar Ekonometrika Menggunakan Stata”, Training for Trainers Ekonometrika, Departemen Ilmu Ekonomi, Universitas Padjadjaran, 2016

Wooldridge, J., Introductory Econometrics: A Modern Approach, 2013