07350015%2E2014%2E969429

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=ubes20

Download by: [Universitas Maritim Raja Ali Haji] Date: 11 January 2016, At: 19:13

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

Comment

Jonathan H. Wright

To cite this article: Jonathan H. Wright (2015) Comment, Journal of Business & Economic Statistics, 33:1, 12-13, DOI: 10.1080/07350015.2014.969429

To link to this article: http://dx.doi.org/10.1080/07350015.2014.969429

Published online: 26 Jan 2015.

Submit your article to this journal

Article views: 123

View related articles

(2)

12 Journal of Business & Economic Statistics, January 2015

Comment

Jonathan H. W

RIGHT

Department of Economics, Johns Hopkins University, Baltimore, MD 21218 ([email protected])

The test of equal predictive accuracy proposed by Diebold and Mariano (1995) (henceforth the DM test) has been extraor-dinarily influential. Whenever the accuracy of two forecasts is compared, the DM test is used to assess whether the difference in forecast accuracy is a fluke or not. The article has more than 3500 Google scholar cites. And it has the virtue of great simplic-ity. It is no more than at-test of the hypothesis that the difference in means between two series is equal to zero.

Yet the DM test poses a certain conundrum. Often we are comparing forecasts from nested models such as comparing

yt+1=β1x1t+ε1t, (1)

and

yt+1 =β1x1t+β2x2t+ε2t. (2)

In this case, the hypothesis of equal population forecast ac-curacy requiresβ2to be equal to zero. In other words, the two

models have to be the same. In this case, it turns out that the null limiting distribution of the DM statistic is nonstandard and has a nonzero mean. This makes sense; if the extra variables in the large model are truly irrelevant, then their inclusion should make forecasts less accurate.

Consequently, many authors have used complicated methods for applying the DM test to nested forecast comparisons. This gives rise to odd situations. Say that we are comparing forecasts from a small and a large model. It is possible that the forecasts from the large model have higher mean square prediction error than the forecasts from the small model, but still that the large model is still significantly better than the small model. This seems a very odd outcome, and is I think a case of the “abuse” of the Diebold–Mariano test.

I agree with Diebold’s conclusion here that the DM test statis-tic should generally be compared to standard normal cristatis-tical values, but I get to that conclusion via a slightly different route. In Diebold’s Assumption DM, the forecast errors are taken as primitives, and it is assumed that the loss differential evaluated at the estimated parameters is stationary and has mean zero. In the example above, it is assumed that

E((yt+1−βˆ1x1t)2)=E((yt+1−βˆ1x1t−βˆ2x2t)2)

in all sample sizes, and then of course the asymptotic normal-ity of the DM statistic follows immediately. The trouble is that there are no actual parameter values for which assumption DM will hold for all sample sizes. Several alternative routes have been considered. Giacomini and White (2006) assumed estima-tion based on rolling windows with fixed window size, so that parameter estimation error does not vanish asymptotically. Cal-houn (2011) lets the number of additional variables in the large model go to infinity, for the same purpose. To me, the most broadly applicable approach is that of Clark and McCracken (2013, 2014) who specified that β2=KT−1/2, where K is a

fixed constant that ensures that models (1) and (2) have equal finite-sample forecast accuracy. Diebold refers to this as “new school WCM.” In this case, the DM statistic still doesnothave a normal asymptotic distribution. The distribution is complicated and model-dependent. However, in many contexts, using stan-dard normal critical values will not be too bad, in the sense that the effective size of the test will be close enough to its nominal level. Clark and McCracken (2013) provided some Monte Carlo simulation evidence on this point. They found that comparing the DM statistic to standard normal critical values gives a test of the null of equal finite-sample forecast accuracy that can be sig-nificantly oversized if the Newey–West standard errors are used. But this size distortion can be greatly reduced if the standard errors instead use the rectangular window and the small-sample adjustment of Harvey, Leybourne, and Newbold (1997). None of this means that using standard normal critical values is right under all circumstances, just that it is often a simple way of getting an approximately correctly sized test of the relevant hy-pothesis (for forecasters): equal finite-sample forecast accuracy. And of course it should be emphasized that this is not a test of models, or of population forecast accuracy—likelihood ratio tests are better suited to that task.

I agree with Diebold that the standard out-of-sample method for selecting models wastes power. The reasons that are often given for using the out-of-sample approach do not really hold up to close scrutiny. There would be considerable merit to a genuinely out-of-sample comparison; one where the researcher specified the models to be compared without seeing the holdout sample, and then evaluated the models on this holdout sample. But this is not what the pseudo-out-of-sample methodology generally entails. Still there is one good reason for using the out-of-sample method. Data are heavily revised, and at least in the United States, these revisions are largely unforecastable. The out-of-sample method can replicate precisely what a researcher could have done in real-time, using only the data as observed then. With real-time datasets like that provided by the Federal Reserve Bank of Philadelphia, it is easy and common to do out-of-sample forecasting on real-time datasets. Diebold describes this as “rarely done.” But there are many articles that do out-of-sample forecasting on vintage data. It seems to me to be a valid and important exercise. Conversely, I see little point in selecting models using the standard out-of-sample method if it is not done using real-time data. To be sure, the asymptotics of the DM test get complicated if data revisions are involved. But again, using standard normal critical values should often provide an approximately correctly sized test of the null of equal finite-sample forecast accuracy.

January 2015, Vol. 33, No. 1 DOI:10.1080/07350015.2014.969429

(3)

Kilian: Comment 13

REFERENCES

Calhoun, G. (2011), “Out-of-Sample Comparisons of Overfit Models,” Working Paper 10002, Iowa State University. [12]

Clark, T. E., and McCracken, M. W. (2013), “Advances in Forecast Evaluation,” inHandbook of Economic Forecasting (Vol. 2), eds. G. Elliott and A. Timmermann, Amsterdam: Elsevier. [12]

——— (2014), “Nested Forecast Model Comparisons: A New Approach to Testing Equal Accuracy,”Journal of Econometrics, forthcoming. [12]

Diebold, F. X., and Mariano, R. S. (1995), “Comparing Predictive Ac-curacy,” Journal of Business and Economic Statistics, 13, 253–263. [12]

Giacomini, R., and White, H. (2006), “Tests of Conditional Predictive Ability,”

Econometrica, 74, 1545–1578. [12]

Harvey, D. I., Leybourne, S. J., and Newbold, P. (1997), “Testing the Equality of Prediction Mean Squared Errors,”International Journal of Forecasting, 13, 281–291. [12]

Comment

Lutz K

ILIAN

Department of Economics, University of Michigan, Ann Arbor, MI 48109 ([email protected])

Professor Diebold’s personal reflections about the history of the DM test remind us that this test was originally designed to compare the accuracy of model-free forecasts such as judgmen-tal forecasts generated by experts, forecasts implied by financial markets, survey forecasts, or forecasts based on prediction mar-kets. This test is used routinely in applied work. For example, Baumeister and Kilian (2012) use the DM test to compare oil price forecasts based on prices of oil futures contracts against the no-change forecast.

Much of the econometric literature that builds on Diebold and Mariano (1995), in contrast, has been preoccupied with testing the validity of predictive models in pseudo-out-of-sample envi-ronments. In this more recent literature, the concern actually is not the forecasting ability of the models in question. Rather the focus is on testing the null hypothesis that there is no predictive relationship from one variable to another in population. Testing for the existence of a predictive relationship in population is viewed as an indirect test of all economic models that suggest such a predictive relationship. A case in point is studies of the predictive power of monetary fundamentals for the exchange rate (e.g., Mark1995). Although this testing problem may seem similar to that in Diebold and Mariano (1995) at first sight, it is conceptually quite different from the original motivation for the DM test. As a result, numerous changes have been proposed in the way the test statistic is constructed and in how its distribution is approximated.

In a linear regression model testing for predictability in pop-ulation comes down to testing the null hypothesis of zero slopes which can be assessed using standard in-samplet- or Wald-tests. Alternatively, the same null hypothesis of zero slopes can also be tested based on recursive or rolling estimates of the loss in fit associated with generating pseudo-out-of-sample predictions from the restricted rather than the unrestricted model. Many empirical studies including Mark (1995) implement both tests.

Under standard assumptions, it follows immediately that pseudo-out-of-sample tests have the same asymptotic size as, but lower power than in-sample tests of the null hypothesis of no predictability in population, which raises the question why anyone would want to use such tests. While perhaps ob-vious, this point has nevertheless generated extensive debate. The power advantages of in-sample tests of predictability were first formally established in Inoue and Kilian (2004). Recent

work by Hansen and Timmermann (2013) elaborates on the same point. Less obviously it can be shown that these asymp-totic power advantages also generalize to comparisons of mod-els subject to data mining, serial correlation in the errors and even certain forms of structural breaks (see Inoue and Kilian

2004).

WHERE DID THE LITERATURE GO OFF TRACK?

In recent years, there has been increased recognition of the fact that tests of population predictability designed to test the validity of predictive models are not suitable for eval-uating the accuracy of forecasts. The difference is best il-lustrated within the context of a predictive regression with coefficients that are modeled as local to zero. The local asymp-totics here serve as a device to capture our inability to detect nonzero regression coefficients with any degree of reliability. Consider the data-generating process yt+1=β+εt+1,where

β =0+δT1/2_{, δ >}₀_,_and_ε

t∼NID(0,1).The Pitman drift

parameter δ cannot be estimated consistently. We restrict at-tention to one-step-ahead forecasts. One is the restricted fore-casty_t₊₁_|_t=0; the other is the unrestricted forecasty_t₊₁_|_t =β,ˆ

where ˆβ is the recursively obtained least-squares estimate of

β. This example is akin to the problem of choosing between a random walk with drift and without drift in generating forecasts of the exchange rate.

It is useful to compare the asymptotic MSPEs of these two forecasts. The MSPE can be expressed as the sum of the forecast variance and the squared forecast bias. The restricted forecast has zero variance by construction for all values ofδ, but is biased away from the optimal forecast byδ, so its MSPE isδ2_. _The

unrestricted forecast in contrast has zero bias, but a constant variance for all δ,which can be normalized to unity without loss of generality. As Figure 1 illustrates, the MSPEs of the two forecasts are equal for δ=1. This means that for values

January 2015, Vol. 33, No. 1 DOI:10.1080/07350015.2014.969430

Color versions of one or more of the figures in the article can be found online atwww.tandfonline.com/r/jbes.