STATISTICAL BACKTESTS BASED ON THE SIZES OF TAIL LOSSES

A5.4 RECOMMENDED READING

9.3 STATISTICAL BACKTESTS BASED ON THE SIZES OF TAIL LOSSES

9.2.4 The Conditional Backtesting (Christoffersen) Approach

A useful adaptation to these approaches is the conditional backtesting approach suggested by Christoffersen (1998). The idea here is to separate out the particular hypotheses being tested, and then test each hypothesis separately. For example, the full null hypothesis in a standard frequency-of-tail- losses test is that the model generates a correct frequency of exceptions and, in addition, that exceptions are independent of each other. The second assumption is usually subsidiary and made only to simplify the test. However, it raises the possibility that the model could fail the test, not because it generates the wrong frequency of failures, as such, but because failures are not independent of each other.

The Christoffersen approach is designed to avoid this problem. To use it, we break down the joint null hypothesis into its constituent parts, thus giving us two distinct sub-hypotheses: the sub- hypothesis that the model generates the correct frequency of tail losses, and the sub-hypothesis that tail losses are independent. If we make appropriate assumptions for the alternative hypotheses, then each of these hypotheses has its own likelihood ratio test, and these tests are easy to carry out. This means that we can test our sub-hypotheses separately, as well as test the original joint hypothesis that the model has the correct frequency of independently distributed tail losses.

The Christoffersen approach therefore helps us to separate out testable hypotheses about the dynamic structure of our tail losses from testable hypotheses about the frequency of tail losses. This is potentially useful because it not only indicates whether models fail backtests, but also helps to identify the reasons why.

9.3 STATISTICAL BACKTESTS BASED ON THE SIZES

2.5 3 3.5 4 4.5 5 5.5 6 0.95

0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1

Tail losses

Cumulative probability

Theoretical distribution function Empirical distribution function

Figure 9.3 Predicted and empirical tail-loss distribution functions.

Note: Estimates obtained for a random sample size of 1,000 drawn from a Studentt-distribution with ﬁve degrees of freedom.

The model assumes that the P/L process is normal, and the conﬁdence level is 95%.

is normal. If we select a 95% VaR conﬁdence level, we would expect 5% of 1,000, or 50, loss observations exceeding our VaR. Given our sample, we estimate the VaR itself to be 2.153, and there are 47 tail-loss observations.

We now wish to assess the (in this case, incorrect) null hypothesis that the empirical and predicted distribution functions (DFs) are the same. We can form some idea of whether they are the same from the plot of the two tail-loss distributions shown in Figure 9.3, and casual observation would indeed suggest that the two distributions are quite different: the empirical DF is below the predicted DF, indicating that tail losses are higher than predicted. This casual impression is quite useful for giving us some ‘feel’ for our data, but we also need to check our impressions with a formal test of the distance between the two cdfs.

Perhaps the most straightforward formal test of this difference is provided by the Kolmogorov–

Smirnov test, which gives us a test value of 0.2118 and gives the null hypothesis a probability value of 2.49%. If we take the standard 5% signiﬁcance level, our test result is signiﬁcant and we would (rightly) reject the null hypothesis. The Kupiec test, by contrast, would have accepted the null hypothesis as correct.

The main difference between these tests is that the sizes-of-tail-losses test takes account of the sizes of losses exceeding VaR and the Kupiec test does not. The Kupiec test will not be able to tell the difference between a ‘good’ model that generates tail losses compatible with the model, and a ‘bad’

one that generates tail losses incompatible with the model, provided that they have the right tail-loss frequencies. By contrast, our sizes-of-tail-losses approach does take account of the difference between the two models, and should be able to distinguish between them. We should therefore expect this

test to be more discriminating and, hence, more reliable than the Kupiec test. Some simulations I ran to check this out broadly conﬁrmed this expectation, and suggest that the new test is practically feasible given the amount of data we often have to work with.

Box 9.2 Testing the Difference between Two Distributions

There are often situations (e.g., when applying the Crnkovic–Drachman or other backtests) where we wish to test whether two distribution functions — an empirical distribution and a theoretical or predicted distribution — are signiﬁcantly different from each other. We can carry out this type of test using a number of alternative test statistics.⁹

The most popular is the Kolmogorov–Smirnov (KS) test statistic, which is the maximum value of the absolute difference between the two distribution functions. The KS statistic is easy to calculate and its critical values can be approximated by a suitable program (e.g., using the routine described in Presset al.(1992, pp. 624–626)). We can also calculate both the test statistic and its signiﬁcance using MATLAB’s ‘kstest’ function. However, the KS statistic tends to be more sensitive around the median value of the distribution and less sensitive around the extremes.

This means that the KS statistic is less likely to detect differences between the tails of the distributions than differences between their central masses — and this can be a problem for VaR and ETL estimations, where we are more concerned with the tails than the central mass of the distribution.

An alternative that avoids this latter problem is Kuiper’s statistic, and it is for this reason that Crnkovic and Drachman prefer the Kuiper statistic over the KS one. The Kuiper statistic is the sum of the maximum amount by which each distribution exceeds the other, and its critical values can be determined in a manner analogous to the way in which we obtain the critical values of the KS statistic. However, Crnkovic and Drachman (1996, p. 140) also report that the Kuiper’s test statistic is data-intensive: results begin to deteriorate with less than 1,000 observations, and are of little validity for less than 500.¹⁰ Both these tests are therefore open to objections, and how useful they might be in practice remains controversial.

9.3.2 The Crnkovic–Drachman Backtest Procedure

Another size-based backtest is suggested by Crnkovic and Drachman (CD; 1995, 1996). The essence of their approach is to evaluate a market model by testing the difference between the empirical P/L distribution and the predicted P/L distribution, across their whole range of values.¹¹Their argument is that each P/L observation can be classiﬁed into a percentile of the forecasted P/L distribution, and

9The tests discussed in Box 9.2 — the Kolgmorov–Smirnov and Kuiper tests — are both tests of the differences between two continuous distributions. However, if we are dealing with discrete distributions, we can also test for differences between distributions using more straightforward chi-squared test procedures (see, e.g., Presset al.(1992, pp. 621–623)). These tests are very easy to carry out — and we can always convert continuous distributions into discrete ones by putting the observations into bins. See also Section 9.3.3 below.

10One other cautionary point should be kept in mind: both these statistics presuppose that the parameters of the distributions are known, and if we use estimates instead of known true parameters, then we can’t rely on these test procedures and we should strictly speaking resort to Monte Carlo methods (Presset al.(1992, p. 627)) or an alternative test, such as the Lillifors test (Lilliefors (1967)), which also examines the maximum distance between two distributions, but uses sample estimates of parameter values instead of parameter values that are assumed to be known.

11Since this test uses more information than the frequency-of-tail-losses test, we would also expect it to be more reliable, and this seems to be broadly conﬁrmed by results reported by Crnknovic and Drachman (1995).

if the model is good, the P/L observations classiﬁed this way should be uniformly distributed and independent of each other. This line of reasoning suggests two distinct tests:

r

A test of whether the classified observations have a uniform density function distributed over the range (0,1), which we can operationalise by testing whether the empirical distribution of classified observations matches the predicted distribution of classified observations.

r

A test of whether the classiﬁed observations are independent of each other, which we can carry out by means of a standard independence test (e.g., the BDS test, as suggested by Crnkovic and Drachman).

The ﬁrst of these is effectively a test of whether the predicted and empirical P/L (or L/P) distributions are the same, and is therefore equivalent to our earlier sizes-of-tail-losses test applied to all observations rather than just tail losses. This means that the main difference between the sizes-of- tail-losses test and the (ﬁrst) Crnkovic–Drachman test boils down to the location of the threshold that separates our observations into tail observations and non-tail ones.¹²

The impact of the threshold is illustrated in Figure 9.4. If we take the threshold to be the VaR at the 95% conﬁdence level, as we did earlier, then we get the high threshold indicated in the ﬁgure, with only 5% of expected observations lying in the tail to the right: 95% of the observations used

0−5 0.05

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

−4 −3 −2 −1 0 1 2 3 4 5

Loss (ⴙ) / profit (ⴚ)

Probability

Very low threshold

High threshold VaR at 95%cl

Figure 9.4 The impact of alternative thresholds on tail losses.

12There are also certain minor differences. In particular, Crnkovic and Drachman suggest the use of Kuiper’s statistic rather than the Kolmogorov–Smirnov statistic in testing for the difference between the predicted and empirical distributions.

Nonetheless, the choice of distance test is ancillary to the basic procedure, and we could easily carry out the sizes-of-tail- losses test using Kuiper’s statistic instead of the Kolmogorov–Smirnov one. Crnkovic and Drachman also suggest weighting Kuiper’s statistic to produce a ‘worry function’ that allows us to place more weight on the deviations of large losses from predicted values. This is a good suggestion, because it enables us to take account of the economic implications of alternative outcomes, and if we wish to, we can easily incorporate a ‘worry function’ into any of the backtests discussed in the text.

by the CD test will be truncated, and there will be a large difference between the observation sets used by the two tests. However, if the threshold is sufficiently low (i.e., sufficiently to the left in the figure), we would expect all observations to lie to the right of it. In that case, we could expect the two tests to be indistinguishable.

The bottom line is that we can regard the ﬁrst CD test as a special case of the sizes-of-tail-losses test, and this special case occurs when the sizes-of-tail-losses test is used with a very low threshold. This, in turn, implies that the sizes-of-tail-losses test can be regarded as a generalization of the (ﬁrst) CD test.

But which test is more helpful? The answer depends in part on the relevance of the non-tail observations in our data set to the tail observations (or tail quantiles, such as VaRs) that we are (presumably) more interested in. If we believed that the central-mass observations gave us useful and unbiased information relevant for the estimation of tail probabilities and quantiles, then we might want to make use of them to get the benefits of a larger sample size. This might suggest making use of the whole P/L sample — which in turn suggests we should use the CD test. However, there are many circumstances in which we might wish to truncate our sample and use the sizes-of-tail-losses test instead. We might be concerned that the distribution fitted to the tail did not fit the more central observations so well (e.g., as would be the case if we were fitting an extreme value distribution to the tail), or we might be concerned that the distribution fitted to the central-mass observations was not providing a good fit for the tail. In either case, including the more central observations could distort our results, and we might be better off truncating them.

There is, in effect, a sort of trade-off between variance and bias: making the tail larger (and, in the limit, using the CD test) gives us more observations and hence greater precision (or a lower variance); but if the extreme observations are particularly distinctive for any reason, then including the central ones (i.e., working with a larger tail or, in the limit, using the CD test) could bias our results. Strictly speaking, therefore, we should only use the CD test if we are conﬁdent that the whole empirical distribution of P/L is relevant to tail losses we are concerned about. In terms of our variance–bias trade-off, this would be the case only if concerns about ‘variance’ dominated concerns about ‘bias’. But if we are concerned about both variance and bias, then we should choose a tail size that reﬂects an appropriate balance between these factors — and this would suggest that we should use the sizes-of-tail-losses test with an appropriately chosen threshold.

Box 9.3 Tests of Independence

The second test in the Crnkovic–Drachman backtest procedure is a test of the independence of classiﬁed P/L observations. They suggest carrying this out with the BDS test of Brocket al.

(1987). The BDS test is powerful but quite involved and data-intensive, and we might wish to test independence in other ways. For instance, we could use a likelihood ratio test (as in Christoffersen).

Alternatively, we could estimate the autocorrelation structure of our classified P/L observations, and these autocorrelations should be ( jointly) insignificant if the observations are independent. We can then test for independence by testing the significance of the empirical autocorrelations, and implement such tests using standard econometrics packages or MATLAB (e.g., using the

‘crosstab’ command in the Statistics Toolbox or the ‘crosscor’ command in the Garch Toolbox).

9.3.3 The Berkowitz Approach

There is also another, more useful, size-based approach to backtesting. Recall that the CD approach tells us that the distribution of classiﬁed P/L observations should be ‘iid U(0,1)’ distributed — both

uniform over the range (0,1) and independently and identically distributed over time. Instead of testing these predictions directly, Berkowitz (2001) suggests that we transform these classiﬁed observations to make them normal under the null hypothesis, and we can obtain our transformed normal series by applying an inverse normal transformation to the uniform series (i.e., making use of the result that if xt is iid U(0,1), thenzt =⁻¹(xt) is iid N(0,1)). What makes this suggestion so at- tractive is that once the data are transformed into normal form, we can apply much more powerful statistical tools than we can apply directly to uniform data. In particular, it now becomes possible to apply a battery of likelihood ratio tests and identify more clearly the possible sources of model failure.

One possible use of such a procedure is to test the null hypothesis (i.e., thatztis iid N(0,1)) against a fairly general ﬁrst-order autoregressive process with a possibly different mean and variance. If we write this alternative process as:

zt−µ=ρ(zt−1−µ)+εt (9.2)

then the null hypothesis maintains thatµ=0,ρ=0, andσ², the variance ofεt, is equal to 1. The log-likelihood function associated with Equation (9.2) is known to be:

−1

2log(2π)−1

2log[σ²/(1−ρ)²]−[z1−µ/(1−ρ)]²

2σ²/(1−ρ)² −(T −1)

2 log(2π)−(T −1) 2 log(σ²)

−

zt−1− T

t=2

(zt−µ−ρzt−1)² 2σ²

(9.3) (Berkowitz (2001, p. 468)). The likelihood ratio test statistic for the null hypothesis is then:

LR= −2[L(0,1,0)−L( ˆµ,σˆ²,ρ)]ˆ (9.4) where ˆµ, ˆσand ˆρare maximum likelihood estimates of the parameters concerned.LRis distributed under the null hypothesis asχ²(3), a chi-squared with three degrees of freedom. We can therefore test the null hypothesis against this alternative hypothesis by obtaining maximum likelihood estimates of the parameters, deriving the value of the LR statistic, and comparing that value against the critical value for a χ²(3). This is a powerful test because the alternative hypothesis is quite a general one and because, unlike the last two approaches just considered, this approach captures both aspects of the null hypothesis — uniformity/normality and independence — within a single test.

We can also adapt the Berkowitz approach to test whether the sizes of tail losses are consistent with expectations under the null hypothesis. The point here is that if the underlying data are fatter tailed than the model presumes, then the transformedzt will also be fatter tailed than the normal distribution predicted by the null hypothesis. We can test this prediction by transforming our tail-loss data and noting that their likelihood function is a truncated normal log-likelihood function (i.e., because we are dealing only with values of zt corresponding to tail losses, rather than the whole range of possiblezt values). We then construct the LR test in the same way as before: we estimate the parameters, substitute these into the truncated log-likelihood function, substitute the value of the latter into the equation for the LR test statistic, Equation (9.4), and compare the resulting test value against the critical value bearing in mind that the test statistic is distributed asχ²(3) under the null hypothesis.

Box 9.4 A Risk-return Backtest

Bederet al.(1998, pp. 302–303) suggest a rather different backtest based on the realised risk–

return ratio. Assume for the sake of argument that we have normally distributed P/L, with mean µand standard deviationσ. Our VaR is then−µ−αclσ, and the ratio of expected P/L to VaR is−µ/(µ+α_clσ)= −1/[1+αclσ/µ]. If the model is a good one and there are no untoward developments over the backtesting period, we should ﬁnd that the ratio of actual P/L to VaR is equal to this value give or take a (hopefully well-behaved) random error. We can test this prediction formally (e.g., using Monte Carlo simulation), or we can use the ratio of actual P/L to VaR as an informal indicator to check for problems with our risk model.

Dalam dokumen An Introduction to Market Risk Measurement - CiteSeerX (Halaman 170-176)