A5.4 RECOMMENDED READING
9.2 STATISTICAL BACKTESTS BASED ON THE FREQUENCY OF TAIL LOSSES
Having completed our preliminary data analysis, we turn now to formal statistical backtesting. All statistical tests are based on the idea that we first select a significance level, and then estimate the probability associated with the null hypothesis being ‘true’. Typically, we would accept the null hypothesis if the estimated value of this probability, the estimated prob-value, exceeds the chosen significance level, and we would reject it otherwise. The higher the significance level, the more likely we are to accept the null hypothesis, and the less likely we are to incorrectly reject a true model (i.e., to make a Type I error, to use the jargon). However, it also means that we are more likely to incorrectly accept a false model (i.e., to make a Type II error). Any test therefore involves a trade-off between these two types of possible error.1
In principle, we should select a significance level that takes account of the likelihoods of these errors (and, in theory, their costs as well) and strikes an appropriate balance between them. However, in practice, it is very common to select some arbitrary significance level such as 5% and apply that level in all our tests. A significance level of this magnitude gives the model a certain benefit of the doubt, and implies that we would reject the model only if the evidence against it is reasonably strong:
for example, if we are working with a 5% significance level, we would conclude that the model was adequate if we obtained any prob-value estimate greater than 5%.
A test can be said to be reliable if it is likely to avoid both types of error when used with an appropriate significance level.
9.2.1 The Basic Frequency-of-tail-losses (or Kupiec) Test
Perhaps the most widely used test is the basic frequency-of-tail-losses test (see Kupiec (1995)). The idea behind this approach is to test whether the observed frequency of tail losses (or frequency of losses that exceed VaR) is consistent with the frequency of tail losses predicted by the model. In particular, under the null hypothesis that the model is ‘good’ (i.e., consistent with the data), the number of tail lossesxfollows a binomial distribution:
Prob (x|n,p)= n
i
pi(1−p)n−i (9.1)
1We should keep in mind that the critical value associated with the null hypothesis will depend on the alternative hypothesis (e.g., whether the alternative hypothesis is that the ‘true’ prob-value is different from, or greater than, or less than, the prob- value under the null hypothesis). However, in what follows we will assume that the alternative hypothesis is the last of these, namely, that the ‘true’ prob-value is less than the null-hypothesis prob-value.
wherenis the number of P/L observations and p, the predicted frequency of tail losses, is equal to 1 minus the VaR confidence level. Given the values of the parametersn, pandx, the Kupiec test statistic is easily calculated using a suitable calculation engine (e.g., using the ‘binomdist’ function in Excel or the ‘binocdf’ function in MATLAB).
To implement the Kupiec test, we require data onn,pandx. The first two are easily found from the sample size and the VaR confidence level, and we can derivexfrom a set of paired observations of P/L and VaR each period. These paired observations could be the actual observations (i.e., observed P/L and associated VaR forecasts each period)2 or historical simulation ones (i.e., the historical simulation P/L we would have observed on a given portfolio, had we held it over the observation period, and the set of associated VaR forecasts).
For example, suppose we have a random sample ofn=1,000 P/L observations drawn from a portfolio. We take the confidence level to be 95%, and our model predicts that we should get np = 50 tail losses in our sample. With this sample, the number of tail-loss observations, x, is 55. The Kupiec test then gives us an estimated prob-value estimate of 21%, where the latter is taken to be the estimated probability of 55 or more excess loss observations. At a standard significance level such as 5%, we would therefore have no hesitation in ‘passing’ the model as acceptable.
The Kupiec test has a simple intuition, is very easy to apply and does not require a great deal of information. However, it also has some drawbacks:
r
The Kupiec test is not reliable except with very large sample sizes.3r
Since it focuses exclusively on the frequency of tail losses, the Kupiec test throws away potentially valuable information about the sizes of tail losses.4 This suggests that the Kupiec test should be relatively inefficient, compared to a suitable test that took account of the sizes as well as the frequency of tail losses.52Note, therefore, that the Kupiec test allows the portfolio or the VaR to change over time. The same goes for the other tests considered in this chapter, although it may sometimes be necessary (e.g., as with the basic sizes of excess losses test considered below) to first apply some transformation to the data to make the tests suitably invariant. It should be obvious that any test that requires a fixed portfolio or constant VaR is seldom of much practical use.
3Frequency-of-tail-loss tests have even more difficulty as the holding period rises. If we have a longer holding period than a day, we can attempt to apply these tests in one of two ways: by straightforward temporal aggregation (i.e., so we work with P/L and VaR over a period ofhdays rather than 1 day), and by using rollingh-day windows with 1-day steps (see, e.g., Tilman and Brusilovskiy (2001, pp. 85–86)). However, the first route cuts down our sample size by a factor ofh, and the second is tricky to implement. When backtesting, it is probably best to work with data of daily frequency — or more than daily frequency, if that is feasible.
4The Kupiec test also throws away useful information about the pattern of tail losses over time. If the model is correct, then not only should the observed frequency of tail losses be close to the frequency predicted by the model, but the sequence of observed indicator values — that is to say, observations that take the value 1 if the loss exceeds VaR and 0 otherwise — should be independently and identically distributed. One way to test this prediction is suggested by Engle and Manganelli (1999, pp. 9–12): if we definehitt as the value of the indicator in periodtminus the VaR tail probability, 1−cl, thenhittshould be uncorrelated with any other variables in the current information set. (In this case, the indicator variable takes the value 1 if an exception occurs that day, and the value 0 otherwise.) We can test this prediction by specifying a set of variables in our current information set and regressinghittagainst them: if the prediction is satisfied, these variables should have jointly insignificant regression coefficients.
5Nonetheless, one way to make frequency-of-tail-loss tests more responsive to the data is to broaden the indicator function.
As noted already, in the case of a pure frequency-of-tail-losses test, the indicator function takes a value of 1 if we have a loss in excess of VaR and a value of 0 otherwise. We can broaden this function to award higher indicator values to higher tail losses (e.g., as in Tilman and Brusilovskiy (2001, pp. 86–87)), and so give some recognition to the sizes as well as the frequencies of tail losses. However, these broader indicator functions complicate the testing procedure, and I would suggest we are better off moving directly to sizes-of-tail-losses tests instead.
Box 9.1 Regulatory Backtesting Requirements
Commercial banks in the G-10 countries are obliged to carry out a set of standardised backtests prescribed by the 1996 Amendment to the 1988 Basle Accord, which lays down capital adequacy standards for commercial banks. The main features of these regulations are:6
r
Banks must calibrate daily VaR measures to daily P/L observations, and these VaRs are predi- cated on a confidence level of 99%.r
Banks are required to use two different P/L series: actual net trading P/L for the next day; and the theoretical P/L that would have occurred had the position at the close of the previous day been carried forward to the next day.r
Backtesting must be performed daily.r
Banks must identify the number of days when trading losses, if any, exceed the VaR.The results of these backtests are used by supervisors to assess the risk models, and to determine the multiplier (or hysteria) factor to be applied: if the number of exceptions during the previous 250 days is less than five, then the multiplier is 3; if the number of exceptions is five, the multiplier is 3.40, and so forth; and 10 or more exceptions warrant a multiplier of 4.
Leaving aside problems relating to the capital standards themselves, these backtesting rules are open to a number of objections:
r
They use only one basic type of backtest, a Kupiec test, which is known to be unreliable except with very large samples.r
They ignore other backtests and don’t make use of valuable information about the sizes of exceptions.r
Models can fail the regulatory backtests in abnormal situations (e.g., such as a market crash or natural disaster) and lead banks to incur unwarranted penalties.r
The rules relating the capital multiplier to the number of exceptions are arbitrary, and there are concerns that the high scaling factor could discourage banks from developing and implementing best practice.r
Backtesting procedures might discourage institutions from reporting their ‘true’ VaR estimates to supervisors.However, even if these problems were dealt with or at least ameliorated, there would always remain deeper problems: any regulatory prescriptions would inevitably be crude, inflexible, prob- ably counterproductive, and certainly behind best market practice. It would be better if regulators did not presume to tell banks how they should conduct their backtests, and did not make capital requirements contingent on the results of their own preferred backtest procedure — and a primitive one at that.
9.2.2 The Time-to-first-tail-loss Test
There are also related approaches. One of these is to test for the time when the first tail loss occurs (see Kupiec (1995, pp. 75–79)). If the probability of a tail loss is p, the probability of observing the first tail loss in periodT isp(1−p)T−1, and the probability of observing the first tail loss by period T is 1−(1−p)T, which obeys a geometric distribution.
6For more on regulatory backtesting, see Crouhyet al.(1998, pp. 15–16).
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10 20 30 40 50 60 70 80 90 100
Time of first tail loss
Cumulative probability
Figure 9.2 Probabilities for the time of first tail loss.
Note: Estimated for an assumedp-value of 5%, using the ‘geocdf’ function in MATLAB.
These probabilities are easily calculated, and Figure 9.2 shows a plot of the probability of observing our first tail loss by periodT, for ap-value of 5%. The figure shows that the probability of observing our first loss by timeT rises withTitself — for example, atT =5, the probability of having observed a first tail loss is 22.6%; but forT =50, the same probability is 92.3%.
However, this test is inferior to the basic Kupiec test because it uses less information: it only uses information since the previous tail loss, and in effect throws all our other information away. It is therefore perhaps best regarded as a diagnostic to be used alongside more powerful tests, rather than as a substitute for them.
9.2.3 A Tail-Loss Confidence-interval Test
A related alternative is to estimate a confidence interval for the number of tail losses, based on the available sample, and then check whether the expected number of tail losses lies within this sample.
We can construct a confidence interval for the number of tail losses by using the inverse of the tail-loss binomial distribution (e.g., by using the ‘binofit’ function in MATLAB). For example, if we have x=55 tail losses out ofn=1,000 observations, then the 95% confidence interval for the number of tail losses is [42, 71]. Since this includes the number of tail losses (i.e., 50) we would expect under the null hypothesis that the model is ‘true’, we can conclude that the model is acceptable.
This approach uses the same data as the Kupiec test and should, in theory, give the same model assessment as it.7
7The same goes for a final binomial alternative, namely, using binomial theory to estimate the confidence interval for the
‘true’ prob-value given the number of exceptions and the sample size.
9.2.4 The Conditional Backtesting (Christoffersen) Approach
A useful adaptation to these approaches is the conditional backtesting approach suggested by Christoffersen (1998). The idea here is to separate out the particular hypotheses being tested, and then test each hypothesis separately. For example, the full null hypothesis in a standard frequency-of-tail- losses test is that the model generates a correct frequency of exceptions and, in addition, that excep- tions are independent of each other. The second assumption is usually subsidiary and made only to simplify the test. However, it raises the possibility that the model could fail the test, not because it gen- erates the wrong frequency of failures, as such, but because failures are not independent of each other.
The Christoffersen approach is designed to avoid this problem. To use it, we break down the joint null hypothesis into its constituent parts, thus giving us two distinct sub-hypotheses: the sub- hypothesis that the model generates the correct frequency of tail losses, and the sub-hypothesis that tail losses are independent. If we make appropriate assumptions for the alternative hypotheses, then each of these hypotheses has its own likelihood ratio test, and these tests are easy to carry out. This means that we can test our sub-hypotheses separately, as well as test the original joint hypothesis that the model has the correct frequency of independently distributed tail losses.
The Christoffersen approach therefore helps us to separate out testable hypotheses about the dynamic structure of our tail losses from testable hypotheses about the frequency of tail losses. This is potentially useful because it not only indicates whether models fail backtests, but also helps to identify the reasons why.