AN HISTORICAL PERSPECTIVE ON FORECAST ERRORS.
Clements, Michael P. ; Hendry, David F.
 Michael P. Clements [*]
 David F. Hendry [**]
 Using annual observations on industrial production over the last
three centuries, and on GDP over a 100-year period, we seek an
historical perspective on the forecastability of these UK output
measures. The series are dominated by strong upward trends, so we
consider various specifications of this, including the local linear
trend structural time-series model, which allows the level and slope of
the trend to vary. Our results are not unduly sensitive to how the trend
in the series is modelled: the average sizes of the forecast errors of
all models, and the wide span of prediction intervals, attests to a
great deal of uncertainty in the economic environment. It appears that,
from an historical perspective, the postwar period has been relatively
more forecastable.
 Introduction
 How well or badly forecasters are doing should be measured relative
to the predictability of the relevant variables. Little is known about
any lower bounds to the predictability of macroeconomic time series, and
how such bounds change over time, but the historical properties of data
are open to analysis. Consequently, we investigate the historical
forecastability of UK output measures, using annual observations on UK
GDP and industrial production (IP) over the last 100 and 300 years
respectively, to gauge changes in predictability relative to some
'base-line' models. [1]
 The forecastability of a variable depends both upon its intrinsic
properties and on the model used. The former is not under the
investigators' control, but the epochs studied here witnessed many
fundamental technological, legal, social and political changes, most of
which had consequences that were not well understood till long
afterwards. Part of the apparent change in forecast accuracy may be
improvements in the measurement system itself. The forecasting model,
however, is under an investigator's control, and recent research
into the sources of forecast failure suggests using univariate
time-series forecasting models which are relatively robust to
deterministic shifts: the following section reviews the arguments. A
univariate approach obviates the need to model 'explanatory
variables', which is distinctly difficult over the long historical
periods we consider (see Hendry, 2000a, for models of several macro time
series over 1875-1990), and anyway need not improve forecast accuracy or
precision when underlyin g relationships are changing. Linear
time-series models (for example, of the Box and Jenkins, 1976, form) are
easily understood and estimated, and while they may exhibit
non-constancy, we allow for that by re-estimating the models'
parameters as the estimation and forecast windows move through the data,
allowing for gradual parameter evolution. We also consider a class of
models that allows the levels and slopes of the trend components of the
models to evolve through time. Casual observation of the data (see chart
1) shows that strong upward trends are the dominant features to be
modelled, so we focus on that aspect. Year-on-year growth rates have
been volatile, more so for IP than GDP, and their volatility has not
been constant over time.
 The paper has three objectives. First, we investigate whether the
'structural' time-series models employed here, which allow the
trend function to change over time, yield either more accurate point
forecasts, or better capture the uncertainty surrounding such forecasts,
than linear time-series models. To resolve this issue, we report a
number of measures of forecast accuracy that assess the performance of
models against each other, and consider whether the models'
prediction intervals are well calibrated. Secondly, we examine whether
the last century was relatively more quiescent (and therefore more
forecastable) than the 18th and 19th centuries. Finally, we seek to
identify whether periods when forecast performance was particularly poor
coincide with periods of turbulence in the economic or legislative
fields, including domestic and international conflict.
 Clements and Hendry (1998b, 1999a) argue that there should be such
a match, insofar as turbulence induces deterministic shifts, which are
the most pernicious source of forecast failure.
 The next section reviews the theoretical background that leads to
our choice of forecasting models, and is followed by a discussion of the
models we use. The fourth section describes the empirical forecasting
exercise and our results, and the fifth section concludes. Two
appendices contain more technical information. Appendix 1 describes the
derivation of the mean squared forecast error for the model with the
trend allowed to change over time, and appendix 2 provides an overview
of the measures used to evaluate forecast performance.
 Background
 Economic forecasting occurs in a non-stationary and evolving world,
where the model and the data generation process (DGP) are bound to
differ due to the complexity of the latter. Long historical time series
merely serve to highlight change; and change in turn reveals
mis-specifications in models of the DGP. Forecast failure then publicly
exposes the impact of unmodelled changes on models.
 Key attributes of forecasts are their accuracy and precision, and
although other features of the forecast-error distribution may matter,
we will focus on bias and variance as measures thereof, despite their
well-known drawbacks. Inaccurate or imprecise forecasts may simply
reflect the intrinsic difficulty of forecasting the given series.
Because there are no absolutes against which to assess the accuracy
and/or precision of a forecast, root mean-square forecast errors (which
weight the inaccuracy and imprecision together) will be compared across
models, and statistical tests used to check if one model is
significantly better than the others on this criterion. We also use
tests due to Christoffersen (1998) to check whether each model's
prediction intervals contain the appropriate number of realised values
of the process. However, forecast failure - which occurs when there is a
significant deterioration in forecast performance relative to the
anticipated outcome - is more easily detected, and can be assessed in a
number of ways: see, for example, the tests discussed in Kiviet (1986)
and Clements and Hendry (1999b).
 Most forecasting models have three main components: deterministic
terms (like intercepts) whose future values are known; observed
stochastic variables (like prices) with unknown future values; and
unobserved errors all of whose values (past, present and future) are
unknown. Any, or all, of these components, or the relationships between
them, could be inappropriately formulated, inaccurately estimated, or
change in unanticipated ways. All nine types of mistake could induce
poor forecast performance, either from inaccurate or imprecise
forecasts. Instead, Clements and Hendry (1998b, 1999a) find that some
mistakes have pernicious effects on forecasts, whereas others are
relatively less important in most settings.
 This result follows from a taxonomy of forecast errors which allows
for structural change in the forecast period, the model and DGP to
differ over the sample period, the parameters of the model to be
estimated (possibly inconsistently) from (potentially inaccurate) data,
and the forecasts to commence from incorrect initial conditions. The
taxonomy reveals the central role in forecast failure of deterministic
shifts over the forecast period, whereas other problems, such as
poorly-specified models, inaccurate data, inadequate methodology,
over-parameterisation, or incorrect estimators seem less relevant.
Surprisingly, it is even difficult to detect shifts in parameters other
than those of deterministic terms (see for example, Hendry, 2000b). Of
course, all inadequacies in models reduce forecast performance relative
to the optimum, but the analytics, supported by Monte Carlo and
empirical studies, suggest that such effects are 'swamped' by
the large errors manifested in forecast failure.
 Such a theory reveals that many of the conclusions which can be
established formally for correctly-specified forecasting models of
constant-parameter processes no longer hold. In weakly stationary
processes -- or integrated processes reducible to stationarity by
differencing and cointegration -- a congruent, encompassing model
in-sample will dominate in forecasting at all horizons (see Hendry,
1995, for definitions). Causal variables will always improve forecasts
relative to non-causal; forecast accuracy will deteriorate as the
horizon increases; and there should be no forecast-accuracy gains from
pooling forecasts across methods or models: indeed, pooling refutes
encompassing. Forecast failure will be a rare event, precisely because
the future will be like the present and past. Instead, when the DGP is
not stationary and the model does not coincide with the DGP, then new
implications are that causal variables need not dominate non-causal in
forecasting; forecast failure will occur when the future changes from
the present, and is not predictable from in-sample tests, but need not
entail changes in estimated parameters nor invalidate a model; and
h-step ahead forecasts can 'beat' 1-step forecasts made h-1
periods later (h [greater than] 1). Moreover, the outcome of forecasting
competitions on economic time series will be heavily influenced by the
occurrence of unmodelled deterministic shifts that occur prior to
forecast-evaluation periods: models which are robust to such shifts will
do relatively well.
 Given that forecast failure mainly derives from deterministic
shifts, there are four potential solutions: differencing, co-breaking,
intercept corrections, and updating. Differencing lowers the polynomial degree of deterministic terms: in particular, double differencing
usually leads to a mean-zero, trend-free series, because continuous
acceleration is rare in economics (even during hyperinflations). The
recent study of the Norges Bank model in Eitrheim, Husebo and Nymoen
(1999) illustrates the effectiveness of that approach. Next,
deterministic non-stationarity can also be removed by co-breaking,
namely the cancellation of breaks across linear combinations of
variables (see for example, Clements and Hendry, 1999a, chapter 9).
Finally, intercept corrections can be shown to help robustify forecasts
against biases due to deterministic shifts, as can the closely-related
approach of updating the estimated intercept with an increased weight
accorded to more recent data.
 These implications help determine the class of model that might
minimise systematic forecast failure over the long periods under
analysis here.
 Models
 Because of the vexed question of whether output series possess unit
roots or are better described as stationary around a linear trend, we
estimate both types of model, namely difference stationary (DS) and
trend stationary (TS) models. The stochastic-trend model treats the
variable {[y.sub.t]} as integrated of order one, [y.sub.t][sim] l(1), as
in the random walk with drift:
 [y.sub.t] = [y.sub.t-1] + [micro] + [[epsilon].sub.t] where
[[epsilon].sub.t] -IN[0, [[[sigma].sup.2].sub.[epsilon]]]. (1)
 The TS model is ostensibly quite different, whereby {[y.sub.t]} is
stationary about a deterministic function of time, here taken to be a
simple linear trend:
 [y.sub.t] = [phi] + [[gamma].sup.t] + [u.sub.t] where [u.sub.t] -
IN[0, [[[sigma].sup.2].sub.u]]. (2)
 The disturbances in both models can be stationary Gaussian
autoregressive-moving average (ARMA) processes without fundamentally
altering the properties of these models (see Clements and Hendry, 2001).
 However, both these models can be viewed as special cases of the
'structural time-series' class of models of Harvey (1989)
(Proietti, 2001, provides a review). Ignoring cyclical and seasonal
components, the local linear trend (LLT) structural time-series model
can be written as:
 [y.sub.t] = [[micro].sub.t] + [[epsilon].sub.t] (3)
 where:
 [[micro].sub.t] = [[micro].sub.t-1] + [[beta].sub.t-1] +
[[eta].sub.t-1] (4)
 [[beta].sub.t] = [[beta].sub.t-1] + [[zeta].sub.t-1]
 and the disturbance terms [[epsilon].sub.t], [[eta].sub.t] and
[[zeta].sub.t] and are zero mean, uncorrelated white noise, with
variances given by [[[sigma].sup.2].sub.[epsilon]],
[[[sigma].sup.2].sub.[eta]] and [[[sigma].sup.2].sub.[zeta]]. The
interpretation of this model is that is [[micro].sub.t] the (unobserved)
trend component of the time series, and [[epsilon].sub.t] is the
irregular component. [[eta].sub.t] affects the level of the trend, and
[[zeta].sub.t] allows its slope ([beta]) to change. We investigate
whether this more flexible approach leads to improved forecast accuracy
over (1) and (2).
 Differencing the equation for [[micro].sub.t] in (4):
 [[delta].sup.2][[micro].sub.t] = [delta][[beta].sub.t-1] +
[delta][[eta].sub.t-1] = [[zeta].sub.t-2] + [delta][[eta].sub.t-1],
 and substituting into (3) differenced twice, we find:
 [[delta].sup.2][y.sub.t] = [[zeta].sub.t-2] +
[delta][[eta].sub.t-1] + [[delta].sup.2][[epsilon].sub.t],
 which is a restricted ARIMA(0, 2, 2). [2]
 Estimation of the unknown error variances is by maximum likelihood,
most easily achieved after casting the model into state-space form
(SSF), for example:
 [y.sub.t] = Z[[alpha].sub.t] + G[[epsilon].sub.t]
 [[alpha]sub.t+1] + T[[alpha].sub.t] + H[[epsilon].sub.t],
 which define the measurement and state equations respectively,
where [[alpha].sub.t] = ([[micro].sub.t][[beta].sub.t])' Z=(1 0), G
= ([[sigma].sub.[epsilon]] 0 0), [[epsilon]'.sub.t] =
([[epsilon].sub.t]/[[sigma].sub.[epsilon]]
[[eta].sub.t]/[[sigma].sub.[eta]] [[zeta].sub.t]/[[sigma].sub.[zeta]])
and:
 [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
 For given values of the hyperparameters ([[sigma].sub.[epsilon]],
[[sigma].sub.[eta]], [[sigma].sub.[zeta]]), the Kalman filter is used to
obtain the prediction-error decomposition, that is, the one-step ahead
independent prediction errors, from which the likelihood can be
assembled. The likelihood is then maximised over these hyperparameters
by a numerical optimisation routine.
 Letting [I.sub.T] denote available information, the h-step ahead
forecasts are given by:
 E[y.sub.T+h]\[I.sub.T]] = ZE[[[alpha].sub.T+h]\[I.sub.T]] =
[ZT.sub.h[sim]][[alpha].sub.T\T]
 since E[G[[epsilon].sub.T+h]/[I.sub.T] = 0, where [[alpha].sub.T\T]
is the smoothed estimate of the state vector at time T. For the LLT
model, after estimation:
 [[gamma].sub.T+h\T] = E[[[gamma].sub.T+h]\[I.sub.T]
[[micro].sub.T\T] + h/[[beta].sub.T\T]
 with an h-step ahead mean-square forecast error (MSFE) of:
 MSFE[y.sub.T+h\T]]=[p.sub.11] + 2h[p.sub.12] + [h.sup.2][p.sub.22]
 +h[[[sigma].sup.2].sub.[eta]] +
h(h-1)(2h-1)[[[sigma].sup.2].sub.[zeta]]/2 +
[[[sigma].sup.2].sub.[epsilon] (5)
 where P = [P.sub.T\T] = {[p.sub.ij]} is the smoothed estimate of
the covariance of the state vector: details are provided in the
appendix.
 Suppose now that [[[sigma].sup.2].sub.[zeta] = 0, then from (4),
[delta][[micro].sub.t] = [beta] + [[eta].sub.t-1], so that differencing
(3) and substituting delivers:
 [delta][y.sub.t] = [beta] + [[eta].sub.t-1] + [delta][[xi].sub.t],
 which is the same as (1) when the disturbance in the latter is an
MA(2). Thus, the DS model is a structural time-series model in which the
'slope' does not change. The TS model can be obtained as a
special case of the structural time-series model by setting
[[[sigma].sup.2].sub.[eta]] = [[[sigma].sup.2].sub.[zeta]] = 0 so that
(4) becomes:
 [[micro].sub.t] = [[micro].sub.t-1] + [beta] = [[micro].sub.0] +
[[beta].sub.t]. (6)
 (6) indicates that the trend has a constant 'slope'
([[[sigma].sup.2].sub.[zeta]] = 0), and the 'level' is not
subject to stochastic shocks ([[[sigma].sup.2].sub.[eta]] = 0).
Substituting into (3) results in:
 [y.sub.t] = [[micro].sub.0] + [beta]t + [[xi].sub.t]
 which is identical to (2) when [[micro].sub.0] = [phi], [beta] =
[gamma] and [[xi].sub.t] = [u.sub.t] - Thus, the trend-stationary model
is a limiting case of the structural time-series model in which the
level and the slope of the trend component are constant over rime.
 As noted in the previous section, devices are available to
robustify forecasts against breaks in means or growth rates. These are
discussed at length in Clements and Hendry (1998a), drawing on, inter
alia, Clements and Hendry (1996, 1998a), where it is shown that bias can
often be reduced at the cost of greater imprecision. We experimented
with a 'double-difference' (DD) predictor, whereby forecasts
are generated by solving [[delta].sup.2][y.sub.t] = 0 for t = T+1, ...,
T + h. Here, [delta] = 1 - L, [L.sup.n][x.sub.t] = [x.sub.t-n], so
[[delta].sup.2] = (1 - L) - (1 - L)L. This predictor instantly adapts to
changes in the growth rate of [y.sub.t]: the growth rate observed at
period T is predicted into the future. Notice that the DS model
incorporates a unit root which will adapt for changes in the level of
the series, but not for changes in the growth rate, [micro]. That is,
the growth rate of [micro] is expected to hold globally. The LLT model
allows a local trend, as does a model which estimates [micro] on a
window of observations ending at the forecast origin. We term this model
DS-L, and calculate the growth rate from the last five years of data.
The DS model uses all the data up to the forecast origin (and may
include dynamics, but this is of secondary importance). Finally, we
consider 'intercept-correcting' the TS model forecasts by
adding in the average of the previous two errors (up to and including
the forecast origin error). This predictor is denoted TS-IC.
 Empirical forecast comparisons
 A number of sampling schemes could be adopted for forecast
evaluation (see, for example, West and McCracken, 1998, pp. 818-9). We
use a 'recursive' scheme, where the parameters are
re-estimated in each period as the forecast origin moves forward through
the sample. Thus, for an observation vector of length T, the model is
first specified and estimated on the data up to period
[T.sub.0]([T.sub.0] [less than] T), and a forecast (point or interval)
of [T.sub.0] + 1 to [T.sub.0] + 4 is made. Then, the model is
re-estimated on data up to and including [T.sub.0] + 1, and forecasts of
[T.sub.0] + 2 to [T.sub.0] + 5 are made, and so on.
 The specifications of the models are fixed, although we could have
optimised lag orders for each estimation sample according to some
information criterion, such as BIC (see Schwarz, 1978) or PIC (see
Phillips, 1994): Swanson and White (1997) found improved accuracy from
such a strategy in some cases. Prediction intervals were calculated
using the 'plug-in' approach of Box and Jenkins for the DS and
TS models. For the LLT model, a 100x[alpha]% interval was calculated as
the central forecast [+ or -][[phi].sup.-1]([alpha]/2)x[square
root]MSFE[[y.sub.T+h\T]], where MSFE[[y.sub.T+h\T]], is given in (5),
and [phi] is the standard normal distribution function. This approach
ignores parameter-estimation uncertainty (the model parameters are
simply replaced by sample estimates), and assumes the model disturbances
are Gaussian. Clements and Taylor (2000) survey this literature, and
suggest a preferable bootstrap approach, but these refinements would be
unlikely to alter the overall picture.
 Industrial production (IP)
 We generated sequences of 1 to 4-step ahead forecasts from these
univariate time series models for the logarithm of the IP series
([y.sub.t]) (1700-1988). For the unit root (or difference-stationary,
DS) model we specified an ARIMA (1,1,0), allowing a non-zero mean; and
for the trend-stationary (TS) model, we specified an ARIMA (2,0,0)
process for [u.sub.t] = [y.sub.t] - [phi] - [gamma]t, where [phi] is the
intercept and [gamma], the coefficient on the time trend (t), is the
underlying growth rate. Alternative specifications were generally no
better in terms of MSFE for the period as a whole, but we did not
undertake extensive specification searches. The LLT was as described
above, as were the DD, DS-L and TS-IC.
 The models were estimated from the period 1700 to 1758, and
forecasts were then made of the next four years. The same specifications
were re-estimated on data up to 1759, and 1 to 4-step ahead forecasts
were again generated. This process was continued up to an estimation
sample ending in 1987, with 1 to 4-step ahead forecasts of the
observations 1988 to 1991. At each forecast origin, we also calculated
prediction intervals, and recorded whether the actual fell within, or
outside the associated interval.
 For the forecast-evaluation exercise, we split the overall sample
of forecasts into three periods: 1759-1850, 1851-1944, 1945-1988 (for
1-step forecasts - for 2-steps ahead, the dates were 1760-1851, and so
on). For the whole sample of forecast errors, table 1 shows that the DS
and LLT models are more accurate than the TS model, with the LLT having
a small advantage at 3 and 4-steps ahead, although the RMSFEs (root
MSFEs) suggest considerable uncertainty for all models. For example, an
approximate 95 per cent prediction interval for a 1-step ahead forecast
using the DS model would be [+ or -]2x[square root]4.82, that is, [+ or
-]4.4 percentage points of the central projection. We have omitted
second-moment results for the DD, DS-L and TS-IC predictors, because
these predictors generally fared badly, and in only few cases yielded
minor improvements. As is evident from table 2, which records the biases
in the forecast errors of IP over the whole period for all the
predictors, this is due to higher variability of their forecast errors,
because they are manifestly much less biased. For example, the 1-step
forecasts of the DS model are on average too low by 1 per cent, those of
the LLT are too low by 1/2 per cent, while those of DD are virtually
unbiased: on average they are too low by only 0.03 per cent. This is
consistent with Clements and Hendry (1999a), suggesting that the
variance costs offset the bias reductions when weighted by the RMSFE
metric: for this reason, our subsequent focus is on the three models of
trend, DS, TS and LLT.
 Generally, the DS model is the most accurate on RMSFE for the
sub-periods as well. The 1945-98 sub-period is associated with less
uncertainty, but even here a 95 per cent interval would be [+ or -]3.6
percentage points of the central projection, rising to [+ or -]5.2
percentage points for a four-year ahead forecast. Table 3 indicates that
the DS forecasts at 1-step ahead are often significantly more accurate
than those of the TS model on MSFE. This is the case for the whole
period and the earliest sub-period, while at conventional significance
levels, the LLT is statistically no worse than the DS model at 1-step.
For the whole sample period, neither the standard nor modified versions
of the forecast-encompassing test reject the null that the DS 1-step
forecasts encompass those of the TS forecasts (in line with the findings
of the DieboldMariano test of equal accuracy), nor do they reject the
null that DS encompasses LLT, nor that LLT encompasses TS, but they
clearly reject when the tests are reversed, that is , TS does not
encompass DS, LLT does not encompass DS, and TS does not encompass LLT.
For the first sub-period, the LLT forecasts are generally preferred on
the encompassing criteria, while for 1851-1944, given that we have the
DS and TS forecasts, nothing is gained from having the LLT model
forecasts as well, though the converse is not true.
 The interval-forecast tests in table 4 indicate that, with the
exception of the third sub-period (1945-88), the unconditional coverage
of the 1-step prediction intervals is appropriate at the three nominal
levels we consider. In the third period there are far too few
'misses' (actuals outside the intervals) and this is signalled
by p-values below 5 per cent. Table 5 records the dates of the misses,
and confirms their general absence from the postwar period. The
whole-period test of the independence of the sequences of hits and
misses is more demanding, and rejects for all three models for 50 and 75
per cent intervals (and has a p-value of 0.052 for the TS model even for
the 95 per cent interval). [3] Table 5 confirms both the clustering of
misses for a particular model, and the high correlation of misses
between models. Consequently, for the whole period, all three models are
rejected at the 10 per cent level on the joint test of correct
conditional coverage of SO and 75 per cent intervals. There is no
evidence against the models on the first sub-period, although the second
(1851-1944) resembles the whole period. Here again, though, there is
less evidence against the LLT model SO per cent interval, indicating
that the LLT model may be able to characterise more accurately, or
account for, the uncertainty surrounding the point forecasts at
different points in time.
 Table 5 also notes any dates where wars (and the oil crises) may
have played a role. There is a link between forecast failure and
manifest turbulence, but it is nor strong: many dates coincide with
wars, but there were also many other wars over that period which do not
seem to have had marked effects (note that the data were interpolated
during the Second World War). That the link is weak is perhaps
unsurprising: other events (financial crises, gold discoveries and so
on) and possible interactions have not been investigated, but we note
that there are 4 dates between 1929 and 1939. Because the models are
rather crude empirical [4] descriptions of the macroaggregates, events
which might be expected to result in failure are not always signalled,
and there are failures with no obvious causes.
 Gross domestic product (GDP)
 For the logarithm of the historical GDP series ([y.sub.t])
(1830-1991), we again generated sequences of 1 to 4-steps ahead
forecasts. For the DS model, we specified an ARIMA(1,1,0), allowing a
non-zero mean, and for the TS model, we specified an ARIMA (2,0,0) for =
[u.sub.t] = [y.sub.t]-[phi]-[gamma]t. The LLT was as described in the
third section on models above, but with [[[sigma].sup.2].sub.[zeta]]=0,
because the optimisation routines failed to solve when this parameter
was allowed to be non-zero. Thus the model is a restricted ARIMA
(0,1,2).
 The models were estimated on the period 1830 to 1888, and forecasts
were then made for the next four years. The same specifications were
re-estimated on data up to 1889, and 1 to 4-step ahead forecasts were
again made. This process was continued up to an estimation sample ending
in 1987, with 1 to 4-step ahead forecasts of the observations 1988 to
1991. Because there is only a sample of 100 forecasts, it did not seem
prudent to analyse subsets of forecasts.
 The RMSEs for GDP over the period 1889-1988 are comparable in
magnitude to those recorded for industrial production. The DS model
again appears to be the most accurate, followed by the 'LLT'
(recall [[[sigma].sup.2].sub.[zeta]]=0). But the differences are not
statistically significant at 1-step ahead on the Diebold-Mariano test
applied to the 1-step forecast errors. The forecast-encompassing tests
are able to discriminate between the models. On pairwise comparisons, DS
encompasses, but is not encompassed by, TS, and similarly, DS
encompasses, but is not encompassed by, LLT. Comparing the two dominated
models, TS does not encompass LLT, but at the 5 per cent level we cannot
reject that LLT encompasses the TS (although we can at the 10 per cent
level). Turning now to the prediction intervals, at the 5 per cent
level, there is evidence against the independence of the 75 per cent
interval hits and misses for TS, and against the 90 per cent LLT
interval, as well as against the 90 per cent DS interval on the joint
test.
 The RMSE figures can be compared with those calculated by agencies
such as the Treasury and NIESR for their GDP forecasts (see, for
example, respectively, Mellis and Whittaker, 2000 and Poulizac, Weale
and Young, 1996), albeit that they are for the recent period and are
based on quarterly data. Nevertheless, Mellis and Whittaker (2000, Table
3.2, p. 41) give an RMSE of 1.89 for 4-step forecasts for the period 971
-96, compared with our figure for the 100-year period of 3 1/4 for the
DS model (see the last panel of table 1).
 Table 6 gives the dates of the misses for the 90 per cent
intervals. Many of the dates coincide with well-known events, though, as
before, there are important events that do not coincide with any
forecast failures.
 Conclusion
 In terms of the first issue -- which model appears to offer the
most useful characterisations of UK output measures historically -- the
findings support models with stochastic trends. However, the added
flexibility of the LLT model, which allows the slope and level of the
trend to change, appears to offer little benefit over the DS
model's unit root and fixed underlying growth rate, except perhaps
for forecasts of three or four years ahead. Nevertheless, the average
sizes of the forecast errors, and consequently the wide span of
prediction intervals, attest to a great deal of uncertainty in the
economic environment.
 In relation to the second issue, for IP, the postwar period has
been relatively more quiescent, with lower MSFEs, and few (if any)
misses using prediction intervals which (because of the way they have
been calculated) largely reflect the earlier periods. For GDP, forecast
failures occur somewhat less often postwar (about once per decade, but
bunched in the 1970s) than otherwise, but the 1920s stand out as the
hard-to-forecast decade.
 Finally, to resolve the third issue, perusal of tables 5 and 6
provides some support for our claim that forecast failure is associated
with turbulent periods. For GDP, for example, note the asterisks in the
three years following WW1, around the time of WW2, the OPEC oil price
hikes, combined in 1980 with the onset of recession. The match seems
less close for industrial production, in that several major wars do not
appear in the list, albeit that many others do, but as the data were
missing during the Second World War and were linearly interpolated, they
cannot reflect any 'boom-bust' that may actually have
occurred.
 (*.) Department of Economics, University of Warwick.
 (**.) Department of Economics, University of Oxford. Financial
support from the UK Economic and Social Research Council under grant no.
LI16251015 is gratefully acknowledged by both authors. All computations
were performed using code written in Gauss. Tommaso Proietti kindly
provided the Gauss code to estimate the local linear trend model.
 NOTES
 (1.) The data were kindly provided by Charlie Bean. The industrial
production series is the 'Output in Industry' series compiled
from Crafts and Harley (1992), p. 725; Mitchell (1988), p. 846; and
Central Statistical Office (1993): the data were missing during the
Second World War so were linearly interpolated between 1938 and 1946.
Real GDP in 1985 prices at factor cost comes from Mitchell (1988), p.
836; and Economic Trends Annual Supplement, (1993), corrected for the
exclusion of Southern Ireland.
 (2.) See, for example, Harvey (1989) or Proietti (2001) for the
restrictions on the lag 1 and 2 autocorrelations relative to an
unrestricted second-order MA process.
 (3.) The power of the test falls in the nominal coverage of the
interval at high coverage levels. Intuitively, there are fewer misses
from which to deduce whether they are independently distributed amongst
the hits.
 (4.) See, for example, Hand (1999) on empirical versus iconic models.
 REFERENCES
 Box, G.E.P. and Jenkins, G.M. (1976), Time Series Analysis,
Forecasting and Control, San Francisco, Holden-Day (first published
1970).
 Central Statistical Office (1993), Economic Trends Annual
Supplement, London, HMSO.
 Chong, Y.Y. and Hendry, D.F. (1986), 'Econometric evaluation
of linear macro-economic models', Review of Economic Studies, 53,
pp. 671-90, reprinted in Granger, C.W.J. (ed.) (1990), Modelling
Economic Series, Oxford, Clarendon Press.
 Christoffersen, P.F (1998), 'Evaluating interval
forecasts', International Economic Review, 39, pp. 841-62.
 Clements, M.P. and Hendry, D.F. (1996), 'Intercept corrections
and structural change', Journal of Applied Econometrics, II, pp.
475-94.
 ----- (1998a), 'Forecasting economic processes',
International Journal of Forecasting, 14, pp. 111-31.
 ----- (1998b), Forecasting Economic Time Series: The Marshall
Lectures on Economic Forecasting, Cambridge, Cambridge University Press.
 -- (1999a), Forecasting Non-stationary Economic Time Series: The
Zeuthen Lectures on Economic Forecasting, Cambridge. Mass., MIT Press.
 -- (1999b), 'Modelling methodology and forecast failure',
Econometrics Journal (forthcoming).
 -- (2001), 'Forecasting with difference-stationary and
trend-stationary models', Econometrics Journal, 4, S1-19.
 Clements, M.P. and Taylor, N. (2001), 'Bootstrapping
prediction intervals for autoregressive models'. International
Journal of Forecasting (forthcoming).
 Crafts, N.F.R. and Harley, C.K. (1992), 'Output growth and the
British Industrial Revolution: a restatement of the Crafts-Harley
view', Economic History Review, 45, pp. 703-30.
 Diebold, F.X. and Mariano, R.S. (1995), 'Comparing predictive
accuracy', Journal of Business and Economic Statistics, 13, pp.
253-63.
 Eitrheim, O, Husebo, T.A. and Nymoen, R. (1999),
'Equilibrium-correction versus differencing in macroeconometric
forecasting', Economic Modelling, 16, pp. 515-44.
 Ericsson, N.R. (1992), 'Parameter constancy, mean square
forecast errors, and measuring forecast performance: an exposition,
extensions, and illustration', journal of Policy Modeling, 14, pp.
465-95.
 Hand, D.J. (1999), 'Discussion contribution on "Data
mining reconsidered: encompassing and the general-to-specific approach
to specification search" by Hoover and Perez', Econometrics
Journal, 2, pp. 241-3.
 Harvey, A.C. (1989), Forecasting. Structural Time Series Models and
the Kalman Filter, Cambridge, Cambridge University Press.
 Harvey, D., Leybourne, S. and Newbold, P. (1997), 'Testing the
equality of prediction mean squared errors', International Journal
of Forecasting, 13, pp. 281-91.
 -- (1998), 'Tests for forecast encompassing', Journal of
Business and Economic Statistics, 16, pp. 254-9.
 Hendry, D.F. (1995), Dynamic Econometrics, Oxford, Oxford
University Press.
 -- (2000a), 'Does money determine UK inflation over the long
run?' in Backhouse, R. and Salanti, A. (eds), Macroeconomics and
the Real World, Oxford, Oxford University Press.
 -- (2000b), 'On detectable and non-detectable structural
change', Structural Change and Economic Dynamics, 11, pp. 45-65.
 Kiviet, J.F. (1986), 'On the rigor of some mis-specification
tests for modelling dynamic relationships', Review of Economic
Studies, 53, pp. 241-61.
 Mellis, C. and Whittaker, R. (2000), 'The Treasury's
forecasts of GDP and the RPI: how have they changed and what are the
uncertainties?' in Holly, S. and Weale, M. (eds), Econometric Modelling: Techniques and Applications, Cambridge, Cambridge University
Press.
 Mitchell, B.R. (1988), British Historical Statistics, Cambridge,
Cambridge University Press.
 Phillips, P.C.B. (1994), 'Bayes models and forecasts of
Australian macroeconomic time series' in Hargreaves, C. (ed.),
Non-stationary Time-series Analyses and Cointegratian, Oxford, Oxford
University Press.
 Poulizac, D., Weale, M. and Young, G. (1996), 'The performance
of National Institute economic forecasts', National Institute
Economic Review, 156, pp. 56-62.
 Proietti, T. (2001), 'Forecasting with structural time series
models', mimeo, Dipartimento di Scienze Statistiche, Universita di
Udine, forthcoming in Clements, M.P. and Hendry, D.F. (eds) (2001), A
Companion to Economic Forecasting, Oxford, Basil Blackwell.
 Schwarz, G. (1978), 'Estimating the dimension of a
model', Annals of Statistics, 6, pp. 461-4.
 Swanson, N.R. and White, H. (1997), 'Forecasting economic time
series using flexible versus fixed specification and linear versus
nonlinear econometric models', International Journal of
Forecasting, 13, pp. 439-62.
 Wallis, K.F. (1995), 'Large-scale macroeconometric
modelling', in Pesaran, M.H. and Wickens, M.R. (eds), Handbook of
Applied Econometrics: Macroeconomics, Oxford, Basil Blackwell.
 West, K.D. and McCracken, M.W. (1998), 'Regression-based tests
of predictive ability', International Economic Review, 39, pp.
817-40.
Table 1 RMSE and MAE of multi-step forecast errors
 RMSE MAE
h DS TS LLT DS TS LLT
Industrial production forecast
errors: 1759-1988
1 4.82 5.08 4.95 3.82 4.07 3.97
2 6.56 7.07 6.55 5.20 5.70 5.18
3 7.72 8.60 7.63 6.18 7.12 6.01
4 8.60 9.84 8.49 6.89 8.20 6.70
Industrial production forecast
errors: 1759-1850
1 4.52 5.15 4.41 3.64 4.25 3.55
2 5.67 6.98 5.43 4.44 5.63 4.27
3 6.44 8.68 6.16 5.16 7.40 4.87
4 7.66 10.60 7.28 6.01 9.10 5.66
Industrial production forecast
errors: 1851-1944
1 5.56 5.58 5.87 4.46 4.47 4.87
2 7.75 7.80 7.91 6.39 6.42 6.54
3 9.32 9.39 9.34 7.73 7.78 7.72
4 10.08 10.19 10.14 8.24 8.32 8.22
Industrial production forecast
errors: 1945-1988
1 3.58 3.60 3.68 2.82 2.84 2.93
2 5.41 5.46 5.42 4.25 4.32 4.21
3 6.27 6.37 6.23 4.98 5.15 4.76
4 6.86 7.01 6.85 5.86 6.05 5.60
GDP forecast errors: 1889-1988
1 3.25 3.49 3.39 2.30 2.43 2.41
2 5.50 6.08 5.58 3.94 4.22 4.00
3 7.31 8.05 7.30 5.25 5.73 5.27
4 8.58 9.41 8.48 6.19 6.70 6.21
Note: All MSFE and MAE figures have been multiplied by a hundred.
Table 2 Biases of multi-step forecast errors for industrial
production, 1759-1988
h DS TS LLT DD TS-IC DS-L
1 0.93 1.57 0.46 0.03 0.21 0.05
2 1.60 2.64 0.71 0.05 0.40 0.10
3 2.34 3.75 0.96 0.07 0.61 0.15
4 3.03 4.76 1.18 0.06 0.80 0.17
Note: All figures have been multiplied by a hundred.
Table 3 Point forecast evaluation: tests of equal forecast
accuracy and of forecast encompassing
 DS,TS DS,LLT TS,LLT
Industrial production forecast
errors: 1759-1988
DM - 0.147 0.859
FE(1) 0.166 - 0.249 - -
FE(2) 0.094 - 0.325 0.002 -
Industrial production forecast
errors: 1759-1850
DM - 0.740 1.0
FE(1) 0.180 - 0.023 0.438 -
FE(2) 0.127 - 0.050 0.505 -
Industrial production forecast
errors: 1851-1944
DM 0.092 0.058 0.076
FE(1) 0.305 0.192 0.962 0.002 0.820
FE(2) 0.237 0.133 0.963 0.002 0.823
Industrial production forecast
errors: 1945-1988
DM 0.182 0.169 0.258
FE(1) 0.450 0.303 0.679 0.112 0.979
FE(2) 0.451 0.306 0.690 0.125 0.979
GDP forecast errors: 1889-1988
DM 0.147 0.122 0.735
FE(1) 0.397 - 0.618 0.005 0.002
FE(2) 0.528 0.004 0.748 0.064 0.003
Industrial production forecast
errors: 1759-1988
DM
FE(1) 0.114
FE(2) 0.143
Industrial production forecast
errors: 1759-1850
DM
FE(1) 0.072
FE(2) 0.081
Industrial production forecast
errors: 1851-1944
DM
FE(1) 0.002
FE(2) 0.003
Industrial production forecast
errors: 1945-1988
DM
FE(1) 0.172
FE(2) 0.185
GDP forecast errors: 1889-1988
DM
FE(1) 0.079
FE(2) 0.082
Notes: Rows headed DM report p-values of Diebold-Mariano
tests of equal forecast accuracy or MSFE of x, y.
A value of 0.04, say, indicates that there is only
a 4 per cent chance that y is no less accurate than x.
A value of 0.80, say, indicates that there is a 20
per cent chance that x is no less accurate than y.
FE(1) is the standard forecast encompassing tests,
where for x, y, the first entry is the p-value of the
null that x forecast encompasses y, and the second
entry is the p-value of the hypothesis that y
forecast encompasses x. FE(2) is the modified form
of the test given in Harvey et al. (1988). '-' denotes
zero to three decimal places.
Table 4 Interval forecast evaluation
 DS TS
 Uncond. Ind. Cond. Uncond.
Industrial production: 1759-1988
50% 0.235 - - 0.692
75% 0.591 - 0.001 0.407
90% 0.105 0.412 0.193 0.825
Industrial production: 1759-1850
50% 0.210 0.304 0.269 0.300
75% 0.325 0.214 0.284 0.239
90% 0.427 0.280 0.407 0.350
Industrial production: 1851-1944
50% 0.302 0.003 0.008 0.215
75% 0.132 0.003 0.004 0.132
90% 0.890 0.235 0.489 0.838
Industrial production: 1945-1988
50% 0.015 0.094 0.013 0.015
75% 0.023 0.563 0.065 0.023
90% 0.002 NA NA 0.002
GDP: 1889-1988
50% 0.071 0.857 0.193 0.317
75% 0.817 0.190 0.413 0.494
90% 0.118 0.051 0.044 0.206
 LLT
 Ind. Cond. Uncond. Ind. Cond.
Industrial production: 1759-1988
50% - 0.001 0.147 0.071 0.068
75% 0.002 0.005 0.939 - -
90% 0.052 0.149 0.827 0.112 0.275
Industrial production: 1759-1850
50% 0.546 0.484 0.060 0.781 0.163
75% 0.246 0.255 1.000 0.086 0.229
90% 0.710 0.603 0.238 0.375 0.336
Industrial production: 1851-1944
50% 0.001 0.002 0.148 0.121 0.105
75% 0.013 0.014 0.132 0.003 0.004
90% 0.071 0.192 0.037 0.383 0.078
Industrial production: 1945-1988
50% 0.094 0.013 0.006 0.445 0.017
75% 0.563 0.065 0.007 0.326 0.016
90% NA NA 0.181 0.659 0.371
GDP: 1889-1988
50% 0.557 0.510 0.161 0.191 0.159
75% 0.033 0.081 0.482 0.440 0.580
90% 0.422 0.325 0.063 0.021 0.012
Notes: The elements are p-values
of the nulls of the Christoffersen
tests of correct unconditional
coverage, independence, and correct
conditional coverage of 50.75 and
90 per cent nominal intervals. '-'
indicates zero to three decimal
places. 'NA' indicates that there
were no 'misses', so the [LR.sub.ind]
(and therefore [LR.sub.cc]) tests
are inapplicable.
Table 5 Dates of forecast failure
 DS TS LLT War
Industrial production: 1759-1988
1774 * - * w?
1792 - * - w
1799 * * - w
1810 - * - w
1811 - * - w
1815 - * - w
1817 - * * -
1825 * * * -
1827 - * * w?
1828 * * * -
1836 * * - -
1839 * * - w?
1842 - - * -
1844 * * - -
1867 - - * -
1869 - - * -
1870 * * - w?
1871 - * - -
1879 - - * w?
1885 - - * w?
1886 - - * -
1906 * * * -
1916 - - * w
1917 * * * w
1918 - - * w
1920 * * * a
1921 * * * a
1922 * * - a
1924 * * * -
1929 * * * -
1931 * * * -
1934 - - * -
1936 - - * -
1975 - - * o
1980 - - * o
No. 16 22 24
% 7 10 10
Notes: The dates record when I-step
forecasts lay outside their 90 per cent
prediction intervals. (w) denotes a war
involving the UK directly, (w?) when there
was a war ongoing, (a) denotes the after-
match of World War I, and (o) the
oil crises.
Table 6 Dates of forecast failure
 DS TS LLT
GDP: 1889-1988
1892 * * *
1894 * - *
1898 - * -
1903 - * -
1908 * * *
1915 * - *
1919 * * *
1920 * * *
1921 * * *
1922 * - -
1923 * - -
1926 * * *
1927 * * *
1931 * * *
1940 * * *
1941 - - *
1944 * * *
1945 - - *
1946 - - *
1973 * * *
1974 * - -
1975 - - *
1980 - - *
No. 15 14 16
% 15 14 16
Note: The dates record when I-step forecasts lay
outside their 90 per cent prediction intervals.
 [Graph omitted]
 [Graph omitted]
 [Graph omitted]
 [Graph omitted]
 Appendix 1. Derivation of MSFE for the LLT Recall that the model in
SSF is given by:
 [y.sub.t] = Z[[alpha].sub.t] + G[[epsilon].sub.t]
 [[alpha].sub.t+1] = T[[alpha].sub.t] + H[[epsilon].sub.t],
 where [[alpha].sub.t] = ([[micro].sub.t] [[beta].sub.t])', Z =
(1 0), G = ([[sigma].sub.[epsilon]] 0 0),
[[epsilon]'.sub.t]=([[epsilon].sub.t]/[[sigma].sub.[epsilon]]
[[eta].sub.t]/[[sigma].sub.[eta]] [[zeta].sub.t]/[[sigma].sub.[zeta]])
and:
 T = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
 Below, the matrix p = {[p.sub.ij]} is the updated smoothed MSE matrix of the state vector. The h-step ahead forecasts are given by:
 E[[y.sub.T+h]\[I.sub.T]] = [ZT.sup.h][[alpha].sub.T\T].
 For the LLT model:
 [y.sub.T+h\T] E[[y.sub.T+h]\[I.sub.T]] = [[micro].sub.T\T] +
h[[beta].sub.T\T].
 The h-step forecast error is:
 [e.sub.T+h\T] = [y.sub.T+h] - [y.sub.T+h\T]
 = Z([[alpha].sub.T+h] - [T.sup.h][[alpha].sub.T\T]) +
G[[epsilon].sub.T+h]
 = [ZT.sup.h]([[alpha].sub.T] - [[alpha].sub.T\T]) (7)
 +Z [[[sigma].sup.h-1].sub.i=0] [T.sup.i] H[[epsilon].sub.T+h-1-i] +
G[[epsilon].sub.T+h]
 because:
 [[alpha].sub.T+h] = [T.sup.h][[alpha].sub.T] +
[[[sigma].sup.h-1].sub.i=0] [T.sup.i] H[[epsilon].sub.T+h-1-i].
 The three terms in (7) are uncorrelated, so:
 E[[[e.sup.2].sub.T+h\T]] = [ZT.sup.h] [PT.sup.h]' Z' + Z
[[[sigma].sup.h-1].sub.i=0] [T.sup.i] HH'[T.sup.i'] Z' +
GG'
 which for the LLT specialises to:
 MSFE = [p.sub.11] + [2hp.sub.12] + [h.sup.2][p.sub.22] + h
[[[sigma].sup.2].sub.[eta]]
 +h(h-1)(2h-1)[[[sigma].sup.2].sub.[zeta]]/6+[[[sigma].sup.2].sub.[eps ilon]].
 Appendix 2. Forecast evaluation measures
 Until recently, attention in macroeconomics was focused primarily
on the production and evaluation of point forecasts (see, for example,
Wallis, 1995), with only summary information regarding the degree of
uncertainty associated with forecasts, such as standard errors, being
provided. However, Christoffersen (1998) suggests ways of evaluating
interval forecasts, as a step toward providing a more complete
description of the uncertainty surrounding forecasts.
 Point forecasts
 Although there is a large literature on forecast-accuracy
comparisons, there have been few formal comparisons of rival forecasts
of the same phenomena, where by 'formal', we mean comparisons
which attempt to assess whether differences between rival forecasts can
be attributed to sampling variability, or whether they are
'significant'. Diebold and Mariano (1995) propose a test of
the null of equal forecast accuracy, for an arbitrary 'loss
function', g([e.sub.i,T+h]), where [e.sub.i,T+h] is the h-step
ahead forecast error from using model i commencing at the forecast
origin T. The loss differential is defined as [d.sub.T+h] [equivalent]
[g([e.sub.i,T+h]) - g([e.sub.j,T+h])] for rival forecasts i and j, so
that equal forecast accuracy entails E[d.sub.T+h] = 0. Given a
covariance-stationary sample realisation {[d.sub.t]} of n
'observations', the asymptotic distribution of the sample mean
loss differential d(d = [n.sup.-1][[[sigma].sup.n].sub.t=1][d.sub.t])
under the null is given by:
 [square root]n(d)[[right arrow].sup.D]N[0,V(d)],
 where we can approximate the variance by
 V(d)=[n.sup.-1]([[gamma].sub.0] +
2[[[sigma].sup.h-1].sub.i=1][[gamma].sub.i]), (8)
 where [[gamma].sub.i] is the ith autocovariance (so V(d) is a
weighted sum of autocovariances), and the truncation is determined by
the forecast horizon. Diebold and Mariano (1995) discuss the choice of
weights (here, rectangular) and the truncation lag.
 The large-sample statistic that Diebold and Mariano (1995) propose
for testing the null of equal forecast accuracy is:
 d/[square root]V(d)[app.sup.[sim]]N[0,1],
 where V(d) is a consistent estimate of V(d), using a weighted sum
of the sample autocovariances in (8):
 [[gamma].sub.k] = [n.sup.-1]
[[[sigma].sup.n].sub.t=k+1]([d.sub.t]-d)([d.sub.t-k]+d).
 Harvey, Leybourne and Newbold (1997) propose a modified version of
this test statistic that corrects for its tendency to be over-sized, and
present simulation evidence that attests to the usefulness of their
modification. They suggest using:
 [square root]n+1-2h+[n.sup.-1]h(h-1)/n x d/[square root]V(d)
 and comparing this to the Student t distribution with n-1 degrees
of freedom. To simplify, matters, we consider only 1-step forecasts, so
that serial correlation is ignored, and use:
 [[gamma].sub.0] = [n.sup.-1] [[[sigma].sup.n].sub.k=1]
[([d.sub.k]-d).sup.2],
 with:
 d/[square root][[gamma].sub.0][app.sup.[sim]]N[0,1],
 because the correction factor is [square root][n.sup.-1](n-1) = 1,
and for n [greater than] 30, the normal closely approximates the t
distribution.
 Forecast encompassing
 For two rival 1-step forecasts, [y.sub.i,t] and [y.sub.j,t], a
forecast-encompassing test is implemented by regressing [e.sub.i,t](=
[y.sub.t] - [y.sub.i,t]) on [y.sub.i,t] - [y.sub.j,t]:
 [y.sub.t] - [y.sub.i,t] = [alpha][(y.sub.i,t] - [y.sub.j,t]) +
[[zeta].sub.t] (9)
 and using a t-test of [H.sub.0]: [alpha] = 0. This is equivalent to
the regression:
 [Y.sub.t] = (1+[alpha])[y.sub.i,t] - [alpha][y.sub.j,t] +
[[zeta].sub.t] (10)
 and is a variant of the test due to Chong and Hendry (1986): see
Ericsson (1992) and Clements and Hendry (1998b). Harvey, Leybourne and
Newbold (1998) show that this test may be oversized when there is a
substantial sample of forecast errors available and the forecast errors
are not approximately normal. They suggest a number of alternatives,
including:
 [R.sub.1] = 1/[square root]n[[Q.sub.1].sup.-1/2]
[[[sigma].sup.n].sub.t=1][([e.sub.i,t] - [e.sub.j,t]).sup.2] [alpha]
 where:
 [Q.sub.1] = 1/n [[[sigma].sup.n].sub.t=1][([e.sub.i,t] -
[e.sub.j,t]).sup.2][[[zeta].sup.2].sub.t]
 and [alpha] and [[zeta].sub.t] are the OLS estimates of [alpha] and
the residuals from (9). Harvey et al. (1998) also show that the
Diebold-Mariano test of equal MSFE is closely related to the notion of
forecast encompassing.
 Interval evaluation
 Christoffersen (1998) suggests that a 'good' interval
forecast should have correct conditional coverage, such that the
interval is wider in volatile periods than in those of greater
tranquility: otherwise, observations outside the interval will be
clustered in volatile periods, and absent in tranquil periods.
Christoffersen (1998, p. 844) develops a 'unified framework for
testing the conditional coverage' using three likelihood-ratio (LR)
tests for unconditional coverage, independence, and conditional coverage
respectively.
 Let [L.sub.t\t-1](P) and [U.sub.t\t-1](p) denote the lower and
upper limits of the interval forecast of [y.sub.t] made at time t-1,
with a coverage probability p. Define the indicator function, [1.sub.t],
which takes the value unity when [y.sub.t] lies inside the interval
range and zero otherwise. For 1-step ahead forecasts t = 1,..., n, there
is a sequence of outcomes
 [[1.sup.n].sub.t=1].
 We wish to test that:
 E[[1.sub.t]\[1.sub.t-1],[1.sub.t-2],...,[1.sub.1]] =
E[[1.sub.t]]=p.
 That is, at each point in time, the probability that the actual
outcome will fall within the interval is p, and this is independent of
the history up to that point. Christoffersen (1998, Lemma 1) establishes
that this is equivalent to testing that the sequence {[1.sub.t]} is
identically and independently distributed as a Bernouilli distribution
with parameter p, denoted: {[1.sub.t]} [sim] [IID.sub.B](p).
 We can this hypothesis in two parts. First, the test of correct
unconditional coverage: E [[1.sub.t]] = p versus [[1.sub.t]] = [pi] [not
equal to] p. For a sample of n intervals, the likelihood under the null
is:
 L(p;[1.sub.1],[1.sub.2],...,[1.sub.n]) = [(1-p).sup.[n.sub.0]]
[p.sup.[n.sub.1]]
 Where [n.sub.0] is the number of times the actual does not fall
within the interval, and [n.sub.1] is the number of 'hits'.
The likelihood under the alternative is:
 L([pi];[1.sub.1],[1.sub.2],...,[1.sub.n]) =
[(1-[pi]).sup.[n.sub.0]] [[pi].sup.[n.sub.1]],
 So the standard LR test is:
 [LR.sub.uc] = -2ln (L(p;*)/L([pi];*)) a [[[chi].sup.2].sub.1],
 where a denotes 'asymptotically distributed as', and [pi]
= [n.sub.1] / ([n.sub.0] + [n.sub.1]).
 To test for independence, model the indicator function by a binary
first-order Markov chain with transition probability matirx:
 [[pi].sub.1] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
(11)
 where [[pi].sub.ij] = Pr([1.sub.t] = j\[1.sub.t-1] = i). Under
independence, [[pi].sub.ij] = [[pi].sub.j], i, j = 0, 1 where
[[pi].sub.j] = Pr ([1.sub.t] = j). So under the null, (11) is restricted
to:
 [[pi].sub.2] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
(12)
 The [[pi].sub.ij] and [[pi].sub.1] are estimated by their sample
frequencies, so:
 [[pi].sub.1] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
 where [n.sub.ij] is the number of times j is followed by i, etc.,
and:
 [[pi].sub.1] = [n.sub.01] + [n.sub.11] / [n.sub.00] + [n.sub.10] +
[n.sub.01] + [n.sub.11].
 The unrestricted likelihood is:
 L([pi].sub.1]) = [(1 - [[pi].sub.01]).sup.[n.sub.00]]
[[[pi].sup.[n.sub.01]].sub.01] (1 - [[pi].sub.11]).sup.[n.sub.10]]
[[[pi].sup.[n.sub.11]].sub.11] (13)
 and the restricted is:
 L([[pi].sub.2]) = [(1 - [[pi].sub.1]).sup.([n.sub.00] +
[n.sub.10])] [[[pi].sub.1].sup.([n.sub.01] + [n.sub.11])]. (14)
 The likelihood-ratio test statistic is:
 [LR.sub.ind] = -2ln (L([[pi].sub.2])/L([[pi].sub.1])) a
[[[chi].sup.2].sub.1],
 under the null hypothesis of independently-distributed indeicator
function values.
 A joint test of correct conditional coverage is [LR.sub.cc] =
[LR.sub.uc] + [LR.sub.ind] a [[[chi].sup.2].sub.2].