文章基本信息

标题：AN HISTORICAL PERSPECTIVE ON FORECAST ERRORS.
作者：Clements, Michael P. ; Hendry, David F.
期刊名称：National Institute Economic Review
印刷版ISSN：0027-9501
出版年度：2001
期号：July
语种：English
出版社：National Institute of Economic and Social Research
关键词：Economic forecasting

AN HISTORICAL PERSPECTIVE ON FORECAST ERRORS.

Clements, Michael P. ; Hendry, David F.

Michael P. Clements [*]

David F. Hendry [**]

Using annual observations on industrial production over the last three centuries, and on GDP over a 100-year period, we seek an historical perspective on the forecastability of these UK output measures. The series are dominated by strong upward trends, so we consider various specifications of this, including the local linear trend structural time-series model, which allows the level and slope of the trend to vary. Our results are not unduly sensitive to how the trend in the series is modelled: the average sizes of the forecast errors of all models, and the wide span of prediction intervals, attests to a great deal of uncertainty in the economic environment. It appears that, from an historical perspective, the postwar period has been relatively more forecastable.

Introduction

How well or badly forecasters are doing should be measured relative to the predictability of the relevant variables. Little is known about any lower bounds to the predictability of macroeconomic time series, and how such bounds change over time, but the historical properties of data are open to analysis. Consequently, we investigate the historical forecastability of UK output measures, using annual observations on UK GDP and industrial production (IP) over the last 100 and 300 years respectively, to gauge changes in predictability relative to some 'base-line' models. [1]

The forecastability of a variable depends both upon its intrinsic properties and on the model used. The former is not under the investigators' control, but the epochs studied here witnessed many fundamental technological, legal, social and political changes, most of which had consequences that were not well understood till long afterwards. Part of the apparent change in forecast accuracy may be improvements in the measurement system itself. The forecasting model, however, is under an investigator's control, and recent research into the sources of forecast failure suggests using univariate time-series forecasting models which are relatively robust to deterministic shifts: the following section reviews the arguments. A univariate approach obviates the need to model 'explanatory variables', which is distinctly difficult over the long historical periods we consider (see Hendry, 2000a, for models of several macro time series over 1875-1990), and anyway need not improve forecast accuracy or precision when underlyin g relationships are changing. Linear time-series models (for example, of the Box and Jenkins, 1976, form) are easily understood and estimated, and while they may exhibit non-constancy, we allow for that by re-estimating the models' parameters as the estimation and forecast windows move through the data, allowing for gradual parameter evolution. We also consider a class of models that allows the levels and slopes of the trend components of the models to evolve through time. Casual observation of the data (see chart 1) shows that strong upward trends are the dominant features to be modelled, so we focus on that aspect. Year-on-year growth rates have been volatile, more so for IP than GDP, and their volatility has not been constant over time.

The paper has three objectives. First, we investigate whether the 'structural' time-series models employed here, which allow the trend function to change over time, yield either more accurate point forecasts, or better capture the uncertainty surrounding such forecasts, than linear time-series models. To resolve this issue, we report a number of measures of forecast accuracy that assess the performance of models against each other, and consider whether the models' prediction intervals are well calibrated. Secondly, we examine whether the last century was relatively more quiescent (and therefore more forecastable) than the 18th and 19th centuries. Finally, we seek to identify whether periods when forecast performance was particularly poor coincide with periods of turbulence in the economic or legislative fields, including domestic and international conflict.

Clements and Hendry (1998b, 1999a) argue that there should be such a match, insofar as turbulence induces deterministic shifts, which are the most pernicious source of forecast failure.

The next section reviews the theoretical background that leads to our choice of forecasting models, and is followed by a discussion of the models we use. The fourth section describes the empirical forecasting exercise and our results, and the fifth section concludes. Two appendices contain more technical information. Appendix 1 describes the derivation of the mean squared forecast error for the model with the trend allowed to change over time, and appendix 2 provides an overview of the measures used to evaluate forecast performance.

Background

Economic forecasting occurs in a non-stationary and evolving world, where the model and the data generation process (DGP) are bound to differ due to the complexity of the latter. Long historical time series merely serve to highlight change; and change in turn reveals mis-specifications in models of the DGP. Forecast failure then publicly exposes the impact of unmodelled changes on models.

Key attributes of forecasts are their accuracy and precision, and although other features of the forecast-error distribution may matter, we will focus on bias and variance as measures thereof, despite their well-known drawbacks. Inaccurate or imprecise forecasts may simply reflect the intrinsic difficulty of forecasting the given series. Because there are no absolutes against which to assess the accuracy and/or precision of a forecast, root mean-square forecast errors (which weight the inaccuracy and imprecision together) will be compared across models, and statistical tests used to check if one model is significantly better than the others on this criterion. We also use tests due to Christoffersen (1998) to check whether each model's prediction intervals contain the appropriate number of realised values of the process. However, forecast failure - which occurs when there is a significant deterioration in forecast performance relative to the anticipated outcome - is more easily detected, and can be assessed in a number of ways: see, for example, the tests discussed in Kiviet (1986) and Clements and Hendry (1999b).

Most forecasting models have three main components: deterministic terms (like intercepts) whose future values are known; observed stochastic variables (like prices) with unknown future values; and unobserved errors all of whose values (past, present and future) are unknown. Any, or all, of these components, or the relationships between them, could be inappropriately formulated, inaccurately estimated, or change in unanticipated ways. All nine types of mistake could induce poor forecast performance, either from inaccurate or imprecise forecasts. Instead, Clements and Hendry (1998b, 1999a) find that some mistakes have pernicious effects on forecasts, whereas others are relatively less important in most settings.

This result follows from a taxonomy of forecast errors which allows for structural change in the forecast period, the model and DGP to differ over the sample period, the parameters of the model to be estimated (possibly inconsistently) from (potentially inaccurate) data, and the forecasts to commence from incorrect initial conditions. The taxonomy reveals the central role in forecast failure of deterministic shifts over the forecast period, whereas other problems, such as poorly-specified models, inaccurate data, inadequate methodology, over-parameterisation, or incorrect estimators seem less relevant. Surprisingly, it is even difficult to detect shifts in parameters other than those of deterministic terms (see for example, Hendry, 2000b). Of course, all inadequacies in models reduce forecast performance relative to the optimum, but the analytics, supported by Monte Carlo and empirical studies, suggest that such effects are 'swamped' by the large errors manifested in forecast failure.

Such a theory reveals that many of the conclusions which can be established formally for correctly-specified forecasting models of constant-parameter processes no longer hold. In weakly stationary processes -- or integrated processes reducible to stationarity by differencing and cointegration -- a congruent, encompassing model in-sample will dominate in forecasting at all horizons (see Hendry, 1995, for definitions). Causal variables will always improve forecasts relative to non-causal; forecast accuracy will deteriorate as the horizon increases; and there should be no forecast-accuracy gains from pooling forecasts across methods or models: indeed, pooling refutes encompassing. Forecast failure will be a rare event, precisely because the future will be like the present and past. Instead, when the DGP is not stationary and the model does not coincide with the DGP, then new implications are that causal variables need not dominate non-causal in forecasting; forecast failure will occur when the future changes from the present, and is not predictable from in-sample tests, but need not entail changes in estimated parameters nor invalidate a model; and h-step ahead forecasts can 'beat' 1-step forecasts made h-1 periods later (h [greater than] 1). Moreover, the outcome of forecasting competitions on economic time series will be heavily influenced by the occurrence of unmodelled deterministic shifts that occur prior to forecast-evaluation periods: models which are robust to such shifts will do relatively well.

Given that forecast failure mainly derives from deterministic shifts, there are four potential solutions: differencing, co-breaking, intercept corrections, and updating. Differencing lowers the polynomial degree of deterministic terms: in particular, double differencing usually leads to a mean-zero, trend-free series, because continuous acceleration is rare in economics (even during hyperinflations). The recent study of the Norges Bank model in Eitrheim, Husebo and Nymoen (1999) illustrates the effectiveness of that approach. Next, deterministic non-stationarity can also be removed by co-breaking, namely the cancellation of breaks across linear combinations of variables (see for example, Clements and Hendry, 1999a, chapter 9). Finally, intercept corrections can be shown to help robustify forecasts against biases due to deterministic shifts, as can the closely-related approach of updating the estimated intercept with an increased weight accorded to more recent data.

These implications help determine the class of model that might minimise systematic forecast failure over the long periods under analysis here.

Models

Because of the vexed question of whether output series possess unit roots or are better described as stationary around a linear trend, we estimate both types of model, namely difference stationary (DS) and trend stationary (TS) models. The stochastic-trend model treats the variable {[y.sub.t]} as integrated of order one, [y.sub.t][sim] l(1), as in the random walk with drift:

[y.sub.t] = [y.sub.t-1] + [micro] + [[epsilon].sub.t] where [[epsilon].sub.t] -IN[0, [[[sigma].sup.2].sub.[epsilon]]]. (1)

The TS model is ostensibly quite different, whereby {[y.sub.t]} is stationary about a deterministic function of time, here taken to be a simple linear trend:

[y.sub.t] = [phi] + [[gamma].sup.t] + [u.sub.t] where [u.sub.t] - IN[0, [[[sigma].sup.2].sub.u]]. (2)

The disturbances in both models can be stationary Gaussian autoregressive-moving average (ARMA) processes without fundamentally altering the properties of these models (see Clements and Hendry, 2001).

However, both these models can be viewed as special cases of the 'structural time-series' class of models of Harvey (1989) (Proietti, 2001, provides a review). Ignoring cyclical and seasonal components, the local linear trend (LLT) structural time-series model can be written as:

[y.sub.t] = [[micro].sub.t] + [[epsilon].sub.t] (3)

where:

[[micro].sub.t] = [[micro].sub.t-1] + [[beta].sub.t-1] + [[eta].sub.t-1] (4)

[[beta].sub.t] = [[beta].sub.t-1] + [[zeta].sub.t-1]

and the disturbance terms [[epsilon].sub.t], [[eta].sub.t] and [[zeta].sub.t] and are zero mean, uncorrelated white noise, with variances given by [[[sigma].sup.2].sub.[epsilon]], [[[sigma].sup.2].sub.[eta]] and [[[sigma].sup.2].sub.[zeta]]. The interpretation of this model is that is [[micro].sub.t] the (unobserved) trend component of the time series, and [[epsilon].sub.t] is the irregular component. [[eta].sub.t] affects the level of the trend, and [[zeta].sub.t] allows its slope ([beta]) to change. We investigate whether this more flexible approach leads to improved forecast accuracy over (1) and (2).

Differencing the equation for [[micro].sub.t] in (4):

[[delta].sup.2][[micro].sub.t] = [delta][[beta].sub.t-1] + [delta][[eta].sub.t-1] = [[zeta].sub.t-2] + [delta][[eta].sub.t-1],

and substituting into (3) differenced twice, we find:

[[delta].sup.2][y.sub.t] = [[zeta].sub.t-2] + [delta][[eta].sub.t-1] + [[delta].sup.2][[epsilon].sub.t],

which is a restricted ARIMA(0, 2, 2). [2]

Estimation of the unknown error variances is by maximum likelihood, most easily achieved after casting the model into state-space form (SSF), for example:

[y.sub.t] = Z[[alpha].sub.t] + G[[epsilon].sub.t]

[[alpha]sub.t+1] + T[[alpha].sub.t] + H[[epsilon].sub.t],

which define the measurement and state equations respectively, where [[alpha].sub.t] = ([[micro].sub.t][[beta].sub.t])' Z=(1 0), G = ([[sigma].sub.[epsilon]] 0 0), [[epsilon]'.sub.t] = ([[epsilon].sub.t]/[[sigma].sub.[epsilon]] [[eta].sub.t]/[[sigma].sub.[eta]] [[zeta].sub.t]/[[sigma].sub.[zeta]]) and:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For given values of the hyperparameters ([[sigma].sub.[epsilon]], [[sigma].sub.[eta]], [[sigma].sub.[zeta]]), the Kalman filter is used to obtain the prediction-error decomposition, that is, the one-step ahead independent prediction errors, from which the likelihood can be assembled. The likelihood is then maximised over these hyperparameters by a numerical optimisation routine.

Letting [I.sub.T] denote available information, the h-step ahead forecasts are given by:

E[y.sub.T+h]\[I.sub.T]] = ZE[[[alpha].sub.T+h]\[I.sub.T]] = [ZT.sub.h[sim]][[alpha].sub.T\T]

since E[G[[epsilon].sub.T+h]/[I.sub.T] = 0, where [[alpha].sub.T\T] is the smoothed estimate of the state vector at time T. For the LLT model, after estimation:

[[gamma].sub.T+h\T] = E[[[gamma].sub.T+h]\[I.sub.T] [[micro].sub.T\T] + h/[[beta].sub.T\T]

with an h-step ahead mean-square forecast error (MSFE) of:

MSFE[y.sub.T+h\T]]=[p.sub.11] + 2h[p.sub.12] + [h.sup.2][p.sub.22]

+h[[[sigma].sup.2].sub.[eta]] + h(h-1)(2h-1)[[[sigma].sup.2].sub.[zeta]]/2 + [[[sigma].sup.2].sub.[epsilon] (5)

where P = [P.sub.T\T] = {[p.sub.ij]} is the smoothed estimate of the covariance of the state vector: details are provided in the appendix.

Suppose now that [[[sigma].sup.2].sub.[zeta] = 0, then from (4), [delta][[micro].sub.t] = [beta] + [[eta].sub.t-1], so that differencing (3) and substituting delivers:

[delta][y.sub.t] = [beta] + [[eta].sub.t-1] + [delta][[xi].sub.t],

which is the same as (1) when the disturbance in the latter is an MA(2). Thus, the DS model is a structural time-series model in which the 'slope' does not change. The TS model can be obtained as a special case of the structural time-series model by setting [[[sigma].sup.2].sub.[eta]] = [[[sigma].sup.2].sub.[zeta]] = 0 so that (4) becomes:

[[micro].sub.t] = [[micro].sub.t-1] + [beta] = [[micro].sub.0] + [[beta].sub.t]. (6)

(6) indicates that the trend has a constant 'slope' ([[[sigma].sup.2].sub.[zeta]] = 0), and the 'level' is not subject to stochastic shocks ([[[sigma].sup.2].sub.[eta]] = 0). Substituting into (3) results in:

[y.sub.t] = [[micro].sub.0] + [beta]t + [[xi].sub.t]

which is identical to (2) when [[micro].sub.0] = [phi], [beta] = [gamma] and [[xi].sub.t] = [u.sub.t] - Thus, the trend-stationary model is a limiting case of the structural time-series model in which the level and the slope of the trend component are constant over rime.

As noted in the previous section, devices are available to robustify forecasts against breaks in means or growth rates. These are discussed at length in Clements and Hendry (1998a), drawing on, inter alia, Clements and Hendry (1996, 1998a), where it is shown that bias can often be reduced at the cost of greater imprecision. We experimented with a 'double-difference' (DD) predictor, whereby forecasts are generated by solving [[delta].sup.2][y.sub.t] = 0 for t = T+1, ..., T + h. Here, [delta] = 1 - L, [L.sup.n][x.sub.t] = [x.sub.t-n], so [[delta].sup.2] = (1 - L) - (1 - L)L. This predictor instantly adapts to changes in the growth rate of [y.sub.t]: the growth rate observed at period T is predicted into the future. Notice that the DS model incorporates a unit root which will adapt for changes in the level of the series, but not for changes in the growth rate, [micro]. That is, the growth rate of [micro] is expected to hold globally. The LLT model allows a local trend, as does a model which estimates [micro] on a window of observations ending at the forecast origin. We term this model DS-L, and calculate the growth rate from the last five years of data. The DS model uses all the data up to the forecast origin (and may include dynamics, but this is of secondary importance). Finally, we consider 'intercept-correcting' the TS model forecasts by adding in the average of the previous two errors (up to and including the forecast origin error). This predictor is denoted TS-IC.

Empirical forecast comparisons

A number of sampling schemes could be adopted for forecast evaluation (see, for example, West and McCracken, 1998, pp. 818-9). We use a 'recursive' scheme, where the parameters are re-estimated in each period as the forecast origin moves forward through the sample. Thus, for an observation vector of length T, the model is first specified and estimated on the data up to period [T.sub.0]([T.sub.0] [less than] T), and a forecast (point or interval) of [T.sub.0] + 1 to [T.sub.0] + 4 is made. Then, the model is re-estimated on data up to and including [T.sub.0] + 1, and forecasts of [T.sub.0] + 2 to [T.sub.0] + 5 are made, and so on.

The specifications of the models are fixed, although we could have optimised lag orders for each estimation sample according to some information criterion, such as BIC (see Schwarz, 1978) or PIC (see Phillips, 1994): Swanson and White (1997) found improved accuracy from such a strategy in some cases. Prediction intervals were calculated using the 'plug-in' approach of Box and Jenkins for the DS and TS models. For the LLT model, a 100x[alpha]% interval was calculated as the central forecast [+ or -][[phi].sup.-1]([alpha]/2)x[square root]MSFE[[y.sub.T+h\T]], where MSFE[[y.sub.T+h\T]], is given in (5), and [phi] is the standard normal distribution function. This approach ignores parameter-estimation uncertainty (the model parameters are simply replaced by sample estimates), and assumes the model disturbances are Gaussian. Clements and Taylor (2000) survey this literature, and suggest a preferable bootstrap approach, but these refinements would be unlikely to alter the overall picture.

Industrial production (IP)

We generated sequences of 1 to 4-step ahead forecasts from these univariate time series models for the logarithm of the IP series ([y.sub.t]) (1700-1988). For the unit root (or difference-stationary, DS) model we specified an ARIMA (1,1,0), allowing a non-zero mean; and for the trend-stationary (TS) model, we specified an ARIMA (2,0,0) process for [u.sub.t] = [y.sub.t] - [phi] - [gamma]t, where [phi] is the intercept and [gamma], the coefficient on the time trend (t), is the underlying growth rate. Alternative specifications were generally no better in terms of MSFE for the period as a whole, but we did not undertake extensive specification searches. The LLT was as described above, as were the DD, DS-L and TS-IC.

The models were estimated from the period 1700 to 1758, and forecasts were then made of the next four years. The same specifications were re-estimated on data up to 1759, and 1 to 4-step ahead forecasts were again generated. This process was continued up to an estimation sample ending in 1987, with 1 to 4-step ahead forecasts of the observations 1988 to 1991. At each forecast origin, we also calculated prediction intervals, and recorded whether the actual fell within, or outside the associated interval.

For the forecast-evaluation exercise, we split the overall sample of forecasts into three periods: 1759-1850, 1851-1944, 1945-1988 (for 1-step forecasts - for 2-steps ahead, the dates were 1760-1851, and so on). For the whole sample of forecast errors, table 1 shows that the DS and LLT models are more accurate than the TS model, with the LLT having a small advantage at 3 and 4-steps ahead, although the RMSFEs (root MSFEs) suggest considerable uncertainty for all models. For example, an approximate 95 per cent prediction interval for a 1-step ahead forecast using the DS model would be [+ or -]2x[square root]4.82, that is, [+ or -]4.4 percentage points of the central projection. We have omitted second-moment results for the DD, DS-L and TS-IC predictors, because these predictors generally fared badly, and in only few cases yielded minor improvements. As is evident from table 2, which records the biases in the forecast errors of IP over the whole period for all the predictors, this is due to higher variability of their forecast errors, because they are manifestly much less biased. For example, the 1-step forecasts of the DS model are on average too low by 1 per cent, those of the LLT are too low by 1/2 per cent, while those of DD are virtually unbiased: on average they are too low by only 0.03 per cent. This is consistent with Clements and Hendry (1999a), suggesting that the variance costs offset the bias reductions when weighted by the RMSFE metric: for this reason, our subsequent focus is on the three models of trend, DS, TS and LLT.

Generally, the DS model is the most accurate on RMSFE for the sub-periods as well. The 1945-98 sub-period is associated with less uncertainty, but even here a 95 per cent interval would be [+ or -]3.6 percentage points of the central projection, rising to [+ or -]5.2 percentage points for a four-year ahead forecast. Table 3 indicates that the DS forecasts at 1-step ahead are often significantly more accurate than those of the TS model on MSFE. This is the case for the whole period and the earliest sub-period, while at conventional significance levels, the LLT is statistically no worse than the DS model at 1-step. For the whole sample period, neither the standard nor modified versions of the forecast-encompassing test reject the null that the DS 1-step forecasts encompass those of the TS forecasts (in line with the findings of the DieboldMariano test of equal accuracy), nor do they reject the null that DS encompasses LLT, nor that LLT encompasses TS, but they clearly reject when the tests are reversed, that is , TS does not encompass DS, LLT does not encompass DS, and TS does not encompass LLT. For the first sub-period, the LLT forecasts are generally preferred on the encompassing criteria, while for 1851-1944, given that we have the DS and TS forecasts, nothing is gained from having the LLT model forecasts as well, though the converse is not true.

The interval-forecast tests in table 4 indicate that, with the exception of the third sub-period (1945-88), the unconditional coverage of the 1-step prediction intervals is appropriate at the three nominal levels we consider. In the third period there are far too few 'misses' (actuals outside the intervals) and this is signalled by p-values below 5 per cent. Table 5 records the dates of the misses, and confirms their general absence from the postwar period. The whole-period test of the independence of the sequences of hits and misses is more demanding, and rejects for all three models for 50 and 75 per cent intervals (and has a p-value of 0.052 for the TS model even for the 95 per cent interval). [3] Table 5 confirms both the clustering of misses for a particular model, and the high correlation of misses between models. Consequently, for the whole period, all three models are rejected at the 10 per cent level on the joint test of correct conditional coverage of SO and 75 per cent intervals. There is no evidence against the models on the first sub-period, although the second (1851-1944) resembles the whole period. Here again, though, there is less evidence against the LLT model SO per cent interval, indicating that the LLT model may be able to characterise more accurately, or account for, the uncertainty surrounding the point forecasts at different points in time.

Table 5 also notes any dates where wars (and the oil crises) may have played a role. There is a link between forecast failure and manifest turbulence, but it is nor strong: many dates coincide with wars, but there were also many other wars over that period which do not seem to have had marked effects (note that the data were interpolated during the Second World War). That the link is weak is perhaps unsurprising: other events (financial crises, gold discoveries and so on) and possible interactions have not been investigated, but we note that there are 4 dates between 1929 and 1939. Because the models are rather crude empirical [4] descriptions of the macroaggregates, events which might be expected to result in failure are not always signalled, and there are failures with no obvious causes.

Gross domestic product (GDP)

For the logarithm of the historical GDP series ([y.sub.t]) (1830-1991), we again generated sequences of 1 to 4-steps ahead forecasts. For the DS model, we specified an ARIMA(1,1,0), allowing a non-zero mean, and for the TS model, we specified an ARIMA (2,0,0) for = [u.sub.t] = [y.sub.t]-[phi]-[gamma]t. The LLT was as described in the third section on models above, but with [[[sigma].sup.2].sub.[zeta]]=0, because the optimisation routines failed to solve when this parameter was allowed to be non-zero. Thus the model is a restricted ARIMA (0,1,2).

The models were estimated on the period 1830 to 1888, and forecasts were then made for the next four years. The same specifications were re-estimated on data up to 1889, and 1 to 4-step ahead forecasts were again made. This process was continued up to an estimation sample ending in 1987, with 1 to 4-step ahead forecasts of the observations 1988 to 1991. Because there is only a sample of 100 forecasts, it did not seem prudent to analyse subsets of forecasts.

The RMSEs for GDP over the period 1889-1988 are comparable in magnitude to those recorded for industrial production. The DS model again appears to be the most accurate, followed by the 'LLT' (recall [[[sigma].sup.2].sub.[zeta]]=0). But the differences are not statistically significant at 1-step ahead on the Diebold-Mariano test applied to the 1-step forecast errors. The forecast-encompassing tests are able to discriminate between the models. On pairwise comparisons, DS encompasses, but is not encompassed by, TS, and similarly, DS encompasses, but is not encompassed by, LLT. Comparing the two dominated models, TS does not encompass LLT, but at the 5 per cent level we cannot reject that LLT encompasses the TS (although we can at the 10 per cent level). Turning now to the prediction intervals, at the 5 per cent level, there is evidence against the independence of the 75 per cent interval hits and misses for TS, and against the 90 per cent LLT interval, as well as against the 90 per cent DS interval on the joint test.

The RMSE figures can be compared with those calculated by agencies such as the Treasury and NIESR for their GDP forecasts (see, for example, respectively, Mellis and Whittaker, 2000 and Poulizac, Weale and Young, 1996), albeit that they are for the recent period and are based on quarterly data. Nevertheless, Mellis and Whittaker (2000, Table 3.2, p. 41) give an RMSE of 1.89 for 4-step forecasts for the period 971 -96, compared with our figure for the 100-year period of 3 1/4 for the DS model (see the last panel of table 1).

Table 6 gives the dates of the misses for the 90 per cent intervals. Many of the dates coincide with well-known events, though, as before, there are important events that do not coincide with any forecast failures.

Conclusion

In terms of the first issue -- which model appears to offer the most useful characterisations of UK output measures historically -- the findings support models with stochastic trends. However, the added flexibility of the LLT model, which allows the slope and level of the trend to change, appears to offer little benefit over the DS model's unit root and fixed underlying growth rate, except perhaps for forecasts of three or four years ahead. Nevertheless, the average sizes of the forecast errors, and consequently the wide span of prediction intervals, attest to a great deal of uncertainty in the economic environment.

In relation to the second issue, for IP, the postwar period has been relatively more quiescent, with lower MSFEs, and few (if any) misses using prediction intervals which (because of the way they have been calculated) largely reflect the earlier periods. For GDP, forecast failures occur somewhat less often postwar (about once per decade, but bunched in the 1970s) than otherwise, but the 1920s stand out as the hard-to-forecast decade.

Finally, to resolve the third issue, perusal of tables 5 and 6 provides some support for our claim that forecast failure is associated with turbulent periods. For GDP, for example, note the asterisks in the three years following WW1, around the time of WW2, the OPEC oil price hikes, combined in 1980 with the onset of recession. The match seems less close for industrial production, in that several major wars do not appear in the list, albeit that many others do, but as the data were missing during the Second World War and were linearly interpolated, they cannot reflect any 'boom-bust' that may actually have occurred.

(*.) Department of Economics, University of Warwick.

(**.) Department of Economics, University of Oxford. Financial support from the UK Economic and Social Research Council under grant no. LI16251015 is gratefully acknowledged by both authors. All computations were performed using code written in Gauss. Tommaso Proietti kindly provided the Gauss code to estimate the local linear trend model.

NOTES

(1.) The data were kindly provided by Charlie Bean. The industrial production series is the 'Output in Industry' series compiled from Crafts and Harley (1992), p. 725; Mitchell (1988), p. 846; and Central Statistical Office (1993): the data were missing during the Second World War so were linearly interpolated between 1938 and 1946. Real GDP in 1985 prices at factor cost comes from Mitchell (1988), p. 836; and Economic Trends Annual Supplement, (1993), corrected for the exclusion of Southern Ireland.

(2.) See, for example, Harvey (1989) or Proietti (2001) for the restrictions on the lag 1 and 2 autocorrelations relative to an unrestricted second-order MA process.

(3.) The power of the test falls in the nominal coverage of the interval at high coverage levels. Intuitively, there are fewer misses from which to deduce whether they are independently distributed amongst the hits.

(4.) See, for example, Hand (1999) on empirical versus iconic models.

REFERENCES

Box, G.E.P. and Jenkins, G.M. (1976), Time Series Analysis, Forecasting and Control, San Francisco, Holden-Day (first published 1970).

Central Statistical Office (1993), Economic Trends Annual Supplement, London, HMSO.

Chong, Y.Y. and Hendry, D.F. (1986), 'Econometric evaluation of linear macro-economic models', Review of Economic Studies, 53, pp. 671-90, reprinted in Granger, C.W.J. (ed.) (1990), Modelling Economic Series, Oxford, Clarendon Press.

Christoffersen, P.F (1998), 'Evaluating interval forecasts', International Economic Review, 39, pp. 841-62.

Clements, M.P. and Hendry, D.F. (1996), 'Intercept corrections and structural change', Journal of Applied Econometrics, II, pp. 475-94.

----- (1998a), 'Forecasting economic processes', International Journal of Forecasting, 14, pp. 111-31.

----- (1998b), Forecasting Economic Time Series: The Marshall Lectures on Economic Forecasting, Cambridge, Cambridge University Press.

-- (1999a), Forecasting Non-stationary Economic Time Series: The Zeuthen Lectures on Economic Forecasting, Cambridge. Mass., MIT Press.

-- (1999b), 'Modelling methodology and forecast failure', Econometrics Journal (forthcoming).

-- (2001), 'Forecasting with difference-stationary and trend-stationary models', Econometrics Journal, 4, S1-19.

Clements, M.P. and Taylor, N. (2001), 'Bootstrapping prediction intervals for autoregressive models'. International Journal of Forecasting (forthcoming).

Crafts, N.F.R. and Harley, C.K. (1992), 'Output growth and the British Industrial Revolution: a restatement of the Crafts-Harley view', Economic History Review, 45, pp. 703-30.

Diebold, F.X. and Mariano, R.S. (1995), 'Comparing predictive accuracy', Journal of Business and Economic Statistics, 13, pp. 253-63.

Eitrheim, O, Husebo, T.A. and Nymoen, R. (1999), 'Equilibrium-correction versus differencing in macroeconometric forecasting', Economic Modelling, 16, pp. 515-44.

Ericsson, N.R. (1992), 'Parameter constancy, mean square forecast errors, and measuring forecast performance: an exposition, extensions, and illustration', journal of Policy Modeling, 14, pp. 465-95.

Hand, D.J. (1999), 'Discussion contribution on "Data mining reconsidered: encompassing and the general-to-specific approach to specification search" by Hoover and Perez', Econometrics Journal, 2, pp. 241-3.

Harvey, A.C. (1989), Forecasting. Structural Time Series Models and the Kalman Filter, Cambridge, Cambridge University Press.

Harvey, D., Leybourne, S. and Newbold, P. (1997), 'Testing the equality of prediction mean squared errors', International Journal of Forecasting, 13, pp. 281-91.

-- (1998), 'Tests for forecast encompassing', Journal of Business and Economic Statistics, 16, pp. 254-9.

Hendry, D.F. (1995), Dynamic Econometrics, Oxford, Oxford University Press.

-- (2000a), 'Does money determine UK inflation over the long run?' in Backhouse, R. and Salanti, A. (eds), Macroeconomics and the Real World, Oxford, Oxford University Press.

-- (2000b), 'On detectable and non-detectable structural change', Structural Change and Economic Dynamics, 11, pp. 45-65.

Kiviet, J.F. (1986), 'On the rigor of some mis-specification tests for modelling dynamic relationships', Review of Economic Studies, 53, pp. 241-61.

Mellis, C. and Whittaker, R. (2000), 'The Treasury's forecasts of GDP and the RPI: how have they changed and what are the uncertainties?' in Holly, S. and Weale, M. (eds), Econometric Modelling: Techniques and Applications, Cambridge, Cambridge University Press.

Mitchell, B.R. (1988), British Historical Statistics, Cambridge, Cambridge University Press.

Phillips, P.C.B. (1994), 'Bayes models and forecasts of Australian macroeconomic time series' in Hargreaves, C. (ed.), Non-stationary Time-series Analyses and Cointegratian, Oxford, Oxford University Press.

Poulizac, D., Weale, M. and Young, G. (1996), 'The performance of National Institute economic forecasts', National Institute Economic Review, 156, pp. 56-62.

Proietti, T. (2001), 'Forecasting with structural time series models', mimeo, Dipartimento di Scienze Statistiche, Universita di Udine, forthcoming in Clements, M.P. and Hendry, D.F. (eds) (2001), A Companion to Economic Forecasting, Oxford, Basil Blackwell.

Schwarz, G. (1978), 'Estimating the dimension of a model', Annals of Statistics, 6, pp. 461-4.

Swanson, N.R. and White, H. (1997), 'Forecasting economic time series using flexible versus fixed specification and linear versus nonlinear econometric models', International Journal of Forecasting, 13, pp. 439-62.

Wallis, K.F. (1995), 'Large-scale macroeconometric modelling', in Pesaran, M.H. and Wickens, M.R. (eds), Handbook of Applied Econometrics: Macroeconomics, Oxford, Basil Blackwell.

West, K.D. and McCracken, M.W. (1998), 'Regression-based tests of predictive ability', International Economic Review, 39, pp. 817-40.

Table 1 RMSE and MAE of multi-step forecast errors
 RMSE MAE
h DS TS LLT DS TS LLT
Industrial production forecast
errors: 1759-1988
1 4.82 5.08 4.95 3.82 4.07 3.97
2 6.56 7.07 6.55 5.20 5.70 5.18
3 7.72 8.60 7.63 6.18 7.12 6.01
4 8.60 9.84 8.49 6.89 8.20 6.70
Industrial production forecast
errors: 1759-1850
1 4.52 5.15 4.41 3.64 4.25 3.55
2 5.67 6.98 5.43 4.44 5.63 4.27
3 6.44 8.68 6.16 5.16 7.40 4.87
4 7.66 10.60 7.28 6.01 9.10 5.66
Industrial production forecast
errors: 1851-1944
1 5.56 5.58 5.87 4.46 4.47 4.87
2 7.75 7.80 7.91 6.39 6.42 6.54
3 9.32 9.39 9.34 7.73 7.78 7.72
4 10.08 10.19 10.14 8.24 8.32 8.22
Industrial production forecast
errors: 1945-1988
1 3.58 3.60 3.68 2.82 2.84 2.93
2 5.41 5.46 5.42 4.25 4.32 4.21
3 6.27 6.37 6.23 4.98 5.15 4.76
4 6.86 7.01 6.85 5.86 6.05 5.60
GDP forecast errors: 1889-1988
1 3.25 3.49 3.39 2.30 2.43 2.41
2 5.50 6.08 5.58 3.94 4.22 4.00
3 7.31 8.05 7.30 5.25 5.73 5.27
4 8.58 9.41 8.48 6.19 6.70 6.21
Note: All MSFE and MAE figures have been multiplied by a hundred.
Table 2 Biases of multi-step forecast errors for industrial
production, 1759-1988
h DS TS LLT DD TS-IC DS-L
1 0.93 1.57 0.46 0.03 0.21 0.05
2 1.60 2.64 0.71 0.05 0.40 0.10
3 2.34 3.75 0.96 0.07 0.61 0.15
4 3.03 4.76 1.18 0.06 0.80 0.17
Note: All figures have been multiplied by a hundred.
Table 3 Point forecast evaluation: tests of equal forecast
accuracy and of forecast encompassing
 DS,TS DS,LLT TS,LLT
Industrial production forecast
errors: 1759-1988
DM - 0.147 0.859
FE(1) 0.166 - 0.249 - -
FE(2) 0.094 - 0.325 0.002 -
Industrial production forecast
errors: 1759-1850
DM - 0.740 1.0
FE(1) 0.180 - 0.023 0.438 -
FE(2) 0.127 - 0.050 0.505 -
Industrial production forecast
errors: 1851-1944
DM 0.092 0.058 0.076
FE(1) 0.305 0.192 0.962 0.002 0.820
FE(2) 0.237 0.133 0.963 0.002 0.823
Industrial production forecast
errors: 1945-1988
DM 0.182 0.169 0.258
FE(1) 0.450 0.303 0.679 0.112 0.979
FE(2) 0.451 0.306 0.690 0.125 0.979
GDP forecast errors: 1889-1988
DM 0.147 0.122 0.735
FE(1) 0.397 - 0.618 0.005 0.002
FE(2) 0.528 0.004 0.748 0.064 0.003
Industrial production forecast
errors: 1759-1988
DM
FE(1) 0.114
FE(2) 0.143
Industrial production forecast
errors: 1759-1850
DM
FE(1) 0.072
FE(2) 0.081
Industrial production forecast
errors: 1851-1944
DM
FE(1) 0.002
FE(2) 0.003
Industrial production forecast
errors: 1945-1988
DM
FE(1) 0.172
FE(2) 0.185
GDP forecast errors: 1889-1988
DM
FE(1) 0.079
FE(2) 0.082
Notes: Rows headed DM report p-values of Diebold-Mariano
tests of equal forecast accuracy or MSFE of x, y.
A value of 0.04, say, indicates that there is only
a 4 per cent chance that y is no less accurate than x.
A value of 0.80, say, indicates that there is a 20
per cent chance that x is no less accurate than y.
FE(1) is the standard forecast encompassing tests,
where for x, y, the first entry is the p-value of the
null that x forecast encompasses y, and the second
entry is the p-value of the hypothesis that y
forecast encompasses x. FE(2) is the modified form
of the test given in Harvey et al. (1988). '-' denotes
zero to three decimal places.
Table 4 Interval forecast evaluation
 DS TS
 Uncond. Ind. Cond. Uncond.
Industrial production: 1759-1988
50% 0.235 - - 0.692
75% 0.591 - 0.001 0.407
90% 0.105 0.412 0.193 0.825
Industrial production: 1759-1850
50% 0.210 0.304 0.269 0.300
75% 0.325 0.214 0.284 0.239
90% 0.427 0.280 0.407 0.350
Industrial production: 1851-1944
50% 0.302 0.003 0.008 0.215
75% 0.132 0.003 0.004 0.132
90% 0.890 0.235 0.489 0.838
Industrial production: 1945-1988
50% 0.015 0.094 0.013 0.015
75% 0.023 0.563 0.065 0.023
90% 0.002 NA NA 0.002
GDP: 1889-1988
50% 0.071 0.857 0.193 0.317
75% 0.817 0.190 0.413 0.494
90% 0.118 0.051 0.044 0.206
 LLT
 Ind. Cond. Uncond. Ind. Cond.
Industrial production: 1759-1988
50% - 0.001 0.147 0.071 0.068
75% 0.002 0.005 0.939 - -
90% 0.052 0.149 0.827 0.112 0.275
Industrial production: 1759-1850
50% 0.546 0.484 0.060 0.781 0.163
75% 0.246 0.255 1.000 0.086 0.229
90% 0.710 0.603 0.238 0.375 0.336
Industrial production: 1851-1944
50% 0.001 0.002 0.148 0.121 0.105
75% 0.013 0.014 0.132 0.003 0.004
90% 0.071 0.192 0.037 0.383 0.078
Industrial production: 1945-1988
50% 0.094 0.013 0.006 0.445 0.017
75% 0.563 0.065 0.007 0.326 0.016
90% NA NA 0.181 0.659 0.371
GDP: 1889-1988
50% 0.557 0.510 0.161 0.191 0.159
75% 0.033 0.081 0.482 0.440 0.580
90% 0.422 0.325 0.063 0.021 0.012
Notes: The elements are p-values
of the nulls of the Christoffersen
tests of correct unconditional
coverage, independence, and correct
conditional coverage of 50.75 and
90 per cent nominal intervals. '-'
indicates zero to three decimal
places. 'NA' indicates that there
were no 'misses', so the [LR.sub.ind]
(and therefore [LR.sub.cc]) tests
are inapplicable.
Table 5 Dates of forecast failure
 DS TS LLT War
Industrial production: 1759-1988
1774 * - * w?
1792 - * - w
1799 * * - w
1810 - * - w
1811 - * - w
1815 - * - w
1817 - * * -
1825 * * * -
1827 - * * w?
1828 * * * -
1836 * * - -
1839 * * - w?
1842 - - * -
1844 * * - -
1867 - - * -
1869 - - * -
1870 * * - w?
1871 - * - -
1879 - - * w?
1885 - - * w?
1886 - - * -
1906 * * * -
1916 - - * w
1917 * * * w
1918 - - * w
1920 * * * a
1921 * * * a
1922 * * - a
1924 * * * -
1929 * * * -
1931 * * * -
1934 - - * -
1936 - - * -
1975 - - * o
1980 - - * o
No. 16 22 24
% 7 10 10
Notes: The dates record when I-step
forecasts lay outside their 90 per cent
prediction intervals. (w) denotes a war
involving the UK directly, (w?) when there
was a war ongoing, (a) denotes the after-
match of World War I, and (o) the
oil crises.
Table 6 Dates of forecast failure
 DS TS LLT
GDP: 1889-1988
1892 * * *
1894 * - *
1898 - * -
1903 - * -
1908 * * *
1915 * - *
1919 * * *
1920 * * *
1921 * * *
1922 * - -
1923 * - -
1926 * * *
1927 * * *
1931 * * *
1940 * * *
1941 - - *
1944 * * *
1945 - - *
1946 - - *
1973 * * *
1974 * - -
1975 - - *
1980 - - *
No. 15 14 16
% 15 14 16
Note: The dates record when I-step forecasts lay
outside their 90 per cent prediction intervals.

[Graph omitted]

[Graph omitted]

[Graph omitted]

[Graph omitted]

Appendix 1. Derivation of MSFE for the LLT Recall that the model in SSF is given by:

[y.sub.t] = Z[[alpha].sub.t] + G[[epsilon].sub.t]

[[alpha].sub.t+1] = T[[alpha].sub.t] + H[[epsilon].sub.t],

where [[alpha].sub.t] = ([[micro].sub.t] [[beta].sub.t])', Z = (1 0), G = ([[sigma].sub.[epsilon]] 0 0), [[epsilon]'.sub.t]=([[epsilon].sub.t]/[[sigma].sub.[epsilon]] [[eta].sub.t]/[[sigma].sub.[eta]] [[zeta].sub.t]/[[sigma].sub.[zeta]]) and:

T = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Below, the matrix p = {[p.sub.ij]} is the updated smoothed MSE matrix of the state vector. The h-step ahead forecasts are given by:

E[[y.sub.T+h]\[I.sub.T]] = [ZT.sup.h][[alpha].sub.T\T].

For the LLT model:

[y.sub.T+h\T] E[[y.sub.T+h]\[I.sub.T]] = [[micro].sub.T\T] + h[[beta].sub.T\T].

The h-step forecast error is:

[e.sub.T+h\T] = [y.sub.T+h] - [y.sub.T+h\T]

= Z([[alpha].sub.T+h] - [T.sup.h][[alpha].sub.T\T]) + G[[epsilon].sub.T+h]

= [ZT.sup.h]([[alpha].sub.T] - [[alpha].sub.T\T]) (7)

+Z [[[sigma].sup.h-1].sub.i=0] [T.sup.i] H[[epsilon].sub.T+h-1-i] + G[[epsilon].sub.T+h]

because:

[[alpha].sub.T+h] = [T.sup.h][[alpha].sub.T] + [[[sigma].sup.h-1].sub.i=0] [T.sup.i] H[[epsilon].sub.T+h-1-i].

The three terms in (7) are uncorrelated, so:

E[[[e.sup.2].sub.T+h\T]] = [ZT.sup.h] [PT.sup.h]' Z' + Z [[[sigma].sup.h-1].sub.i=0] [T.sup.i] HH'[T.sup.i'] Z' + GG'

which for the LLT specialises to:

MSFE = [p.sub.11] + [2hp.sub.12] + [h.sup.2][p.sub.22] + h [[[sigma].sup.2].sub.[eta]]

+h(h-1)(2h-1)[[[sigma].sup.2].sub.[zeta]]/6+[[[sigma].sup.2].sub.[eps ilon]].

Appendix 2. Forecast evaluation measures

Until recently, attention in macroeconomics was focused primarily on the production and evaluation of point forecasts (see, for example, Wallis, 1995), with only summary information regarding the degree of uncertainty associated with forecasts, such as standard errors, being provided. However, Christoffersen (1998) suggests ways of evaluating interval forecasts, as a step toward providing a more complete description of the uncertainty surrounding forecasts.

Point forecasts

Although there is a large literature on forecast-accuracy comparisons, there have been few formal comparisons of rival forecasts of the same phenomena, where by 'formal', we mean comparisons which attempt to assess whether differences between rival forecasts can be attributed to sampling variability, or whether they are 'significant'. Diebold and Mariano (1995) propose a test of the null of equal forecast accuracy, for an arbitrary 'loss function', g([e.sub.i,T+h]), where [e.sub.i,T+h] is the h-step ahead forecast error from using model i commencing at the forecast origin T. The loss differential is defined as [d.sub.T+h] [equivalent] [g([e.sub.i,T+h]) - g([e.sub.j,T+h])] for rival forecasts i and j, so that equal forecast accuracy entails E[d.sub.T+h] = 0. Given a covariance-stationary sample realisation {[d.sub.t]} of n 'observations', the asymptotic distribution of the sample mean loss differential d(d = [n.sup.-1][[[sigma].sup.n].sub.t=1][d.sub.t]) under the null is given by:

[square root]n(d)[[right arrow].sup.D]N[0,V(d)],

where we can approximate the variance by

V(d)=[n.sup.-1]([[gamma].sub.0] + 2[[[sigma].sup.h-1].sub.i=1][[gamma].sub.i]), (8)

where [[gamma].sub.i] is the ith autocovariance (so V(d) is a weighted sum of autocovariances), and the truncation is determined by the forecast horizon. Diebold and Mariano (1995) discuss the choice of weights (here, rectangular) and the truncation lag.

The large-sample statistic that Diebold and Mariano (1995) propose for testing the null of equal forecast accuracy is:

d/[square root]V(d)[app.sup.[sim]]N[0,1],

where V(d) is a consistent estimate of V(d), using a weighted sum of the sample autocovariances in (8):

[[gamma].sub.k] = [n.sup.-1] [[[sigma].sup.n].sub.t=k+1]([d.sub.t]-d)([d.sub.t-k]+d).

Harvey, Leybourne and Newbold (1997) propose a modified version of this test statistic that corrects for its tendency to be over-sized, and present simulation evidence that attests to the usefulness of their modification. They suggest using:

[square root]n+1-2h+[n.sup.-1]h(h-1)/n x d/[square root]V(d)

and comparing this to the Student t distribution with n-1 degrees of freedom. To simplify, matters, we consider only 1-step forecasts, so that serial correlation is ignored, and use:

[[gamma].sub.0] = [n.sup.-1] [[[sigma].sup.n].sub.k=1] [([d.sub.k]-d).sup.2],

with:

d/[square root][[gamma].sub.0][app.sup.[sim]]N[0,1],

because the correction factor is [square root][n.sup.-1](n-1) = 1, and for n [greater than] 30, the normal closely approximates the t distribution.

Forecast encompassing

For two rival 1-step forecasts, [y.sub.i,t] and [y.sub.j,t], a forecast-encompassing test is implemented by regressing [e.sub.i,t](= [y.sub.t] - [y.sub.i,t]) on [y.sub.i,t] - [y.sub.j,t]:

[y.sub.t] - [y.sub.i,t] = [alpha][(y.sub.i,t] - [y.sub.j,t]) + [[zeta].sub.t] (9)

and using a t-test of [H.sub.0]: [alpha] = 0. This is equivalent to the regression:

[Y.sub.t] = (1+[alpha])[y.sub.i,t] - [alpha][y.sub.j,t] + [[zeta].sub.t] (10)

and is a variant of the test due to Chong and Hendry (1986): see Ericsson (1992) and Clements and Hendry (1998b). Harvey, Leybourne and Newbold (1998) show that this test may be oversized when there is a substantial sample of forecast errors available and the forecast errors are not approximately normal. They suggest a number of alternatives, including:

[R.sub.1] = 1/[square root]n[[Q.sub.1].sup.-1/2] [[[sigma].sup.n].sub.t=1][([e.sub.i,t] - [e.sub.j,t]).sup.2] [alpha]

where:

[Q.sub.1] = 1/n [[[sigma].sup.n].sub.t=1][([e.sub.i,t] - [e.sub.j,t]).sup.2][[[zeta].sup.2].sub.t]

and [alpha] and [[zeta].sub.t] are the OLS estimates of [alpha] and the residuals from (9). Harvey et al. (1998) also show that the Diebold-Mariano test of equal MSFE is closely related to the notion of forecast encompassing.

Interval evaluation

Christoffersen (1998) suggests that a 'good' interval forecast should have correct conditional coverage, such that the interval is wider in volatile periods than in those of greater tranquility: otherwise, observations outside the interval will be clustered in volatile periods, and absent in tranquil periods. Christoffersen (1998, p. 844) develops a 'unified framework for testing the conditional coverage' using three likelihood-ratio (LR) tests for unconditional coverage, independence, and conditional coverage respectively.

Let [L.sub.t\t-1](P) and [U.sub.t\t-1](p) denote the lower and upper limits of the interval forecast of [y.sub.t] made at time t-1, with a coverage probability p. Define the indicator function, [1.sub.t], which takes the value unity when [y.sub.t] lies inside the interval range and zero otherwise. For 1-step ahead forecasts t = 1,..., n, there is a sequence of outcomes

[[1.sup.n].sub.t=1].

We wish to test that:

E[[1.sub.t]\[1.sub.t-1],[1.sub.t-2],...,[1.sub.1]] = E[[1.sub.t]]=p.

That is, at each point in time, the probability that the actual outcome will fall within the interval is p, and this is independent of the history up to that point. Christoffersen (1998, Lemma 1) establishes that this is equivalent to testing that the sequence {[1.sub.t]} is identically and independently distributed as a Bernouilli distribution with parameter p, denoted: {[1.sub.t]} [sim] [IID.sub.B](p).

We can this hypothesis in two parts. First, the test of correct unconditional coverage: E [[1.sub.t]] = p versus [[1.sub.t]] = [pi] [not equal to] p. For a sample of n intervals, the likelihood under the null is:

L(p;[1.sub.1],[1.sub.2],...,[1.sub.n]) = [(1-p).sup.[n.sub.0]] [p.sup.[n.sub.1]]

Where [n.sub.0] is the number of times the actual does not fall within the interval, and [n.sub.1] is the number of 'hits'. The likelihood under the alternative is:

L([pi];[1.sub.1],[1.sub.2],...,[1.sub.n]) = [(1-[pi]).sup.[n.sub.0]] [[pi].sup.[n.sub.1]],

So the standard LR test is:

[LR.sub.uc] = -2ln (L(p;*)/L([pi];*)) a [[[chi].sup.2].sub.1],

where a denotes 'asymptotically distributed as', and [pi] = [n.sub.1] / ([n.sub.0] + [n.sub.1]).

To test for independence, model the indicator function by a binary first-order Markov chain with transition probability matirx:

[[pi].sub.1] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (11)

where [[pi].sub.ij] = Pr([1.sub.t] = j\[1.sub.t-1] = i). Under independence, [[pi].sub.ij] = [[pi].sub.j], i, j = 0, 1 where [[pi].sub.j] = Pr ([1.sub.t] = j). So under the null, (11) is restricted to:

[[pi].sub.2] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (12)

The [[pi].sub.ij] and [[pi].sub.1] are estimated by their sample frequencies, so:

[[pi].sub.1] = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

where [n.sub.ij] is the number of times j is followed by i, etc., and:

[[pi].sub.1] = [n.sub.01] + [n.sub.11] / [n.sub.00] + [n.sub.10] + [n.sub.01] + [n.sub.11].

The unrestricted likelihood is:

L([pi].sub.1]) = [(1 - [[pi].sub.01]).sup.[n.sub.00]] [[[pi].sup.[n.sub.01]].sub.01] (1 - [[pi].sub.11]).sup.[n.sub.10]] [[[pi].sup.[n.sub.11]].sub.11] (13)

and the restricted is:

L([[pi].sub.2]) = [(1 - [[pi].sub.1]).sup.([n.sub.00] + [n.sub.10])] [[[pi].sub.1].sup.([n.sub.01] + [n.sub.11])]. (14)

The likelihood-ratio test statistic is:

[LR.sub.ind] = -2ln (L([[pi].sub.2])/L([[pi].sub.1])) a [[[chi].sup.2].sub.1],

under the null hypothesis of independently-distributed indeicator function values.

A joint test of correct conditional coverage is [LR.sub.cc] = [LR.sub.uc] + [LR.sub.ind] a [[[chi].sup.2].sub.2].