Forecast Evaluation Tests and Negative Long-run Variance Estimates in Small Samples (cid:3)

In this paper, we show that when computing standard Diebold-Mariano-type tests for equal forecast accuracy and forecast encompassing, the long-run variance can frequently be negative when dealing with multi-step-ahead predictions in small, but empirically relevant, sample sizes. We subsequently consider a number of alternative approaches to dealing with this problem, including direct inference in the problem cases and use of long-run variance estimators that guarantee positivity. The (cid:133)nite sample size and power of the di⁄erent approaches are evaluated using extensive Monte Carlo simulation exercises. Overall, for multi-step-ahead forecasts, we (cid:133)nd that the recently proposed Coroneo and Iacone (2016) test, which is based on a weighted periodogram long-run variance estimator, o⁄ers the best (cid:133)nite sample size and power performance.


Introduction
Given the critical role that forecasting plays in economic and …nancial research and policymaking, the evaluation of competing forecasts of the same outcomes has become an extensive and prominent …eld in the econometric and empirical economic literatures. Within this …eld, the most common forecast evaluation exercise typically undertaken is to compare the accuracy of two or more sets of forecasts on the basis of some measure of loss associated with the forecast errors, such as mean squared forecast error. In a key contribution to the literature, Diebold and Mariano (1995) [DM] proposed an approach for testing equal forecast accuracy valid for potentially contemporaneously correlated, serially correlated and non-normal forecast errors, based on testing for a zero mean in a series de…ned as the di¤erence between the two forecasts' error loss functions (the "loss di¤erential"). Harvey et al. (1997) [HLNa] suggested two …nite sample modi…cations to the DM statistic to improve size control in small samples, based on a …nite sample bias correction to the test statistic, and using Student's t critical values rather than those from a standard normal. Application of the DM test or its HLNa variant have now become prevalent in empirical forecasting research, to the extent that it is now routine for the results of such forecast accuracy tests to be reported alongside any forecast comparisons.
Testing for equal forecast accuracy is just one approach to evaluating the predictive ability of rival forecasts. A second popular evaluation method is to test for whether one set of forecasts encompasses another, in the sense that the encompassed forecasts do not result in a reduction in forecast accuracy when used in combination with the encompassing set of forecasts. Harvey et al. (1998) [HLNb] proposed a forecast encompassing test based on a DM-type approach, where the loss di¤erential is rede…ned to permit testing an encompassing null hypothesis, and the approach has become standard in cases where one abstracts from model parameter estimation uncertainty.
In this paper we focus on the behaviour of the DM/HLNa tests based on squared error loss, and the HLNb test for encompassing, in small samples. Our work is therefore in a similar vein to that of Ashley (2003) and Ashley and Tsang (2014) who investigate out-of-sample inference with limited data availability. The DM test statistic is fundamentally comprised of the loss di¤erential series sample mean standardised by an estimate of the long-run variance. DM make use of the fact that optimal h-steps-ahead forecasts are at most (h 1)-order dependent to advocate use of a rectangular kernel in the long-run variance estimator which truncates at lag h 1. While this approach results in decent …nite sample size and power properties for many sample size and h settings, the long-run variance estimator is not guaranteed to be positive whenever h > 1. DM note this possibility, but suggest that such an outcome would be rare; similarly, Clark (1999) …nds a low occurrence of negative long-run variance estimates in his equal accuracy test simulations (always less than 3% of replications). However, these observations were made on the basis of results that considered predictions only up to two steps ahead. Our …rst contribution is to highlight that the prevalence of negative long-run variance estimates can be much greater in small samples when longer horizon forecasts are considered. For example, when testing equal mean squared forecast error with h = 6, we …nd that negative variance estimates arise approximately 20% of the time for a sample size of 16, rising to over 40% of the time for a sample size of 8.
In practical applications, often due to the limitations of economic or forecast data, it is not uncommon for forecast evaluation to be conducted using sample size and forecast horizon settings that lie in the region where negative variance estimates occur frequently. For example, in the context of testing for equal forecast accuracy, the recent papers Caporale and Gil-Alana (2014), Dreger and Wolters (2014), Dib et al. (2008) and Qin et al. (2008) all implement the DM/HLNa tests in forecast samples smaller than 25 observations with horizons of h = 6 or greater. Further, Mehl (2009) and Chow and Choy (2006) have reported …nding negative DM/HLNa long-run variance estimates when using samples of 18 and 24 forecasts at horizons of 6 and 5-6, respectively.
Given that negative variance estimates can arise frequently in situations of practical relevance, it is important to determine the best approach to deliver a reliable testing procedure in terms of small sample size and power properties. DM suggest treating a negative variance estimate as a zero, thereby automatically rejecting the null against a two-sided alternative in such cases. However, given the low occurrence of negative variance estimates in their simulations, the size implications of such an approach are not fully explored. In the simulation work of Clark (1999), the relatively few replications where negative variance estimates were obtained were excluded from the simulations, thereby abstracting from the e¤ects of dealing with certain problematic cases. HLNa and HLNb simulated combinations of sample size and forecast horizon where negative variances can occur frequently, but their simulations failed to correctly deal with negative variance estimates, thus again the impact of negativity in the variance estimate is not clear. Of course, other long-run variance estimators exist which ensure non-negativity; for example, DM, Clark (1999) and others discuss the possible use of the Bartlett kernel, and in a recent paper on testing equal forecast accuracy, Coroneo and Iacone (2016) recommend use of the nonparametric periodogram estimator of Hualde and Iacone (2017), combined with use of bandwidth-dependent critical values. The second contribution of this paper is therefore to formally assess the behaviour of di¤erent strategies for dealing with the potential problem of negative long-run variance estimation in tests of equal accuracy and encompassing. We conduct an extensive set of Monte Carlo simulations to establish the small sample size and power properties of di¤erent approaches. Broadly, we …nd that for multi-step-ahead forecasts, the Coroneo and Iacone (2016) approach outperforms other methods, and the attractive …nite sample properties reported in their paper for moderate sample sizes and forecast horizons extends to the small sample and longer horizon region under focus in this paper for both equal accuracy and encompassing tests, where the DM/HLNa and HLNb tests can su¤er from negative variance estimates.
The outline of the paper is as follows. In section 2, we brie ‡y outline the DM, HLNa and HLNb tests for equal mean squared forecast error and forecast encompassing. Section 3 highlights by simulation the frequency with which negative long-run variance estimates can arise for di¤erent sample sizes and forecast horizons. In section 4, a number of ways of dealing with these cases are considered, including alternative long-run variance estimators that are guaranteed to be positive, and section 5 investigates the performance of these procedures using …nite sample size and power simulations. In section 6 we conduct a related set of simulation experiments using a DGP calibrated to the empirical work of Dreger and Wolters (2014), while section 7 considers simulations for the case where forecasts are obtained from estimated models. Section 8 concludes.

Standard tests for equal accuracy and encompassing
Consider …rst the issue of evaluating whether two competing sets of forecasts are equally accurate according to some loss function-based accuracy measure, or whether one forecast outperforms the other in terms of that metric. Denote the actuals by y t and the competing forecasts by f 1t and f 2t , t = 1; :::; T , and consider a given loss function L(:) that depends on the forecast errors, so that the cost of error associated with the forecast f it is L(e it ), i = 1; 2. Now de…ne the loss di¤erential series d t = L (e 1t ) L (e 2t ) ; t = 1; :::; T: The null hypothesis of equal forecast accuracy, according to the speci…ed loss function L(:), can then be expressed as For example, under squared error loss, d t = e 2 1t e 2 2t and the null hypothesis entails the equality of population mean squared forecast errors.
Under the assumptions that d t is covariance stationary and short memory, DM propose a test of H 0 based on the asymptotic distribution of the sample mean loss di¤erential . Denoting a consistent long-run variance estimator by! 2 , the DM test statistic is then given by which has an asymptotic standard normal distribution under the null. DM suggest use of a long-run variance estimator comprised of a weighted sum of sample autocovariances, and, motivated by the fact that optimal h-steps-ahead forecast errors are at most (h 1)-dependent, they advocate using a rectangular kernel truncated at lag h 1, i.e.
HLNa propose a modi…cation of the DM statistic designed to improve the small sample size behaviour of the test. Their statistic is based on an approximate bias correction to the long-run variance estimator, and can be written as These authors also suggest use of t T 1 critical values in place of those from the standard normal, again to better control small sample size.
Next consider investigating whether one set of forecasts encompasses another, in that the accuracy of one set of (encompassing) forecasts f 1t cannot be improved through linear combination with a second set of (encompassed) forecasts f 2t . HLNb develop a test for forecast encompassing based on the Bates and Granger (1969) forecast combination scheme, where the combination weights sum to one. 2 Denoting the combined forecast by f ct , the combination is where (0 1) determines the weights associated with the constituent forecasts. In this context, forecast f 1t encompasses forecast f 2t if the optimal mean squared error-minimising combination weight opt = E(e 2 1t ) E(e 1t e 2t ) E(e 2 1t ) + E(e 2 2t ) 2E(e 1t e 2t ) is equal to zero. The null of forecast encompassing can then be expressed in a DM-type form: HLNb therefore propose applying the DM approach to this testing problem, along with the HLNa bias correction and use of t T 1 critical values. The test statistic is then (2) but with d t given by (3). The test is conducted against the one-sided alternative E(d t ) > 0 (i.e. > 0), given the assumption of a non-negative combination weight.
1 Note that ARCH-type behaviour in the forecast errors induces additional autocorrelation into dt, requiring use of higher order lags; see Harvey, Leybourne and Newbold (1999). 2 Extensions of the test to allow for biased forecasts and combination weights that are not constrained to sum to one are discussed in Clements and Harvey (2009).

Frequency of negative long-run variance estimates
The long-run variance estimator (1), based on the rectangular kernel, is not guaranteed to be positive whenever h > 1. In practice, of course, a negative outcome is highly problematic since the M DM statistic for testing equal accuracy or encompassing cannot be computed. In such circumstances, a practitioner must then decide how to deal with such a result; suggestions in the literature include treating the estimate as zero or using an alternative long-run variance estimator that guarantees positivity. Whatever strategy is followed will have implications for the size and power of the resulting testing procedure, so it is therefore valuable to quantify how frequently negative long-run variance estimates are likely to be encountered in practice.
While DM, Clark (1999) and Coroneo and Iacone (2016), inter alios, note that (1) can produce a negative result, little evidence has so far been provided as to the extent of this potential problem.
To shed more light on the issue, in this section we report results from Monte Carlo simulation experiments to determine the frequency with which negative long-run variance estimates arise for di¤erent sample sizes and forecast horizons, both for equal accuracy and encompassing tests.
To begin, we consider the case of testing for equal forecast accuracy, adopting a standard simulation data generating process [DGP] consistent with the work of DM, HLNa and Clark (1999). We assume mean squared error loss, so that d t = e 2 1t e 2 2t , t = 1; :::; T , generating the forecast errors according to the following DGP, which allows for h-steps-ahead forecasts to follow moving average [MA] processes of order h 1: , t = 1 (h 1); :::; T . The ratio of the variances of the two forecast errors is given by R > 0, with R = 1 giving the null and R 6 = 1 the alternative. Focusing on the small samples that are often employed in forecast evaluation exercises, we simulate this DGP for T = f8; 16; 32; 64g, h = f2; 3; 4; 5; 6g, and calculate the frequency with which negative values of the long-run variance estimator (1) arise. We consider three settings for the MA parameters: (i) the case of no serial correlation with j = 0 8j, (ii) a case of moderate serial correlation with j = 0:9=(h 1) 8j, and (iii) a case of high degree serial correlation with j set to the jth element of = (0:95; 0:9; 0:8; 0:65; 0:6), these values being drawn from the US in ‡ation forecast error-based DGP 1 of Clark and McCracken (2013). Here and throughout the paper, simulations are conducted using 10,000 Monte Carlo replications. Table 1 reports the results under the null (R = 1), and under the alternative (R > 1), with the settings R = 12, R = 7, R = 3 and R = 2 for T = 8, T = 16, T = 32 and T = 64, respectively (chosen to ensure that the test powers considered in section 5 are roughly comparable across sample sizes).
As might be expected, we …nd negative long-run variance estimates occur with a frequency that increases with the forecast horizon, and decreases with the sample size. While the occurrence of negative estimates is rare when T = 64, the problem can be substantial for the smaller sample sizes considered, particularly for longer forecast horizons where the frequency can rise above 40%. In such circumstances, a practitioner would be unable to compute the standard DM or M DM test statistics almost half the time. The pattern of frequencies for negative long-run variance estimates has very little dependence on whether the simulations are conducted under the null or alternative hypotheses, and while there is a reduction in the frequency of negative estimates as the degree of serial correlation increases, the overall features of the results are similar across the di¤erent dependence settings, particularly for the longer forecast horizons. We also considered simulations where the forecast errors were contemporaneously correlated, but this had little e¤ect on the proportion of negative long-run variances obtained. These results highlight a potentially serious issue with the implementation of standard tests for equal forecast accuracy in small samples.
Turning now to testing for forecast encompassing, we let d t = e 1t (e 1t e 2t ), t = 1; :::; T , where the forecast errors are generated according to the following DGP, again allowing for MA(h 1)-dependence in the errors of h-steps-ahead forecasts: ; t = 1 (h 1); :::; T with 2 > 2 . The null hypothesis that forecast f 1t encompasses f 2t is obtained by setting = 1, while a setting of < 1 gives the alternative. Under the alternative, it can be shown that power depends only on the single parameter k = p 2 2 =(1 ). Table 2 reports results for the frequency of negative long-run variance estimates using the same settings for T , h and j as in Table 1. Results are reported under both the null and alternative, with the settings for k under the alternative being k = 1:25, k = 2:00, k = 3:00 and k = 4:50 for T = 8, T = 16, T = 32 and T = 64, respectively (again chosen to broadly align the test power levels considered in section 5 across sample sizes).
The pattern of negative estimates for the long-run variance is very similar in the case of testing for forecast encompassing to that for testing for equal forecast accuracy. Indeed, on comparing Tables 1 and 2 for a given combination of T , h and j , it is clear that the numerical frequencies are very close to each other, suggesting that the prevalence of negative long-run variance estimates is driven more by the interplay of sample size, serial correlation and the number of estimated autocovariances (h 1), rather than by the precise form of d t . We again see a rising incidence of negative estimates as T decreases and as h increases. As with the equal accuracy results, it makes little di¤erence whether the long-run variance is being calculated under the null or alternative, and the rejection frequencies are highest for lower degrees of serial correlation. The overall …nding is that negative long-run variance estimates can occur with very high probability for equal accuracy and encompassing tests when using multi-stepahead forecasts with small, yet practically relevant, sample sizes.

Adjusted Diebold-Mariano-type tests
Given the prevalence of negative long-run variance estimates that arise for multi-step-ahead forecasts in small samples when using the standard long-run variance estimator in the DM-type tests, it is important to establish methods for dealing with this potential problem. In this section we consider a number of possible approaches, all based on the DM-type tests for equal accuracy and encompassing. The following section then evaluates their relative performance in terms of …nite sample size and power.
The …rst approach we consider is the suggested method of DM in the equal accuracy testing context, which is to treat any occurrence of a negative long-run variance as a zero, viewing the negative estimate as indicative of a very small long-run variance. This of course implies a test statistic of 1, depending on the sign of the numerator d. In a two-sided testing context, as in DM, such a treatment induces an immediate rejection of the null hypothesis, so a negative longrun variance estimate always indicates evidence in favour of the alternative hypothesis under this approach. When testing against a one-sided alternative, as is common in applications of equal accuracy tests and always the case when testing for encompassing, treating a negative long-run variance as zero will either induce automatic rejection or non-rejection, depending on whether the implied test statistic value of +1 or 1 lies in the relevant one-tailed critical region. Applying this approach to the M DM tests of HLNa and HLNb, we can express the method as with the test statistic to be compared with t T 1 critical values.
Given the frequency with which negative long-run variance estimates can occur, the M DM rej approach will induce substantial over-size in two-sided equal accuracy testing procedures for h > 1 and small T , as all occurrences of a negative! 2 trigger a rejection of the null. A similar, albeit reduced, feature of over-size would also be expected for one-sided equal accuracy tests and tests for forecast encompassing, with rejections of the null occurring whenever a negativê ! 2 coincides with the appropriate sign of d. A simple conservative approach which would avoid such properties is to treat the occurrence of a negative long-run variance estimate as a failure to correctly estimate the true long-run variance, and default to non-rejection of the null in such instances. One way of writing such a method would be to de…ne the adjusted test statistic as with the test statistic again being compared with t T 1 critical values. A potential down-side of this approach is that the greater size control a¤orded by treating negative estimate cases as non-rejections is also likely to be associated with low power under the respective test alternative.
Another simple approach is to deal with a negative long-run variance estimate by replacing it with the corresponding short-run variance estimate^ 0 , thereby reducing the bandwidth in (1) from h 1 to zero. While this approach neglects the impact of autocorrelation terms, it can be argued that the very presence of a negative estimate indicates that estimation of such components is highly unreliable in these situations. When the short-run variance estimator is used, the appropriate bias correction in the M DM statistic is that for h = 1, i.e.
and the overall test statistic that adopts this statistic when a negative long-run variance is encountered can be written as Critical values from the t T 1 distribution are again to be used.
While the above methods replace negative long-run variances with simple decision rules or a short-run variance estimate, the next two approaches we consider retain a proper estimate of the long-run variance, but make use of estimators that impose positivity. An obvious possibility in this class is to replace the rectangular kernel in (1) with the Bartlett kernel, i.e.
where m denotes the bandwidth. Clark (1999) considered such an approach with Newey-West and pre-whitened Newey-West bandwidth selection. While Clark's simulations abstracted from issues of negative variance estimation, it was found that a Bartlett-based approach could result in greater …nite sample over-size than when using the rectangular kernel, hence it would not be recommended to use the Bartlett kernel in all circumstances, particularly when the rectangular kernel does not have negative variance estimate problems. Here, we consider a hybrid approach, whereby the standard M DM test is used provided the long-run variance estimate is positive, but in the case of a negative estimate, the statistic switches to one based on the Bartlett kernel.
For consistency with the optimal forecast-motivated choice of truncation h 1 in (1), along with the fact that use of the Bartlett kernel is most likely to arise in small samples, we set the Bartlett bandwidth to m = h 1. As the HLNa bias correction does not apply to! 2 Bart (and an equivalent bias correction is not possible to obtain without e¤ectively reducing! 2 Bart to! 2 ), we de…ne the DM statistic that uses the Bartlett long-run variance estimator as We can then write the third testing approach as The original DM test, and the variants outlined above, all make use of weighted sample autocovariances in the long-run variance estimator. An alternative approach proposed by Coroneo and Iacone (2016) is to use a weighted periodogram estimator, and these authors recommend construction of a DM-type test using the estimator of Hualde and Iacone (2017). Denoting the periodogram of d t for Fourier frequency j = 2 j=T by with i the imaginary unit, they suggest use of the Daniell kernel with bandwidth m to construct the weighted periodogram estimator of the long-run variancê which is then used to construct the DM-type test statistic If the bandwidth is treated as …xed,! 2 Dan is not a consistent estimator of ! 2 , but is asymptotically unbiased, and under the null hypothesis of E(d t ) = 0, DM CI follows an asymptotic t 2m distribution. This …xed-m treatment results in a test with appealing …nite sample properties, o¤ering better size control relative to the m ! 1 treatment that results in standard normal limit theory. Coroneo and Iacone observe that the t 2m distribution can act as a better approximation of the true null distribution for a smaller bandwidth, whereas larger bandwidths can be associated with higher power, hence a size-power trade-o¤ emerges. Following these authors, we consider two versions of the test, setting the bandwidths according to m = T 1=3 and m = T 1=4 (where b:c denotes the integer part of the argument), denoting the resulting test statistics by DM CI;1 and DM CI;2 , respectively. Note that for any given sample size, m is then treated as a …xed number so that the …xed-m asymptotic theory can be applied, with critical values drawn from the t 2m distribution.
In addition to the above methods, we also experimented with other possible solutions to the negative variance estimate problem. We considered replacing a negative long-run variance estimate with a modi…ed estimate based on reducing the rectangular kernel bandwidth sequentially until a positive estimate was obtained, and we investigated the exponential covariogram-based long-run variance estimator proposed in the spatial prediction context by Hering and Genton (2011). We also considered alternatives to the Bartlett long-run variance estimator with bandwidth h 1, examining results for the Bartlett kernel using a larger bandwidth setting of 2(h 1), and also the standard and pre-whitened quadratic spectral long-run variance estimators of Andrews (1991) and Andrews and Monahan (1992) with automatic bandwidth selection.
However, these alternatives did not deliver superior …nite sample size and power performance relative to the better of the approaches considered above, hence we do not detail these tests and their results in this paper; full results are available from the authors on request.

Finite sample size and power
In this section we consider the …nite sample performance of the di¤erent methods outlined in the previous section. We …rst consider testing for equal forecast accuracy, again focusing on mean squared error loss (d t = e 2 1t e 2 2t ), and simulate the empirical sizes of the M DM rej , M DM non , M DM SR , M DM B , DM CI;1 and DM CI;2 testing approaches, with the tests conducted against a two-sided alternative at the nominal 0.10-level. In addition to these six approaches, for comparison we also report results for the DM Bart statistic compared with t T 1 critical values, which always employs the Bartlett kernel-based estimator! 2 Bart regardless of the sign of the rectangular kernel-based estimator! 2 . As with the earlier simulations in section 3, we use a standard simulation setup in line with DM, HLNa and Clark (1999). Table 3 reports the sizes for the same simulation DGPs that were considered in the negative long-run variance simulations of section 3 when the null hypothesis was imposed (R = 1). Note that DM CI;1 and DM CI;2 are identical when T = 16 since T 1=3 = T 1=4 in this case.
When h = 1, the original M DM statistic cannot su¤er from negative long-run variance estimation problems, so M DM rej , M DM non , M DM SR , M DM B all amount to simply conducting M DM . (Note also that when h = 1, no serial correlation is present in the DGP, hence the j settings play no role.) Here, the test is well behaved, with sizes very close to the nominal level, with only modest under-size displayed for T = 8 and T = 16. A very similar pattern of size behaviour is also seen for DM CI;1 and DM CI;2 , while DM Bart exhibits some minor over-size but is also generally well behaved. All tests are therefore reliable for one-step-ahead forecasts and there is little to choose between them in terms of …nite sample size.
For multi-step-ahead forecasts (h > 1), the possibility of negative long-run variance estimates arises and so the method of dealing with these problem cases results in di¤erent size properties for the overall procedures that we consider. The M DM rej approach translates any negative long-run variance estimate into a rejection of the null, thus the high frequency of negative estimates for larger h and smaller T induces a high degree of over-size for this approach.
In line with the results of Table 1, the size of M DM rej reaches almost 0.50, and such large upward size distortions render this procedure invalid. The DM Bart test can also exhibit severe over-size, consistent with the simulations of Clark (1999), with size rising to almost 0.50 in the worst cases. The M DM B method achieves better size control through use of the Bartlett kernel only in problem cases, but is again subject to quite substantial over-size for moderate values of h and T , with empirical size rising above 0.30 in the case of high degree serial correlation.
The M DM SR approach (which replaces negative long-run variance estimates with a short-run variance estimate) o¤ers better size control for the cases of no serial correlation and modest serial correlation, but, as might be expected, when the degree of serial correlation is high, the simpli…cation of using only a short-run variance results in substantial size distortions. Of the M DM -based approaches, the best performing method is M DM non (which translates negative variance estimates into non-rejections of the null). However, the size can still be in ‡ated above the nominal level, with sizes of around 0.16 occurring. In contrast, the DM CI;1 and DM CI;2 weighted periodogram approaches o¤er a much greater degree of size control across h and T .
Apart from the case of T = 8 with high degree serial correlation, the two versions generally have size close to 0.10, with the worst upward size distortion being a size below 0.12, o¤ering a clear improvement over the other methods considered. When T = 8, h > 3 and the errors are highly serially correlated, DM CI;1 can su¤er from more substantial over-size, while DM CI;2 retains excellent size control. The attractive …nite sample size results reported in Coroneo and Iacone (2016) for moderate sample sizes and forecast horizons therefore extend to the small sample and longer horizon region under focus here, particularly for DM CI;2 , suggesting a valuable role for the DM CI approach in delivering forecast accuracy tests with reliable size in small samples.
When comparing results for the over-sized DM Bart test and the well-behaved DM CI;2 test, both of which always use a long-run variance estimator that is guaranteed to be positive yet have very di¤erent …nite sample size properties, it is interesting to examine the di¤erences between the tests, so as to ascertain the components of DM CI;2 that are instrumental in achieving size control. The DM CI;2 statistic makes use of a di¤erent form of long-run variance estimator In addition to evaluating the empirical sizes of the procedures, it is also important to assess their relative powers. Table 4  3. The results are reported in Table 5. As for the equal accuracy case, for one-step-ahead forecasts, we observe that the M DM -based procedures (which are identical for h = 1) and the DM CI tests display good …nite sample size control. Indeed, the M DM test has almost no size distortions even for small T , while only a very modest amount of under-size is displayed for DM CI;1 and DM CI;2 . On the other hand, DM Bart is over-sized, particularly for the smaller sample sizes. When h > 1, we …nd a similar picture of size behaviour to that in Table 3.
Speci…cally, M DM rej and DM Bart can be substantially over-sized, although the over-size of M DM rej is less severe than for the equal accuracy case, since here the encompassing test is conducted against a one-sided alternative, hence only a proportion of the negative long-run variances obtained induce a rejection of the null. Of the M DM -based approaches, M DM non o¤ers the best size control with size always below 0.14, while M DM SR and M DM B su¤er from greater size distortion, although to a lesser extent than was found in the equal accuracy testing context. DM CI;1 and DM CI;2 again have very good size behaviour across most settings, the exceptions being when the errors are highly serially correlated and either T = 16 together with h = 5 or h = 6, where DM CI;1 and DM CI;2 can su¤er from a small amount of upward size distortion, or T = 8 and h > 2, in which case DM CI;1 can again be over-sized, while DM CI;2 o¤ers greater size control in these cases.
Turning to power for forecast encompassing tests, Table 6 gives results for the size-adjusted powers of M DM non , M DM SR , DM Bart , M DM B , DM CI;1 and DM CI;2 for the relevant DGPs of section 3, with k varying across T as speci…ed in that section; the critical values used for the size-adjustment are again obtained by simulation from the corresponding size experiment. The relative power rankings of the tests are unchanged compared to tests for equal forecast accuracy, therefore the comments and conclusions outlined above are equally applicable in this context.
We again …nd that M DM has a power advantage over the DM CI tests for h = 1, while DM CI;1 generally outperforms the other procedures for T = 8 and h > 1, and for the longer forecast horizons when T is larger. DM CI;2 again has generally lower power than DM CI;1 , although this is only of real import when T = 8. Once again, therefore, M DM is to be recommended for one-step-ahead forecasts, but for multi-step-ahead forecasts, apart from a potential role for M DM non when a simple M DM -based modi…cation is desired, it is the DM CI;1 and DM CI;2 tests that are to be preferred. These tests o¤er the best …nite sample performance in terms of size and relative power, with DM CI;2 recommended for T = 8 when h > 2, and DM CI;1 otherwise.

Simulations calibrated from empirical data
In order to ensure that our simulation results are representative of what is likely to be encountered in practical applications, we now consider a set of simulations for a DGP where the sample sizes, forecast horizons and forecast error serial correlation settings are all calibrated according to a particular application in the literature. Speci…cally, we follow the Dreger and Wolters (2014) application where Euro-area in ‡ation is forecast one, two and three years ahead from an autoregressive model using quarterly data. We obtained HICP in ‡ation data from the authors for the period 1981Q1-2010Q4, and, following Dreger and Wolters, we construct 1-, 4-, To determine the degree of serial correlation present in the forecast errors, we …t moving average processes to the three forecast error series, determining the order of MA process in each case according to the Akaike information criterion, selecting from MA processes up to order h 1.

8-and 12-quarter in ‡ation rates as follows
We …nd the selected models to be M A(3), M A(2) and M A(4) for h = 4, h = 8 and h = 12, respectively, with the …tted MA coe¢ cients given in Table 7. Although these MA parameters have been estimated using a very small sample size, it can be seen that the values obtained are not inconsistent with the settings adopted in the earlier simulation exercises.
Given the calibrations obtained from the Dreger and Wolters application, we repeat the simulation experiments considered in sections 3 and 5, but now with the settings T = f29; 25; 21g, h = f4; 8; 12g and the corresponding j values from Table 7. Accordingly, Table 8 reports the frequency with which negative long-run variance estimates arise when using the standard rectangular kernel-based estimator (1), for both equal accuracy tests and encompassing tests, under both the respective null and alternative hypotheses. The settings under the alternative for the three horizon/sample size pairings considered are R = f4; 6; 8g (for testing equal accuracy) and k = f2; 1:8; 1:8g (for encompassing testing), again chosen so that the test powers are broadly comparable across sample sizes. For h = 4, we observe a very low occurrence of negative long-run variance estimates, while for h = 8 the proportion of negative estimates across the simulations is in the region of 0.15, rising to around 0.33 for h = 12. These comments apply equally to tests for equal forecast accuracy and tests for forecast encompassing. The sample sizes considered in this empirically calibrated exercise lie inbetween the T = 16 and T = 32 settings used in the section 3 simulations, and two of the forecast horizons considered are greater than the range considered in section 3. However, it is clear that the pattern of frequencies for negative estimates is consistent with the earlier results, with a high incidence of problematic negative outcomes as the forecast horizon increases. This further demonstrates that the possibility of obtaining a negative long-run variance estimate is an empirically relevant issue when applying standard tests for equal accuracy and encompassing in small samples. forecast accuracy and forecast encompassing, and the two bandwidth settings in DM CI;1 and DM CI;2 deliver similar size results. Given that here we consider longer forecast horizons than in section 5, it is reassuring to see that DM CI;1 and DM CI;2 retain good size control across h.
As would be expected given the earlier simulations, substantial over-size is seen for M DM rej  Table 10 reports the corresponding size-adjusted powers of the procedures, and, with the exception of the badly over-sized DM Bart test, DM CI;1 displays the best power performance, followed by DM CI;2 . In contrast, the best-sized M DM -based procedure, M DM non , su¤ers from relatively low size-adjusted power for h = 8 and h = 12. These results clearly strengthen the case for use of DM CI;1 or DM CI;2 in practical applications.

Impact of model parameter estimation uncertainty
Beginning primarily with West (1996), much work on forecast evaluation testing has focused on cases where the forecasts have been produced by estimated models, either non-nested or nested, and more sophisticated methods have been proposed to properly account for the impact that model parameter estimation uncertainty can have on the distributions of DM-type forecast accuracy and encompassing tests in such circumstances. For reviews of this literature, see West (2006) and Clark and McCracken (2013). In some situations where forecasts have been obtained from estimated models, the original DM approach is asymptotically valid without the need for any modi…cation. Examples are where the forecast models are non-nested, linear and estimated by ordinary least squares (OLS), along with the loss function being mean squared forecast error, or when the number of forecast observations is small relative to the number of observations used for model estimation. In this section, we consider a set of simulations designed to examine the same issues of negative long-run variance estimation and test size performance in small samples, but now where the forecasts have …rst been obtained from estimated models. In order to focus on tests that are asymptotically valid, we restrict attention to tests for equal mean squared forecast error where the forecasts are obtained from non-nested linear models estimated by OLS.
Our forecasting exercise involves an in-sample period for model estimation, t = 1; :::; N , and an out-of-sample period for forecast evaluation, t = N + 1; :::; N + T . We consider the following DGP y t = 1 x 1t + 2 x 2t + " t ; t = 1; :::; N + T where, without loss of generality, we interpret x 1t and x 2t to be predictor variables useful for forecasting y t at horizon h. We set [x 1t ; x 2t ] 0 N (0; I 2 ) and, as our focus here is on the impact of parameter estimation uncertainty rather than forecast error serial correlation, we simply generate " t N (0; 1), t = 1; :::; N + T , independently of [x 1t ; x 2t ] 0 . As in the j = 0 simulations of sections 3 and 5, we do not assume knowledge of this lack of serial correlation when constructing the test statistics, so the results can be compared directly with the j = 0 sections of Tables 1 and 3. We consider two model-based forecasts, with the models given by Model 2 : y t = 2 x 2t + e 2t which are …rst estimated by OLS over the period t = 1; :::; N to give the parameter estimateŝ 1 and^ 2 . The two forecast series are then speci…ed as f 1t =^ 1 x 1t ; t = N + 1; :::; N + T f 2t =^ 2 x 2t ; t = N + 1; :::; N + T with the corresponding forecast errorŝ The tests for equal forecast accuracy are then de…ned exactly as in section 4, but with e 1t and e 2t replaced withê 1t andê 2t , respectively. By setting 1 = 2 , it is straightforward to show that E(e 2 1t ) = E(e 2 2t ), so that the forecasts have equal accuracy in population, thereby giving the null hypothesis for our testing exercise. We set 1 = 2 = 1 and consider two in-sample period sizes, N = 40 and N = 80, combined with the same set of out-of-sample sizes and forecast horizons employed in the earlier simulations of sections 3 and 5. Table 11 reports results for the frequency of negative long-run variance estimates. On comparing these results (for both N = 40 and N = 80) with the j = 0 section of Panel A of Table 1, we …nd that the results are almost identical, hence the presence of forecast model parameter estimation uncertainty has almost no e¤ect on the prevalence of negative estimates. Table 12 gives results for the empirical sizes of the test procedures, and while some minor di¤erences are seen between the results for N = 40 and N = 80, the results for the di¤erent insample sizes are broadly similar to each other, and there is little di¤erence between these results and those for j = 0 in Table 3. Once again, therefore, the impact of estimating the forecast models is very slight, and the same comments made in section 5 apply here. The fundamental …ndings of (i) a high frequency of negative long-run variance estimates when evaluating multistep-ahead forecasts using small numbers of out-of-sample forecast errors, and (ii) the DM CI tests o¤ering the best size control among the alternative procedures considered, are therefore equally relevant in the context of forecasts obtained from estimated models.

Conclusion
In this paper, we have highlighted that application of the standard DM-based tests for equal forecast accuracy and forecast encompassing can often result in a negative long-run variance estimate when dealing with multi-step-ahead predictions and small, but empirically relevant, sample sizes. Having examined a number of possible approaches to dealing with this problem, we have found that the recently proposed testing approach of Coroneo and Iacone (2016), which uses a weighted periodogram long-run variance estimator combined with …xed-bandwidth asymptotics, o¤ers the best overall …nite sample size and power performance. Use of this test with a bandwidth setting of T 1=3 or T 1=4 (the choice being determined by the sample size and forecast horizon involved) results in only modest size distortions, while power levels are appealing relative to other approaches, permitting reliable inference even in the small sample/long horizon cases we consider. Aside from this preferred approach, a case could possibly be made for a strategy that uses the M DM tests of Harvey et al. (1997Harvey et al. ( , 1998) when a positive long-run variance estimate is obtained, and defaulting to a non-rejection of the null hypothesis when a negative long-run variance arises; while this approach does not perform as well as the Coroneo and Iacone (2016) procedure, it does have the advantage of simplicity, since no additional computation beyond calculation of the M DM statistic is required. Finally, when the forecast evaluation is being done with one-step-ahead predictions, no negative long-run variance estimates can arise with the standard tests, and the M DM tests provide good size control and superior power to the Coroneo and Iacone (2016) test.
The simulations conducted in this paper considered a range of sample sizes and forecast horizons, as well as di¤erent degrees of serial correlation in the forecast errors. While we have focused throughout on normally distributed forecast errors, we also considered simulations based on errors drawn from the t 6 distribution, given that forecast errors often appear to display fattailed behaviour. We found the results to be qualitatively similar to those based on normal errors, hence our conclusions would be unchanged under such a forecast error assumption.
Finally, we note that the issue of negative long-run variance estimates would also be relevant in the recommended test of Harvey and Newbold (2000) for multiple forecast encompassing (where the null is that one forecast encompasses a number of competing predictors), since this test employs a multivariate version of the M DM approach. It would be expected that the variance-covariance estimator in the test statistic could fail to be positive de…nite for small samples and multi-step-ahead predictions, and in future work it would be interesting to consider extensions of the above techniques to that context.      Table 3. Empirical size of nominal 0.10-level tests for equal forecast accuracy.