The role of information in nonstationary regression

ABSTRACT The role of standard likelihood-based measures of information and efficiency is unclear when regressions involve nonstationary data. Typically the standardized score is not asymptotically Gaussian and the standardized Hessian has a stochastic, rather than deterministic limit. Here we consider a time series regression involving a deterministic covariate which can be evaporating, slowly evolving or nonstationary. It is shown that conditional information, or equivalently, profile Kullback–Leibler and Fisher information remain informative about both the accuracy, i.e. asymptotic variance, of profile maximum likelihood estimators, and the power of point optimal invariant tests for a unit root. Specifically, these information measures indicate fractional, rather than linear trends that may minimize inferential accuracy. Such is confirmed in a numerical experiment.


Introduction
Inference in models involving nonstationary variables is challenging in two important regards. First the standard Cramér-Rao efficiency theory does not apply. Estimators are, generally, not asymptotically normal nor do their covariances converge to Fisher information. Secondly, the asymptotic analysis of such models invariably provides stochastic representations for estimators and tests, rather than their distributional properties. Fisher information, as a probability metric, is not applicable in such models. Some of the asymptotic implications of these issues are explored in [1], while Marsh [2] considers the finite sample properties of Kullback-Leibler divergence. This paper considers two standard time series specifications, either for t = 1, . . . , T, ε t ∼ iidN(0, σ 2 ). In these models, d t represents a deterministic component that will be employed to capture the effect of both stationary or ergodic as well as nonstationary covariates. Typically, interest is in inference on ρ, i.e. testing for a unit root, while if d t = α x t for some choice of x t , then α will be a nuisance. In such circumstances, conditional information, Bhapkar and Srinivasan [3] and Zhu and Reid [4], ought be employed as a probability metric (see also [5] for different choices of such metrics) for inference about the interest parameter. Conditional information is defined for a log-likelihood l(θ 1 , θ 2 ) depending on interest parameter θ 1 and nuisance parameter θ 2 by CI θ 1 |θ 2 = I θ 1 θ 1 − I θ 1 θ 2 I −1 θ 2 θ 2 I θ 1 θ 2 , where I θ 1 θ 2 = E[−∂ 2 l(θ 1 , θ 2 )/∂θ 1 ∂θ 2 ]. Since standard information theory does not apply in nonstationary models, here an analogue is defined via expectation of the stochastic limit of the scaled log-likelihood Hessian. This limit is found by first imposing the unit root, giving a preferred point (see [6]) probability metric analogue. It is shown that conditional information about ρ in specification A corresponds to profile Kullback-Leibler and profile Fisher information in specification B. Although this metric neither bounds nor equals the asymptotic variance of an unbiased estimator for ρ, it remains informative about inferential accuracy. Specifically, it is found that these can be convex functions: when d t = αt β they attain a unique minimum at a value of β * = ( √ 6 − 1)/2 and at β + = ( The prediction that inferential accuracy is therefore minimized at these points is supported by a numerical experiment. The analysis of unit root tests began in the context of specification A. More recently, the set-up of specification B has dominated the literature, as it permits straightforward construction of invariant tests, having distributions free of nuisance parameters. In the context of the impact of covariates in unit root testing, Elliott et al. [7] characterize the asymptotic power envelope for both a general d t = o(T 1/2 ), as well as the linear trend case. Marsh [8] shows that Fisher information in the maximal invariant (to a linear trend) vanishes under a unit root, while Phillips [9] considers the impact of nonlinear and slowly evolving trends. On the other hand, Hansen [10] (see also [11,12]) explores the impact of stationary stochastic regressors in specification A. The results of this paper help shed some light on some of these findings.
The plan for the paper is as follows. The motivation for the results is provided in Section 2 via consideration of the original Dickey-Fuller [13] formulation (i.e. specification A) and the effect of stationary covariates as in [10]. The main results of the paper are provided for specification B in Section 3, while Section 4 discusses these results and Section 5 concludes. An appendix provides the proofs of the main results as well as tables and graphs for the numerical analysis.

Motivation via specification A
The original Dickey-Fuller [13] unit root testing framework considered a model as in specification A. And it is within this context that the power enhancement of stationary covariates, see [10], is explored. In the simplest possible set-up, suppose that In [10], and also [11], Dickey-Fuller tests of H 0 : ρ = 1 in are demonstrated to have powers increasing in R 2 . Since in the limit of R 2 → 1 we could, in fact, observe the errors (y t − ρy t−1 ) T 1 , this result is to be expected, as well as having empirical importance.
Here we explore the effect of the degree of covariate trending in the context of testing H 0 : ρ = 1 in the context of the fitted model, with y 0 = 0 and where we will assume β ≥ −0.5 and that data are generated via the pure random walk, y t = u t . In (3) we attempt to capture the effect of the covariate via the proxy variable {t β } T t=1 , i.e. we put d t = αt β . The aim is to capture the influence of different asymptotic covariate behaviour, i.e. whether the sequence {d t } T t=1 diverges or converges and at what rate, on measures of inferential accuracy for the interest parameter ρ.
Specifically, when −0.5 ≤ β < 0 then {t β } is an 'evaporating' trend, and captures the effect of an ergodic regressor, in that when H 0 is true E[ y t ] converges to a constant (zero, in the simplest case). Instead, when β > 0, E[ y t ] diverges. For 0 < β < 0.5 Elliott et al. [7] term the trend as being 'slowly evolving', although non-stationary. Since a pure random walk has stochastic order O(T 1/2 ), we might view the covariate trend being dominant if β > 0.5, and the stochastic trend being dominant if β < 0.5. The purpose of the following analysis is to detail the effect of the rate of divergence/convergence of the covariate on inference about ρ.
Consider the Score and Hessian for model (3), initially assuming σ 2 = 1 for simplicity: Imposing y t = u t and y 0 = 0 then E[y t−1 ] = 0 and Fisher information is Using this as an inferential metric would be misleading since it would imply no impact of the covariate on inference on ρ. Instead, note the standard results, where W(r) is standard Brownian motion, χ 2 1 denotes a chi-square random variable with one degree of freedom, ⇒ denotes weak convergence and = d denotes equality in distribution. The Score then obeys the following limit: where D T = diag{T, T β+1/2 }. Expansion of the Score in the Gaussian case yields, and hence, and Note that if we define the limit of the scaled Hessian by then the quantities scaling the limit distribution of the components of the Score in (5) and (6) areH so thatH ρ|α andH α|ρ are the stochastic analogues of Conditional Information. 1 Bhapkar and Srinivasan [3] and Zhu and Reid [4] argue that conditional information (2) should form the basis of any efficiency theory, e.g. application of the Cramér-Rao lower bound to any estimator of ρ. In the current context, this would fail since I ρα = 0 would wrongly imply that the value of β does not affect the limit distribution ofρ MLE . On the other hand, the stochastic quantityH ρ|α depends explicitly on β and should therefore prove informative about inference on ρ, as a function of β.
Indeed, here the limit in (5) can be interpreted as . OnlyH ρ|α contains any information on the impact of the covariate on the asymptotic distribution ofρ MLE . It does not, however, measure its variance directly, since it is correlated with Z.
Specification A is extremely useful in two regards. First, as in [10], it exposes the effects of even stationary covariates on tests for nonstationarity. Second, here, a sensible stochastic analogue of conditional information arises naturally, and its role in the limit distribution is clear. However, the latter applies only by imposing α = 0, while generally the distribution ofρ MLE will depend explicitly upon α, and any other value will produce different, as well as quickly intractable, limit theory. Specification B, on the other hand, allows construction of invariant statistics, and in the next section, it will be shown thatH ρ|α has far wider applicability, in that context.

Profile likelihood and information measures
In the context of specification B, suppose that a process (u t ) T 1 is generated according to and we are interested in testing the null hypothesis H 0 : ρ = 1, against H 1 : ρ = 1 − c/T, for c > 0. In the simplest case, we assume that the observed time series data (y t ) T 1 is given by y t = u t ; however, we explicitly 'de-trend' the observations according to two non-linear trend models: The purpose is to measure the influence of β on our ability to determine whether or not (u t ) T 1 has a unit root. Letα,α 0 andα 1 denote the OLS estimators for α, α 0 and α 1 in (8), respectively. Unit root tests are constructed from detrended data, The hypotheses H 0 and H 1 are invariant with respect to the groups of transformations defined, respectively, by Similar to [14,15], the maximal invariants under G 1 and G 2 are v 1 = C 1 y and v 2 = C 2 y, , then all statistics constructed only from u * t (u + t ) are invariant, having distributions not depending on α or α 0 and α 1 , respectively. In particular, any quantity derived via the imposition of α = α 0 = α 1 = 0 will, in the context of specification B, still apply more generally, unlike with specification A.
To measure the effect of the trend parameter β on asymptotic inference, we will focus upon likelihood-based measures constructed from the Gaussian Profile Likelihood: whereũ t = u * t for M 1 andũ t = u + t for M 2 , with likelihood profiled with respect to the nuisance parameters α or (α 0 , α 1 ), respectively, via OLS. Accordingly, define the following profile measures:

Kullback-Leibler divergence
Define the log-likelihood ratio by then the asymptotic profile Kullback-Leibler divergence is given by

Fisher and conditional information
For specification B, the profile Score and Hessian are The Gaussian profile MLEs satisfỹ Imposing ρ = 1, notingũ T = O p (T 1/2 ) andε T = ũ T = O p (1), then the limit of the scaled Hessian,H(ρ, (1), as is its expectation. Asymptotic Fisher information in (ũ t ) T t=1 about ρ when y t = ε t is and conditional information in ρ given σ 2 is equal to Fisher information, in this case, i.e. CI ρ|σ 2 =Ĩ 1 (β).
Before proceeding we will require limiting forms for the OLS estimators of the nuisance parametersα and (α 0 ,α 1 ) when u t = ε t . These generalize results found in [16] and are given in the following lemma, proved in Appendix 1.

Lemma 3.1:
Note that, as is well known,α 0 is never consistent, while neither ofα orα 1 are if β < 0.5. This, forα, contrasts with the limit forα MLE implied by (6) and which could be generalized for α = 0, if α were the interest parameter, for instance.
Applying the results of Lemma 3.1 to the appropriate profile likelihood yields explicit expressions for the profile Kullback-Leibler, Fisher and conditional information as given below. For each model, we find that these are all asymptotically equivalent and depend upon the degree of trending, β, in exactly the same way. The findings are summarized in the following theorem, which is also proved in Appendix 1.

Theorem 3.1: (Part I) Let y t := u t = u t−1 + ε t and suppose that we de-trend y t according to M
.
(Part II) Now let y t := u t = u t−1 + ε t and suppose that we de-trend y t according to M 2 , with , .

Discussion and analysis
(1) Returning to the original Dickey-Fuller [13] model (i.e. specification A in (1)), we find that the expectation of the limit of the Conditional Hessian is That is, the measure of conditional information derived for specification A is identical to profile Fisher information in specification B. This finding can be generalized, at some considerable algebraic cost, to the case of d t = α 0 + αt β .
(2) In all cases, it is clear that the covariate is relevant for inference on ρ, whether it is evaporating or nonstationary, whether slowly evolving or explosive. For instance, in M 1 with β = 0, we have I * 1 (0) = CI * 1|σ 2 = 1/6, and KL * (0) = c 2 /12. The outcomes can also be compared with the benchmark of a pure random walk (i.e. the likelihood does not need profiling), in which case we find I 1 = 1/2 and KL = c 2 /4. In the case of M 1 , I * 1 (β) < 1/2 for all −0.5 < β < ∞, although I * 1 (−0.5) = 1/2 and lim β→∞ I * 1 (β) = 1/2. That is, profiling with respect to the limiting evaporating or explosive covariate has, effectively, no effect on information. For M 2 the benchmark case can be taken as M 1 with β = 0. Once again we find I + 1 (β) < 1/6 for all −0.5 < β < ∞, but I + 1 (−0.5) = 1/6 and lim β→∞ I + 1 (β) = 1/6. (3) In order to demonstrate that these findings are genuinely informative about the effect of regressing out t β on unit root inference, we examine the power envelope. Adding scale invariance to the groups of transformations G 1 and G 2 defined above, then from [14] the maximal invariant (under (9)) for testing H 0 : ρ = 1 in (7) is v j = C j y/ y M j y, where C j and M j are defined above. The statistic v j has density (up to normalized Haar measure on the surface of the unit T−j sphere) as where ρ = I − ρL and L is the lag-operator matrix. The Neyman-Pearson tests for H 0 against the alternative H 1 : ρ = 1 are to reject H 0 if where k δ is chosen so that the size is δ.
In Table A1 (in Appendix 2), the resulting power envelope was simulated for T = 250, for ρ = 1 − c/T with c = 1, 2, . . . , 10 and for different values of β. The simulations were carried out with two million replications. Note that β = T is used to approximate the limiting case of β → ∞. In Table A1, a clear prediction is supported; in M 1 power is not maximized when β = 0, detrending with respect to an evaporating trend can imply as much or even more power. It is not quite possible, in this context, to confirm the prediction that β * and β + minimize power. This is for two reasons. Firstly the powers are clearly very close and insignificantly different even with two million replications. Second, the properties of the power envelope are determined by the behaviour of tests under both the null and alternative, whereas Theorem 1 applies only under the null.
(4) Instead, consider the profile maximum likelihood estimators for ρ in M 1 and M 2 , where u * t and u + t are defined above. Figures A1 and A2, in Appendix 2, plot the simulated (with T = 250 and two million replications) variances of T(ρ 1 − 1)and T(ρ 2 − 1), respectively, for different values of the trend parameter β. Plotted also are vertical lines at β * and β + . These figures help confirm, finally, the third prediction that there is a value which minimizes the inferential accuracy, and crucially, this value is not equal to 1.

Conclusions
This paper argues that likelihood-based measures of information and efficiency remain informative about inferential accuracy even in regressions involving nonstationary data. This is true, even though such models obey none of the required assumptions for consistent and efficient, asymptotically normal estimation.
The equivalency of conditional information in a lagged dependent variable justifies the use of the simpler Kullback-Leibler, or Fisher information applied to profile likelihood in the case of unit root inference in the presence of a general covariate. These are informative, in that clear predictions including maximum inferential efficiency for 'evaporating' trends and minimum efficiency for fractional, not linear, trends are clearly supported through a numerical experiment. Note 1. The author is grateful to an anonymous referee for steps leading to this interpretation.

Disclosure statement
No potential conflict of interest was reported by the author.
Moreover, by arguments almost identical to those given above in the proof of Part I, for M 2 we have .
(c) Immediate from the definition of KL + .