Inference on Factor Structures in Heterogeneous Panels

This paper develops an estimation and testing framework for a stationary large panel model with observable regressors and unobservable common factors. We allow for slope heterogeneity and for correlation between the common factors and the regressors. We propose a two stage estimation procedure for the unobservable common factors and their loadings, based on Common Correlated E⁄ects estimator and the Principal Component estimator. We also develop two tests for the null of no factor structure: one for the null that loadings are cross sectionally homogeneous, and one for the null that common factors are homogeneous over time. Our tests are based on using extremes of the estimated loadings and common factors. The test statistics have an asymptotic Gumbel distribution under the null, and have power versus alternatives where only one loading or common factor di⁄ers from the others. Monte Carlo evidence shows that the tests have the correct size and good power. JEL codes: C12, C33.


Introduction
Consider the following model for stationary panel data: where i = 1, ..., n, t = 1, ..., T , x it is an m-dimensional vector of observable explanatory variables and f t is an r-dimensional vector of unobservable common factors; in equation (2), Λ i is a matrix of coefficients of dimension m × r. Model (1)-(2) is based on Pesaran (2006).
Arguably, (1)-(2) is a model with a huge potential for empirical applications. For example, Kapetanios and Pesaran (2007) consider an APT model allowing for individual asset returns to be affected by common factors (both observable and unobservable); Eberhardt and Teal (2012) adopt a common factor model approach to estimate cross-country production functions for the agriculture sector; Eberhardt, Helmers and Strauss (2012) consider the impact of spillovers in the estimation of private returns to R&D within a common factor framework; Castagnetti and Rossi (2012) adopt a heterogeneous panel with a multifactor error model to study the determinants of credit spread changes in the Euro corporate bond market.
As far as conducting inference on (1) is concerned, the inferential theory on the slope coefficients β i has been developed in various contributions. Particularly, Pesaran (2006) proposes a family of estimators for β i based on instrumenting the f t s through cross sectional averages of the x it and y it ; such estimation techniques are referred to as the Common Correlated Effects (CCE) estimators. One of the key features of the CCE estimator is that it does not require any inference to be carried out on γ i or f t . Pesaran and Tosetti (2011) and Castagnetti and Rossi (2012) show that, in principle, residuals computed from (1) using CCE estimators can be used to extract γ i and f t using e.g. Principal Components (henceforth, PC). However, the properties of the estimated γ i and f t are not discussed. In addition to the CCE estimators, Bai (2009a,b) develops a different estimation technique for (1)-(2) under the assumption of homogeneous slopes, i.e. β i = β. Such technique is known as the Interactive Effect (henceforth IE) estimator, and it is based on iteratively computing β for given values of γ i and f t , and then γ i and f t for a given value of β. Although results are available for the estimated triple (β, λ i , f t ), inference is developed under the assumption of homogeneous β i s; moreover, no explicit asymptotics for γ i or f t is derived beyond consistency.
This paper makes two contributions to the literature. Firstly, we derive the inferential theory for the unobservable common factors f t and their coefficients γ i in (1)-(2). We estimate γ i and f t by applying PC to the residuals computed from (1) using the CCE estimator. This two-stage procedure builds on an idea of Pesaran (2006Pesaran ( , p.1000, and Pesaran and Tosetti (2011), while the asymptotics of the estimated (γ i , f t ) is studied by adapting the method of proof in Bai (2009a,b) to the case of heterogeneous β i s.
As a second contribution, we develop two tests: one for the null that γ i = γ for all i, and one for the null that f t = f for all t. The rationale for these two tests can be understood by noting that, as Pesaran (2006) points out, model (1)-(2) nests various alternative specifications. In the case of homogeneous loadings (i.e. γ i = γ), equation (1) is tantamount to a panel regression with a time effect -in such case, therefore, there is no real common factor structure. This fact is used by Sarafidis, Yamagata and Robertson (2009) to test for cross dependence in a dynamic panel context. Similarly, in the case of homogeneous factors (i.e. f t = f ), equation (1) boils down to a heterogeneous panel with individual effects -in this case, too, there is no real common factor structure. Therefore, the two tests described above can be used to verify whether a factor structure in (1)-(2) indeed exists, or whether simpler specifications nested in (1)-(2) should be employed. In this respect, our paper is closely related to a recent contribution by Baltagi, Kao and Na (2012), who propose an approach based on finite sample corrections and wild bootstrap to testing for H 0 : γ i = 0 in a standard panel factor model defined as The tests developed in this paper should therefore be employed before trying to estimate any factor structure, including the number of common factors, as we also discuss in Section 3.
From a methodological point of view, we use statistics based on extrema of the estimated γ i and f t , in a similar fashion to the tests for slope homogeneity developed by Kapetanios (2003) and Westerlund and Hess (2011). From a technical point of view, in our proofs we use similar arguments to the changepoint literature (see e.g. Csorgo and Horvath, 1997): we approximate the sequences of estimated parameters with sequences of normals, and apply Extreme Value Theory (EVT henceforth). In this respect, our paper is a first attempt to systematize the use of extrema of estimated parameters in the context of a panel regression with unobservable common factors. As far as small sample properties are concerned, we show through a Monte Carlo exercise that the tests have correct size and satisfactory power for different levels of the signal-to-noise ratio and for several simulation designs.
The paper is organized as follows. The estimation procedure, and the asymptotics of the estimates of γ i and f t are in Section 2; Section 3 contains results about the two tests mentioned above. Section 4 contains a validation of our theory through synthetic data.
NOTATION. We use "−→" to denote the ordinary limit; " d −→" and " p −→" to denote convergence in distribution and in probability respectively; and we use "a.s." as short-hand for "almost surely". The Euclidean norm of a vector x is denoted as ∥x∥; similarly, the induced Euclidean norm of a matrix A is denoted as ∥A∥ = max x̸ =0 ∥Ax∥ / ∥x∥. Other notation is defined throughout the paper and in Appendix.

Estimation
In model (1)- (2), where x it is m-dimensional and f t is r-dimensional, we consider the following notation, which we use throughout the whole paper. We define F = (f 1 , ..., f T ) ′ ; for each i. Based on this, the β i s in (1) can be estimated as which is the CCE estimator of Pesaran (2006); it holds thatβ In order to estimate γ i and f t , we propose the following two-step procedure.
Step 1 Estimate the β i s using the CCE estimator, and compute the residualsṽ i = y i −X iβi .
Step 2 Apply the PC estimator toṽ i , obtainingγ i andf t under the restrictionsF ′F = T I r and n −1 ∑ n i=1γ iγ ′ i diagonal. In Step 2,F is calculated as √ T times the r largest eigenvectors of 1 with In (1), γ i and f t are not separately identifiable; as it is typical in this literature, we only manage to estimate a rotation of γ i and f t , say H −1 γ i and H ′ f t . However, for our purposes knowing H −1 γ i and H ′ f t is as good as knowing γ i and f t .
Consider the following assumptions.
a.s. for all i, where l min (·) denotes the smallest eigenvalue; (ii) C ≡ n −1 ∑ n i=1 C i has rank r ≤ m + 1.
∞ for some δ > 0; (iii) the γ i s are non stochastic and such that max i ∥γ i ∥ < ∞ and Assumption 1 deals with the error term ϵ it , and, for example, it entails that Assumption C in Bai (2009a) holds. Assuming that the ϵ it s are cross sectionally independent is stronger than necessary -this is e.g. not needed in Pesaran (2006) or in Bai (2009a). In our context, we make this assumption only for the sake of simplicity; alternatively, it could be relaxed by assuming that the cross sectional covariances satisfy some summability conditions (see e.g. Assumption C in Bai, 2009a). Throughout the paper, we derive results under this assumption, and discuss, for each result, to what extent the assumption can be relaxed.
Similarly, as far as serial dependence is concerned, more general forms of dependence can be considered for ϵ it ; for example, using the theory developed in Phillips and Solo (1992) (see Phillips and Moon, 1999, for the extension to the panel context), we could assume a linear process for ϵ it , at the price of a more complicated algebra. Finally, note the requirement on the existence of the 12-th moment of ϵ it . This assumption is stronger than what the literature normally considers -e.g. in Bai (2009a), assuming E |ϵ it | 8 < ∞ suffices. In our context, the existence of the 12-th moment is needed in order to derive consistency ofγ i and f t (see in particular the proof of Lemma A.1).
As far as Assumption 2 is concerned, the requirement that the ϵ x it s are cross sectionally independent, and that both ϵ x it and f t are i.i.d. over time, are again stronger than necessary and only made for simplicity. As far as Assumption 3 is concerned, the rank condition in part (ii) is the same as equation (21) in Pesaran (2006), and it guarantees the consistency of the CCE estimatorβ i , defined in (3). It is well known (see e.g. Remark 2 in Pesaran and Tosetti, 2011) that the CCE estimator can be consistent even if the rank condition is not met. In our context, some results in the proofs (see e.g. Lemma A.1) require this rank condition also. Finally, Assumption 4 is standard; the requirement that the γ i s are non random in part (iii) can be relaxed in a similar fashion to Assumption 3 in Pesaran (2006).
We now turn to studying the asymptotics ofγ i andf t .
Theorem 1 Under Assumptions 1-4, it holds that, for every î Theorem 1 can be compared with Theorem 2 in Bai (2003, p.147): the rates of convergence in (5) are exactly the same. As is typical in this literature, consistency is affected by the use of generated regressors,f t , when computingγ i -see equation (4). This is the reason appears in (5): it is not possible to estimate the γ i s consistently unless both n and T pass to infinity.
On the other hand, the limiting distribution of (6) is different from the one in Theorem 2 in Bai (2003): this is due to the presence, in our context, of the idiosyncratic regressors x it .
In the statement of the Theorem, the expression of the limiting covariance matrix is reported for the general case of serial dependence. Under independence (Assumptions 1 and 2), an estimator of Σ γi is (see also Bai, 2003, p.150) Under more general forms of serial dependence, Σ γi can be estimated asΣ γi = Newey and West (1987), viz.

and the bandwidth parameter q
is chosen so that q → ∞ with q/T 1/4 → 0. Other HAC estimators can also be employed (see e.g. Andrews, 1991).
We now present the asymptotic results forf t .
Theorem 2 Under Assumptions 1-4, it holds that, for every t where Theorem 2 is the counterpart to Theorem 1 in Bai (2003, p.145). In particular, the asymptotic distribution in equation (9) is exactly the same. However, the rates of convergence given in (8) are different. In Theorem 1 in Bai (2003), . In our case, the fact that the x it s are correlated with f t introduces a As a consequence, the restriction n T → 0 is needed in (9). As in the case of Theorem 1, the asymptotic covariance matrix of √ n expressed in its general form. We note that, since the limiting distribution is the same as in Bai (2003), the asymptotic covariance matrix of √ n can be estimated using equation (7) in Bai (2003, p.150). Specifically, lettingṽ = (ṽ 1 , ...,ṽ n ) ′ , and defining V nT as a diagonal matrix containing the r largest eigenvalues of 1 nTṽṽ ′ in descending order, the it ; this is consistent with Assumption 1, which postulates cross sectional independence. However, even if this assumption were relaxed, estimating Σ Γϵ,t would still require some restrictions on the presence of cross dependence; this is because it is in general not possible to estimate Σ Γϵ,t consistently unless some ordering among the cross sectional units is assumed -see also Bai (2003, p.150).
Combining Theorems 1 and 2, we obtain the asymptotics for the estimated common Corollary 1 Under Assumptions 1-4, as (n, T ) → ∞, for all i and t where Σ f t is defined in Theorem 2.
The rate of convergence is the same as in Theorem 3 in Bai (2003). However, the limiting distribution is different under T n → 0, since now the estimation errorf t − H ′ f t would also come up in the expression.
After discussing the asymptotic properties ofγ i andf t , we turn to deriving tests for the null of no factor structure.

Testing for no factor structure
In this section, we develop tests for the null of no factor structure in (1). Motivated by Sarafidis, Yamagata and Robertson (2009), we propose two tests for, respectively: (a) the null of cross-sectional homogeneity of the loadings γ i s; and (b) the null of homogeneity, over time, of the f t s.
Formally, we propose two tests for the null hypotheses: As mentioned in the Introduction, we argue that both (12) and (13) entail that there is no real factor structure in (1). Consider (12) first. When H a 0 holds, equation (1) can be rewritten as as (1) is tantamount to a standard panel specification with a unit specific effect.
The considerations made above also entail that testing for (12) and (13)  Let γ = n −1 ∑ n i=1γ i and f = T −1 ∑ T t=1f t . In order to test for (12) and (13), we propose the following max-type test statistics: This approach has been proposed, in the context of testing for poolability with observable regressors, by Westerlund and Hess (2011), whose simulations show that the power properties are very promising, although issues may arise in presence of ties (Hall and Miller, 2010).
Under the null hypotheses H a 0 and H b 0 , the spaces spanned by the loadings and by the factors (respectively) have rank equal to one. This fact was already noted by Sarafidis, Yamagata and Robertson (2009) who, building on it, suggest running their test setting r = 1. This can be applied to our context also: S γ,nT and S f,nT can be used setting r = 1, which avoids having to estimate r. Indeed, based on the same reasoning, this also entails that tests can be carried out with any, arbitrarily chosen value of r; the asymptotics will be the same, although it can be expected that finite sample properties will differ across different values of r. From a methodological perspective, this entails that tests based on (12) and (13) can be implemented without prior knowledge of the number of factors: thus, testing does not require estimation of r as a preliminary step. Indeed, we note that tests for (12) and (13) are to be implemented before determining r. If the null is not rejected, the conclusion can be drawn that no factor structure is needed, and either (14) or (15) is the correct specification.
Conversely, if the null is rejected, then it follows that there is a genuine factor structure.
Hence, the next step is determining the number of latent common factors r, e.g. by applying the information criteria in Bai and Ng (2002). The asymptotic properties of the estimated common factors, loadings and common components are those given in Section 2.
It is worth pointing out that an alternative approach could be based on using averages for example, Pesaran and Yamagata (2008) with a sequence of normally distributed random variables, plus an error term whose supremum taken over n is negligible. In light of this, the proofs are similar, in spirit, to the ones found in the changepoint literature (see e.g. Csorgo and Horvath, 1997 ) .
Let k 1 be the largest number for which E |ϵ it | k 1 , E ∥x it ∥ k 1 and E ∥f t ∥ k 1 are finite. In view of Assumption 1, k 1 ≥ 12.
Also, let the critical value c α,n be defined such that Finally, let Γ (·) denote the Gamma function. It holds that: Theorem 3 Let Assumptions 1-4 hold, and let (n, T ) → ∞ with √ T n 2/k 1 n → 0 and n 4/k 1 where A n = 1 2 and B n = ln (n) + . Under the alternative H a 1 : it holds that P (S γ,nT > c α,n ) = 1.
Theorem 3 states that S γ,nT has a Gumbel distribution. This holds in the joint limit (n, T ) → ∞, under some restrictions on the relative rate of divergence of n and T . Since k 1 ≥ 12, we require T n 11/6 → 0, which is marginally stricter than the condition √ T n → 0 needed in for (6). Similarly, the restriction that n 4/k 1 T → 0 becomes, under Assumptions 1 and 2, n T 3 → 0. The two restrictions indicate that the test should be applied when n is not exceedingly larger than T , and vice versa. Equation (20) also provides a rule to calculate asymptotic critical values c α,n , which are given by Thus, for a given level α, c α,n is nuisance free, and it depends only on the cross-sectional sample size, n. A well known issue in EVT is that convergence to Extreme Value distributions is in general rather slow. Canto e Castro (1987) shows that the rate of convergence for the maximum of a sequence of random variables following a Gamma distribution is O ( 1/ ln 2 n ) .
Unreported Monte Carlo evidence shows that tests based on using c α,n perform quite well, although they are a bit oversized. As an alternative, one can replace B n with F −1 where F −1 χr (·) is the inverse of the cumulative distribution function of a chi-square with r degrees of freedom, see Embrechts, Kluppelberg and Mikosch (1997).
As far as consistency of the test is concerned, equation (21)   . Thus, when using maxtype statistics such as S γ,nT , n does not play a role in enhancing the power of the test. On the other hand, the test is powerful as long as just one γ i is different from the others.
Finally, from a technical point of view, the assumption that the ϵ it s are cross sectionally and serially independent can be relaxed. In the proof of Theorem 3, we show that (20) still holds if one assumes |E (ϵ it ϵ js )| < σ ij with σ ij = o p (1) as n → ∞ for all i ̸ = j.

Testing for
In this subsection, we study the asymptotics of S  .
In order to present the results, we let k 2 be the largest number such that E ∥f t ∥ k 2 , E ∥x it ∥ k 2 and E |ϵ it | k 2 are all finite. In view of Assumptions 1 and 2, k 2 ≥ 12.
Let the critical value c α,T be defined such that P (S f,nT ≤ c α,T ) = 1 − α under H b 0 . It holds that: Theorem 4 Let Assumptions 1-4 hold, and let (n, T ) → ∞ with nT 2/k 2 T → 0 and n 2/k 2 T 2/k 2 where A T = 1 2 and B T = ln (T ) + . Under the alternative H b 1 : it holds that P (S f,nT > c α,T ) = 1.
Theorem 4 is very similar to Theorem 3; convergence to the Gumbel distribution under the null is shown for (n, T ) → ∞ jointly under some restrictions between n and T . Namely, we require nT 2/k 2 T → 0 and n 2/k 2 T 2/k 2 n → 0. Since k 2 ≥ 12, the former restriction is, at most, n T 5/6 → 0. This is marginally stronger than n T → 0, which is required for (9) to hold. Similarly, requiring that n 2/k 2 T 2/k 2 n → 0 entails T n 5 → 0. This means that T ought to be larger than n, but not massively larger (the restriction T n 5 → 0 can be expected to always hold in practice).
Critical values for a test of level α can be calculated as alternatively, B T can be approximated by F −1 χr (1 − 1/T ). As far as power is concerned, (24) stipulates that the test is consistent versus alternatives shrinking as O . Similarly to Theorem 3, it suffices that f t differs from f in just one period t for the test to reject H b 0 .
Results are derived under the assumptions of no serial correlation in the ϵ it s. In the proof, we show that this requirement can be relaxed: the Theorem still holds as long as

Small sample properties
In this section, we evaluate, through synthetic data, the small sample properties of estimators of γ i and f t (discussed in Section 2), and the power and size of tests for (12) and (13) based on S γ,nT and S f,nT (discussed in Section 3).
The Monte Carlo settings are as follows. Based on model (1)-(2), we consider the following data generating process (DGP): i.e. we consider model (1)

Small sample properties -γ i andf t
We evaluate the small sample properties of the estimatorsγ i andf t .
As far asf t is concerned, we follow the same logic as in Bai (2003). We compute the Table 1 (recall that J = 1, 000).
[Insert Table 1 somewhere here] Table 1 illustrates that the estimated common factorf t is highly correlated with the unobserved common factor f t . This reinforces the results in Bai (2003), albeit obtained in a different context, that the estimated factors are quite good at tracking the true ones; indeed, numerical values are very similar to those in Table I in Bai (2003, p.151). When n and T are ≥ 100, the estimated factors can be treated as the true ones.
As far asγ i is concerned, we report confidence intervals for γ i . In order to illustrate how confidence intervals shrink as T expands, we set n = 50 and T = 20, 50, 100, 1000.
According to equation (6) in Theorem 1, as (n, γi . Further, letδ be the least square estimate . By rotatingγ i towards γ i , we consider the confidence interval for γ i directly, reported in Figure 1. [Insert Figure 1 somewhere here] Figure 1 shows that, in most cases and for all combinations of n and T , the confidence intervals contain the true value of γ i . This also holds true for the case (n, T ) = (50, 1000), where the ratio √ T n is not negligible, as the theory would require. As predicted by the theory, as T grows, the confidence intervals collapse to the true value of γ i .

Small sample properties -S γ,nT and S f,nT
In this subsection, we report empirical rejection frequencies and power for tests based on the max-type statistics S γ,nT and S f,nT defined in (16) and (17) respectively.
As far as the design of the Monte Carlo is concerned, recall that the variance of the common components c it = γ i f t is set equal to 1 across all experiments. We conduct our In addition to conducting simulations under the DGP (26), we also consider two alternative DGPs that are nested in (26), in order to assess the robustness of the tests proposed to different specifications of (1)-(2). We firstly consider a DGP for the regressors x it that modifies (27) by not containing common factors, viz.
In this case, cross dependence in the y it s is purely due to the presence of f t in (26). The rank condition in Assumption 3(ii) does not hold, although the CCE estimator is still consistent.
Secondly, we consider a DGP for (1) in which there are no unit specific regressors, viz.
this is a pure factor model, that fits in the class of models considered by Bai (2003). In this case, it can be argued that testing for no factor structure (either by using S γ,nT or S f,nT ) complements the information criteria in Bai and Ng (2002), by being a test for r = 0. This is can also be compared with the framework in Baltagi, Kao and Na (2012).
Critical values have been computed by approximating B n and B T as discussed in Section 3. Unreported simulations show that results worsen only slightly when using the asymptotic critical values. 1 Testing for H a 0 : γ i = γ When evaluating the empirical rejection frequencies for tests based on S γ,nT , we run the Monte Carlo simulations under the null γ i = 1 for all i. When evaluating power, we , reporting results for the case of σ γ = 0.2. Given that ϵ it is cross sectionally uncorrelated and homoskedastic by design, Σ γi is estimated aŝ Results for size and power when using the main DGP (26)-(27) are in Table 2.
[Insert Table 2 somewhere here] We firstly consider the empirical rejection frequencies (left panel in the table). The test has a tendency to be oversized in small samples; as a general rule, the correct size is attained when T ≥ 100 and n ≥ 50; indeed, when σ 2 ϵ = 1 (high signal-to-noise ratio), the test has satisfactory size properties even for T = 50. The Table also shows that, as the signal-tonoise ratio decreases (i.e., as σ 2 ϵ increases), the tendency towards small sample oversizement worsens. This is not so when T ≥ 100 and n ≥ 50: the test attains the correct size even for large values of σ 2 ϵ .
As far as the power is concerned (right panel in the Table), the test has good power properties in all cases: the power is above 50% for almost all cases. We note that, similarly to the size, the power deteriorates as the signal-to-noise ratio decreases; when n and T are sufficiently large, this disappears.

[Insert Tables 3 and 4 somewhere here]
Results do not change much with respect to the ones in Table 2, as far as both empirical rejection frequencies and power are concerned. Indeed, the size improves in both cases (especially when simulations are conducted under (29)). When the signal-to-noise ratio is sufficiently high, the test attains its nominal size for all values of n, as long as T ≥ 100.
It is interesting to note that both size and power become much better under (29) than in the other cases. The correct size is attained as long as n ≥ 30 and T ≥ 50; moreover, the power is always above 90% for all combinations of n and T . Testing We run the Monte Carlo simulations under the null f t = 1 for all t when evaluating the size of tests based on S f,nT . When evaluating the power, we generate the common factors , reporting results for the case of σ f = 0.2. Finally, we estimate Σ f t as Results when using (26)-(27) are in Table 5.
[Insert Table 5 somewhere here] The size of the test is almost always the correct one, with few exceptions -the test is oversized for small T when σ 2 ϵ is high. Both n and T have a quite limited impact on the results.
The test has very good power properties, especially when the signal-to-noise ratio is high.
We note that the power increases with both n and T , in a more pronounced way with n.
As in the previous subsection, we also considered size and power under the alternative specifications (26)-(28) and (29); results are in Tables 6 and 7.
[Insert Tables 6 and 7 somewhere here] Results do not differ much, when carrying out simulations under (26)-(28), from the values in Table 5. Actually, as it was noted for the case of S γ,nT , results improve slightly, in particular the power. Similar considerations hold for the empirical rejection frequencies computed under (29): the size is always the correct one. The power is also very good, under all possible combinations of parameters.

Conclusions
In this contribution, we develop an inferential theory for the unobservable common factors and their loadings in a large, stationary panel model with observable regressors. Our framework allows for slope heterogeneity; we also allow for correlation between common factors and observable regressors, by modelling the DGP of the observable regressors as containing the common factors, in a similar spirit as in Pesaran (2006).
We extend the framework in Pesaran (2006) by providing a two stage estimator for the unobserved common factors and their loading. We derive rates of convergence and limiting distribution of both the estimated factors and loadings, using a similar method of proof to Bai (2009a). The main finding is that results differ only marginally from the case of a stand-alone panel factor model as the one studied by Bai (2003).
In a similar vein to Sarafidis, Yamagata and Robertson (2009)

Appendix A: Technical Lemmas
In this Appendix and the next one, we set H = I r in the proofs (although not in the statements of the Lemmas), for the sake of notational simplicity. Inequalities are written, when possible, omitting constants.
The Lemmas in this Section extend various results in Bai (2009a,b) to our framework.
All proofs rely upon the decomposition -see Proposition A.1 in Bai (2009a): In (30), the main difference with Bai (2009a) is the presence of the unit specific estimates, β j . Consider also the following notation, which we use henceforth throughout Appendices , so that we can writẽ . We extensively use the notation for any r ≤ 3.
By a well known norm inequality (see e.g. Strang , 1988, p. 369, exercise 7.2 where the last equality holds by symmetry. In view of Assumption 3(i), and omitting γ i , this entails that Using the fact that, by where we have used: Burkholder's inequality; Holder's inequality; the C r -inequality and Jensen's inequality; the Cauchy-Schwartz inequality; and the fact that, by Assumptions 1 and 2(i), E |ϵ it | 2r < ∞ and E ∥x it ∥ 2r < ∞ respectively. Using the Cauchy-Schwartz inequality in this context is more than what is necessary, since x it and ϵ it are independent. Turning to II, note that, for sufficiently large n and omitting higher order terms, Pesaran (2006). Therefore, lettingε = n −1 ∑ n i=1 ϵ i and omitting higher order terms Consider E ∥I∥ r ; since C has full rank by Assumption 3(i) and D w is invertible ] 1/2 , which is O (1) by Assumption 2(i). As far as the second term is concerned, note , after similar passages as in equation (32). It holds that E ∥f t ∥ 4r < ∞ by Assumption 2(i).
Further, using the fact that the ϵ it s are cross sectionally independent, we have by similar passages as in equation (32). Putting all together, again by similar passages as above. Therefore, E Proof. The proof of A.2(i) is very similar, and in fact simpler, than that of A.2(ii); thus we focus on the latter only. Using (30) The proof follows very similar lines to that of Lemma A.4(ii) in Bai (2009b): the only difference here is the presence of the unit-specific estimation errors,β j − β j . Thus, we report only the complete passages to determine the order of magnitude of I; the same logic applies to all the other terms in the expansion. The only term for which passages slightly differ is V , and we report the full blown proof for it.
Consider now V , whose proof is marginally different to that in Bai (2009b, p.5) using the Cauchy-Schwartz inequality (first line), Holder's inequality with the same orders as in (35) (second line) and Lemma A.1. Also and zero otherwise by Assumption 1, whence Proof. The Lemma is a refinement of Lemma A.3 in Bai (2009a). Particularly, by Consider part (i). Using (30) henceforth, we omit γ i in the passages, based on Assumption 4(ii).
. Using the Holder's inequality and the Cauchy-Schwartz inequality in a similar way to (35), this is bounded by . Turning to II, we have II = 1 where we have used Assumptions 1, 2(ii) and 3(i); the Cauchy-Schwartz inequality; and the facts that both E Turning to III, it can be decomposed into 1 . Using the same logic as above, this entails . Similar passages and Lemma . Term IV has the same magnitude of term II. As far as V is concerned, using the fact that ; a similar logic to the proof of , using Lemma A.2(i). Putting all together, part (i) of the Lemma follows.
The proof of part (ii) follows essentially the same passages, and is therefore omitted. As far as part (iii) is concerned, the same logic as above can be applied directly to (30), obtaining . QED

Lemma A.4 Let Assumptions 1-4 hold. Under H
also, under H a 0 , it holds that γ − γ = 1 n ∑ n i=1 (γ i − γ). Using (36) and neglecting higher order terms coming fromF Consider I; using cross sectional independence in Assumption 1 and Assumption 3(i), I is bounded by the square root of

by Assumption 3(i) and Lemma A.2(i). Similarly, III is bounded by
, by Lemma A.3(ii). The ones for II and III are not necessarily the sharpest bounds, but are sufficient for our purpose. Putting all together, γ . QED

A.1. Similar arguments yield II
. Putting all together, this yields F − . QED

Lemma A.6 Let Assumptions 1-4 hold, and let k denote the largest finite moment of
ϵ it , f t and x it . It holds that Proof. In the proof, we extensively use the fact that, for an arbitrary sequence of The proofs are rather repetitive, and where possible we only provide an intuition of the main argument, omitting passages.
Consider part (i). We know, from the proof of Lemma A.1, that so that the order of magnitude of max 1≤i≤n β i − β i 2 can be derived by studying T −1 max 1≤i≤n Consider the former. By the proof of Lemma The largest a for which this moment exists is a = k/2, whence As far as max 1≤i≤n T −1 X ′ iM w F 2 is concerned, we know from the proof of Lemma A.1 . When applying max 1≤i≤n , this only affects the x it s. To illustrate this, consider term I in (33): 1 √ nT max 1≤i≤n . Therefore, the whole expression is of order o p ( n −1/2 T −1/2 n 2/k ) . Applying similar passages to terms II and III in (33) . Part (i) follows putting everything together.
Consider part (ii). The passages of the proof are rather repetitive. The main argument is that, based on (40) is bounded by terms such as etc. This entails that, when taking the maximum across t, the order of magnitude of the maximum is given by terms like max 1≤t≤T ∥x jt ∥ 2 , max 1≤t≤T ∥f t ∥ 2 and max 1≤t≤T ∥ϵ jt ∥ 2 , which are of order o p ( T 2/k ) . This provides part (ii).
The proof of part (iii) is based on (39): By Assumption 4(iii), max 1≤i≤n T −1/2F ′ (F − F ) 2 ∥γ i ∥ 2 has the same order of magnitude Consider I; based on the same arguments as in (32), we have max 1≤i≤n can be studied using (34). It follows that max 1≤i≤n T −1 , which follows from (35) (first line) and from the C r -inequality (second line). Note that  . Part (iii) follows from putting everything together.
Consider parts (iv) and (v). Using the definition ofε it : Parts (iv) and (v) follow immediately from Assumptions 1 and 2. Explicit rates are derived using the other parts of this Lemma. Parts (vi) and (vii) can be proved similarly, using . Also, note using the same passages as in the proof of Lemma A.2. Putting all together, part (i) follows.

As far as part (ii) is concerned, it follows immediately from noting that
As far as showing parts (iii)-(vi) is concerned, we extensively use the fact that, for an i.i.d. sequence of random variables Z 1 , ..., Z m such that E (Z 1 ) = 0 and E |Z 1 | k ≤ M for some k > 2, and given a Brownian motion Results like (38) are known as "Hungarian constructions"; see, inter alia, Csorgo and Revezs (1975a,b), and Tusnady (1975, 1976); we also refer to Shorack and Wellner (1986)  We turn to the proof of part (iii) of the Lemma. We have Consider I. Note that the sequence z ϵγ,it = γ i γ ′ i ϵ 2 it −Σ Γϵ,t has mean zero and it is independent . Also, by using Assumption 4(iii), the largest k . As far as II is concerned, it has the same order of magnitude as max 1≤t≤T ε 2 it − ϵ 2 it , given in Lemma A.6(v). Turning to III, its order of magnitude is given by O p it , which comes from Lemma A.6(iv).
Term IV is dominated. Putting all together, part (iii) of the Lemma follows.
As far as part (iv) is concerned, the proof is similar, in spirit, to that of part (iii). Recall As far as I is concerned, the proof is similar to that of part (i) of this Lemma, upon noting that the largest a for which . Considering III, a similar logic as above yields by the C r -inequality and Lemma A.3(ii). Using Lemma A.6(vi), we have III = o p ( n 2/k T 1/2 δ −2 nT ) .
Term IV is dominated. Putting all together, part (iv) follows.
Consider now part (v). We provide the proof setting M Xi = I T for simplicity. For an i.i.d., zero mean, Gaussian sequence , we can write, in view of (38) where the largest k for which this can be written corresponds to the largest a for which Term I satisfies (38) with also, the largest existing moment over t is of order k. Thus, max 1≤t≤T n 1/2−1/k n Putting everything together, and defining n −1/2 ∑ n i=1 N 2 it = N t , part (vi) follows. We point out that, in parts (v) and (vi), independence over, respectively, t and i is not required; thus, Assumptions 1 and 2 could be relaxed to accommodate for serial and cross sectional dependence, as long as the Hungarian construction (38) holds. QED

Appendix B: Proofs
Proof of Theorem 1. By definition, we have We start by considering the denominator of (39):

Repeated application of Lemma A.3 yields
. Thus, as We turn to the numerator of (39). It holds that By Assumptions 1 and 2, it holds that I = O p (1). As far as II is concerned, note applying Lemma A.1(i) (to the first term), and Lemma A.3(i) and Assumptions 1 and 2 (to the second term), it follows that . Thus, the numerator of (39) is of order Finally, as (n, T ) → ∞ under the restriction Using a CLT based on Assumptions 1 and 2, which proves (6). QED

Proof of Theorem 2.
Using (30), we can writê The order of magnitude of I follows exactly from the same passages as in the proof of Lemma . Consider II; omitting γ j in view of Assumption 4(iii), we have . Using Lemma A.3(i), it can be shown that . As far as IV is concerned, note that Similar passages as in the proof of the order of magnitude of II a , and the fact that are based on the same arguments as in Bai (2003), since the estimation errorβ j − β j does not appear in their expression. Putting everything together, it holds thatf . As (n, T ) → ∞ with n T → 0, the term that dominates is V II, whose asymptotics is exactly the same as studied in Bai (2003, Theorem 1). QED Proof of Theorem 3. Prior to proving the Theorem, we lay out some preliminary results and notation. We writê Under H a 0 , a i = 0; also, b i can be rewritten as b i =γ i −γ. Using (39), we have After this preliminary calculations, we turn to proving (20). In order to do this, we firstly can be approximated by the maximum of a sequence of independent random variables with a χ 2 r distribution, up to a negligible error. Given that the maximum of a sequence of chi-squares is of order O p (ln n), the approximation error should be o p (ln n) at most. Secondly, we show that I i − V i in (42) are also all o p (ln n) uniformly in i.
, and consider in particular the sequence It holds that . As far as the numerator of this expression is concerned, by Lemma A.7(v) we write T −1/2 F ′ M Xi ϵ i = N i + R N i with N i defined in Lemma A.7 as being zero mean Gaussian with covariance matrix Σ f M e,i , and . Note also that √ T b 1i and √ T b 1j are independent for i ̸ = j as T → ∞, due to Assumption 1. Indeed, it can be shown through a Cramer-Wold device that has, asymptotically, a normal distribution with covariance where σ ij is such that |E (ϵ it ϵ js )| ≤ σ ij for all (t, s). Thus, in order to have asymptotic independence, it suffices to have σ ij → 0 as (n, T ) → ∞. As far as the denominator of √ T b 1i is concerned, based on Lemma A.7(ii) we write . Hence we write Based on (43), and on the definitions of Σ f M e,i and of Σ f M,i , it holds that and it is dominated. Therefore After proving that max 1≤i≤n T we turn again to equation (42 . Turning to max 1≤i≤n II i , note that, in equa- further, by the invertibility of Σ −1 γi and Lemma A.7(iv), max 1≤i≤n T has the same order of magnitude as max 1≤i≤n , it can be evaluated by considering the orders of magnitude The former can be shown to be o p ( , based on the proof of Lemma A.6(iii). The latter has the same order of magnitude as by Lemmas A.4 and A.7(iv). Fi- Putting all together, and using (45), it holds that where the remainders are negligible as (n, T ) → ∞ with √ T n 2/k 1 n → 0 and n 4/k 1 T → 0.
Equation (20) follows from (46); the asymptotics of max 1≤i≤n N ′ i Σ −1 f M e,i N i is studied e.g. in Embrechts, Kluppelberg and Mikosch (1997, Table 3.4.4, p.156) We now finish the proof of the Theorem, analysing the power properties of the test. In order to evaluate the presence of power when γ i ̸ =γ for some (at least one) i, after some algebra it can be shown that, under the alternative, S γ,nT has non-centrality parameter given by in view of Lemma A.6(iii); which tends to 1 if c α,n − S N C γ,nT → −∞ as (n, T ) → ∞. In view of equation (22), we know that c α,n = O (ln n), whence (21) follows. QED Proof of Theorem 4. The proof is very similar, in spirit, to the proof of Theorem 3, and therefore some passages are omitted to save space. Consider the following preliminary notation and derivations. We writê where b 2t contains terms I − V I and V III in (40). Also, for each t, Neglecting higher order terms containing o p After this preliminary calculations, we now turn to proving (23). Similarly to the proof of Theorem 3, we firstly prove that max 1≤t≤T n can be approximated by the maximum of a stationary sequence of random variables with a χ 2 r distribution, up to a negligible error. Secondly, we show that, in (48), max 1≤t≤T I t , ..., max 1≤t≤T V t are all o p (ln T ) uniformly in t.
We start from max 1≤t≤T n with N t defined in Lemma A.7 as being zero mean Gaussian with covariance matrix Σ Γϵ,t , and . The N t s are independent across t by Assumption 1. More generally, the covariance matrix between N t and N s for s ̸ = t is given which is known as Berman condition, Berman (1964). Note also that by Lemmas A.7(i) and A.3(ii). Hence, we write Passages are very similar to those after (44) in the proof of Theorem 3. In particular, it can be shown using Lemma A.7 that: max 1≤t≤T I b1 t and max 1≤t≤T II b1 t are both ; and that max 1≤t≤T IV b1 t ,..., max 1≤t≤T IX b1 t are all dominated and therefore negligible. Thus where the approximation errors are negligible as long as (n, T ) → ∞ with n 1/k 2 T 1/k 2 √ n → 0 we turn back to equation (48). We show that max 1≤t≤T I t , ..., max 1≤t≤T V t in (48) are all by using Lemma A.7(iii). Also, combining Lemmas A.5 and Lemma . As far as max 1≤t≤T II t and max 1≤t≤T III t are concerned, studying their order of magnitude involves finding a bound for max 1≤t≤T ∥b 2t ∥ and max 1≤t≤T ∥b 2t ∥ 2 . Recall Similar passages as in the proof of Lemma A.6(ii) yield max 1≤t≤T ∥b 2t ∥ = o p ( . We now turn to analysing max 1≤t≤T II t and max 1≤t≤T III t .
As far as the former is concerned, max Putting all together, we have Equation (23) follows from standard EVT (Embrechts, Kluppelberg and Mikosch, 1997) in case of serial dependence, if the Berman condition (49) holds, equation (23) can be shown e.g. by using Theorem 3.5.1 in Leadbetter and Rootzen (1988, p.470).
We now complete the proof of the Theorem by studying the power versus local alternatives. Under H b 1 , it can be shown that S f,nT has non-centrality parameter given by by construction. Also, II is bounded by by Lemma A.6(ii); simi- . Let S f,0 nT denote the null distribution of S f,nT . Then, under H b ; P [S f,nT > c α,T ] tends to 1 if c α,T − S N C f,nT → −∞ as (n, T ) → ∞; since we know from (25) that c α,T = O (ln T ), (21) follows from (24). QED In addition to the proofs of the main results in the paper, we now report two negative results concerning the use of average-type statistics to test for the null of no factor structure, as discussed in Section 3 -see equations (18) and (19).
Theorem B.1 Let Assumptions 1-4 hold, and assume that, as (n, T ) → ∞ Consider I; using (36), we can write By assumption, I a is O p (1). The joint limit assumed in the statement of the Theorem could be shown under more primitive conditions, but this suffices for our purposes. Turning to I b , it is bounded by where we have used the consistency ofΣ γi and Assumptions 3(i) and 4(iii). Applying . By a similar logic, it can be shown that I c is Turning to I d , similar passages as above entails that it is bounded by Similarly, I e is bounded by using a similar logic, it can be shown that . Putting all together, I = . Finally, consider II and III in (54). As far as  . Turning to III, this is bounded by , which has the same order of magnitude as II.
Putting all together, it holds that . QED Theorem B.2 Let Assumptions 1-4 hold, and assume that, as (n, T ) → ∞ .

Proof of Theorem B.2 Under
Consider I; using (47) we may write By assumption, (1); this joint limit could be shown under more primitive assumptions. As far as I b is concerned, by virtue of the consistency of Σ f t , it , which follows from the proof of Theorem 2. Finally, turning to I c and settingΣ −1 F t = I r for simplicity, we may write and I c, . We now turn to analysing II and III in (55). By Lemma . As far as III is concerned, using the consistency of Σ f t , it is       Table 3: Empirical rejection frequencies (for a nominal size of 5%) and power for tests for H a 0 : γ i = γ, based on S γ,nT . The DGP used in the simulations is (26)-(28), i.e. the case of no common factor structure in the regressors.      Table 7: Empirical rejection frequencies (for a nominal size of 5%) and power for tests for H b 0 : f t = f , based on S f,nT . The DGP used in the simulations is (29), i.e. the case of a pure factor model for y it .