A Randomized Sequential Procedure to Determine the Number of Factors

ABSTRACT This article proposes a procedure to estimate the number of common factors k in a static approximate factor model. The building block of the analysis is the fact that the first k eigenvalues of the covariance matrix of the data diverge, while the others stay bounded. On the grounds of this, we propose a test for the null that the ith eigenvalue diverges, using a randomized test statistic based directly on the estimated eigenvalue. The test only requires minimal assumptions on the data, and no assumptions are required on factors, loadings or idiosyncratic errors. The randomized tests are then employed in a sequential procedure to determine k. Supplementary materials for this article are available online.


Introduction
This article proposes a procedure to determine the number of factors in a static approximate factor model, viz.
where φ i and F t are column vectors of finite dimension k.
Starting from the seminal contribution by Chamberlain and Rothschild (1983), inference on (1) has been the subject of several studies. In recent years, many contributions have focused on the case of panel data, where both N and T are large-see, inter alia, the review of Bai and Ng (2008). The first step in the analysis of (1) is, arguably, the determination of the number of common factors, k. To this end, the literature has developed numerous techniques, which are usually based on a well-established fact: the first k eigenvalues of the covariance matrix of the X i,t s diverge to infinity, whereas the other ones stay bounded. Two main approaches have been developed. The first one is based on finding a threshold for the eigenvalues of the covariance matrix of the X i,t s, which can be used to decide which eigenvalues are finite and which ones are not; the information criteria proposed by Bai and Ng (2002) belong in this category. The second possible approach is based on computing the ratio of adjacent eigenvalues, again exploiting the fact that such ratio eventually diverges: this is the rationale employed by Onatski (2009Onatski ( , 2012 and Ahn and Horenstein (2013), inter alia. Neither approach is free from problems. Typically, eigenvalue thresholding requires the choice of a penalty function, as is customary in the context of information criteria (see Bai and Ng 2002). However, such choice is not unique, which is bound to affect at least the finite sample properties of the estimated k; note however that, building on an idea in Hallin and Liska (2007), Alessi, Barigozzi, and Capasso (2010) proposed a robust, data-driven methodology to tune the choice of the penalty function which works very well in simulations. Moreover, existing techniques also require The tests are then employed as part of a sequential procedure to determine k. From a methodological point of view, the test statistic employed in this article mimics the behavior of λ (p) -that is, it diverges to positive infinity under H 0 . Owing to such lack of randomness under the null, we base our tests on randomising the test statistic. This approach is not new, per se, in the literature: the original idea dates back to Pearson (1950), and it has been recently introduced in econometrics-see, for example, Corradi and Swanson (2006). In particular, in this article we follow the approach used in Corradi and Swanson (2006), where randomization is employed in conjunction with sample conditioning: randomness is added to the basic statistic, and then the asymptotics is derived conditional on the sample, showing its validity for all samples, save for a zero measure set. Therefore, as explained by Corradi and Swanson (2006), the notion of size is different from the standard one: classically, the level α of a test means that, if a researcher applies the test B times and the null is valid, then (s)he will reject the null with frequency αthat is, (s)he will be wrong αB times. Conversely, in this context α means that out of J researchers who apply the test, αJ of them will reject the null when this is true. Still, in our article we obtain a test statistic which, for a given level α, rejects the null with probability α (when true), and with probability 1 (when false).
Although randomization is still not widely employed in econometrics, it has several advantages in this context. To begin with, it affords to actually test for the existence of common factors, rather than being a diagnostics. Also, no restrictions are required on the relative rate of divergence of N and T as they pass to infinity: the number of factors can be estimated for any values of N and T as long as min{N, T } → ∞. This feature makes the test applicable in virtually any context, and it may be helpful in several applications, where one dimension is much larger than the other-examples include such diverse fields as accounting (where data are often recorded on an annual basis and are available for many companies, but for a limited number of years), finance (where, e.g., data on hedge funds performance are available for thousands of funds, which are live for a relatively short span), microeconometrics with firm level data, marketing studies (where revealed preferences are often recorded over a limited period of time for many consumers), and genomics (where usually thousands of genomes are observed for tens of patients). Finally, it is important to note that the approach adopted in this article relies on a well-known asymptotic representation result: on account of (1), the process X i,t is decomposed into two subspaces, the one spanned by (eigenvectors associated with) the spiked eigenvalues of the covariance matrix of the data, and its orthogonal complement. Thus, the observations decompose into the sum of their projections onto these subspaces: such decomposition always exists and is unique (we refer to the comments by Hallin and Lippi 2013). As a consequence, the theory in this article uses assumptions spelt out directly on the observable quantities X i,t , rather than on F t or u i,t .
The remainder of the article is organized as follows. In Section 2, we spell out the main assumptions and derive preliminary results. The test and its properties (null distribution and consistency) are discussed in Section 3; in particular, in Section 3.3, we discuss the sequential procedure to determine k. Section 4 contains a set of simulations to verify the properties of the test for no factor structure, and of the whole procedure to determine k. Section 5 concludes. Extensions to the case of weak factors and proofs are in the Supplemental Material to the article (Trapani 2017).
Notation. We denote the ordinary limits as "→ " , and use the symbol " " to indicate that two sequences, say a N,T and b N,T , have the same order of magnitude, that is, a N,T = O p (b N,T ) and b N,T = O p (a N,T ). We use "a.s. " as short-hand for "almost surely" , and "≡" for definitional equality. The notation M (and, where needed, M , M , etc.) denotes a finite, generic constant that may differ from line to line. Other relevant notation is introduced in the remainder of the article.

Assumptions and Preliminary Theory
Consider the matrix form of (1) t , . . . , u N,t ] and is an N × k matrix whose ith row is φ i . Henceforth, we assume, without loss of generality, that the data have mean zero, and also that common factors and idiosyncratic errors are orthogonal, as is typical in this literature.
Assumption 1(ii) is actually a consequence of the asymptotic representation in (1). By Assumption 1, The following notation will also be used extensively henceforth: the pth largest eigenvalue of X is denoted as λ (p) ; the pth eigenvalue of F as γ (p) ; and, finally, the pth eigenvalue of u as ω (p) .
Assumption 2 adds some structure to the spectra of F and u , and it is similar, in spirit, to Assumptions 4, 5 and 8 in Forni et al. (2009). As far as the ω (p) s are concerned, we require that they all be finite; however, they do not need to be distinct or bounded away from zero, and some or all of them could indeed be zero.
As far as the nonzero γ (p) s are concerned, part (i) of the assumption requires that they diverge to positive infinity, as N → ∞, at a rate O(N). This assumption is typical of factor analysis: for example, Bai and Ng (2002) required that, in addition to F being positive definite, N −1 tends to a positive definite matrix, which is tantamount to assuming that γ (p) passes to infinity at a rate O(N). Such behavior of the common factors is often referred to, in the literature, as having "strong" or "pervasive" factors (see Onatski 2015, in particular Assumption 2 and the discussion thereafter). However, weaker factors also could be considered, where γ (p) is allowed to diverge at a rate slower than N. For the time being, and to make the presentation of results easier to follow, we focus on the case of strong factors only; in the Supplement (Trapani 2017), we investigate the more general case of γ (p) = m p N 1−ν p with ν p ∈ [0, 1).
Finally, note that Assumption 2 does not require that the λ (p) s be distinct, or that the diverging eigenvalues be well-separated, which are typical requirement in this literature (see, e.g., Wang andFan 2017, andalso Forni et al. 2009).
The following well-known result characterizes the eigenvalues of X .
Lemma 2.1. Let c (p) be a set of nonnegative finite numbers, which are strictly positive for p ≤ k. Then, under Assumptions 1 and 2(i)-(ii), it holds that, as under Assumptions 1 and 2, it holds that According to Lemma 2.1, λ (p) either diverges at a rate O(N), or it converges to a finite constant (which may well be equal to zero) according as p ≤ k or not. Basically, the behavior of the eigenvalues of X as N passes to infinity is the same as that of the eigenvalues of F .

Estimation of λ (p)
Consider X ≡ 1 T T t=1 X t X t , and let λ (p) denote the p-th largest eigenvalue of X . To derive the asymptotics of λ (p) , we need the following assumption.

Assumption 3. It holds that
Assumption 3(ii) is a high-level condition which deserves more comments. In essence, it poses a constraint on the amount of serial correlation that one can have in the process {X h,t X j,t } T t=1 -and therefore, albeit indirectly, in X i,t . According to the assumption, X i,t does not need to be independent across t, which is a requirement in "classical" Random Matrix Theory (see Bai 1999). On the other hand, similar restrictions to Assumption 3(ii) are customarily employed in the literature on factor models (see, e.g., Bai and Ng 2002;Forni et al. 2009;and Onatski 2015). However, Assumption 3(ii) differs from the assumptions typically made in this literature since it restricts the amount of serial dependence directly in the X i,t s, as opposed to considering the unobservable quantities F t and u i,t (see however Forni et al. 2009). In this respect, Assumption 3(ii), on account of its involving observable quantities only, should be easier to understand and verify.
Two examples are reported below to illustrate how Assumption 3(ii) can be verified from more primitive conditions. Example 1. Assumption 3(ii) holds if the data are independent. Indeed, assuming that {X h,t , X j,t } is independent across t, Burkholder's inequality (see Lin and Bai 2010) and Assumption Serfling (1970Serfling ( , p. 1231) thus entails that Assumption 3(ii) holds.
Example 2. To consider more general cases of (weak) dependence, assume that X i,t is a stationary process with the representation where F s,t i is the σ -field generated by {ε i,t , ε i,t−1 , . . . , ε i,t−s }, · 2 denotes the L 2 -norm and c i,t is a sequence of nonnegative numbers. Condition (7) is very popular when considering nonlinear transformations, and it holds for a wide variety of processes, including linear processes, ARCH and GARCH processes and data from dynamical systems and Volterra series (see Davidson 2002, inter alia). By Assumption 3(i), it follows that X h,t X j,t is L 2 -NED of size 1 2 (see Example 17.17 in Davidson 1994, p. 273). Thus, by Theorem 17.5 in Davidson (1994, p. 204), X h,t X j,t is an L 2 -mixingale of size 1 2 ; hence, Assumption 3 (ii) follows from McLeish's maximal inequality (McLeish 1975).
The rate of convergence of λ (p) is in the following lemma.
Lemma 2.2. Under Assumptions 1 and 3, it holds that Lemma 2.2 contains a strong rate for λ (p) − λ (p) , and it can be viewed as the sample counterpart to the population result in Lemma 2.1. The rate is valid for any combination of N and T , and for all estimated eigenvalues. Further, the lemma does not require Assumption 2, and therefore it does not require any assumptions on the λ (p) s: these do not need to be distinct or (when they diverge) well-separated; some of the eigenvalues may be equal to zero; and the eigenvalues that diverge do not need to do it at any special rate. In essence, the lemma states that eigenvalues are estimated with an error which depends on the dimensions of the dataset, N and T ; in light of (8), the estimation error is quite large. It is, however, comparatively small for 1 ≤ p ≤ k, since λ (p) is of order O(N). Conversely, the estimation error is very large when k + 1 ≤ p ≤ min{N, T }, compared to λ (p) , which is bounded. As shown in Section 3, the rates in (8), while not necessarily sharp for all estimated eigenvalues, afford the construction of a test statistic for (2).
Albeit only incidental to the main arguments in the article, the result in ( 8) can be compared with related findings in the literature. In the context of "classical" Random Matrix Theory, it has been shown that, under the assumptions that X i,t is iid across i and t and that Bai and Yin (1993). Lemma 2.2 illustrates what happens in the presence of common factors, which introduce a spiked eigenvalue structure, with some of the λ (p) diverging. Using different assumptions (chiefly, N > T and a restriction on the rate at which λ (p) passes to infinity), Wang and Fan (2017; see Theorem 3.1) derive the limiting distribution of λ (p) for 1 ≤ p ≤ k, showing asymptotic normality at a rate N √ T . This suggests that the strong rate in (8) should be optimal, modulo the logarithmic terms, at least for 1 ≤ p ≤ k ; note however that Lemma 2.2 holds for all eigenvalues, not only for the spiked ones. Similarly, in a different context and with slightly more stringent assumptions on the eigenvalues, Forni et al. (2009, see Lemma 2(a)) show that λ (p) − λ (p) = O P (max{1, N √ T }). Lemma 2.2 (which is an almost sure result) yields essentially the same rate when N and T are of comparable magnitude.

The Test
In this section, we define the test statistic and study its asymptotics under the null and the alternative: for some 0 < m p < ∞ and finite.

The Test Statistic
Let β ≡ ln N ln T , and define δ ∈ [0, 1) such that Finally, consider the following estimator of λ N We are now ready to introduce the test. Define Under the null that λ (p) = m p N, ϕ (p) → ∞ at a rate exp{N 1−δ }; conversely, ϕ (p) converges to a finite number under the alternative that λ (p) < ∞. We now provide a full-fledged explanation of the latter statement. To understand the need for rescaling by N −δ , note that under the alternative it is required that ϕ ( 1+ ) T ; on account of (9), it holds that − 1 2 + β(1 − δ) < 0, which yields the desired result. Hence, under the alternative, it follows that ϕ (p) = o a.s. (1). Note that, under the null, N −δ λ (p) diverges, given that N −δ λ (p) = m p N 1−δ and δ < 1 by construction. Finally, we point out that λ N makes the argument of the exponential scale-free; in principle, any statistic that ensures scale invariance may also be used.
Given that ϕ (p) → ∞ under the null, we cannot use it directly and we instead propose a randomized version of it. We present the construction of the test statistic as a four step algorithm.
Step 1. Generate an artificial sample {ξ N(0, 1), and define the sequence ϕ (p) × ξ (p) j , 1 ≤ j ≤ R; Step 2. Define the sample {ζ with u extracted from a distribution F (u) with support U ⊂ R\{0}; Step 3. Compute Step 4. Define the test statistic We give a heuristic preview of how the test statistic works. Under the null, ϕ (p) passes to infinity, so that the variance of ϕ (p) × ξ (p) j should be ∞; consequently, the iid sequence  ( 13), there is a sum of iid random variables with nonzero mean, which diverges to positive infinity at a speed √ R.

Asymptotic Properties
We now discuss the null distribution and the power versus H A : λ (p) ≤ m p < ∞. Henceforth, we frequently employ the following notation: P * is the probability law of {ζ (p) j (u)} R j=1 conditional on the sample, and " D * → " denotes convergence in distribution according to P * .
The following theorem characterizes the null distribution of (p) .
Theorem 3.1 states that, under the null, (p) follows a chisquared distribution with one degree of freedom; the result holds for all samples, save for a zero measure set, and no restrictions are needed on the relative rate of divergence of N and T as they pass to infinity.
In order for Theorem 3.1 to hold, it is necessary that R → ∞, which is natural since equation (13) is an application of the CLT; equation (15) provides an upper bound for R.
Define c α such that, as min{N, T, R} → ∞, it holds that P[ (p) ≤ c α ] = α under H 0 . The following theorem states that the test is consistent versus the alternative H A : λ (p) ≤ m p . In the proofs of Theorems 3.1 and 3.2, we show that ϑ (p) (u) has a noncentrality parameter asymptotically equal to where u ∈ (0, |u| √ ϕ (p) ). Under the null, this term should go to zero, whence (15). Under the alternative, the term is bounded from below by 2R π |u| ϕ (p) this expression has a local maximum at |u| = 2ϕ (p) , and if δ > 0, then ϕ (p) converges to 1; these heuristic considerations point toward choosing u = ± √ 2.

Determining k
In this section, we study how the individual tests for H 0 : λ (p) → ∞ can be used, in a sequential procedure, to determine the number of common factors. The estimator of k (say k) is the output of the following algorithm: Step 1. Run the test for H 0 : λ (1) = ∞ based on (1) . If the null is rejected, set k = 0 and stop, otherwise go to the next step.
Step 2. Starting from p = 1, run the test for H 0 : λ (p+1) = ∞ based on (p+1) , constructed using an artificial sample {ξ If the null is rejected, set k = p and stop; otherwise repeat the step until the null is rejected (or until a pre-specified maximum number, say k max , is reached). As can be expected, in this context a pivotal role is played by the level of the individual tests, α, which should be chosen so that k is a good approximation of k, at least asymptotically. Theorem 3.3 states that k is consistent, as long as the level α of the individual tests is chosen so as to converge to zero: no specific rates are required (see also Kapetanios 2010). Further, the theorem does not require any special choice of k max : as long as this value is "large enough" (that is, as long as k max ≥ k), the theorem holds. It is worth noting that usually the literature uses the Schwert's rule (Schwert 1989; see also the comments in Bai and Ng 2002, p. 203), although other choices are also possible. Indeed, simulations show that the estimation procedure is not sensitive to the choice of k max .

Simulations
We evaluate the performance of the sequential procedure to determine k, using synthetic data. Data are generated as where N(1, 1) for 1 ≤ i ≤ N and 1 ≤ j ≤ k. The design in (17) is very similar to Bai and Ng (2002) and Ahn and Horenstein (2013). The idiosyncratic error u i,t is generated as In (19), (19) introduces cross-sectional dependence among the u i,t s. By (18), for most of the units it holds that Var(u i,t ) = 1; thus, in (17), θ −1 represents the signal-to-noise ratio of the common factors.
Data are generated according to three different schemes, which correspond to different levels of serial and cross-sectional correlation: (a) i.i.d. data: corresponding to ρ = b = C = 0; (b) serially dependent, but cross-sectionally uncorrelated data: ρ = 0.5, b = C = 0; (c) serially and cross-sectionally correlated data: ρ = 0.5, b = 0.5 and C = max{10, N 20 }. Case (c) is arguably the most interesting (and problematic) one: the presence of strong cross-sectional dependence in the idiosyncratic term u i,t is observationally equivalent to having a weak common factor whose associated eigenvalue diverges at a rate N 1−ν for ν close to 1 .
The test statistics (p) are specified as follows. Based on (9) we set The estimated eigenvalues are rescaled by λ N as suggested in (11) when N ≤ T ; this choice also works for N > T , but in this case better results are found using with λ N,(p) ≡ N −1 N j=p λ ( j) . We use u = ± √ 2, chosen with equal weight. Finally, based on Theorem 3.3, the level of each test should be chosen so as to go to zero as min{N, T } → ∞; we have employed 0.01/ min{N, T }, which is in the spirit of Bonferroni-type approaches. As a general comment, this (conservative) choice may result in overstating rather than understating k, which could be more desirable.
The procedure suggested here is compared against the methodologies suggested in Bai and Ng (2002; referred to as IC1, IC2, PC1, PC2 below), considering also the refinements developed by Alessi, Barigozzi, and Capasso (2010);Onatski (2010; referred to as ON), and Ahn and Horenstein (2013; referred to as ER and GR): with φ i and F t the estimators of φ i and F t studied in Bai (2003) under exactly k factors.We define σ 2 = V (k max ), u = 2.7 λ (k max +1) − 1.7 λ (2k max +1) and v (k) = min{N,T } j=k+1 λ ( j) . In their contribution, Alessi, Barigozzi, and Capasso (2010) recommended to employ different values of the tuning constant C 0 , and to evaluate the estimated number of factors over a whole range of values of C 0 , thereby selecting the optimal one, identified as the value which yields a stable estimate of k. We implemented this procedure by searching for the optimal value of C 0 over the grid [0, 13], using intervals of width 0.005; results are reported for PC1, which was the best performing criterion across all exercises. We do not report the criteria proposed by Bai and Ng (2002) since the results are usually worse than those obtained using the refinement proposed by Alessi, Barigozzi, and Capasso (2010); they are anyway reported in the tables in the Supplement (Trapani 2017).
We consider two different experiments.

Experiment I: (Testing For) No Factor Structure
We start by considering the case k = 0 -that is, the case of no factor structure. In this context, scheme (c) is particularly interesting, since it allows for the existence of cross sectional correlation among the data, but not due to common factors. This part of the analysis is related to the papers by Trapani (2015a, 2015b), who propose tests for the null of no factor structures (see also the discussion in Bai 2009). All the criteria described above are able, in principle, to determine whether k = 0; specifically, the ones designed by Ahn and Horenstein (2013) require an initialization which we base on λ (0) = max{N −1/2 , T −1/2 }. As far as our test is concerned, we recommend that the first step in the analysis should be to carry out a test, at level α (we choose α = 0.05), for H 0 : λ (1) = ∞ versus H A : λ (1) < ∞; upon rejecting the null, the conclusion can be reached that k = 0; conversely, if the null is not rejected, then the procedure described in Section 3.3 should be employed, thus casting the estimation of k into a two-stage procedure. We implement the test with the (very simple) specification R = 200, which works well for all cases considered.
The results in Figure 1 show that the test proposed here has excellent power for all cases considered, being able to detect whether k = 0 or not. Note that in the worst-case scenario, corresponding to serial and cross-sectional dependence, with (N, T ) = (25, 100), the power is anyway in the region of 95%. Indeed, when there is cross-sectional dependence, the test proposed in this article clearly dominates all other approaches.
All other criteria also work very well when there is neither serial nor cross-sectional dependence-a possible exception, shown in the Supplement (Trapani 2017), are the criteria developed by Bai and Ng (2002), but the correction by Alessi, Barigozzi, and Capasso (2010) dramatically improves their performance, especially when N ≥ 50. In the presence of serial dependence, the performance of other criteria is also very good, at least in moderate to large samples: in particular, the tests developed by Ahn and Horenstein (2013) work extremely well when max{N, T } ≥ 100, whereas the test developed by Onatski (2010) yields accurate results as long as T ≥ 100. All criteria, however, systematically overstate the number of factors in presence of cross-sectional dependence, thereby leading to think that there is a factor structure in the data, when in fact this is absent.
As a final note, we tried the same experiment setting θ = 2; results for k are the same as for θ = 1, and thus we do not report them.

Experiment II: Determining the Number of (Strong) Common Factors
We evaluate the procedure to determine k considering the cases k = 1, 3, and 5. In the first set of results (Figure 2), we set θ = 1. As far as the implementation of the test is concerned, the test works very well when using R = 400 for all cases considered; indeed, unreported experiments show that the less costly choice R = 200 also works well, at least for N ≥ 50.
We do not report results corresponding to scheme (a)-no serial or cross-sectional dependence-since all estimators perform very well. Figure 2 also shows that results are good all across the board when there is only serial dependence; in this case, there is a slight worsening of IC1, IC2, PC1, and PC2 (see the Supplement), which is however limited to when either dimension (N or T ) is very small-the impact of small T seems more acute in such case, although the refinement proposed by In the first three columns, we set θ = 1. The notation ABC refers to the criteria developed in Alessi, Barigozzi and Capasso (); ER/GR means that we report the best between the two criteria. Only the cases of serially, and serially and crosssectionally dependent data are reported. As in Figure , the notation ABC refers to the criteria developed in Alessi, Barigozzi and Capasso (); ER/GR means that we report the best between the two criteria. Only the cases of serially, and serially and crosssectionally dependent data are reported. As in Figure , the notation ABC refers to the criteria developed in Alessi, Barigozzi and Capasso (); ER/GR means that we report the best between the two criteria. Alessi, Barigozzi, and Capasso (2010) makes these criteria as good as the other ones in this case. Conversely, k and both ER and GR perform very well in this case too. Remarkably, k is the best criterion when N is small-see the cases (N, T ) = (25, 100) and (50, 50)-and it k also works well in the opposite case (N, T ) = (200, 25), although it tends to understate the true number of factors when k = 5. However, when (N, T ) = (200, 50), k becomes very good even when k = 5, which seems to suggest that k performs well for a wide spectrum of values of N, especially when T ≥ 50.
When there is cross-sectional dependence, conclusions become more mixed; the only exception is the criteria developed by Bai and Ng (2002; see the tables in the Supplement), which are systematically wrong and tend to overstate k in all possible cases, irrespective of the values of (N, T ) and of the true k. However, when tuning the penalty function as suggested in Alessi, Barigozzi, and Capasso (2010), even these estimators become very reliable, especially when T ≥ 50. As far as the other estimators are concerned, there is no clear winner among the methodologies employed: k fares better than the other criteria when N is quite small-see the cases (N, T ) = (25, 100), and (50, 50), especially under cross-sectional dependence and k ≤ 3. When k = 5 and T is small, k has a tendency to understate k; indeed, all estimators fare worse as k increases-one interesting exception is the criterion developed by Onatski (2010), whose performance actually improves as k increases (considering especially the cases where N > T ). Finally, the criteria developed by Ahn and Horenstein (2013) also work well across all cases considered, with the same exceptions detailed above.
We now turn to the case of a weaker factor structure, which we simulate by using the same design as above but with θ = 2. Figure 3 shows that results are far less clear cut in this case, especially when there is cross sectional dependence: no technique dominantly outperforms the other ones. As a general comment, k is better when k is small, but its performance deteriorates when k increases; similar results are found when considering the techniques developed by Ahn and Horenstein (2013); similarly to the results in Figure 2, however, the estimator developed by Onatski (2009) works better for large values of k. The estimator developed by Alessi, Barigozzi, and Capasso (2010) performs also very well, as long as N is sufficiently large (at least for the case of cross-sectional dependence). As far as k is concerned, note that it breaks down, for large values of k, when T is small-the case (N, T ) = (200, 25) is exemplary in this respect. However, when T increases the estimator improves rapidly-see the case (N, T ) = (200, 50).

Conclusions
We develop a procedure to estimate the number of common factors in a stationary panel factor model. As is typical in this literature, we exploit the fact that the first k eigenvalues of the data covariance matrix diverge as N → ∞, whilst the other ones stay bounded. We therefore derive a test statistic, based on sample eigenvalues, which diverges or converges according as the corresponding population eigenvalue is unbounded or finite. Given that, under H 0 , the test statistic diverges, we suggest a randomized tests to recover standard normal inference. The individual tests are then used as part of a sequential procedure to determine k. Results are derived under minimal assumptions; we show that the estimator of k is robust to a wide variety of data features, including serial and cross-sectional dependence, the presence of weak factors and several combinations of N and T . A noteworthy feature of the proposed test is that it is very good at determining whether a factor structure does actually exist in the data or not. The setup developed in this article hinges on a static factor model; however, it would be desirable to consider also dynamic factor models. A possible approach would be based on casting the dynamic factor model into a static one, following the approach proposed by Bai and Ng (2007).
Finally, a word of warning on the meaning of the hypotheses tested for. As is natural to think, the question whether an eigenvalue is infinity or not is clearly ill-posed. However, the test proposed here is a test on the divergence rate of estimated eigenvalues: despite the asymptotic characterization of the test, the purpose of the analysis in this article is to assess the magnitude of an eigenvalue, rather than its actual behavior at infinity, thus allowing a researcher to decide whether the pth eigenvalue of the covariance matrix of the data is "large enough" so that a model with at least p common factors is a good characterization of the X i,t 's or not.

Supplementary Materials
This supplement contains some further material for the article. In particular, in this note we report: the extension of results derived in the article to weak factors (Section 1); technical lemmas and proofs (Section 2); the full set of tables containing the Monte Carlo simulations (Section 3).