Protein Collapse is Encoded in the Folded State Architecture

Natural protein sequences that self-assemble to form globular structures are compact with high packing densities in the folded states. It is known that proteins unfold upon addition of denaturants, adopting random coil structures. The dependence of the radii of gyration on protein size in the folded and unfolded states obeys the same scaling laws as synthetic polymers. Thus, one might surmise that the mechanism of collapse in proteins and polymers ought to be similar. However, because the number of amino acids in single domain proteins is not significantly greater than about two hundred, it has not been resolved if the unfolded states of proteins are compact under conditions that favor the folded states - a problem at the heart of how proteins fold. By adopting a theory used to derive polymer-scaling laws, we find that the propensity for the unfolded state of a protein to be compact is universal and is encoded in the contact map of the folded state. Remarkably, analysis of over 2000 proteins shows that proteins rich in $\beta$-sheets have greater tendency to be compact than $\alpha$-helical proteins. The theory provides insights into the reasons for the small size of single domain proteins and the physical basis for the origin of multi-domain proteins. Application to non-coding RNA molecules show that they have evolved to collapse sharing similarities to $\beta$-sheet proteins. An implication of our theory is that the evolution of natural foldable sequences is guided by the requirement that for efficient folding they should populate minimum energy compact states under folding conditions. This concept also supports the compaction selection hypothesis used to rationalize the unusually condensed states of viral RNA molecules.


INTRODUCTION
Folded states of globular proteins, which are evolved (slightly) branched heteropolymers made from twenty amino acids, are roughly spherical and are nearly maximally compact with high packing densities [1][2][3]. Despite achieving high packing densities in the folded states, globular proteins tolerate large volume substitutions while retaining the native fold [4]. This is explained in a couple of interesting theoretical studies [5,6], which demonstrated that there is sufficient free volume in the folded state to accommodate mutations. Collectively these and related studies show that folded proteins are compact. When they unfold, which can be achieved upon addition of high concentrations of denaturants (or applying a mechanical force), they swell adopting expanded conformations. The radius of gyration (R g ) of a folded globular protein is well described by the Flory law with R g ≈ 3.3N 1 3 [7], whereas in the swollen state R g ≈ a D N ν , where a D is an effective monomer size and the Flory exponent ν ≈ 0.6 [8]. Thus, viewed from this perspective we could surmise that proteins must undergo a coil-to-globule transition [9,10], a process that is reminiscent of the well characterized equilibrium collapse transition in homopolymers [11,12]. The latter is driven by the balance between conformational entropy and intra-polymer interaction energy resulting in the collapsed globular state. The swollen state is realized in good solvents (interaction between monomer and solvents is favorable) whereas in the collapsed state monomer-monomer interactions are preferred. The coil-to-globule transition in large homopolymers is akin to a phase transition. The temperature at which the interactions between the monomers roughly balance monomer-solvent energetics is the θ temperature. By analogy, we may identify high (low) denaturant concentrations with good (poor) solvent for proteins.
Despite the expected similarities between the equilibrium collapse transition in homopolymers and the compaction of proteins, it is still debated whether the unfolded states of proteins under folding conditions are more compact compared to the states created at high denaturant concentrations. If polypeptide chain compaction is universal, is collapse in proteins essentially the same phenomenon as in homopolymer collapse or is it driven by a different mechanism [13][14][15][16][17]? Surprisingly, this fundamental question in the protein folding field has not been answered satisfactorily [10,18]. In order to explain the plausible difficulties in quantifying the extent of compaction, let us consider a protein, which undergoes an apparent two-state transition from an unfolded (swollen) to a folded (compact) state as the denaturant concentration (C) is decreased. At the concentration, C m , the populations of the folded and unfolded states are equal. A vexing question, which has been difficult to unambiguously answer in experiments, is: what is the size, R g , of the unfolded state under folding conditions (C < C m )? Small Angle X-ray Scattering (SAXS) experiments on some proteins show practically no change in the unfolded R g as C is changed [19]. On the other hand, from experiments based on single molecule Fluorescence Resonance Energy Transfer (smFRET) it has been concluded that the size of the unfolded state is more compact below C m compared to its value at high C [20,21]. The so-called smFRET-SAXS controversy is unresolved. Resolving this apparent controversy is not only important in our understanding of the physics of protein folding but also has implications for the physical basis of the evolution of natural sequences.
The difficulties in describing the collapse of unfolded states as C is lowered could be attributed to the following reasons. (1) Following de Gennes [22], homopolymer collapse can be pictured as formation of a large number of the blobs driven by local interactions between monomers on the scale of the blob size. Coarsening of blobs results in the equilibrium globule formation with the number of maximally compact conformations whose number scales exponentially with the number of monomers. Other scenarios resulting in fractal globules, enroute to the formation of equilibrium maximally collapsed structures, have also been proposed [23]. The globule formation is driven by non-specific interactions between the monomers or the blobs. Regardless of how the equilibrium globule is reached it is clear that it is largely stabilized by local interactions, because contacts between monomers that are distant along the sequence are entropically unfavorable. In contrast, even in high denaturant concentrations proteins could have residual structure, which likely becomes prominent at C < C m . At low C there are specific favorable interactions between residues separated by a few or several residues along the sequence. As their strength grows, with respect to the entropic forces, the specific interactions may favor compaction in a manner different from the way non-specific local interactions induce homopolymer collapse. In other words, the dominant native-like contacts also drive compaction of unfolded states of proteins. (2) A consequence of the impact of the native-like contacts (local and non-local) on collapse of unfolded states is that specific energetic considerations dictate protein compaction resulting in the formation of minimum energy compact structures (MECS) [24]. The number of MECS, which are not fully native, is small, scaling as ln N with N being the number of amino acid residues. Therefore, below C m their contributions to R g have to be carefully dissected, which is more easily done in single molecule experiments than in ensemble measurements such as SAXS. (3) Single domain proteins are finite-sized with N rarely exceeding ∼ 200. Most of those studied experimentally have N < 100. Thus, the extent of change in R g of the unfolded states is predicted to be small, requiring high precision experiments to quantify the changes in R g as C is changed. For example, in a recent study [25], we showed that in PDZ2 domain the change in R g of the unfolded states as the denaturant concentration changes from 6 M guanidine chloride to 0 M is only about 8%. Recent experiments have also established that changes in R g in helical proteins are small [20].
In homopolymers there are only two possible states, coil and globule, with a transition between the two occurring at T θ . On the other hand, even in proteins that fold in a twostate manner one can conceive of at least three states (we ignore intermediates here): (i) the unfolded state U D at high C; (ii) the compact but unfolded state U C , which could possibly exist below C m ; (iii) the native state. Do the sizes of U D and U C differ? This question requires a clear answer as it impacts our understanding of how proteins fold, because the characteristics of the unfolded states of proteins plays a key role in determining protein foldability [26][27][28].
Given the flexibility of proteins (persistence length on the order of 0.5 − 0.6 nm), we expect that the size of the extended polypeptide chain must gradually decrease as the solvent quality is altered. Experiments on a number of proteins show that this is the case [29][30][31].
However, in some SAXS experiments the theoretical expectation that R U C g < R U D g for one protein was not borne out [10,19], precipitating a more general question: are chemically denatured proteins compact at low C? The absence of collapse is not compatible with inferences based on smFRET [21] and theory [26]. Here, we create a theory to not only resolve the smFRET-SAXS controversy but also provide a quantitative description of how the propensity to be compact is encoded in the native topology. The theory, based on polymer physics concepts, includes specific attractive interactions (mimicking interactions accounting for native contacts in the Protein Data Bank (PDB)) and a two-body excluded volume repulsion. By construction the model does not have a native state. In order to validate the theoretical predictions, we performed simulations using a completely different model often used in protein folding simulations. In both the models, there are only two states (analogues of U D and U C ) in the model. The formation of U C is driven by the contact map of the folded state. Thus, chain compaction is driven in much the same way as in homopolymers, altered only by specific interactions that differentiate proteins from homopolymers.
Theory and simulations predict how the extent of compaction (collapsibility) is determined by the strength and the number of the native contacts and their locations along the chain. We use a large representative selection of proteins from the PDB to establish that collapsibility is an inherent characteristic of evolved protein sequences. A major outcome of this work is that β-sheet proteins are far more collapsible than structures dominated by α-helices. Our theory suggests that there is an evolutionary pressure on proteins for being compact as a pre-requisite for kinetic foldability, as we predicted over twenty years ago [26].
We come to the inevitable conclusion that the unfolded state of proteins must be compact under native conditions, and the mechanism of polypeptide chain compaction has similarities as well as differences to collapse in homopolymers. As a by-product of this work, we also establish that certain non-coding RNA molecules must undergo compaction prior to folding as their folded structures are stabilized predominantly by long-range tertiary contacts.

THEORY
We start with an Edwards Hamiltonian for a polymer chain [32]: where r(s) is the position of the monomer s, a 0 the monomer size, and N is the number of monomers. The first term in Eq. (1) accounts for chain connectivity, and the second term represents volume interactions and favorable interactions between select monomers given by The first term in Eq.(2) accounts for the homopolymer (non-specific) two-body interactions. It is well established in the theory of homopolymers that in good solvents with v > 0 the polymer swells with R g ∼ aN ν (ν ≈ 0.6). In poor solvents (v < 0) the polymer undergoes a coil-globule transition with R g ∼ aN ν (ν ≈ 1/3). These are the celebrated Flory laws. Here, we consider only the excluded volume repulsion case (v > 0).
The second term in Eq. (2) requires an explanation. The generic scenario for homopolymer collapse is based on an observation by de Gennes, who pictured the collapse process as being driven by the initial formation of blobs that arrange to form a sausage-like structure. At later stages the globule forms to maximize favorable intra-molecular contacts while simultaneously minimizing surface tension. Compaction in proteins, although shares many features in common with homopolymer collapse, could be different. A key difference is that the folded states of almost all proteins are stabilized by a mixture of local contacts (interaction between residues separated by less than say ∼ 8 but greater than 3 residues) as well as non-local (> 8 residues) contacts. Note that the demarcation using 8 between local and non-local contacts is arbitrary, and is not germane to the present argument. These specific interactions also dominate the enthalpy of formation of the compact, non-native state U C , playing an important role in its stability. Previous studies using lattice models of proteins in two [33] and three [34] dimensions showed that formation of compact but unfolded states are predominantly driven by native interactions with non-native interactions playing a subdominant role. A more recent study [35], analyzing atomic detailed folding trajectories has arrived at the same conclusion. Therefore, our assumption is that the topology of the folded It is worth mentioning that several studies investigated the consequences of optimal packing of polymer-like representations of proteins [36][37][38][39][40][41][42]. These studies primarily explain the emergence of secondary structural elements by considering only hard core interactions, attractive interactions due to crowding effects [40,43], or formation of compact states induced by anisotropic attractive patchy interactions [42]. However, the absence of tertiary interactions in these models, which give rise to compact states of varying topologies, prevents them from addressing the coil-to-globule transition. This requires creating a microscopic model along the lines described here.
We note in passing (with discussion to follow) that a number of studies have considered the effect of crosslinks on the shape of polymer chains [44][45][46][47][48][49][50]. Polymers with crosslinks have served as models for polymer gels and rubber elasticity [51][52][53]. In these studies the contacts were either random, leading to the random loop model [45], or explicit averages over the probability of realizing such contacts were made [44,54], as may be appropriate in modeling gels. These studies inevitably predict a coil-to-globule phase transition as the number of crosslinks increases.
In contrast to models with random crosslinks, in our theory attraction exists only between specific residues, described by the second term in Eq. (2), where the sum is over the set of interactions (native contacts) involving pairs {s i , s j }. We use the contact map of the protein (extracted from the PDB structure) in order to assign the specific interactions (their total number being N nc ). The contact is assigned to any two residues s i and s j if the distance between their C α atoms in the PDB entry is less than R c = 0.8nm and |s i − s j | > 2. We use Gaussian potentials in order to have short (but finite) range attractive interactions. For the excluded volume repulsion, this range is on the order of the size of the monomer, a 0 = 0.38 nm. For the specific attraction, the range is the average distance in the PDB entry between C α atoms forming a contact (averaged across a selection of proteins from the PBD). We obtain σ = 0.63 nm.
By changing the value of κ, and hence the strength of attraction, there is a transition between the extended and compact states. Decreasing κ is analogous to chemically denaturing proteins, although the connection is not precise. At high denaturant concentrations (κ ≈ 0, good solvent) the excluded volume repulsion (first term in Eq.(2)) dominates the attraction, while at low C (high κ, poor solvent) the attractive interactions are important. The point where attraction balances repulsion is the θ-point, and the value of κ = κ θ . Although reserved for the coil-to-globule transition in the limit of N 1 in homopolymers, we will use the same notation (θ-point) here. In our model, at the θ-point, the chain behaves like an ideal chain. To describe the globular state, a three-body repulsion needs to be added to the Hamiltonian (Eq. (2)), but we focus on the region between the extended coil and the θ-point because our interest is to access only the collapsibility of proteins. If κ θ is very large then significant chain compaction would only occur at very low (C C m ) denaturant concentrations, implying low propensity to collapse. Conversely, small κ θ implies ease of collapsibility.
Note that the ground state (κ 1) of the Hamiltonian in Eq. (2) is a collapsed chain whose R g is on the order of the monomer size. In other words, a stable native state does not exist for the model described in Eq. (2). Thus, we define protein collapse as the propensity of the polypeptide chain to reach the θ-point as measured by the κ θ value, and use the changes in the radius of gyration R g as a measure of the extent of compaction.
Assessing collapsibility: For our model, which encodes protein topology without favoring the folded state, we calculate R 2 g using the Edwards-Singh (ES) method [55]. Although from a technical view point the ES method has pros as well as cons, numerous applications show that in practice it yields physically sensible results on a number of systems. First, ES showed that the method does give the correct dependence of R 2 g on N for homopolymers. Second, even when attractive interactions are included, the ES method leads to predictions, which have been subsequently verified by more sophisticated theories. An example of particular relevance here is the problem of the size of a polymer in the presence of obstacles (crowding particles). The results of the ES method [56] and those obtained using renormalization group calculations [57] are qualitatively similar. Here, we adopt the ES method, allowing us to deduce far reaching conclusions for protein collapsibility than is possible solely based on simulations. We use simulations on a limited set of proteins to further justify the conclusions reached using the analytic theory.
The ES method is a variational type calculation that represents the exact Hamiltonian by a Gaussian chain, whose effective monomer size is determined as follows. Consider a virtual chain without excluded volume interactions, with the radius of gyration R 2 g = N a 2 /6 [55], described by the Hamiltonian, where the monomer size in the virtual Hamiltonian is a. We split the deviation W between the virtual chain Hamiltonian and the real Hamiltonian as, where The radius of gyration is R 2 g = 1 N N 0 r 2 (s) ds, with the average being, where · · · v denotes the average over H v .
Assuming that the deviation W is small, we calculate the average to first order in W.
The result is, and the radius of gyration is If we choose the effective monomer size a in H v such that the first order correction (second and third terms on the right hand side of Eq. (A5)) vanishes, then the size of the chain is, . This is an estimate to the exact R 2 g , and is an approximation as we have neglected W 2 and higher powers of W. Thus, in the ES theory, the optimal value of a from Eq. (A5) satisfies, Since W = W 1 + W 2 , the above equation can be written as Evaluation of the r 2 (s)W 1 v term yields, With the help of Eq. (11) and Eq. (9) we obtain the following self-consistent expression for a, 1 Calculating the averages in Fourier space, wherer n = 1 cos πns N r n , and R 2 g = 2 n |r 2 n | ), we obtain The best estimate of the effective monomer size a can be obtained by numerically solving Eq. (13) provided the contact map is known. A bound for the actual size of the chain is R 2 g = N a 2 0 /6. Because we are interested only in the collapsibility of proteins we use the definition of the θ-point to assess the condition for protein compaction instead of solving the complicated Eq. (13) numerically. The volume interactions are on the right hand side of Eq. (13). At the θ-point, the v-term should exactly balance the κ-term. Since at the θ-point the chain is ideal with a = a 0 , we can substitute this value for a in the sums in the denominators of the v-and κ-terms. By equating the two, we obtain an expression for κ θ .
Thus, from Eq. (13), the specific interaction strength at which two-body repulsion (v-term) equals two-body attraction (κ-term) is: The numerator in Eq. (14) is a consequence of chain connectivity and the denominator encodes protein topology through the contact map, determining the extent to which the sizes in U D and U C states change as C becomes less than C m . The numerical value of κ θ is a measure of collapsibility.
A comment about the solution of Eq. (13) for a is worth making. For κ = 0, corresponding to the good solvent condition, we expect that a a 0 . In this case, analysis of Eq. (13), in a manner described in Appendix A, shows that there is only one solution with a ∼ N 1 10 .
Similarly, at k θ Eq. (13) also admits only one solution. Thus, from the structure of Eq. (13) we surmise there are no multiple solutions, at least in the extreme limits v = 0 and k = 0.
The expression for k θ (Eq. (14)) is equally applicable to homopolymers in which contacts between all monomers are allowed, provided the self-avoidance condition is not violated. In . Thus, our model correctly reproduces the known N dependence of T θ obtained long ago by Flory [58] using insightful mean field arguments.

RESULTS
Native topology determines collapsibility: The central result in Eq. (14) can be used to quantitatively predict the extent to which a given protein has a propensity to collapse.
We used a list of proteins with low mutual sequence identity selected from the Protein Data Bank PDBselect [59], and calculated κ θ using Eq. (14) for these proteins. In all we considered 2306 proteins. For each contact (i, j), the energetic contribution due to interaction between i and j is k = (2πσ 2 ) −3/2 κ according to Eq. (2). Thus, k θ = (2πσ 2 ) −3/2 κ θ is the average strength (in units of k B T ) of a contact at the θ-point. If κ θ , calculated using Eq. (14), is too large then the extent of polypeptide chain collapse is expected to be small. It is worth reiterating that the theory cannot be used to determine the stability of the folded state, because in the Hamiltonian there are only two states, U D (κ=0 in Eq. (2)) and U C (κ > κ θ ).
The strength of contacts in real proteins (excluding possibly salt bridges) is typically on the order of a few k B T in the absence of denaturants. This is the upper bound for the contact strength any theory should predict, as adding denaturant only decreases the strength. If k θ is unrealistically high (tens of k B T ) then the attractive interactions of the protein would be too weak to counteract the excluded volume repulsion even at zero denaturant concentration, resulting in negligible difference in R g between the U D and U C states. plane. For the majority of small proteins (less than 150 residues) the value of κ θ is less than 3 k B T , indicating that the unfolded states of all of these proteins should become compact at C < C m . That collapse must occur, as predicted by our theory and established previously in lattice [26], and off-lattice models of proteins [60], does not necessarily imply that it can be easily detected in standard scattering experiments, because the changes could be small requiring high precision experiments (see below).
Weight function of a contact: For a given N , the criterion for collapsibility in Eq. (14) depends on the architecture of the proteins explicitly represented in the denominator through the contact map. Analysis of the weight function of a contact, defined below, provides a quantitative measure of how a specific contact influences protein compaction. Some contacts may facilitate collapse to a greater extent than others, depending on the location of the pair of residues in the polypeptide chain. In this case, the same number of native contacts N nc in the protein of the same length N might yield a lower (easier collapse) or higher (harder collapse) value of k θ . In order to determine the relative importance of the contacts with respect to collapse, we consider the contribution of the contact between residues i and j in the denominator of Eq. (14), A plot of W (i−j) in Fig.(1b) for different values of the chain length N shows that the weight depends on the distance between the residues along the chain. Contacts between neighboring residues have negligible weight, and there is a maximum in , almost independent of the protein length. The maximum is at a higher value for proteins with N > 100 residues. The figure further shows that longer range contacts make greater contribution to chain compaction than short range contacts. The results in Fig.(1b) imply that proteins with a large fraction of non-local contacts are more easily collapsible than those dominated by short range contacts, which we elaborate further below.
Maximum and minimum collapsibility boundaries: we can design protein sequences to optimize for "collapsibility". To design a "maximally collapsible" protein, for fixed N and number of native contacts N nc , we assign each of the N nc contacts one by one to the pair i, j with a maximal W (i, j) among the available pairs with the criterion that |i−j| > 2. Such an assignment necessarily implies that the artificially designed contact map will not correspond to any known protein. Similarly, we can design an artificial contact map by selecting i, j pairs with minimal W (i, j) till all the N nc are fully assigned. Such a map, which will be dominated by local contacts, are minimally collapsible structures.
The white lines in Fig.(1a) show k θ of chains of length N with N nc (N ) contacts distributed in ways to maximize or minimize collapsibility. We estimated N nc (N ) ≈ 0.6N γ , with γ ≈ 1.3, from the fit of the proteins selected from the PDBSelect set ( a fuller discussion is presented in Appendix A). Since the lines are calculated for N nc from the fit over the entire set, and not from N nc for every protein, there are proteins below the minimal and above the maximal curve in Fig.(1a). For a given protein, with N and N nc defined by its PDB structure, k θ for all possible arrangements of native contacts is largely in between the maximally and minimally collapsible lines in Fig.(1a). The majority of proteins in our set are closer to the maximal collapsible curves, suggesting that the unfolded proteins have evolved to be compact under native folding conditions. This theoretical prediction is in accord with our earlier studies which suggested that foldability is determined by both collapse and folding transitions [26], and more recently supported by experiments [20].
β-sheet rather than α-helical proteins undergo larger compaction: The weight function W (Eq. (15) and Fig.(1b)) suggests that contacts in α-helices (|i − j| = 4) only make a small contribution to collapse. Contacts corresponding to the maximum of W at i − j ≈ 30 are typically found in loops and long antiparallel β-sheets. Fig.(2) shows a set of proteins with high α-helix (> 90%) and a set with high content of β-sheets (> 70%) [61].
The values of k θ for the two sets are very distinct, so they barely overlap. We find that many of the α-helical proteins lie on or above the curve of minimal collapsibility while the rest are closer to the maximal collapsibility. The smaller β-rich proteins lie on the curve of maximal collapsibility slightly diverging from it as the chain length grows. These results show that the extent of collapse of proteins that are mostly α-helical is much less than those with predominantly β-sheet structures.
A note of caution is in order. The minimal collapsibility of most α-helical proteins in the set may be a consequence of some of them being transmembrane proteins, which do not fold in the same manner as globular proteins. Instead, the transmembrane α-helices are inserted into the membrane by the translocon, one by one, as they are synthesized. Such proteins would not have the evolutionary pressure to be compact.
Comparison between theory and simulations: The major conclusions, summarized in Figs.(1-2), are based on an approximate theory. In order to validate the theoretical predictions, we performed simulations for 21 proteins using realistic models (see Appendix B for details) that capture the known characteristics of the unfolded states of proteins and the coil to globule transition.
In accord with our theoretical predictions, R g decreases as k increases. For k = 0, corresponding to the maximally expanded state (high denaturant concentration) we expect that R g ≈ a D N 0.588 . A plot of R g versus N 0.588 is linear with a value of a D = 0.25 nm (Fig.3a). Remarkably, this finding is in accord with the experimental fit showing R g ≈ a D N 0.588 with a D = 0.2 nm [8]. The modest increase in the a D , compared to the experimental fit, predicted here can be explained by noting that in real proteins there is residual structure even at high denaturant concentrations whereas in our model this is less probable. The scaling shown in Fig. (3a) shows that the model used in the simulations provides a realistic picture of the unfolded states. We emphasize that the parameters in the simulations were not adjusted to obtain the correct R g scaling or a D .
In Fig. (4) we show the dependence of R g as a function of k for three representative proteins along with their native and unfolded structures and contact maps. The α helical protein myoglobin and the β-lactoglobulin with β sheet architecture, have nearly the same number of amino acids, N ∼ 150. The sizes of the two proteins are similar (Fig.4b) when k is small (k < 0.5) implying that the values of R g in the unfolded states are determined solely by N (see Fig.3a). For each protein, we identified k θ from simulations with the k value at which dRg dk is a minimum. Using this method, we find that the k θ value for β-lactoglobulin is less than for myoglobin. This result is consistent with the theoretical prediction, demonstrating that generically α proteins are less collapsible than β proteins. Interestingly, TIM barrel, an α/β protein with larger chain length (N = 246), collapses at k θ = 1.6, which is larger than β-lactoglobulin but smaller than myoglobin (purple line in Fig.4b). These results are qualitatively consistent with theoretical predictions.
The absolute values of k θ are different between simulations and theory because we used entirely different models to describe the coil to globule transition. The potential used in the theory, convenient for serving analytic expression for k θ , is far too soft to describe the structures of polypeptide chains. As a result the polypeptide chains explore small R g values without significant energetic penalty. Such unphysical conformations are prohibited in the realistic model used in the simulations. Consequently, we expect that the theoretical values of k θ should differ from the values obtained in simulations. Despite the differences in the potentials used in theory and simulations, the trends in k θ predicted using theory are the same as in simulations. The Pearson correlation coefficient, ρ = 0.79. Since we examined only 21 proteins in simulations, which is fewer than theoretical predictions made for 2306 proteins, we analyzed the correlation data by the bootstrap method to ascertain the statistical significance of ρ. The estimated probability distribution of ρ is shown in Fig. (5b). The mean of correlation coefficient is 0.78 and ρ 90% > 0.61 with 90% confidence.
The distribution is bimodal indicating that there is at least one outlier in the data set, which is likely to be the three helix bundle B domain of Protein A (labeled 5 in Fig. (5)).
For 20 proteins excluding Protein A, the distribution has a single peak (green broken line) with the mean 0.88 and ρ 90% > 0.82 (green dotted line in Fig. (5)). From these results, we surmise that both theory and simulations qualitatively lead to the conclusion that proteins with β-sheet architecture are more collapsible than α-helical is structures, which is one of the major predictions of this work.
Given that the simulations describe the characteristics of the unfolded states, we show in Thus, generally R g of the U C state is less than that of the U D state. The end-to-end distribution, P (R ee ), for different values of values of k in Fig.(3c) is broad at k = 0 corresponding to the unfolded protein. Average R ee decreases as attractive strength increases and the distribution becomes narrower. The results in Fig.(3) show that both R ee , which can be inferred using smFRET, and R g (measurable using SAXS), are smaller in the U C state than the U D state. However, the extent of decrease is greater in R ee than R g , an observation that has contributed to the smFRET-SAXS controversy.
RNAs are compact: There are major differences between how RNA and proteins fold [62]. In contrast to the apparent controversy in proteins, it is well established that RNA molecules are compact [63][64][65] at high ion concentrations or at low temperatures. Because our theory relies only on the knowledge of contact map, used to assess collapsibility in Azoarcus ribozyme and MMTV pseudoknot to merely illustrate collapsibility of RNA (Fig. (6)).
The k θ values (green stars in Fig. (2)) are close to the lower β-sheet line, indicating that these molecules must undergo compaction as they fold. This prediction from the theory is fully supported by both equilibrium and time-resolved SAXS experiments [66] on Azoarcus ribozyme. In this case (N = 196) the changes are so large that even using low resolution experiments collapse is readily observed [67]. We should emphasize that the size of different RNAs (for example viral, coding, non-coding) vary greatly. For a fixed length, single-stranded viral RNAs have evolved to be maximally compact, which is rationalized in terms of the density of branching. Although the sizes of the viral RNAs considered in [68] are much longer than the Azoarcus ribozyme the notion that compaction is determined by the density of branching might be valid even when N ∼ 200.
Dependence of k θ on the values of the cut-off: In order to ensure that the theoretical predictions do not change qualitatively if the cutoff values are changed, we varied them over a reasonable range. The reason for our choice of R c is that in majority of folding simulations, using C α representation of proteins, R c = 0.8 nm is typically used. Consider the variation of k θ with R c , the cut-off used to define contacts at a fixed σ = 0.63 nm. As R c increases the number of contacts also increases. From Eq. (14) it follows that k θ should decrease, which is borne out in the results in Fig.(8a).
Reassuringly, the trends are preserved. In particular, the prediction that β-sheet proteins are most collapsible is independent of R c . The trend that β-rich proteins are more collapsible than α-rich proteins remains same irrespective of the R c values.

DISCUSSION
We have shown that polymer chains with specific interactions, like proteins (but ones without a unique native state), become compact as the strength of the specific interaction changes. A clear implication is that the size of the U D state should decrease continuously as C decreases. In other words, the unfolded state under folding conditions is more compact than it is at high denaturant concentrations. Compaction is driven roughly by the same mechanism as the collapse transition in homopolymers in the sense that when the solvent quality is poor (below C m ) the size of the unfolded state decreases continuously. When the set of specific interactions is taken from protein native contacts in the PDB, our theory shows that the values of k θ are in the range expected for interaction between amino acids in proteins. This implies that collapsibility should be a universal feature of foldable proteins but the extent of compaction varies greatly depending on the architecture in the folded state.
This is manifested in our finding that proteins dominated by β-sheets are more collapsible compared to those with α-helical structures.
Magnitude of k θ and plausible route to multi-domain formation: The scaling of k θ with N allows us to provide arguments for the emergence of multi-domain proteins.
In Eqs. (13) or (14) attractive (κ-) and repulsive (v-) terms have the same structure. The only difference in their scaling with N is due to the difference in the sums (over all the monomers in the repulsive term and over native contacts in the attractive term). Double summation over all the monomers gives a factor of N 2 to the repulsive term. The summation over native contacts in the attractive term scales as N nc . Therefore, to compensate for the repulsion, N nc should scale as N 2 . However, for a given protein with a certain length N and certain numbers of contacts, it is not clear how the denominator in Eq. (14) scales with N . Empirically we find N nc (N ) dependence across a representative set of sequences scales as N γ with γ at most ≈ 1.3 (Appendix A). Thus, it follows from Eq. (14) that k θ increases without bound as N continues to increase. Because this is unphysical, it would imply that proteins whose lengths exceeds a threshold value N C cannot become maximally compact even at C = 0. An instability must ensue when N exceeds N C . This argument in part explains why single domain proteins are relatively small [70].
Scaling of N nc as a power law in N γ means that as the protein size grows, the value of k θ will deviate more and more from those found in globular proteins, implying such The present work, surveying over 2300 proteins, shows that the compact state has to exist, engendered by mechanisms that have much in common with homopolymer collapse. For protein-L, the k θ = 1.7k B T , a very typical value, is right on the peak of the heat map in Fig.(1). We have previously argued that because the change in R g between the U D and U C states for small proteins is not large, high precision experiments are needed to measure the predicted changes in R g between U C and U D . For protein-L the change is less than 10% [71], making its detection in ensemble experiments very difficult. Similar conclusions were reached in recent experiments [20]. A clear message from our theory is that, tempting as it may be, one cannot draw universal conclusions about polypeptide compaction by performing experiments on just a few proteins. One has to survey a large number of proteins with varying N and native topology to quantitatively assess the extent of compaction. Our theory provides a framework for interpreting the results of such experiments.
Random contact maps, local and non-local contacts: In order to differentiate collapsibility between evolved and random proteins, we created twelve random contact maps keeping the total number of contacts the same as in protein-L (see Fig.(7) for examples).
For each of these pseudo-proteins we calculated k θ using Eq. (14). We find that for all the random contact maps the k θ values are less than for protein-L, implying that the propensity of the pseudo-proteins to become compact is greater than for the wild type. This finding is in accord with studies based on homopolymer and heteropolymer collapse with random crosslinks. These studies showed that the polymer undergoes a collapse transition as the density of crosslinks is increased [45,47,48]. Of particular note is the demonstration by Camacho and Schanke [50], who showed using exact enumeration of random heteropolymers and scaling arrangements that the collapse can be either a first or second order transition depending on the fraction of hydrophobic residues [50].
Some time ago Abkevich et al. [72] showed, using Monte Carlo simulations of proteinlike lattice polymers, that the folding transition in proteins with predominantly non-local contacts was first order like, which is not the case for proteins in which local contacts dominate. In light of this finding, it is interesting to examine how compaction is affected by local and non-local contacts. We created for N =72 (protein-L) a contact map with 185 (same number as with WT protein-L), predominantly local contacts (Fig.(7b)). The values of k θ for these pseudo-proteins is considerably larger than for the WT, implying that proteins dominated by local contacts are minimally collapsible. We repeated the exercise by creating contact maps with predominantly non-local contacts (Fig.(7c)). Interestingly, k θ values in this case are significantly less than for the WT. This finding explains why in proteins with varied α/β topology there is a balance between the number of local and non-local contacts.
Such a balance is needed to achieve native state stability and speed of folding [72] with polypeptide compaction playing an integral part [26].
Based on these findings we conclude that R g of the unfolded states of proteins dominated by non-local contacts must undergo greater compaction compared to those with that have mostly local contacts. The results in Fig. (2) also show that proteins rich in β-sheet are more collapsible than predominantly α-helical proteins. It follows that β-sheet proteins must have a larger fraction of non-local contacts than proteins rich in α-helices. In Fig. (7d) we plot the distribution of the fraction of non-local contacts for the 2306 proteins. Interestingly, there is a clear separation in the distribution of non-local contacts between α-helical rich and β-sheet rich proteins. The latter have substantial fraction of non-local contacts which readily explains the findings in Fig. (7c) and the predictions in Fig. (2).

CONCLUSIONS
We have created a theory to assess collapsibility of proteins using a combination of analytical modeling and simulations. The major implications of the theory are the following. enhanced if a pair of U D rather than U C molecules collided due cellular stress because the contact radius in the former would be greater than in the latter. Second, the fraction of exposed hydrophobic resides in U D is much greater than in U C , thus greatly increasing the probability of aggregation. The second factor is likely to be more important than the first.
Consequently, transient population of U C due to cellular stress minimizes the probability of aggregation. (iii) We have also shown that the position of the residues forming the native contact greatly influences the collapsibility of β sheet proteins (containing a number of non-local contacts showing greater compaction than α helical proteins, which are typically stabilized by local contacts. Our theory also shows that most RNAs may have evolved to be compact in their natural environments. Although the evolutionary pressure to be compact is likely to be substantial for viral RNAs [64,65,68,73], it is apparent that even non-coding RNAs are also likely to be almost maximally compact in their natural environments. Our theory suggests that, to a large extent, collapsibility of RNA is similar to proteins with β-sheet structures. Both classes of biological macromolecules are stabilized by non-local contacts. Interestingly, it has been argued that the need to be compact ("Compaction selection hypothesis" [73]) could be a major determinant for evolved biopolymers to have minimum energy compact structures as their ground states.
Appendix A: Collapse of homopolymers: The theory described for protein collapse resulting in Eq.
(14) is general and applicable to the collapse of homopolymers as well. We show in this Appendix that the ES formalism can be used to derive the scaling of k θ with N , the number of monomers.
Consider a homopolymer with the following Hamiltonian: where r(s) is the position of the monomer s, and a 0 is the monomer size. The first term in Eq.(A1) accounts for chain connectivity, and the second term represents volume interactions and favorable interactions between monomers, given by V H (r(s)), interactions, the range is σ. In good solvents, with v > 0, the polymer swells with R g ∼ aN ν (ν ≈ 0.6). In poor solvents (v < 0), the polymer undergoes a coil-globule transition with R g ∼ aN ν (ν = 1 3 ). These are the well-known Flory laws.
Following the ES method described in the main text, we arrive at the self-consistent equation for a for the homopolymer chain, The expression for k θ in Eq. (A4) for homopolymers differs from k θ (Eq. (14)) for proteins only by the term in the denominator. The sum over specific interactions for proteins is replaced by the non-specific interaction in Eq. (A4). It can be shown that the N dependence is the same in both the numerator and denominator in Eq. (A4). Therefore, to leading order in W, k θ is independent of N for a homopolymer.
In order to derive the scaling of k θ with N , we need to analyze the corrections arising from second order in W. To second order in W, the radius of gyration is, Here, W 1 is the same as Eq. (5), and W 2 is given by Eq. A2. The terms associated with W 1 are zero at the θ-transition point. By counting the powers of N it follows that r 2 (s)W 2 2 v scales as 1 N 7 and r 2 (s) v W 2 2 v scales as 1 N 5 . Hence, at the θ-point, we find that k θ satisfies the following quadratic equation, in the large N limit. The scaling law for k θ (∝ T θ ) obtained first by Flory [58], was confirmed using simulations much later [74]. To our knowledge this is the first microscopic derivation of the result. Thus, our general formalism can be applied to describe collapse of homopolymers as well as proteins and RNA.

Proteins:
The results for homopolymers given above may be extended to obtain the N dependence of k θ for proteins. By considering the second order correction to the radius of gyration, we obtain the following quadratic equation for k θ , In deriving the above equation we assume that total number of contacts N nc ∼ N γ . A plot of N nc as a function of N (Fig. (8e)) for the PDBselect proteins confirms that this is indeed the case. For γ = 1.3, k θ ∼ N 0.9 , which shows that larger proteins are less collapsible than smaller ones, implying that when N exceeds a critical value they are likely to form multidomain structures. Comparison of Eqs. (A6) and (A7) shows that collapsibility in proteins and homopolymers differs dramatically. For homopolymers the coil-to-globule transition occurs at a finite temperature. The sharpness of the transition increases as N increases. In sharp contrast, the growth of k θ with N for proteins (Eq. (A7)) implies that larger proteins must organize themselves into domains with individual domains forming compact structures.

Appendix B: Simulations
The theoretical results were obtained using a set of approximations, whose validity need to be confirmed using simulations. The purpose of these simulations is to show that the predicted theoretical values of k θ correlate well with simulation results. We performed Langevin dynamics simulations for 21 globule proteins (Fig. (5)). The set includes both all-α and all-β proteins as well as α + β and α/β proteins according to Structural Classification Of Proteins (SCOP).
The simple form (sum of Gaussians) of the interaction energy in Eq. (2) was devised in order to obtain analytic expression for k θ so that collapsibility of two thousand or more proteins could be easily analyzed. The potential in Eq.
(2) has no hard core, which is physically not realistic. Because of the soft interactions it is clear that the theoretical values of k θ have to be an upper bound. In order to firmly establish the qualitative predictions obtained using theory we use a realistic interaction energy in the simulations. The potential function in the simulations is, where . (B2) The first term, describing chain connectivity, the is discrete version of the first term in Eq.
(1) with a 0 = 0.38 nm. The second term accounts for excluded volume interactions used for any pair of residues not included in the contact map. We chose ε v = 1.0 kcal/mol so that monomer particles do not overlap with each other. In this crucial respect, the potential function is drastically different from the interaction potential used in the theory, in which the Gaussian-type soft core potential was used in order to solve the problem analytically.
The summation in the last term in Eq. (B1) runs over all pairs in the contact map. The potential, Φ WCA , is the Weeks-Chandler-Andersen potential [75], a variant of Lenard-Jones potential, consisting of well-separated repulsive and attractive terms (Fig. 8(c), (d)). This is necessary in order to vary the strength of the attraction potential without affecting the repulsive interactions. The coefficient of the attractive term is ε k = k · k B T . We varied k between 0.0 and 5.0 to find the collapse-transition point, k = k θ . The contact distance is the same as in the theory, σ = 0.63 nm.
For each protein and k value, we generated 100 independent simulation trajectories.
Initial conformations were generated in a preliminary simulation at high temperature T = 400 K with k = 0. Each production run at T = 300 K lasts for 10 8 steps. We discarded the first 2 × 10 7 steps in analyzing the data. Conformations are sampled every 10 4 steps. In total, 8 × 10 5 conformations were sampled to calculate the average radius of gyration, R g for each k.