A Machine Learning Protocol for Predicting Protein Infrared Spectra

: Infrared (IR) absorption provides important chemical fingerprints of biomolecules. Protein secondary structure determination from IR spectra is tedious since its theoretical interpretation requires repeated expensive quantum-mechanical calculations in a fluctuating environment. Herein we present a novel machine learning (ML) protocol that uses a few key structural descriptors to rapidly predict amide I IR spectra of various proteins and agrees well with experiment. Its transferability enabled us to distinguish protein secondary structures, probe atomic structure variations with temperature, and monitor protein folding. This approach offers a cost-effective tool to model the relationship between protein spectra and their biological/chemical properties.


Introduction
Understanding the function of proteins benefits enormously from knowledge of their atomistic structure.][6] The theoretical interpretation of spectroscopic signals and connecting them with structural detail is an expensive task, which requires many electronic structure calculations at the quantum chemistry (QC) level for a large number (typically thousands) of representative configurations.
For decades, the map methods have been widely used [7][8][9][10] , to predict vibrational properties without large scale QC calculations.Its basic philosophy is to compute (or predict) vibrational modes by using empirical polynomial function in the local electric fields around the targeted molecules or amide group. 3,7 2] However, the transferability of map methods is limited since a few-parameter fitting of observables to key structural parameters cannot account for the full versatility and complexity of proteins. 7,13 he biased parameterization might bring errors in spectroscopic simulations.Developing a cost-effective approach that has greater predictive power and transferability is called for.
5][16][17][18] In particular, neural networks (NN), a class of machine-learning algorithms, can establish the structure-property relationships by iteratively learning with a complex highdimensional function. 19NN has been proven useful for handling complex non-linear problems, and offers a transferrable tool for simulating protein spectroscopy 20 and for predicting the frequency and transition dipole moments of the O-H stretch in water. 13¹» ï ±º ïð   21 and Ghosh et al. used DNN to obtain spectra information directly from the molecular structure, which greatly accelerates the spectroscopic analysis of materials. 22t is always a worthy goal to realize first-principles predictions of proteins IR spectra, despite of its computationally prohibitive difficulties.In this study, we develop an ML protocol for predicting the amide I IR spectra of proteins with density functional theory (DFT) accuracy.The simulated fine structure of IR signals of various proteins from the trained ML model agrees well with experiment.Applications are presented for the identification of secondary structures, probing structural variations with temperature, and monitoring of protein folding.

Theory and Computation Detail
Quantum mechanics treatment for amide I vibration.We adopt a divide-conquer strategy to treat the amide I vibrations of the whole protein.The protein vibrations are represented as a set of n oscillators associated with each peptide bond in its backbone.The Frenkel exciton model is employed to construct a vibrational model Hamiltonian, 23 in which the diagonal elements are the frequency ( i ) of the i-th amide I oscillator, and the off-diagonal elements represents the coupling between two oscillators i and j (Fig. 1).For a pair of non-neighboring oscillators, since the distances between oscillators are greater than their sizes, the coupling is calculated with the dipole approximation: 24 , where 0 is the dielectric constant, ( ) is the transition dipole of peptide bond i (j), and i j r ij is the vector connecting dipole i and j.For two neighboring oscillators, the couplings are computed directly using a dipeptide model. 10,25  represents the peptide bond moiety and has been widely used for generating parameters by the empirical map method. 11,26 or the vibrational couplings between two neighboring peptide bonds, we employed the N-acetyl-glycine-N'-methylamide (GLDP) molecule (Fig. S1), also known as the glycine dipeptide. Quantum mechanical calculations for data generation.Configurations of NMA were extracted from ab initio molecular dynamics (AIMD) simulation at 300 K in the NVT ensemble, conducted with the CP2K 29 program (Details in Supporting Information).In order to sample relevant configurations, we have run seven independent AIMD simulations with different initial conformations (Fig. S2).From 241.5 ps trajectories with a of 0.5 fs, time step a total of 9660 configurations were extracted at a 25-fs interval to avoid overly correlated configurations for machine learning training.For each configuration, we extracted the NMA molecule and surrounding water molecules within a 5 Å radius for the Hessian calculations using the Gaussian 16 package 30 at the B3LYP/cc-pVDZ level to generate data for machine learning training.The harmonic vibrational frequencies were scaled by 0.97.
A total of 5128 structures of GLDP molecules were generated with the Ramachandran angles -180°180°and -180°1 80°at 5°intervals for both angles for machine learning training (Fig. S1).Then all Ramachandran angles were fixed and the remaining coordinates were optimized. 10The Hessian calculations were performed on the obtained structures, and solvation effects were modeled implicitly by the integral equation formalism polarizable continuum model, using the Gaussian16 package at the B3LYP/cc-pVDZ level.The local coupling of nearest neighbor amide I vibrational modes was calculated by the localizing normal modes scheme of Jacob and Reiher. 25,31 ta analytics.9660 NMA and 5128 GLDP conformations were generated as training set to predict the vibrational frequency ( i ), transition dipole ( ), and neighboring coupling (J ij ).The calculated Neural networks architecture and descriptors.Multilayer Perceptron (MLP) with supervised training scheme using a backpropagation algorithm implemented in TensorFlow 32 to predict the properties ( i , , J ij ) from the geometric descriptors.] The NN consists of one input layer, three hidden layers and one output layer.For each hidden NN layer we used the Rectified Linear Unit activation function. 35The number of hidden layer neurons are 32, 64 and 128, respectively.We adopted different learning rates of the Adam optimizer 36 in TensorFlow for the training process to avoid being trapped into local minima.The learning rate is set to be halved every 500 steps, and the initial learning rate is set to 0.0004.For the training, we added L2 regularization 37 to the architecture of the neural network to prevent overfitting.Hyperparameters were optimized by using random search algorithm 38 in TensorFlow (including neurons for hidden layers, learning rate and l2 regularization parameter) to create a reasonable ML protocol in this work.
In order to establish structure-property relationship between geometry and optical properties of proteins, the ground state Coulomb Matrix 39 (CM) of the NMA and part of GLDP molecules (excluding hydrogens and solvent molecule) was taken as descriptor (Fig. S6): , where i 2.4 0.5 and j are atomic indices, is the interatomic distance, and R R i j represents nuclear charge.The merit of CM lies in its simplicity

Q i
and efficiency, 39 it is also a sufficient descriptor for the molecular spectra. 22Internal coordinates were also tested for the ML training, we did not adopt it since the less accurate and efficient than the Coulomb Matrix (Fig. S7).As a rotationally invariant Machine learning model evaluation.The NN predictive accuracy is reported using the Pearson coefficient (r) and mean absolute deviation (MAE), and its robustness is verified by the standard cross-validation 40 procedure.All data sets were randomly and evenly distributed into 10 bins in this procedure.Each bin was used as a test set while the remaining nine bins as training set.We have calculated the learning curves of whole ML process in this work.The learning curves (Fig. S10) indicates that the NN training for vibrational frequency and transition dipole moment converges with 6000 NMA samples, while that of coupling constant needs 4000 GLDP samples.Importantly, there is no significant overfit issues after adding the standard L2 regularization 37 treatment to the NN architecture (Fig. S10).It is straightforward to predict the frequency and coupling constants because they mainly depend on the ground state structure.However, since the transition properties (e.g.vibration transition dipole moment) involve two different vibration states, it is expected to see more outliers because these quantities are more sensitive to structural changes.This phenomenon indeed poses a great challenge for NN training (Table S8).With the high Pearson coefficient (r>0.9) and low MAE values (1.761 cm -1 ) obtained in cross-validation, we have achieved accurate ML predictions for the vibration frequency ( i ), transition dipole ( ), and coupling constants (J ij ) in the exciton Hamiltonian i (Fig. 2).

Results and Discussion
Ð¿¹» í ±º ïð   Finally, IR spectra were simulated with the model Hamiltonian, using the SPECTRON program developed by Mukamel and coworkers 41 .As Fig2, Fig S11 and S12 shows, our ML model can reproduce the DFT data, and we also make this ML protocol [42][43] and simulation data 44 available online to provide rapid protein IR spectroscopy prediction, paving the way for a real-time operation of ultrafast experimental spectroscopy.Good agreement (the quantitative agreement between the predicted and experimental spectra were measured by Spearman rank correlation coefficients 45 , see Table 1) is obtained between the experimental spectra of the proteins measured in D 2 O (black lines) [46][47][48] and the ML predictions based on 1000 MD configurations (red lines).Intensity is scaled to have the same maximum intensity for each panel.
IR spectroscopic assignment by ML for protein secondary structures.Then we applied the ML protocol to simulate the amide I IR spectra of 12 proteins (Fig4 and Fig S13).The good agreements between our ML predictions and experimental spectra is evident from the high Spearman rank correlation coefficients ( > 0.80 for 11 cases, except one with 0.71 for the 1DHR) (Table 1).This is a widely used measure for the agreement between the predicted and experimental spectra 45,[49][50][51] .The structures of proteins are reflected by distinct spectral characteristics, such as the wavelength region for the dominant signal peak 52 : -helices: 1640~1650 cm -1 , -sheets: 1620~1640 cm -1 & 1680~1690 cm -1 , random coil: 1650~1660 cm - 1 .As indicated by Table 1 and Fig. 4, NN predictions can distinguish the -helix and -sheet secondary structures.Thehelical (PDB code: 1MBC) and -sheet (PDB code: 1REI) proteins exhibit the major spectral peaks at 1650 cm -1 , 1634 and 1680 cm -1 , respectively.Proteins containing both secondary structures ( + ) show characteristic peaks for both motifs.Taking the advantage of the speed of the ML model (Table 1), we can predict the IR spectra by averaging the NN predicted signals of 1000 MD configurations (which would be prohibitively expensive via direct QC computations), so as to capture the fluctuating dynamics for each protein (Details in Supporting Information).The essential features (both main peaks and lineshapes) of experimental spectra are successfully reproduced by the simulated spectra with high Spearman rank correlation coefficient (Fig. 4 and Table 1).We have further investigated the amide I signals of -Immunoglobulin (1REI) in different states, as shown in Fig. S16, the dominant peak of spectra has a blue shift which corresponds to the change of secondary structure content ( -turns and coil increased whilestrands decreased (Table S4)).We have also predicted transient amide I spectra of 1REI at different time moments based on MD trajectories.As Fig. S17 shows, the main peak has a red shift accompanied with the decrease of -turns and coil content, and the increase of -strands content (Table S4).The structural change is clearly captured by the change of spectra (Fig. S16 and S17).This would be useful for tracking conformation changes of proteins.
Map method calculates the IR spectra.2] As expected, due to the use of simple empirical polynomial functions, the map method is much faster (10~20 times) than our neural network (NN) protocol.Roughly speaking, It is at least five orders of magnitude faster than the full DFT calculation (Table S1 and Table S2).Unfortunately, the map predictions can only explain the experimental spectra for 4 (1MBC, 1PPN, 3EST, 5PCA) out of 12 proteins (Fig. S14 and Fig S15 ).Compared with experiment, map results have the RMSE (root mean square error) values of 1.48 to 4.52, and the Spearman rank correlation coefficients of 0.18 to 0.93 (Normally a high Spearman coefficient > 0.6 is required for a good theoretical prediction).In contrast, NN results have RMSEs between 1.43 and 2.81, and the Spearman rank correlation coefficients between 0.71 and 0.96 (Fig S15 and Table S9).The local electric potential/field used in map dependens on the quality of atomic charges in the chosen force field.And the empirical function of map trained by a set of protein dataset may not fit for other types of proteins.In short, the use of empirical parameters and force-field-dependent electric field values limits the transferability of map method.We expect that the improvement of map results may require re-parameterization the model for specific proteins of interest, or a more accurate force field for proteins.
Probing structure variations of Ubiquitin with temperature.We have examined the ML transferability by simulating the IR spectra of Ubiquitin (PDB code: 1UBQ) at different temperatures (1.6 , 28.6 , 55.6 , (Details in Supporting Information).Ubiquitin is a 76-residue protein which contains both and secondary structures, which is frequently used as an exemplar of the folding/unfolding process.As the temperature rises in the range of interest, the dominant peak undergoes a blue-shift from 1642 cm - 1 to 1657 cm -1 (Table S5), accompanied by a broadening of bands and decrease in intensity (Fig. 5a).This result is in line with experiment (Fig. 5a and Fig. S18) 53 , the peak shift in ML prediction effectively reflects the effect on temperature, because temperature changes will lead to changes in protein structure which can be well handled by ML protocol, demonstrating good transferability of our ML model to varying external environment factors.
Monitoring folding path of Trp-cage protein.We have verified the variation of amide I IR spectra across a protein folding path.Trp-cage (PDB code: 1L2Y) is a 20-residue mini-protein which has been widely used for studying folding dynamics.100, 000 MD configurations along the Trp-cage folding pathway were retrieved from our previous study. 54Five stages are taken to reflect the evolution from the un-folded strand (S1), slightly folding but Table 1.ML predicts IR protein spectra with the root mean square error (RMSE) and high Spearman rank correlation ( ) indicates the quantitative agreement with experiment.Structures of 12 proteins with different sizes were taken from the Protein Data Bank, representing a diverse range of secondary structure contents, i.e., different fractions of helix and sheet.The IR spectrum of each protein was computed based on 1000 MD configurations.All reported calculation times refer to calculations on eight cores of an Intel(R) Xeon(R) CPU (E5-2683v4 @ 2.1GHz).retaining the coil structure (S25), rapid folding stage with a large amount of helical structures (S50), and to the folded helix system like a cage (S75 and S100).The ML amide I IR spectra, predicted by averaging over 100 MD snapshots for each state, depicted in Fig. 5b.As the folding process proceeds the random coil content decreases followed by an increase in the helix content (Fig. 5b and Table S6), leading to a 10 cm -1 red shift (S1:1652 cm - 1 , S25:1650 cm -1 , S50:1646 cm -1 , S75:1644 cm -1 , S100:1642 cm -1 ) of the dominant peak (Fig. 5b and Fig. S18 and Table S7).This is consistent with recent time-resolved IR experiments 55 and theoretical simulations 56 .

Summary
We have reported a machine learning protocol based on ab initio data for predicting the amide I IR spectra of a protein from its structure.It shows a promise for providing IR spectra characterization of protein dynamics for different proteins under varying conditions, including secondary structure, temperature dependence, and folding status.It significantly boosts the speed of IR spectra simulation compared to conventional quantum chemistry approaches.We are currently improving the transferability of the model by increasing the size of data set and consider explicit solvent effect in the ML training to reduce ML model errors.This approach can be expanded to predict optical properties of proteins in other spectral regimes including UV, Raman, and other techniques including sum of frequency generation (SFG), and multi-dimensional IR and UV spectroscopies.

Figure 1 .
Figure 1.Model Hamiltonian for amide I vibrations in a protein.Machine Learning Protocol for the vibrational Hamiltonian matrix.The direct QC calculations of the necessary molecular quantities are time consuming.Our aim is to predict the vibrational frequency ( i ), transition dipole ( ), and neighboring coupling (J ij ) i parameters from a NN model.The N-methylacetamide (NMA) molecule (Fig.S1) was taken as the model system for NN training.It represents the peptide bond moiety and has been widely used for generating parameters by the empirical map method.11,26For the vibrational couplings between two neighboring peptide bonds, we employed the N-acetyl-glycine-N'-methylamide (GLDP) molecule (Fig.S1), also known as the glycine dipeptide.This molecule has

i
root mean square deviation (RMSD) of the extracted NMA (NMA molecule and surrounding water molecules within a 5 Å radius) and GLDP molecules indicating large conformational changes and low similarity (Fig. S3 and S4), which mitigates issues originated from over correlation in training data.The broad distribution of training data ( i , , J ij ) indicated that the sampling procedure adequately i covered the ensemble of conformations (Fig. S2 and S5), and the resulting data set is appropriate for establishing structure-property relationships via ML training.

Figure 2 .
Figure 2. (a) Correlation between the DFT-computed ( _DFT) (black lines/dots) and NN-predicted ( _NN) (red lines/dots) amide I vibrational frequencies after cross-validation.(b-d) Comparison of the DFT-computed amide I vibrational transition dipole moment in the x, y, z direction ( x,y,z _DFT) (black lines/dots) and NNpredicted ( x,y,z _NN) (red lines/dots) after cross-validation.(e) Amide I vibrational normal modes (a, b) and local modes (c, d) of GLDP with DFT B3LYP/cc-pVDZ.(f) Comparison of DFTcomputed (J_DFT) (black lines/dots) and NN-predicted (J_NN) (red lines/dots) coupling constants of nearest neighboring amide I modes after cross-validation.descriptor, the CM lacks of orientation information for predicting the vibrational transition dipole moment ( ).To remove the i complexity of orientation dependence during the NN training for i , a rotation matrix operation was applied on each NMA, to set the carbonyl C atom as the zero point in the xyz Cartesian Coordinate, the C-O bond along the positive y axis, and the OCN triangle in the x-y plane (Fig. S8).Consequently, the NN prediction of the i for a new NMA also starts with a treatment of transferring back it to the original coordinate by using the inverse of rotation matrix.The elements of CM (NMA:15; GLDP:21) were then used as inputs (Fig. S6) for NN training and the output (size:1) data are then compared with DFT calculations (Fig. 2).A total of five ML models ( i , (x,y,z), J ij ) were obtained to construct the vibration i model Hamiltonian.Machine learning model evaluation.The NN predictive accuracy is reported using the Pearson coefficient (r) and mean absolute deviation (MAE), and its robustness is verified by the standard cross-validation40 procedure.All data sets were randomly and evenly distributed into 10 bins in this procedure.Each bin was used as a test set while the remaining nine bins as training set.We have calculated the learning curves of whole ML process in this work.The learning curves (Fig.S10) indicates that the NN training for vibrational frequency and transition dipole moment converges with 6000 NMA samples, while that of coupling constant needs 4000 GLDP samples.Importantly, there is no significant overfit issues after adding the standard L2 regularization37 treatment to the NN architecture (Fig.S10).It is straightforward to predict the frequency and coupling constants because they mainly depend on the ground state structure.However, since the transition properties (e.g.vibration transition dipole moment) involve two different vibration states, it is expected to see more outliers because these quantities are more sensitive to structural changes.This phenomenon indeed poses a great challenge for NN training (TableS8).With the high Pearson coefficient (r>0.9) and low MAE values (1.761 cm -1 ) obtained in cross-validation, we have achieved accurate ML predictions for the vibration frequency ( i ), transition Machine learning prediction of IR spectra.The ML predicted parameters were applied to construct the amide I band Hamiltonian using the protocol sketched in Fig.3.The protein is split into individual peptide bonds and dipeptides.The i and values i predicted from the NMA NN model are used to generate Hamiltonian diagonal elements and the off-diagonal elements arising from non-neighboring peptides couplings (computing J ij via the dipole approximation), respectively.The J ij values predicted from GLDP NN model are used for generating off-diagonal elements owing to the nearest neighboring peptide couplings.

Figure 3 .
Figure 3. Machine learning protocol for predicting protein IR spectroscopy.Finally, IR spectra were simulated with the model Hamiltonian, using the SPECTRON program developed by Mukamel and coworkers41 .As Fig2, Fig S11and S12shows, our ML model can reproduce the DFT data, and we also make this ML protocol[42][43] and simulation data44 available online to provide rapid protein IR spectroscopy prediction, paving the way for a real-time operation of ultrafast experimental spectroscopy.

Figure 4 .
Figure 4. Good agreement (the quantitative agreement between the predicted and experimental spectra were measured by Spearman rank correlation coefficients45 , see Table1) is obtained between the experimental spectra of the proteins measured in D 2 O (black lines)[46][47][48] and the ML predictions based on 1000 MD configurations (red lines).Intensity is scaled to have the same maximum intensity for each panel.IR spectroscopic assignment by ML for protein secondary structures.Then we applied the ML protocol to simulate the amide I IR spectra of 12 proteins (Fig4 and FigS13).The good agreements between our ML predictions and experimental spectra is evident from the high Spearman rank correlation coefficients ( > 0.80 for 11 cases, except one with 0.71 for the 1DHR) (Table1).This is a widely used measure for the agreement between the predicted and experimental spectra45,[49][50][51] .The structures of proteins are reflected by distinct spectral characteristics, such as the wavelength region for the dominant signal peak 52 : -helices: 1640~1650 cm -1 , -sheets: 1620~1640 cm -1 & 1680~1690 cm -1 , random coil: 1650~1660 cm - 1 .As indicated by Table1and Fig.4, NN predictions can distinguish the -helix and -sheet secondary structures.Thehelical (PDB code: 1MBC) and -sheet (PDB code: 1REI) proteins exhibit the major spectral peaks at 1650 cm -1 , 1634 and 1680 cm -1 , respectively.Proteins containing both secondary structures ( + ) show characteristic peaks for both motifs.Taking the advantage of the speed of the ML model (Table1), we can predict the IR spectra by averaging the NN predicted signals of 1000 MD configurations (which would be prohibitively expensive via direct QC computations), so as to capture the fluctuating dynamics for each protein (Details in Supporting Information).The essential features (both main peaks and lineshapes) of experimental spectra are successfully reproduced by the simulated spectra with high

Figure 5 .
Figure 5. (a) From left to right : Simulated (red line) and Experimental 53 (black line) IR spectra of Ubiquitin at four different temperatures (1.6 °C ~82.6 °C) and the temperature variation of the dominant peak position.(b)The ML-predicted IR spectra of the Trp-cage protein along its folding path (S1 the original unfolded strand structure; S25: slightly folded but retaining the coil structure; S50: folding rapidly with the emergence of helix elements; S75-S100: stably folded protein with helix structures forming a cage.)All spectra are averaged over 100 (1000) MD snapshots for each state of Trp-cage (Ubiquitin).retaining the coil structure (S25), rapid folding stage with a large amount of helical structures (S50), and to the folded helix system like a cage (S75 and S100).The ML amide I IR spectra, predicted by averaging over 100 MD snapshots for each state, depicted in Fig.5b.As the folding process proceeds the random coil content decreases followed by an increase in the helix content (Fig.5band TableS6), leading to a 10 cm -1 red shift (S1:1652 cm - 1 , S25:1650 cm -1 , S50:1646 cm -1 , S75:1644 cm -1 , S100:1642 cm -1 ) of the dominant peak (Fig.5band Fig.S18and TableS7).This is consistent with recent time-resolved IR experiments 55 and theoretical simulations 56 .SummaryWe have reported a machine learning protocol based on ab initio data for predicting the amide I IR spectra of a protein from its structure.It shows a promise for providing IR spectra characterization of protein dynamics for different proteins under varying conditions, including secondary structure, temperature dependence, and folding status.It significantly boosts the speed of IR spectra simulation compared to conventional quantum chemistry approaches.We are currently improving the transferability of the model by increasing the size of data set and consider explicit solvent effect in the ML training to reduce ML model errors.This approach can be expanded to predict optical properties of proteins in other spectral regimes including UV, Raman, and other techniques including sum of frequency generation (SFG), and multi-dimensional IR and UV spectroscopies.