Gaussian Process Models of Potential Energy Surfaces with Boundary Optimisation

A strategy is outlined to reduce the number of training points required to model intermolecular potentials using Gaussian processes, without reducing accuracy. An asymptotic function is used at long range and the cross-over distance between this model and the Gaussian process is learnt from the training data. Results are presented for different implementations of this procedure, known as boundary optimisation, across the following dimer systems: CO-Ne, HF-Ne, HF-Na + , CO 2 -Ne and (CO 2 ) 2 . The technique reduces the number of training points, at ﬁxed accuracy, by up to ∼ 49 %, compared to our previous work based on a sequential learning technique. The approach is readily transferable to other statistical methods of prediction or modelling problems.


I. INTRODUCTION
In any molecular simulation, approximations of the potential energy surfaces (PESs) that describe the relevant interactions are a pre-requisite.Traditionally, these approximations have been made using empirical potentials (forcefields) 1,2 .However, such force-fields have closed functional forms, which limit their capacity to capture the complicated topography of the PES.Furthermore, they are laborious to produce and may fail to capture accurately even the fitted data.As the accuracy of the simulation depends on the potential employed, much work has been devoted to developing forcefields that provide approximations of PESs with quantummechanical accuracy.
Herein, GPs are employed as the statistical technique.An existing example of a force-field that uses GPs is FFLUX 35 , which has been used to approximate the energies of weakly bound complexes 36 and water clusters 37 , among other applications [38][39][40] .Furthermore, in the field of materials science, GP models that invoke a smooth overlap of atomic positions (SOAP) kernel 41 have seen successful applications to many systems [29][30][31][42][43][44][45] .
GP models have also produced promising results in applications to intermolecular interactions [17][18][19] .Initially 17 , the traina) Electronic mail: jack.broad@nottingham.ac.uk (author to whom correspondence should be addressed) b) Electronic mail: richard.wheatley@nottingham.ac.uk ing sets for these models were constructed with Latin hypercube sampling [46][47][48] though sequential design strategies 49,50 , which achieve a prescribed accuracy with a smaller training set, have been shown to outperform such methods 19 .Regardless of training set design, a fixed cross-over distance, R cross , was imposed prior to training.This parameter defines a boundary beyond which a simple, long-range asymptotic function takes over prediction from the GP 17,19 .A similar approach is used in materials science, in which the contribution by one atom to the neighbour density of another is assumed to be zero if the two are separated by a distance in excess of R cross 31,44 .Cross-over distances have been applied alongside other statistical methods of prediction too 5,51 , with neural network [9][10][11][12] and moment tensor 15 models of PESs employing fixed, pre-determined cross-over distances.
Use of a cross-over distance limits the portion of the PES approximated by the statistical method to the region around the potential well.Consequently, fewer training points are required to develop an accurate model.Previously these distances were all fixed and specified a priori, however, they can be learnt from the training data.Herein the work of Uteva et al. 17,19 is extended into a sequential design strategy in which R cross varies as a function of the number of training points and the species of the interacting atoms.This produces models of the same accuracy with fewer training points than the same design strategy with a fixed R cross .The process by which R cross is optimised is referred to here as boundary optimisation.

A. GP Regression
All GP models herein make predictions via GP regression, with detailed descriptions of GP regression theory available elsewhere [52][53][54] .This section briefly introduces the key concepts in the context of intermolecular potentials.For any intermolecular interaction the PES can be thought of as a multivariate function f (x), x ∈ R Z , where x is a vector of inputs and Z is the number of elements in x.Here the inputs are inverse interatomic separations, though promising results have also been obtained with Morse variables 24,55 .Only pair intermolecular interactions are considered, although GP models have been applied successfully to non-additive interactions 17,19,40 and the methodology outlined here can be extended to such cases straightforwardly.
When the outputs f (x i ) = Y i are available at N values of i, the set {x i , Y i } N i=1 forms the training set.Letting Y be a vector of the observed energies from this set, GP regression can be used to approximate the value of f (x) at a new point x * as where K * is a vector of the covariances between x * and all x i , and K is the positive-definite covariance matrix, Here σ 2 n is the Gaussian noise variance, which accounts for the noise in the training data.
All entries in K are found by evaluating a covariance function, k(x, x ), where x and x are configurations from the training set.The same holds for K * , only with one training configuration replaced by the new point x * .A common example is the squared exponential covariance function, for which Here D is the total number of interatomic distances which comprise unique atomic pairs, σ 2 f is the signal variance, x d and x d are the inverse separations in x and x which contain the dth atomic pair, and l d is the lengthscale for the interaction between this pair of atoms.The hyperparameters of the covariance function are σ 2 n , σ 2 f and the set of lengthscales, l d .This covariance function can be modified to account for the fact that two configurations can have equivalent energies due to symmetry.The result is the symmetric squared exponential covariance function, k sym (x, x ), investigated by Uteva et al. 17 , which is used here.The set of all permutations of the interatomic distances in x under which the energy surface is unchanged is denoted as P, while p is a single permutation within this set.Assuming that l d is invariant under interchange of x d and x d , where k(x, x ) is the squared exponential covariance function from equation 3. GPs also employ a mean function, which is taken here to be zero everywhere.
To specify the hyperparameters of k sym (x, x ) that produce a model that achieves the best fit to the training data, the log of the marginal likelihood function, log(L ), is maximised 52,54 log(L Equation 5 shows that this optimisation entails inversion of the N x N covariance matrix.This incurs a sizable computational cost, which translates to scaling of order O(N 3 ) for hyperparameter optimisation 56 .Consequently, though they achieve higher predictive accuracies than other statistical methods when modelling PESs 51,57,58 , GPs require greater computational effort to train 51 .For prediction, meanwhile, K −1 Y in equation 1 needs to be calculated once only, and as a result the cost scales linearly with N.Even so, molecular simulations that employ a GP model for which N is large are more computationally intensive than those that use a traditional forcefield.These issues have led to attempts to minimise the training set size required by a GP model to achieve a given error.For two training strategies that achieve the same error, that which does so with fewer training points is more computationally efficient.Attempts to develop more computationally efficient training strategies have involved active learning or sequential design methods 19,29,59 , composite kernels 24 and new sampling methods 27,60 .
However, no attempt was made in these previous works to increase training efficiency by learning the optimal crossover distance from the data used in training.Our boundary optimisation approach has the potential to increase the gains in efficiency from these prior methods further still with little associated increase in computational time, provided a viable long-range approximation is available.Such an improvement is shown here, building upon the sequential design method of Uteva et al. 19 .

B. Training Set Design
It was found by Uteva et al. 17 that the predictive performance of a GP for a PES is enhanced by using inverse interatomic separations rather than non-inverse separations to describe intermolecular configurations.This is because k sym (x, x ) is a stationary kernel (i.e. it relies only on the distance between the two configurations being compared), meaning it assumes that the rate of change of the output (i.e. the interaction energy) with the input (the interatomic distances) is constant.This is not the case for a PES, where the energy varies rapidly with separation in the short-range repulsive wall but barely at all in the long-range asymptotic region.The r → r −1 conversion addresses this issue: for low r the rate of change in r −1 is faster than in r, meaning the change in the input more closely matches the change in the output under the former; meanwhile, for large r, the rate of change in r −1 is far smaller, as is that of the energy.
Previously, Latin hypercube (LHC) sampling, which is a space-filling design, has been employed to build data sets when modelling PESs with GPs [17][18][19]61 . Theapproach of Uteva et al. 17 entailed designing LHCs in inverse separations over a range of angles that specified the symmetry-distinct region 17 .The algorithm generates a large number of candidate LHCs and finds the minimum separation in each.The LHC with the largest such separation is selected under what is referred to here as the 'maximin' criterion. Quntum chemical calculations are undertaken on the selected LHC only.This LHC is then subject to a high energy cut-off, E cut , and any configuration for which the interaction energy exceeds E cut is discarded from the LHC 17 .In addition, a geometric constraint of 8.5 Å was placed on all LHCs to ensure no configurations with minimum separations above this threshold were included 17 .It was observed that at separations greater than the geometric constraint the GP predictions tended towards a small, non-zero constant 17 .As these predictions should tend towards zero, a long-range function derived from the multipole expansion 62 was introduced to approximate energies at large separations.
In a later work, Uteva et al. 19 presented active and sequential learning approaches, in which the model of the PES is progressively refined by adding new points to the GP.In this approach, training and validation of the GP models involves three data sets: a training, a reference and a test set.The training set is used in the GP regression.The reference set provides a pool of configurations from which new training points can be selected, while the test set determines the GP's accuracy against an independent data set.
One sequential design strategy presented by Uteva et al. 19 is the highest error search.This approach selects new training points based on the configuration in the reference set with the largest GP prediction error.We adopt this approach herein.However, Uteva et al. 19 applied this strategy only at separations below the fixed, pre-determined cross-over distance 19 , meaning no new training data could be added at separations above R cross .
The highest error search is best described as a sequential design rather than an active learning 63,64 method.This is because active learning methods are a subset of sequential design methods that compute the output corresponding to an input only at the point it is added to the training set.This means, in the context of modelling PESs, an active learning method would use a reference set comprising configurations for which the energies were not calculated.Active learning strategies have, however, been successfully used to develop GP 19,29,59 , moment tensor 15 and neural network 13 models of PESs, and the methodology used here for boundary optimisation could be altered for use with the two set method of Uteva et al. 19 , which is an active learning method.
When using a sequential design method, meanwhile, the energies in the reference and test sets are pre-computed using a relatively computationally inexpensive ab initio technique.Once a model with the requisite predictive accuracy is obtained, the energies in the minimal training set can be recalculated using a more accurate, and costly, ab initio method before use in applications.This process is known as transfer learning 65,66 and it allows a GP model of a PES to be produced with the accuracy of a high-level ab initio technique with relatively few computationally expensive calculations.For example, MP2 67 energies have been upgraded to CCSD(T) 68 energies for calculation of the CO 2 -CO second virial coefficient 17 .In previous applications of both LHC learning 17 and sequential design methods 19 a fixed value of the cross-over distance, R cross , was determined a priori.Rather than fix R cross prior to training, however, it is possible to determine its optimal value from the reference data.This is boundary optimisation and is possible because a sequential design method permits R cross to be varied each time the GP is updated.Consequently, the size of the region over which GP regression is used for prediction will grow with the predictive accuracy of the GP.
Boundary optimisation may lead to increased accuracy for the following reason.When the number of training points, N TP , is low the predictive accuracy of the long-range function at the outer edge of the potential well will exceed that of the GP.Consequently, it is anticipated that allowing the longrange function to approximate the energies of configurations in this region at low N TP will increase the predictive accuracy of the overall model and facilitate more efficient model development.Both the reference and test sets for each chemical system were generated under the LHC design strategy discussed in I B. All energies were calculated in Molpro 69 using MP2 with an aug-cc-pVTZ basis set and the counterpoise correction with the only exception being HF-Na + , which used an augcc-pwCVTZ basis set instead.The specifications of the LHCs for each system are given in table I. Therein r is the distance between the bond centres (not centres of mass), θ is the angle between r and the bond axis of the molecule, θ 1 and θ 2 are the angles between r and the bond axis of the first and second CO 2 molecules respectively, and φ is the torsional angle between the two CO 2 molecules.The molecules were kept rigid, with r CO = 1.1283Å for CO, r CO = 1.1632Å for CO 2 and r HF = 0.9170 Å.However, boundary optimisation could be applied to non-rigid systems straightforwardly, with the only additional requirement being generalisation of the long-range function to non-rigid molecules.A larger geometric constraint of 100 Å (instead of 8.5 Å 17 ) was employed to probe the long range behaviour of the system.The maximum value of N TP was 100 for all systems apart from (CO 2 ) 2 , which used 300 training points at most.
A high energy cut-off, E cut , was applied to the reference and test sets to remove configurations with interaction energies which exceeded its value.E cut = 0.005 E h for all systems apart from HF-Na + , for which E cut = 0.05 E h because the charge-dipole interaction increases the well-depth.(1 E h ≈ 27.211 eV ≈ 2625.5 kJ mol −1 .) For systems with N test N ref the independence of the test set is self-evident.However, for systems where N test ≈ N ref it is also true that the test set is independent.This follows because, although the reference and test sets for each system were designed using the same 'maximin' strategy, the stochastic nature of the LHC algorithm means that separate LHCs contain completely independent sets of configurations.Fur-TABLE II: The properties included in the multipolar long-range functions of each system, as well as the ab initio methods used in their calculation.All calculations were carried out using an aug-cc-pVQZ basis set apart from those for the dispersion coefficients, which used an aug-cc-pVTZ basis set.

System
Dipole Quadrupole Polarizability Dispersion CO-Ne X HF-Ne X HF-Na + Level of Theory MRCI 71,72 MRCI MP2 CCSD thermore, the 'maximin' criterion is based on just one separation in the whole data set, meaning that two LHCs with similar maximin will still be dissimilar.This is demonstrated in figure 1, which shows the energies against the inverse C-Ne separation for the reference and test sets used in training models of the CO 2 -Ne potential.

C. Long-range asymptotic functions
For the long-range energy model, multipole series were employed for all systems apart from HF-Na + .The contributions included in the multipolar long-range functions are shown in table II for each system apart from CO 2 -Ne, for which the function is already described in previous work 17 , and (CO 2 ) 2 .The latter was developed prior to the other multipolar longrange functions and uses atomistic charge, dipole, quadrupole and polarizability contributions from Hartree-Fock 70 theory, which were scaled to give the known total molecular properties.For HF-Na + a fitted long-range function was used, which was derived by fitting a sum of two power laws between two points where the energy was predicted by GP regression.More information on the motivation for and derivation of this function is found in appendix A.

D. Classification of Phase Space using a Boundary
When modelling PESs using GPs it is often necessary to classify configurations as suitable for prediction via the GP or via a long-range asymptotic function.In the work of Uteva et al. 17,19 the classifier formed a boundary from the superposition of atom-centred spheres defined by a single constant, R cross .Specifically, if any interatomic distance was less than R cross the GP was used.As R cross was fixed, this classifier is referred to here as C fixed .Denoting the region in which GP regression was used for prediction as A GP and the region which employed the long-range function as A LR , under C fixed these regions were and where r is a set of intermolecular atom-atom distances, and min(r) is the smallest separation in r.
In boundary optimisation, R cross varies according to the GP accuracy.Thus R cross is not constant but a model parameter, which is learnt from the reference data.As this classifier is still parameterised by a single value, R cross , it is referred to here as C single .
More elaborate classifiers are possible by using more detailed parametric forms to define the boundary region.A simple way of defining a more complex classifier is for the value of R cross to depend on the atom types that comprise the interatomic distance.The resulting classifier is referred to here as C multi .For a system of molecules with D interatomic pairs of chemically different atoms, using C multi requires the vector of cross-over distances R cross = (R 1 , ..., R d , ..., R D ).This defines a multiple-parameter boundary region, as follows, where min d (r) is the minimum separation in r that involves an atomic pair of type d.
The optimum value of the classifier parameters are determined by minimising the error between the model and the reference set (i.e. by minimising the training error), meaning the sizes of A GP and A LR vary with the GP.The sum of squared errors, SSE tot , over the two regions is where Here, "method" denotes either GP or LR, N method the number of points in A method , Ŷi the prediction of the energy for the ith configuration from the desired method and Y i the calculated energy of the same configuration.The RMSE against the test set, RMSE test , is given by where SSE test is SSE tot over the test set and N test is the number of configurations in the test set.RMSE test is therefore a function of R cross with discrete steps, as the RMSE changes only when a change in R cross causes a configuration in the test set to be re-classified.Both C single and C multi are simple, parametric classifiers that pre-impose a mathematical form on the classification.Hence, neither is expected to be optimal with respect to the RMSE against the reference or test sets.That is, a more complicated boundary than that described by these classifiers will likely produce a lower RMSE against a given data set.However, it is shown later that an artificial 'ideal' classifier produces only very marginal improvements over C multi , suggesting this classifier provides a very good balance of simplicity and accuracy.

E. Overall algorithm
Using the data and calculation methods above, the algorithm generates a GP model sequentially as follows: train the GP to the current training set; select the classifier parameters by minimising the RMSE against the reference set; move a new point from the reference to the training set based on the largest error.Below are further details of each step, along with how the choice of classifier and placement strategy affects the algorithm.

GP training
All GPs described herein were trained using the GPy 73 package in Python 2.7.Optimisation of the hyperparameters was carried out by maximising log(L ) using 20 independent restarts whenever a configuration was added to the training set.Moreover, a gamma distribution with an expectation of one and a variance of two was used as a prior on all hyperparameters for all systems.This was to weakly penalise large hyperparameter values, given that the expected values are typically of order 0.1 or below.

Direct Search Algorithm
When using C single , R cross can be optimised via a direct search that exploits how the RMSE varies as a piecewise constant function of R cross .Because of this feature, all possible values of SSE tot (and hence the RMSE) can be computed readily against the reference set.A direct search of these values is guaranteed to find the global minimum.Full details of the direct search are given in algorithm 1.

Algorithm 1 Direct Search Algorithm
1: Compute the squared errors for the GP and long-range function and min(r) at each configuration in the reference set.The direct search algorithm is computationally cheap for the following reasons.Each instance of the direct search requires only a single set of predictions from each of the long range function and GP regression.As the long range function is fixed throughout the sequential design process, these predictions need only be calculated once at the start.The GP pre-dictions need to be re-calculated whenever the GP is updated, which occurs once per step of the sequential design.However, these predictions are already required by a sequential design step when choosing the new training point.Furthermore, calculation of all possible squared error values (step 4) is cheap because calculating each value in order requires only a simple update of SSE LR and SSE GP .Consequently, the direct search is fast compared to the other steps of the sequential design algorithm and can therefore be undertaken at each design step with negligible additional computational effort.

Orthogonal direct search algorithm
An adaption of the direct search algorithm is required for C multi , as this requires a multidimensional optimisation of multiple classifier parameters.This is called the orthogonal direct search as it optimises one element of R cross while keeping the others fixed.The single element that varies is optimised using the direct search algorithm, as this is guaranteed to return the best minimum along that 1D slice of R cross .Although orthogonal optimisations can be time-consuming, the speed of the one-dimensional direct search means that repeating it multiple times for all cross-over distances is feasible.c: For each column in M, order the values from smallest to largest to produce an ordered list in each column.3: For each restart, select a row from M at random to be the initial guess at R cross .4: For each d in turn, fix all cross-over distances apart from R d and find the optimal value of R d using a direct search.5: Repeat step 4 until NI = NI max or the elements of R cross remain unchanged; save this R cross and its corresponding SSE tot .6: Repeat steps 3-5 until NR = NR max .7: Select the R cross that corresponds to the lowest value of SSE tot .
The orthogonal direct search proceeds via algorithm 2. As in the one-dimensional direct search, the square errors of the long-range and GP predictions at each configuration are precomputed for this algorithm.The orthogonal search algorithm is not guaranteed to find the global minimum, as local minima may exist in the RMSE landscape.Hence, NR max restarts are performed with randomly selected starting points.Each restart involves NI max optimisations of each R d .Here NI max = 15 and NR max = 5 for all systems.If the values in R cross converge such that further one-dimensional searches in any orthogonal direction do not change its elements prior to NI = NI max , R cross is saved and the next restart undertaken.In fact, it was rare that NI reached NI max for any of the systems ex-plored here.Moreover, despite the low NR max employed, the same minimum was usually found across multiple restarts.
The direct search offers a method that is at once very fast, designed for the discrete-stepped surface and perfect in a single dimension.This latter property means it skips over any local minima in a given orthogonal direction.As such, optimal cross-over distances are obtained quickly and reproducibly under this method.Typically, about 80 % of random restarts return the same set of cross-over distances for a given system and a given training set.

Training Point Placement Methods
Once the classifier parameters are optimized, the PES model is specified and the next training point determined from the highest error method.The point with the greatest error can be selected either from A GP alone or from the union of A GP and A LR .Using A GP alone is referred to as the constrained placement method, which proceeds via algorithm 3. Steps 2-5 of this algorithm comprise a stage of training, with the RMSE test also calculated at each such stage.Choosing new points from the union of A GP and A LR is named the open placement method.This proceeded identically to the constrained placement method (algorithm 3) except the highest error point was from either A GP or A LR .In each region this point is found using the corresponding method of prediction (i.e. the error of the long-range function was used in A LR ).
Models were trained under both point placement methods because each held potential advantages over the other.The constrained placement method ensures all training points are added in the GP region and so are of immediate use in prediction.Meanwhile, the open placement method is capable of immediately placing points in regions of the PES where the long-range function performed poorly, potentially transferring these regions to A GP more rapidly than under the constrained placement approach.

Closest Model Training Strategy
In the present work, all training strategies used are combinations of a classifier and a point placement method.A further strategy, which is intended for comparison only, is the closest model training strategy.This employs the open placement method to select training points and classifies a configuration using C optimal , where Here, SE method is the squared error in the prediction from "method" at r. C optimal is so named because it classifies a configuration based on whether GP regression or the long-range function best approximate its energy.Equation 12shows that C optimal employs no boundary and so requires prior knowledge of the energy of a configuration in order to classify it.Consequently, models obtained from the closest model strategy are unsuitable for prediction.However, this method represents an 'ideal' classifier that is guaranteed to find the optimal RMSE test for a given GP model.Hence C optimal is useful for estimating how inaccuracies in the parametric classifiers affect the training efficiency.If models from C optimal significantly outperform the other classifiers, this suggests the other classifiers are too simple to properly approximate the true boundary.
There are circumstances where C optimal may not give an optimal RMSE against the test set.There may be short-range hypersurfaces where the interaction energy is predicted nearexactly by the long-range function due to chance.Points in the reference set that are near these hypersurfaces will be classified by C optimal as part of A LR instead of A GP .If the configuration density of the reference set around such a hypersurface is insufficiently high, no training points will be added in its vicinity.Consequently, points in the test set near to the hypersurface will be inadequately approximated by either method of prediction; the GP will perform poorly due to a lack of nearby training points and the long-range function will be inaccurate for short-range points that are not extremely close to the hypersurface.This problem was avoided by using dense reference sets and by not using the constrained placement strategy with C optimal .

Combining Classifiers and Placement Strategies
The sequential design method, or training strategy, requires a choice of classifier and placement strategy.The combinations of these examined in this work are given in table III.Methods that do not involve C optimal can make predictions and so are suitable for applications.The method involving C optimal can only classify points if the true energy is already known, and so is only useful to estimate the loss in performance due to inaccuracies in the parametric classifiers.The fixed boundary method corresponds to the method of Uteva et al. 19 and is included to allow comparison with this prior method, on which the new methods build.The method of Uteva et al.

III. RESULTS AND DISCUSSION
Comparisons of the performances of the different training strategies are made using the HF-Ne, HF-Na + , CO-Ne, CO 2 -Ne and (CO 2 ) 2 potentials.These were selected as they provide a range of interaction types and well depths to test the robustness of the new training strategies.
The number of training points is N TP , which in each system was less than 10 % of the number of configurations in the corresponding reference set (see table I and the related text).Consequently, the training sets for the models discussed are candidates for transfer learning because their small size makes, for example, CCSD(T) calculations possible for the whole training set.Boundary optimisation is anticipated to improve training efficiency as, for small N TP , the long-range function will outperform the GP at the outer edge of the potential well.This is illustrated in figure 2 for the (CO 2 ) 2 system, Figure 2 shows that the predictive accuracy of the longrange function exceeds that of GP regression for configurations with C-C separations above the cross-over distance.Despite this, in the fixed boundary method these configurations would be predicted with the GP.This not only reduces overall accuracy, but also means new training points must be placed at long range to address this, when allowing the long-range function to predict these energies would give adequate accuracy.
The impact of the above on training efficiency compared to fixed boundary training is shown for HF-Ne in figure 3.For C multi the RMSE falls faster with N TP compared to C fixed , indicating improved training efficiency.This improvement is most pronounced when N TP is low and so RMSE test is high.This is expected because increasing the size of the training set increases the size of the GP region, meaning R cross in the boundary-optimised strategies will approach the fixed value of 8.5 Å.Consequently, the difference between the fixed boundary and boundary-optimised models will close as N TP increases.Equivalent plots to figure 3 for all training strategies for all systems are found in the supplementary material.The RMSE data are somewhat noisy.Possible causes of this noise are minor variations in the hyperparameters upon retraining, the discrete nature and stochastic design of the reference set and the low values of N TP meaning that addition of a single point has a non-smooth effect on the RMSE.The new methods herein have considerably lower noise than the prior fixed boundary method.
To compare the efficiency gain over fixed boundary training across all new strategies from subsections II E 5 and II E 6, a metric, E, is used.This compares the N TP required by two different strategies to achieve a given RMSE test .When com-  3 for the HF-Ne potential.Fits are made in the region where the RMSE decays as a power law of N TP .For all systems this corresponds to 1 x 10 −6 E h ≤ RMSE text ≤ 1 x 10 −4 E h , apart from HF-Na + where the region is 1 x 10 −5 E h ≤ RMSE text ≤ 1 x 10 −3 E h due to the larger high energy cut-off for this potential.The fits provide continuous lines to interpolate to any RMSE test within the above range.This enables a comparison of N TP between training strategies at fixed RMSE test that accounts for the somewhat noisy data.Moreover, this is the range of errors in which the models are useful for applications and the decrease in log(RMSE test ) with log(N TP ) is linear.
The fitted equations for each training strategy for the CO-Ne system are given in table IV as power laws in N TP .Equivalent tables for all other systems are found in the supplementary material.As the data are noisy, these tables also show the R 2 value of each fit.In the case of CO-Ne, these evidence the high quality of the fits.In fact, no fit from any training strategy for any system achieves an R 2 value lower than 0.8907 (from the fixed boundary training strategy on the HF-Na + system).Furthermore, figure 3 and the corresponding plots in the supplementary material show that the RMSE data follow a straight line (in a log-log plot).This implies that the R 2 arises from scatter in the data rather than unsuitability of the fitting function.
Re-arranging the equations in table IV provides expressions for N TP in terms of RMSE test for the CO-Ne potential.These give E as a function of RMSE test , via equation 13, for all new training strategies herein.This was done for all other potentials as well, with plots of E against RMSE test for all training strategies for each system given in figure 4.
As observed earlier, the training efficiency gains in figure 4  4).
Generally, the closest model strategy generates the largest efficiency gain, while the smallest improvement comes from strategies involving C single .Efficiency gains only slightly below those from the closest model strategy are achieved by the strategies that use C multi .This hierarchy of improvement implies that the choice of classifier, rather than point placement strategy, is most important because strategies with the same classifier perform more similarly than those with the same point placement method.The closest model strategy is included only to illustrate the total training efficiency gain possible from an 'ideal' classifier, and it is encouraging that the best boundary optimisation methods are close to this ideal case.Indeed, the difference in E between this ideal method and the closest-performing C multi strategy never exceeds ∼ 3 % for any system.This implies that C multi captures the true shape of the boundary region for the systems explored sufficiently.Thus introducing a more detailed classifier would not be worth the increased cost of evaluating the PES.
C multi always outperforms C single .However, for the CO 2 -Ne potential (figure 4d), C multi also outperforms the closest model strategy.This is because the long-range function is nearly exact for a group of short-range configurations in the reference set.This is shown in figure 5 by the thin 'peninsula' of points that are best estimated by the long range function (in red) which encroaches deep into the GP region (in blue).This 'peninsula' exists because the long-range function is of higher energy than the MP2 data in some regions of the PES and of lower energy in others, meaning there must be some hypersurface in between where the two are equal.Prediction for test configurations close to the 'peninsula' is problematic under C optimal unless the reference set is very dense in this region.This is because the exact predictions of the long-range function on the 'peninsula' mean no training points are added there, leading the closest model strategy to perform relatively poorly for the CO 2 -Ne potential.
The total gain in training efficiency achieved by boundary optimisation varies somewhat between systems.While the best-performing training strategy for HF-Ne improved training efficiency by between 25-39 % (i.e.E = 61-75 %), for (CO 2 ) 2 the gain was only 12-18 % (E = 82-88 %).The more limited gains for (CO 2 ) 2 may be because the a priori choice of R cross = 8.5 Å, required for the fixed boundary method, is reasonable for this system.Nevertheless, even for the (CO 2 ) 2 po-0 1 2 3 4 5 6 7 8 x / Å 0 In contrast, the fixed boundary training method adds its fifth training point at a minimum separation above 5 Å and its eighth at 8.5 Å.This demonstrates that a fixed boundary strategy switches between placing training points in the repulsive wall and at the boundary from the onset of training.This difference is because fixed boundary training requires the GP to predict energies at separations up to 8.5 Å from the start of training.Consequently, training points must be added at separations near the 8.5 Å boundary from the onset, even though the energies at these separations are very small and generally well-approximated by the long-range function.This contrasts with the boundary-optimised and closest model strategies, which allow the long-range function to approximate con-   Training point plots for the CO-Ne potential, given in figure 7, show that the single-open strategy extends A GP much further than single-constrained training.This suggests that the capacity to place points in A LR facilitates faster expansion of the boundary for this system.However, these plots also indicate that this discrepancy is most noticeable when the number of training points is large; specifically, rapid expansion of the boundary under single-open training was instigated by placement of a single training point at long range, which facilitated Similarities in the evolution of the boundary are seen when the cross-over distances achieved under C multi and C single are compared for a given point placement method.Figure 9 shows the results of such a comparison for the HF-Na + and CO 2 -Ne systems.The cross-over distances generally grow at a similar rate, but the value of R cross is consistently larger than all values in R cross .This suggests that there are differences between C single and C multi , which manifest in both the training efficiency and boundary placement.The larger GP region under C single compared with C multi is explained by noting that both are approximations of an 'ideal' classifier.When N TP is low (i.e. one or two), such a classifier will attempt to make A GP as small as possible because the training set comprises configurations from the repulsive wall only.Consequently, given the same training set, C single and C multi also try to minimise the size of A GP .As C multi is more flexible its approximation of the 'ideal' classifier will be closer than that of C single , meaning A GP under C single will be initially larger.Other than the repulsive wall, the largest errors tend to occur at separations around the boundary.Thus, due to its larger A GP , a C single strategy will place its first long-range training point at a larger separation than a C multi strategy.This facilitates faster expansion of the boundary under C single than C multi regardless of which point placement method is used.Figure 9a shows that R H-Na + < R F-Na + throughout most of training for the HF-Na + potential, meaning that the interaction involving the larger atom (F) obeys a larger cross-over distance.Such ordering of the cross-over distances also applies to the HF-Ne system for most of training.Moreover, figure 9b shows that the ordering of the cross-over distances for the CO 2 -Ne potential is consistent, with R O-Ne < R C-Ne throughout training.In fact, for the (CO 2 ) 2 and CO-Ne potentials it holds generally throughout training that R O-O < R O-C < R C-C and R O-Ne < R C-Ne respectively for both point placement methods.This implies that the ordering of the cross-over distances under C multi is consistent between systems as well as being physically reasonable.
Additionally, figures 6-9 show that the cross-over distance increases with the size of the training set.This is expected  The resultant more accurate GP enables larger cross-over distance(s).This is the case for all training strategies across all systems.
Boundary optimisation methods decrease the cost of using PES, by reducing the number of training points and the size of the GP region.Fewer training points means GP evaluations are cheaper, as the GP cost is proportional to N TP .Additionally, the lower R cross or R cross of the boundary-optimised model means that in any application the GP would be used less often, in favour of the much cheaper asymptotic function.Hence boundary optimisation produces models that are more efficient to implement in applications than fixed boundary training, with no reduction in overall accuracy.
Boundary optimisation also increases the reproducibility of training.That is, for two separate models from the same training strategy and reference set, the results will be more sim-  10 shows that for fixed boundary training separate runs were identical only up to N TP = 27.While the exact reproducibility in figure 10a is not present for the other potentials, there is generally significantly less difference between independent runs of the boundary-optimised training methods than for fixed boundary training.For example, models of the HF-Ne potential from single-constrained and single-open training are reproducible up to N TP = 30 and N TP = 41 respectively, compared with N TP = 16 for fixed boundary training.Equivalent plots to figure 10 for this system under these strategies are given in the supplementary material.This reproducibility increase compared to fixed boundary training does not transfer to the use of the C multi strategies, likely because a direct search is not as reproducible in multiple dimensions as in a single dimension.However, from the R 2 values in table IV it can be seen that use of either C multi or C single reduces the scatter in the RMSE test data relative to use of C fixed , as evidenced by the larger R 2 values achieved when training with the former two.This trend is repeated across all other potentials examined here, which suggests that use of boundary optimisation leads to more stability in selection of the hyperparameters and hence more consistent predictions.
For the (CO 2 ) 2 system, reproducibility is seen not just for repeat runs of identical training strategies but between the training strategies that use C single .This is illustrated in figure 11, which shows that the single-constrained and single-open training strategies choose identical training points until N TP = 210.Such an observation explains why the two methods have identical E in figure 4e.This is because when N TP = 210 the RMSE test values for single-open and single-constrained placement were 2.817 x 10 −7 E h and 2.784 x 10 −7 E h respectively.Consequently the error was too low at this N TP to be included in the fit, meaning the two strategies were identical over the RMSE test values used in fitting.

IV. CONCLUSION
It has been shown that boundary optimisation produces GP PESs of the same accuracy using fewer training points than fixing R cross a priori.This improvement in efficiency is hierarchical, with a boundary defined by a single, variable crossover distance offering a modest improvement and a boundary defined by multiple such distances facilitating a further gain.
The results presented imply that the classifier is more important to the training strategy than the point placement method.In the RMSE range that is suitable for first principles calculations (∼2 x 10 −5 E h ) the boundary optimisation methods typically reduce the required number of training points by 15-33% relative to a training strategy that is already established as efficient 19 .Because of their reduced training set size, the resulting boundary-optimised PESs are strong candidates for transfer learning, in which the existing ab initio calculations are upgraded to a higher level of theory.Furthermore, as the size of the GP region increases with the size of the training set, only as needed, the resulting GPs are also less computationally intensive in applications than fixed boundary methods, as they employ the GP over a smaller region of phase space.
The classifier C multi , which uses different cross-over distances for difference atomic pairs, performed almost as well as an 'ideal' classifier.Across all systems, the largest difference in performance between the closest model strategy, which uses an 'ideal' classifier, and nearest boundary-optimised strategy was ∼ 3 %.In all cases the best-performing boundary-optimised strategy employed C multi , implying that a classifier comprising of a spherical boundary with a unique radius on each unique atomic pair, captures the true boundary effectively.Further refinement of the classifier would not result in a sufficient reduction in training points to justify the extra classifier expense.
The cross-over distance(s) are learned from the reference data under boundary optimisation using a direct search.This is sufficiently fast to be used at every stage of training whether one or many cross-over distances are employed and in the multi-dimensional case returns cross-over distances in a physically reasonable order.Moreover, both direct search algorithms can be used easily in conjunction with another machine learning technique or on another chemical interaction.In fact, a direct search can be applied to any problem whereby a boundary is sought between a good method of approximation in one region of phase space (the long-range function here) and a machine learning technique in another.This means that the methodology which underpins boundary optimisation is both fast and flexible, in addition to being effective in solving the problem of reducing the computational expense associated with training a GP model of an intermolecular PES.
Physical systems in which the behaviour crosses over from a simple asymptotic function to more complicated behaviour are common in many fields.A prominent example is the transition from ideal to non-ideal gas behaviour.As the boundary optimization techniques herein exploit this cross-over in behaviour, there are many potential applications of this technique to physical problems beyond intermolecular potentials.

V. SUPPLEMENTARY MATERIAL
See supplementary material for more plots of results from the explored systems.functions derived by the empirical method (red) and the method (green) for the HF-Na + potential in a linear configuration with the F proximal to the Na + (i.e. with cos(θ ) = 1).Also shown for comparison are MP2 energies (blue) for this configuration.
Evidence of this is given in figure 12, which also highlights the superiority of the predictions of the empirical long-range function.The RMSEs of each method of prediction were separated by two orders of magnitude, with the empirical function achieving an RMSE of 3.07 x 10 −7 E h versus 4.15 x 10 −5 E h for the multipolar function against the MP2 data shown in the figure .Figure 12 evidences that the multipolar long-range function captured the power law of the interaction energy for this system in this configuration, with the source of its error being that it was offset from the calculated energies.This offset was a product of the energies being calculated using MP2 while many of the properties used to derive the multipolar long-range function were calculated at higher levels of theory (see table II).In the other systems, the small magnitude of the long-range energies meant that this disagreement was negligible and the multipolar long-range function was usable.However, in the case of HF-Na + , long-range energies with larger magnitudes were commonplace due to the strong repulsive and attractive interactions between the H-F dipole and the Na + cation.This exacerbated the difference between the predictions of the multipolar long-range function and the MP2 energies.
To approximate accurately the long-range data without upgrading the reference and test data to a higher level of theory, a long-range function was produced by fitting directly to these data.This was the empirical long-range function.Taking r to be the distance between the centre of the H-F bond (not centre of mass) and the Na + , this function estimated the energy, E, as a sum of power laws, E = Ar −2 + Br −3 . (A1) In doing so, the empirical long-range function exploits that the dominant powers of r in the HF-Na + interaction are known to be -2 and -3 but assumes that the coefficients of these terms are unknown.
As the energy varies differently with r when θ changes, a sum such as in equation A1 must be fitted for every configuration in a given data set for which θ is unique.For a given θ the coefficients in equation A1 can be found using simultaneous equations, which are set up by following algorithm 4. In all cases, r min = 8.5 Å and r max = 10.5 Å, though the power laws which resulted were accurate up to 100 Å.Furthermore, both GPs in the algorithm were trained on ∼ 300 training points from a LHC design.
Algorithm 4 Coefficients for the Empirical Long-range Function 1: Train a GP, GP min , on a range of θ values at separation r = r min ; do the same for another GP, GP max , at r = r max .2: For a given θ , predict E min and E max using GP min and GP max respectively.3: Set up simultaneous equations of the form shown in equation A1: one with E = E min and r = r min , and another with E = E max and r = r max .4: Solve the equations from step three for the coefficients for the current θ .
By repeating steps 2-4 in algorithm 4, a sum of power laws was determined for every θ value in the reference set and test sets.Fitting to the latter was possible as knowledge of the energies in the set over which fitting was undertaken was unnecessary.This was because the GP predictions were based on their respective training sets, and r and θ were found from the inverse interatomic separations at each configuration alone.
Deriving a long-range function from the predictions of two GPs could be problematic if a transfer learning approach were to be invoked as the data used to train these GPs would itself need to be upgraded.For the HF-Na + potential this would not be an issue because increasing the quality of the training data would increase the quality of the fit from the multipolar long-range function.
However, when an empirical function is the only option for modelling the long-range data, upgrading the data used in training GP min and GP max would be of considerable computational expense.This is because, currently, each comprise ∼ 300 configurations.Such an issue could be circumvented by using a sequential design strategy to build minimal training sets for these GPs, which could then be upgraded instead.Furthermore, the training sets used for GP min and GP max are independent of that used to train a GP on the wider PES.Letting the number of training points in the training sets of GP min and GP max be Y and that in the training set of the other GP be Z, the cost of short-range predictions in any simulation would scale linearly with Z and the cost of any long-range predictions would scale linearly with 2Y.Given that the training of GP min and GP max take place at a fixed separations, it is likely that 2Y < Z if all GPs were trained under a sequential design strategy.This means that the predictions of the empirical longrange function would not be the computational bottleneck and that such a function is suitable for use in simulations.Finally, the method is sufficiently flexible that the GP predictions can easily be replaced by those of another statistical method.

Algorithm 2
Orthogonal Direct Search Algorithm 1: Choose limits NI max and NR max on the number of iterations and restarts respectively (see the text for details).2: Assemble the array of minimum distances M, as follows: a: For every configuration, collect all interatomic distances which comprise the same atoms types into group d and find min d (r) for each d.b: Arrange the lists of min d (r) values in an N x D array, M, where N is the number of configurations in the data set and D is the number of unique atomic pairs in r.

Algorithm 3
Constrained Placement Algorithm 1: Select the configuration in the reference set with the highest energy; add this configuration to the training set and remove it from the reference set.2: Retrain the GP to the updated training set.3: Determine the boundary that minimises the RMSE against the reference set (see subsection II E 2).4: Find the configuration in A GP for which the GP error is highest, where A GP is defined by the boundary from the previous step.5: Add this configuration to the training set and remove it from the reference set.6: Repeat steps 2-5 until the desired RMSE or number of training points is reached.

FIG. 3 :
FIG.3: Plot of RMSE test / E h against N TP on a log 10 scale for models of the HF-Ne PES trained via the multi-constrained (circles) and fixed boundary (triangles) training strategies.The blue circles and pink triangles are points which were included in the fitting of the lines shown.

FIG. 5 :
FIG.5: of the x and y coordinates of the Ne in the CO 2 -Ne system for configurations in the reference set.The CO 2 molecule is aligned along the x-axis with the C at the origin.Points are classified as GP points (blue) or asymptotic points (red) by C optimal using a model from the closest model strategy trained up to N TP = 100.

FIG. 6 :
FIG. 6: Plots showing training point placement for the first 100 training points for models of the HF-Ne PES trained using the fixed boundary (a), single-constrained placement (b) and closest model (c) strategies.These points are coloured based on the shortest interatomic distance in the configuration, with only this shortest distance shown for each.Boundary values are represented by black lines where applicable.

FIG. 7 :
FIG. 7: Plots showing training point placement for the first 100 training points for models of the CO-Ne PES trained using the single-constrained placement (a) and single-open placement (b) strategies.These points are coloured based on the shortest interatomic distance in the configuration, with only this shortest distance shown for each.Boundary values are represented by black lines.

FIG. 8 :
FIG. 8: Plot showing the values of R cross achieved for the first 100 training points by single-open and single-constrained training for models of the CO 2 -Ne potential.

FIG. 9 :
FIG. 9: Plots showing the value of R cross and entries in R cross for the first 100 training points for models of the HF-Na + potential from single/multi-open training (a) and the CO 2 -Ne potential from single/multi-constrained training (b).

FIG. 10 :
FIG. 10: Plots showing the first 100 training points and values of R cross from two runs of single-open training (a) and fixed boundary training (b) for models of the CO 2 -Ne potential.

FIG. 12 :
FIG.12: Plots of the predicted energies of the long-range functions derived by the empirical method (red) and the method (green) for the HF-Na + potential in a linear configuration with the F proximal to the Na + (i.e. with cos(θ ) = 1).Also shown for comparison are MP2 energies (blue) for this configuration.

TABLE I :
The co-ordinates for the reference and test LHCs for each system.N ref and N test are the number of points in the reference and test sets respectively after application of the high-energy cut-off, while the maximum number of training points for models of each potential are given in the text.Also shown is the minimum energy across the reference and test sets, E min , in Hartrees (E h ), though no attempt was made to approximate the global minimum energy for any potential.
2: Order the configurations from smallest to largest in terms of min(r).3: Approximate SSE tot initially from the squared errors of the longrange function alone.4: Iterate through the min(r) from the lowest to the highest: a: set R cross = min(r), which moves a single configuration from A LR to A GP ; b: update SSE tot by deducting the squared long range error of the moved point from SSE LR and adding its squared GP error to SSE GP ; c: store the new SSE tot .5: Select whichever value of R cross corresponds to the smallest value of SSE tot .

TABLE III :
The classifier and training point placement method for all training strategies examined here.
FIG. 2: The predictions of the GP (blue) and the long-range function (red) for a slice through the PES of the (CO 2 ) 2 potential in which the two molecules are in a T-shaped configuration.The cross-over distance for this model is shown by the black line at 4.505 Å.Data points (black circles) are independent of the GP training data.was previously shown to significantly reduce the number of training points compared to LHC design.It is demonstrated below that these new boundary optimisation methods improve further the already efficient methods of Uteva et al. 19 .

TABLE IV :
The equation of the lines of best fit for RMSE test of models from all training strategies for CO-Ne as power laws in the number of training points, N TP .Also shown are the R 2 values of each fit on the data.All fits were over points in the range 1 x 10 −6 E h ≤ RMSE test ≤ 1 x 10 −4 E h .TP and N TP,fixed are the numbers of training points required by the new strategy and fixed boundary training, respectively.These quantities and E are shown as functions of RMSE test as they vary with its value.For a given RMSE test , the values of N TP and N TP,fixed are determined from least squares fits of log 10 (RMSE test ) versus log 10 (N TP ), with examples of such fits shown in figure 17ots of E against RMSE test for all potentials are shown in parts (a) to (e).The potential referred to in each frame is shown in the upper right corner.aremorepronounced at high RMSE test for all training strategies across all systems.This suggests that boundary optimisation is most effective when the training set is small, making this technique ideal for applications where a computationally cheap but less accurate potential is required.Nevertheless, significant reductions in the number of training points are also obtained in the RMSE range where PESs become useful for first principles predictions.For example, a GP potential with an RMSE of 3 x 10 −4 eV per atom (1.1 x 10 −5 E h per atom) was employed in a recent simulation of the thermal properties of β -Ga 2 O 331.Furthermore, Uteva et al. successfully determined the CO 2 -CO second virial coefficient using a GP PES with an RMSE of 2.4 x 10 −5 E h17.In this RMSE range the boundary optimisation methods typically reduce the required number of training points by 15-33% (see figure For example, when modelling the CO 2 -Ne potential with single-open training the results from two separate training runs were identical.This is shown in figure 10.This is noteworthy because the values of the hyperparameters selected when maximising log(L ) can vary slightly, even for the same training set.Such variations can alter the predictions of the GP, leading to different values for R cross or R cross and the selection of different training points.Thus, that two separate runs of the same training strategy are totally identical is encouraging.Furthermore, figure