Reinforcement learning of rare diffusive dynamics

We present a method to probe rare molecular dynamics trajectories directly using reinforcement learning. We consider trajectories that are conditioned to transition between regions of configuration space in finite time, like those relevant in the study of reactive events, as well as trajectories exhibiting rare fluctuations of time-integrated quantities in the long time limit, like those relevant in the calculation of large deviation functions. In both cases, reinforcement learning techniques are used to optimize an added force that minimizes the Kullback-Leibler divergence between the conditioned trajectory ensemble and a driven one. Under the optimized added force, the system evolves the rare fluctuation as a typical one, affording a variational estimate of its likelihood in the original trajectory ensemble. Low variance gradients employing value functions are proposed to increase the convergence of the optimal force. The method we develop employing these gradients leads to efficient and accurate estimates of both the optimal force and the likelihood of the rare event for a variety of model systems.


I. INTRODUCTION
Rare but important events play a significant role in phenomena occurring throughout the sciences, ranging from physics 1 and chemistry, 2 to climate science 3 and economics. 4 As a consequence methods developed to study rare events can transcend disciplines. In molecular systems, rare events determine the rates by which chemical reactions occur and phases interconvert, 5 and they also encode the response of systems driven to flow or unfold. [6][7][8][9][10] Strategies that afford a means of studying rare dynamical events in statistically unbiased ways are particularly desired, in order to deduce the intrinsic pathways by which they occur and to evaluate their likelihoods. Borrowing notions from reinforcement learning, 11 we have developed a method to generate rare dynamical trajectories directly through the optimization of an auxiliary dynamics that generates an ensemble of trajectories with the correct relative statistical weights. Within this ensemble of trajectories, a variational estimate of the likelihood of the rare event is obtainable from a simple expectation value.
Much research has been devoted to the enhanced sampling of molecular dynamics simulations, yet there remains active areas of open research. Methods for sampling dynamical fluctuations, especially those away from equilibrium, are considerably less developed then their equilibrium and configurational counterparts. 12,13 Recent a) These authors contributed equally; Electronic mail: avishek das@berkeley.edu b) These authors contributed equally; Electronic mail: dominic.rose1@nottingham.ac.uk c) Electronic mail: juan.garrahan@nottingham.ac.uk d) Electronic mail: dlimmer@berkeley.edu work has sought to construct methods for finding an effective auxiliary dynamics, [14][15][16][17][18][19][20] with the goal of sampling rare dynamical fluctuations with the corresponding correct statistical weights directly, by evolving simulations with additional parametrized forces. Such methods are often designed to approximate the so-called Doob transform [21][22][23][24] which is the unique force that evolves a trajectory conditioned on a rare event.
A general approach to the optimization of a sampling dynamics based on a variational principle for the Doob transform for diffusive processes has recently been developed. 25 Within this context of diffusive processes, optimal forces have been used to elucidate mechanisms and rates of nonlinear response, 26,27 to encode dynamical phase diagrams, [28][29][30] and to deduce inverse design principles. 31,32 In this work we aim to extend a reinforcement learning 11 based approach to the optimization of a sampling dynamics to diffusive systems, building on the work of Refs. 25 and 33 and past literature on reinforcement learning for continuous time processes. [34][35][36][37][38][39][40][41] The techniques of reinforcement learning aim to learn the best decisions to make in each state in order to achieve some goal. Algorithms developed in this context have led to many significant advancements in recent years across tasks requiring an intelligent agent to interact with an environment, such as in gameplay [42][43][44] and robotics, [45][46][47] with a variety of recent applications in physics. [48][49][50][51][52][53][54][55] However, many of these situations are framed as discrete time problems, with relatively little work done in stochastic continuous time control. 34,35 For diffusive processes and importance sampling molecular dynamics, we formulate a reinforcement learning procedure to learn the correct force to influence the probability of choosing each next state. From this perspective, we take a policy gradient based approach, 35,45,46,56,57 learning a generative model for the evolution of the state. The optimized force found is such that rare events are made typical while staying close to the original force, providing a dynamics that can aid in efficiently sampling the targeted trajectory ensemble.
A key advantage of the reinforcement learning techniques we develop is the use of an additional learning process for a function which guides the optimization of the dynamics, a so-called value function, 58 which describes how relevant each state is to the rare events of interest. This value function substantially reduces the variance in estimates of the gradient of the parameters specifying a force, allowing for the use of less data in each optimization step and subsequently more complex approximations to the auxiliary dynamics. We show how this approach can be successfully applied to both finite time problems in which the dynamics is constrained to guarantee the occurrence of some rare transition like a barrier crossing, and to time-homogeneous problems where we are interested in the statistics of time-integrated observables in the long time limit as characterized by its large deviation function.

II. TRAJECTORY ENSEMBLE FORMALISM
We consider systems evolving with a diffusive dynamics over time t of a configuration x. These configurations evolve according to a force vector F(x, t) and noise vector of equal dimension W with associated constant noise matrix G invertible within the stochastically evolving subspace, represented by the stochastic differential equation where the noise W follows a Wiener process, with increments dW drawn from a Gaussian with zero mean and dt variance.Throughout we will work in dimensionless variables that imply unit energy scales and mobilities. The requirement of G being invertible within the stochastic subspace may in principle be relaxed, however in that case there may be multiple noise vectors corresponding to the same change of state, making the evaluation of transition probabilities necessary for our optimization approach difficult. We will follow the Ito convention for ease of notation and implementation with standard numerical integrators. Throughout, we do not assume in Eq. 1 that the force is gradient or that the noise obeys a detailed balance, and thus our approach is generally applicable to equilibrium as well as nonequilibrium dynamics. We aim to probe rare fluctuations in trajectory observables. Here we consider trajectories, X 0,T , defined as the sequence of configurations over an observation time T , though generalizations of fluctuating observation times are possible. 59 Generally, we will consider observables that are functions of time-integrated variables over the trajectory, where the first term is a state dependent observable, while the second term depends on a stochastic increment, with both A[x t , t] and B[x t , t] being state dependent. However, we will also consider cases in which A[x t , t] is a function of a single time in order to impose end point conditioning. Expectations of functions of such observables are defined through path integrals of the form where P [X t,t ] is the total probability of a trajectory decomposable into P [X t,t ] = p [X t,t |x t ] ρ(x t ) where p [X t,t |x t ] is the transition probability conditioned on starting in configuration x t with initial probability ρ(x t ).
Probabilities for trajectories between times t and t starting at x t are defined by where we suppressed the arguments of x t and F[x t , t] for shorthand. This is the standard Onsager-Machlop form for the diffusive dynamics considered here. 60 The measure over paths between times t and t starting from position x t is defined such that where the transition probability is normalized when integrated over all trajectories. These path probabilities satisfy and due to the Markovian noise in Eq. 1.
Trajectories sampled with P [X 0,T ] will be dominated by the most typical values of O [X 0,T ]. We will encode the rare trajectories with atypical values of O [X 0,T ] by reweighting the original trajectory ensemble defined by Eq. 4, multiplying each trajectory by an observable dependent factor. Such reweightings occur naturally in statistical studies of rare events and are isomorphic to extended ensemble approaches in equilibrium configurational problems. The ensemble of events we are interested in is constructed by weighting the probability of trajectories in the original dynamics by an exponentially positive number, where P s [X 0,T ] is denoted as a tilted path ensemble, biased by a statistical field s in such a way to promote rare fluctuations in O [X 0,T ]. The quantity λ(s, T ) normalizes the tilted distribution, and is identifiable as a cumulant generating function (CGF) and equal to the logarithm of the tilted path partition function Z(s, T , can also be constructed to function as a filter based on fulfilling specific criteria. In such cases P s [X 0,T ] is identified as the probability that a trajectory fulfills a specific conditioning, and its ensemble a corresponding conditioned path ensemble. Common examples are Brownian bridges, [61][62][63] where trajectories are conditioned to end at x and is 0 otherwise, and s is taken sufficiently negative that only those trajectories for which the constraint is satisfied have significant weight.

III. GRADIENT OPTIMIZATION FOR FINITE TIME CONSTRAINED DYNAMICS
Our aim is to find a dynamics which generates trajectories with probability as close to the reweighted trajectories ensemble as possible.
For the diffusive dynamics considered here, this is exactly achievable in principle through a so-called generalized Doob transformation. 21,22,[64][65][66][67] The generalized Doob transformation defines a modified dynamics with an added drift force that is generally time dependent but with an identical noise as in the original SDE. However, constructing this transformation is often not possible in practice, as it requires diagonalizing a modified Fokker-Planck operator which in interacting systems is exponentially complex. 24 Here we aim to parametrize a drift force with tunable parameters θ to approximate the generalized Doob transform. With the modified force, F θ (x, t), we have a modified SDE with corresponding trajectory probabilities which still satisfy the Markovian properties of the original dynamics and the same normalization constant. See Ref. 33 for a discussion of problems in which the optimal dynamics is required to be non-Markovian, in the context of discrete time Markov processes. We seek to learn a set of parameters θ to minimize the Kullback-Leibler (KL) divergence between the modified dynamics and the reweighted trajectory ensemble defined by Eq. 8. The KL divergence is defined as where the expectation is taken with respect to the parametrized dynamics. This quantity is a measure of the similarity between the modified and reweighted trajectory ensembles. Achieving a zero value when p θ is given by the generalized Doob transform, the KL divergence has a unique minimum when this Doob transformed dynamics is contained within the space of parametrized dynamics, providing a variational estimate of the CGF. We note that this definition of the KL divergence differs from much of the literature considering optimization of a parametrized diffusive dynamics, 17,[68][69][70] where the parametrized dynamics p θ and target dynamics p s appear in an opposite way. In principle the initial distribution should also be parametrized, as it will be modified by the reweighting, however depending on the space of distributions chosen these can be hard to sample. We drop this modification for simplicity.

A. Low variance gradient estimation
In order to optimize the force, F θ , we follow techniques introduced in the reinforcement learning literature. 11,45,[71][72][73][74] Substituting the parametrized and reweighted trajectory probabilities into the KL divergence, we may rewrite it as an average over a parameter dependent time-integrated observable where in the language of reinforcement learning we define a return, R [X 0,T ], as with the negative of the average of the second term measuring the KL divergence, D KL (p θ |p), between the parametrized dynamics and the original dynamics. This return is analogous to a regularized form of reinforcement learning 72,74 similar to that considered in maximumentropy reinforcement learning. 45,46,73 When evaluated at the generalized Doob transform the KL divergence vanishes and the return evaluates to the CGF. Away from the Doob transform, the positivity of the KL divergence results in the return variationally bounding the CGF from below. 23 We aim to minimize the KL divergence through stochastic gradient descent in the parameter space. For this we need the gradient of D KL (p θ |p s ) with respect to θ, where we note due to conservation of probability. 33 The factor multiplying the return is commonly referred to as the Malliavin weight in the stochastic analysis literature, 75 and corresponds to a particular case of the eligibility traces found in reinforcement learning, 11,58,[76][77][78] which we denote as y θ (T ) = ∇ θ ln p θ [X 0,T |x 0 ]. It can be rewritten by substituting the path probability, wherė is the integrand of the Malliavin weight.
Were we to stop at Eq. 15, we would proceed to optimize a generative model (the diffusive dynamics with our parameterized force) of the trajectories using a scorefunction based approach, similar to standard unsupervised learning. However, following the methods of reinforcement learning, we can use a combination of the Markovianity of the generative model and other variance reduction techniques to produce a gradient estimator which is much more efficient to estimate. To begin with, we can simplify Eq. 15 by noting that due to Markovianity, the Malliavin weight only correlates with the return in the future, and we can rewrite the gradient as where we used t − as a shorthand for t − for some small positive . We refer to the optimization of the modified dynamics using this formulation of the gradient as χ MCR , as it is analogous to the Monte-Carlo returns (MCR), or REINFORCE 79,80 policy gradient algorithm in reinforcement learning. In the long observation time limit, employing this gradient in stochastic optimization reduces to previous variational Monte Carlo procedures. 25 This estimator of the gradient is non-optimal for two reasons. First, it requires evaluation of a two time correlation function. In steady state, stationarity can be invoked to eliminate one of those integrals, however under finite time conditioning this simplification is not possible. Second, it has a high variance and requires significant averaging to converge accurate gradients. This is because both the Malliavin weight and the return undergo a random walk with linearly increasing variance. 75 Building on the analogies with the reinforcement learning formalism we define a value function as a path average of the return, conditioned on starting at the position and time, x t = x. Introduced into the gradients of D KL (p θ |p s ) in distinct ways, the value functions can be used to tame both problems of the naive MCR gradient estimate.
First, we introduce a value function as a baseline that depends only on the state at the time t in order to reduce the variance of the gradient. We note thatẏ θ (t) is linear in the noise and thus averages to zero when multiplied by a function of the state at or before t. Defining a temporal difference error we write the dynamical gradient as where we have formally subtracted zero. We refer to this gradient estimator as χ MCVB , for Monte Carlo Value Baseline (MCVB). 11 The subtraction of the state point dependent value function reduces the variance of the gradient by accounting for the mean uncorrelated part of each return between t − and T withẏ θ (t), focusing on how this return differs from the average behaviour encoded by the value function. Second, we introduce a value function that encodes an estimate of the return in the future in order to further reduce the variance and also the complications associated with estimating the two-time correlation function. We can replace part of the return by a value function that is conditioned at some τ , such that t − < τ < T , where we set the value function to zero for V (x, t) with t > T . Combining this value function form of the kernel of the gradient with the value baseline, we define another temporal difference error and we arrive at a distinct formulation of the gradient which we denote χ AC (θ, T ) for actor-critic gradient (AC) estimator, for the analogous algorithm in reinforcement learning. 11,45 Here the value function is seen as criticizing the transitions generated by the dynamics, i.e. the actor. Variance reduction of gradient estimates is therefore achieved by replacing potentially noisy return samples with the average behaviour expected in the future of the x t+τ state. In Sec. IV, we will compare the accuracy and statistical efficiency of these three gradient estimators: MCR, MCVB, and AC. Before that we discuss how the value functions are simultaneously parametrized and learnt along side the modified force.

B. Parametrizing value functions
While the gradient expressions are exact and the use of value functions expected to facilitate their convergence, using them requires knowledge of the exact value function for the modified dynamics, a formidable task in complex problems. In order to make their use tractable, we optimize a representation of the value function in addition to the modified force. Specifically, we introduce a parametrization of the value function denoted V ψ . To optimize this approximation we note that the value functions satisfy a self-consistency equation called the Bellman equation 81 which has a unique solution for a given dynamics and return (as defined by the tilting observable and the dynamics via Eq. 14). We aim to minimize the error in this equation, thus optimizing our parametrized value towards this unique solution. Our approach is to minimize the squared difference between the two sides of Eq. 26 with the true value function replaced by the parametrized value function, and apply gradient descent to it. Such an approach is the subject of gradient temporal difference methods, 82-84 but produces a gradient estimate which is difficult to evaluate, containing products of expectations which require independent samples. A part of the resultant gradient is however simpler to compute. We derive it by substituting only the right hand side of Eq. 26 with our parametrized value function to provide a fixed target for the left and defining a corresponding error function based on the squared difference. To construct a loss, we integrate these errors along each trajectory, and average them over the trajectory ensemble. This results in a loss function L(ψ, ψ i ), that we take as a function of two weights, ψ and ψ i , where the weight ψ i is the weights after update i, used to provide the fixed target estimate towards which we want to move the functional of ψ. The derivative is then taken with respect to ψ, before setting ψ = ψ i to find the gradient of this loss for the current parameters. Such an approach is referred to as semi-gradient in the reinforcement learning literature, 11 used to achieve the majority of state-of-the-art reinforcement learning results, and proves stable provided the data used to estimate the gradient is sampled using a dynamics which is close to p θ as we intend to do. As mentioned above, alternative methods which additionally consider the variation of the target with ψ can be found in the RL literature, allowing for the use of data sampled from an alternative dynamics, utilized via importance sampling. [82][83][84] Writing an approximate temporal difference for the value function parametrization, within MCVB or for AC we have gradients of the form for the loss function from the value function parametrization, where for the AC algorithm δ ψi is replaced with δ ψi . Given this value function approximation, we can approximate the gradient of the KL divergence by replacing the exact temporal difference with these approximate temporal differences. We then use the same trajectories to estimate the force and value function gradients and simultaneously learn both. For the MCVB algorithm, an approximate value function does not bias the gradients as the future return that correlates with the Malliavin weight stays intact and the expectation of the Malliavin weight is identically 0. However, for the AC algorithm, an approximate value function can introduce a bias into gradients as it replaces the average of the future return, which it may not accurately represent.
Employing gradients with or without value functions, we can construct a stochastic descent algorithm to optimize the modified forces which can be used to estimate the likelihoods of rare events and the trajectories by which they emerge. The algorithms require the evaluation of the forces, value function, their parametric gradients and noises over the course of simulating trajectories. Ensembles of trajectories can then be used to construct an empirical estimate of the gradient via computing the Malliavin weights, returns, and the temporal difference. These empirical estimates then iterate the two weights with respective learning rates α θ and α ψ for the force and value function respectively. The resultant algorithm is outlined in pseudocode below in Alg. 1. Detailed versions of the individual algorithms with computationally efficient on-the-fly implementations for simulating trajectories with discrete timesteps are presented in Appendix A.

IV. RARE FLUCTUATIONS IN FINITE TIME
We have used the algorithms discussed above to examine rare fluctuations of trajectories of fixed duration, starting from a fixed point in configuration space. The specific observable we have investigated is an indicator function for reaching a desired region, Γ, in configura- at the final time T . Rare trajectories reaching a target basin in configuration space are often of interest as transition paths for reactive events, and significant development has been undertaken to efficiently generate them. [85][86][87][88][89] Computing optimal drift forces for generating these rare trajectories enables the study of reactive dynamics in a direct manner. We expect these algorithms to find use in the study of diffusive dynamics where Monte Carlo approaches have difficulty sampling. [90][91][92][93] Further, as the modified force is used with the original noise from the SDE, we have access to the full reactive trajectory ensemble allowing the interrogation of the statistics of the reactive events in a way that other direct path methods like nudged elastic band and zero temperature string methods do not, as they represent only the dominant path. [94][95][96][97] As a consequence, we expect out method will find use when there is a large path space entropy.
The CGF for an indicator variable is given by as an average in the original reference dynamics. From Eq. (13), the KL divergence being nonnegative implies the average return is bounded above by the value of the CGF λ(s, T ). The bound can be saturated only by the unique optimal drift force. We compare the value of the optimized return to numerically exact estimates of the CGF given as where the definition of the indicator function and the final time distribution ρ(x, T ) evolved from a specific initial condition has been used. This form demonstrates the statistics of a single-time indicator observable is described solely by its mean, For a rare fluctuation such that h Γ p < 0.5, this form indicates that there are two distinct regimes in the biased ensemble with s < 0. For a small magnitude of the bias, the indicator function stays close to the unbiased value. Below a critical value of the indicator crosses over to being close to 1. For all of our calculations, we choose a fixed value of s estimated to be smaller then the threshold. With this value of s, we compute the right side of Eq. (32) using an eigen-expansion of the propagator of the Fokker-Planck equation of the original dynamics, and compare with the value of the average return from the gradient descent algorithms having the same value of s. Details of this calculation and comparison to an approximate Kramers escape rate are in Appendix C.

A. Softened Brownian bridges
The first example we consider is a softened version of a so-called Brownian bridge, 61,98 in which a onedimensional Brownian motion starting from the origin is biased to end near a particular point. The reference dynamics is simply given by free diffusion, where comparing to Eq. 1 we have G = √ 2. We consider the target well, Γ(x), to be defined as {1 − ≤ x ≤ 1 + } with = 0.1. The dynamics is simulated with a discrete timestep of 0.001. We use a tilting parameter s = −100 to bias the original ensemble towards higher occurrence of the rare event.
We optimize a force and value function parametrized by linear combinations of Gaussian distributions with where initially the basis sets are a grid of 31 × 21 Gaussians in the x-t space. The Gaussians in time are spaced uniformly between t ∈ [0, T ), with standard deviations equal to half the grid-spacing. A third of the Gaussians in space are placed between x ∈ [−4, −0.5], a third in x ∈ (−0.5, 1.5) and a third in x ∈ [1.5, 5]. These three families of Gaussians each have standard deviations half of the corresponding grid spacings. We initialize all θ m = ψ m = 0.
We consider the performance of the three algorithms differing in the gradient used to optimize them. These include an algorithm that uses no value function (MCR), one that uses a value baseline (MCVB), and one that uses a value function for future returns with τ = 0.1 (AC). We evaluate the efficiency of the algorithms by comparing learning curves, convergence with respect to basis, and properties of the learnt dynamics, shown in Fig 1. All figures comparing different algorithms use the same noise history and the same amount of statistics, such that the differences are solely ascribed to the learned dynamics. The MCR algorithm uses a learning rate of α θ = 0.4. The MCVB algorithm learning rates α θ = 0.4, α ψ = 50, and the AC algorithm learning rates α θ = 1, α ψ = 0.05.
In Figs. 1(a-c), we show learning curves for the total return, the average of the indicator observable, and the KL divergence, generated with 12 trajectories at each optimization step for each of the three algorithms. We have compared the results obtained with this finite basis to the numerically exact value of the optimal return and the corresponding observable average and KL divergence, obtained from Eq. 32 where for free diffusion the distribution is known. We find that while all three algorithms quickly achieve a dynamics which mostly fulfills the indicator function conditioning, the MCR algorithm struggles to optimize the KL divergence cost, while the MCVB and AC algorithm achieve converged values efficienctly. As expected, each algorithm provides a variational estimate to the CGF with the MCVB and AC outperforming MCR. Trajectories with the final learned dynamics for the three algorithms are plotted in Fig. 1(df). The MCR algorithm finds forces that constrain the bridge trajectories too excessively, which results in the suboptimal estimate of the KL divergence. The AC trajectories are closest to the optimal bridge trajectories 61 while the MCVB trajectories lie in between. The main reason for the difference in performance in the three algorithms is the resultant suppression in the statistical errors in the gradient estimate. This is illustrated in Figs. 1(g-h) where the convergence of the gradients of the 31 Gaussian coefficients at a time slice of t = 0.7 is shown for both MCR and MCVB. Since the α θ learning rate is same in both algorithms, the large suppression of fluctuations in the MCVB learning curves results from a more statistically converged gradient estimate using a value function. This suppression of gradient errors at limited statistics in the MCVB and AC algorithms is directly illustrated in Appendix B.
We have studied the convergence of the KL divergence estimate towards the optimal value extracted from the numerically exact CGF, using the MCVB algorithm with an increasing position and time basis. We increased the number of time Gaussians, from 21 to 41 to 81, to observe the KL divergence cost shrinking as the finer grained force can better support the singular indicator function condition at the end of the trajectory. We also ran the optimization with a much bigger basis of 41x×201t Gaussians, and used 248 trajectories at every optimization step and learning rates α θ = 5, α ψ = 1000. The Gaussians in x have standard deviations equal to half the grid spacing, while the Gaussians in t have standard deviations equal to a third of the grid spacing. While the estimate increased, in this particular problem, obtaining the numerically exact KL divergence would require use of still finer-grained Gaussians in space and time in order to represent the singularities of the edges of the target region and of the last timestep.

B. Barrier crossing with multiple reaction pathways
We now investigate the ability of the three algorithms to find the optimal dynamics in two-dimensional barriercrossing problems, the first involving a potential allowing for multiple reaction pathways. The two-dimensional potential U (x) we consider 99 has two minima and two degenerate reaction pathways involving the upper and lower halves of the x = (x, y) plane as illustrated in Fig. 2. Barrier-crossing from one well to another is a rare event occurring with one randomly chosen pathway. 100 Without prior knowledge of the possibility of multiple reaction paths, path sampling algorithms typically need special techniques to discover them. 101 We use our reinforcement learning algorithms to compute an optimal force F θ (x, t) that reproduces unbiased and uncorrelated reaction paths.
The reference equation of motion we consider is where the matrix G is proportional to the identity. We use a discretization timestep of 0.001. The trajectories start from the minimum of the left well, at (x, y) = (−1.11, 0), and are allowed to run for a duration of T = 1.5 and checked for reaching the right target well defined as x > 0, U (x, y) < 0. This small region centered around (1.11,0) is used as Γ for defining the indicator function observable. The value of T has been chosen to be slightly greater than the typical transition path timescale, such that the optimized force should reproduce trajectories that follow the natural steady-state fluctuations of the system. As long as the choice of T is arbitrarily larger than the typical transition path timescale, the optimally generated trajectories will represent unbiased reactive transitions, with additional times being spent in the initial or final metastable states. 102 In the absence of an approximate transition path time estimate, the optimization can be performed over a range of T increasing by orders of magnitude till one enters the regime where side-side correlation functions for the dynamics of barrier crossing behave linearly. 100 We use a value of s = −500 to obtain the CGF. The force and the value function are approximated again as a grid of Gaussians with optimizable coefficients, a simple generalization of the onedimensional Brownian bridge. The duration of the trajectories we consider, T , is much smaller than the typical first passage time for the rare fluctuation we are interested in studying. As such, a general complication arises in initializing our algorithms in that in the absence of a modified force, few trajectories satisfy the indicator function condition. Consequently, the gradients for updating the modified forces are generally very small and noisy. In order to initialize our learning process, we start with a softened version of the indicator function of the form which is quadratic, and non-vanishing across the full domain. After optimizing the return with this observable, we obtain a force that can surpass the barrier, and the optimization with the sharp indicator function observable can begin. This technique of breaking down the optimization of the return into two segments prioritizing each of the two terms of the return is analogous to curriculum learning in reinforcement learning. 103 In many-body systems, the quadratic metric can be defined only in the space of the order parameter that distinguishes the initial and product states. For our multi-channel problem, we initialize learning with (x f , y f ) = (1.11, 0) in the softened indicator, which is the minimum of the target well.
Our approach consists of comparing the performance of the three algorithms MCR, MCVB and AC in the initialization with the quadratic observable, and then using the AC algorithm to optimize the return with the indicator function observable.  The learning rates used in the initialization are α θ = 1 for MCR, α θ = 1, α ψ = 0.5 for MCVB and α θ = 1, α ψ = 0.5, τ = 0.001 for AC, and the learning rate for the final optimization is α θ = 0.2, α ψ = 0.08, τ = 0.1 in the AC algorithm. In the learning curves, we compare the convergence of the return with numerically exact values obtained by computing the RHS in Eq. 32 with a spectral expansion using a Discrete Variable Representation basis. 104 We see that all three algorithms quickly find forces that satisfy the conditioning, but the KL divergence cost is optimized best by the AC algorithm. While each affords a similar variational estimate after the initial optimization, we find qualitative differences in the family of barrier-crossing trajectories obtained from the MCR/MCVB and from the AC algorithm.
Typical trajectories obtained with forces from the end of initialization with MCVB and AC, and at the end of optimization with AC, are shown in Figs. 2(d-f). The force obtained from MCVB spontaneously breaks the symmetry in the potential and chooses one reaction path out of the two. This force solution is a local optimum in the MCR and MCVB algorithms, and it does not naturally relax to a symmetric force that would be representative of the degeneracy of the reaction paths. Trajectories from the AC algorithm spend significant amount of time exploring the initial well, such that the discovered forces recognize the presence of multiple pathways approximately. These forces are further refined during the second optimization, such that the reactive trajectories obtained at the end are restored to be almost fully symmetric like the natural barrier-crossing fluctuations of the system are expected to be. These symmetric twodimensional forces obtained at the end of the AC optimization are plotted at three slices of time, in Figs. 2(g-i). The forces grow in magnitude as a function of time and generally follow the contours of the underlying potential, and towards the end they gather support in unlikely parts of the potential. The ability of the AC algorithm to discover time-dependent forces that lead to exploration of multiple reaction pathways can prove valuable in uncovering reactive trajectories in systems where such degeneracies are not known a priori.

C. Barrier crossing with a long lived intermediate
Another difficult problem in the generation of transition paths and reactive trajectories typically comes from the presence of long-lived intermediates. In order to study the usefulness of our learning algorithms in this context, we consider as an example the dynamics on the so-called Müller-Brown potential. 105 This twodimensional potential surface has been used extensively as a testing case for methods relying on the instantonic approximation for barrier-crossing trajectories. 102,106 The potential is a sum of four Gaussians 107 , where three local minima are separated by two barriers as illustrated in Fig. 3. We employed our algorithms to find forces that generate uncorrelated trajectories that cross both barriers, starting from a local minimum and ending in the global minimum, that are positioned on either side of the third metastable minimum.
The system evolves with diffusive Langevin dynamics of the same form as Eq. 36 using a timestep of 0.0001. We are interested in trajectories starting from x = (0.63, 0.03) in the rightmost local minumum, and ending near the global minimum, centered around x = (−0.5, 1.5), with the indicator function region Γ being defined by U (x) < 145). The trajectories are chosen to be of a fixed duration of T = 0.15, which is on the order of the expected total transition path timescale from Kramers' theory added to the expected relaxation time in the intermediate well. 102,108 For initializing the forces we use a softened quadratic modification of the indicator, in Eq. 37, with s = −10000, while we use a bias value of s = −2000 with the indicator observable to compute the CGF. To represent the x and y components independently of the time-dependent optimal force and to represent the value function, we use a basis of Gaussians with optimizable coefficients placed on a 21 × 21 × 21 grid in x − t. The time Gaussians are placed uniformly between t ∈ [0, T ), while the space Gaussians placed uniformly between x ∈ [−1.5, 1.5] and y ∈ [−0.5, 2]. In Figs. 3(a-c), we have compared the learning curves with MCR, MCVB and AC algorithms during initialization with the smooth indicator function in Eq. 37 and the AC algorithm for the final optimization of the full return with the sharp indicator function. Each algorithm uses 60 trajectories at every optimization step to estimate the gradient. The learning rates for the initialization are α θ = 1 for MCR, α θ = 1, α ψ = 1 for MCVB, and α θ = 0.5, α ψ = 0.2, τ = 0.0001 for AC, and the learning rates for the final optimization are α θ = 0.1, α ψ = 0.01, τ = 0.01 for AC. The learning curves have been compared with approximately calculated values of the CGF and the KL div obtained with a Kramer's escape rate estimate along the Minimum Energy Path. 94 We find that all three algorithms optimize the quadratic observable relatively quickly, but the AC algorithm performs the best at optimizing the KL divergence cost. In Figs. 3(d-h), we illustrate a few uncorrelated trajectories generated with the modified forces at various stages of the initialization and optimization with the AC method and the end of the initialization with the MCVB method. We find that the forces with the AC algorithm are such that the trajectories discover and cross the two barriers and the metastable well between them one after another. At the end of the AC initialization, the trajectories have discovered the metastable well and have crossed both barriers to end in the target well. The AC algorithm by this stage of optimization has also moved the major part of the short trajectory from staying in the initial well to the metastable well. This feature is constant throughout the AC optimization, with only minor changes in the force being carried out inside the target end well. The force from the MCVB initialization, on the other hand, only generates trajectories that connect the initial and target well without relaxing significantly in the metastable well. This would be contrary to the instantonic relaxation mechanism in the system, as the stochastic action is minimized by the local relaxation in the metastable well. In Fig. 3(i) we have plotted the potential energy as a function of time, for 100 uncorrelated barrier-crossing trajectories, which are driven by the final force from the AC algorithm. The trajectories cross the two barriers at roughly fixed times, and spend majority of the time in the metastable well.
The comparison of the three algorithms illustrates the significant improvement of convergence performance of the MCVB and AC algorithm over the naive MCR approach afforded by value functions. For rare reactive events, we have found that the AC algorithm is suited best to find trajectories that explore configuration space the most in search for the easier barriers to cross, and thus is closest in resembling the natural fluctuations of the system. The errors in the converged values of the CGF depend on the truncation of the force basis and statistical uncertainties. The MCVB and AC algorithms preserve the computational scaling of the MCR with the trajectory duration, and only change the prefactors of the scaling by a small fraction making them viable methods for applications to complex systems. The AC algorithm with a small τ will incur a systematic error in the gradients if the value approximation is not accurate, which goes away at an intermediate τ but at the expense of a larger memory cost that may slow down the algorithm without any change in the scaling. Nevertheless, it is possible to use these algorithms with useful combinations of hyperparameters to achieve efficient convergence with a small amount of averaging. The value functions obtained during the optimizations serve as dynamical equivalents of the committor function, in that they encode the expected value of the probability to reach the target well and the associated KL divergence cost, while starting from any point in configuration space at any point in time. Understanding these connections to reaction coordinate design is likely a fruitful future direction of research.

V. GRADIENT OPTIMIZATION FOR INFINITE TIME DYNAMICS
We now generalize the approach of the previous section to focus on the statistics of time-integrated quantities in the long time limit. While for finite time, the generalized Doob transform is time dependent, under mild assumptions in the long time limit the optimal dynamics is time-homogeneous. 21 As a consequence, the parametrization of the modified force and value function is simplified, and explicitly dependent only on the instantaneous configuration of the system. The generalization of the algorithms to this case consists of two main changes. First, we employ online learning, since there is no end to each trajectory. Second, a modified definition of return and value are required to avoid divergences in the infinite time limit.
We formulate the infinite time problem by adapting an approach in reinforcement learning based on timeaveraged returns. 57,109-111 Specifically, we consider the long-time average of the KL divergence of the trajectory ensemble. Under assumptions of time-independence and ergodicity, the time average KL divergence reduces to an average over the steady state distribution of the instantaneous change of the return r(x,ẋ). Above, we have defined a scaled CGF, that is finite as long as the cumulants of the timeintegrated observable are time extensive. The reward, r(x,ẋ), is defined as and is time-independent and evaluatable within the steady state. A gradient expression analogous to MCR can be derived straightforwardly. 25 The previous definition of the value will diverge in the infinite time limit. A simple modification to address this issue is to remove the average reward scaled by the length of the trajectory segment, defining a differential return and corresponding differential value function which satisfies a modified Bellman equation containing the differential return between states, rather than the standard return, and relating the value of states separated by a period of time τ . This modified Bellman equation can be simply rearranged to give an alternative equation for our timeaveraged KL divergence which we note holds for all x. Differentiating the right side of this equation with respect to θ does not involve the gradient of the stationary state. Therefore, taking the derivative and then averaging over the stationary state under F θ , 112 we can write an estimate of the dynamical gradient as where we have defined the differential temporal difference error reached after introducing an additional baseline in the form of τ r(ẋ, x) p θ . In this equation we have arrived at a gradient estimate which depends only on the gradient of the transition probabilities, contained in the Malliavin weights y θ (τ ), and not the gradient of the stationary state itself. This can thus be easily calculated during a simulation using the parametrized dynamics. Note the period of time τ over which the temporal difference is calculated is independent of the period of time τ over which the Malliavin weight is calculated, provided the former is longer. The specific algorithm we consider involves taking the time τ small enough so that the Malliavin weight can be approximated by τẏ θ [x 0 ] which is possible due to the time homogeneous steady state we average within. We thus calculate the estimate as which we denote as the actor-critic gradient in the long time limit. In practice, we will take τ = ∆t, a single time-step in a numerical simulation. A long time limit generalization of the MCVB gradient could be constructed similarly, but this is not considered here. As in the finite time case, to construct this estimate we also need an approximation to the value function, V ψ (x). Following a similar construction for the loss function as before, averaging the error over the stationary state, we estimate the gradient by which to update the value function parameters as with the approximate temporal difference (49) which also replaces the exact temporal difference in gradient estimates for the dynamics. Finally, we also have flexibility with our estimate of the scaled CGF. This can be done using a running average of the reward, where α r is the learning rate and the subscript p θi denotes the parameters from the ith iteration. Alternatively, a lower variance, higher bias estimate may be constructed by noting that we can rearrange Eq. 43 to find an alternative equation for the average. After discretization, an algorithm based on utilising single-transition estimates of these gradients is outlined in pseudocode below in Alg. 2.

VI. RARE FLUCTUATIONS IN THE LONG TIME LIMIT
Here we apply our approach to study the statistics of time-integrated currents in the long time limit. Persistent currents are the hallmark of a nonequilibrium system, and their fluctuations have been studied intensively. 26,[113][114][115] Foundational results have been derived that constrain the symmetries of current fluctuations and relate their cumulants. For example, the fluctuation theorems dictate that the CGF satisfies a reflection symmetry about the driving force for the current, due to the microscopic reversibility of the underlying stochastic dynamics. 116,117 A number of numerical approaches have been developed to evaluate the scaled cumulant generating function, an example of a large deviation function. 1,85,[118][119][120][121][122] These functions provide information of the long time behavior of stochastic systems, and encode response relationships and stability. Within this context, our approach is similar to other controlled dynamics [14][15][16][17][18][19]25,123 based means of evaluating large deviation functions in the continuum and can be used directly as we show below or in concert with Monte Carlo algorithms.
To study the accuracy and efficiency of the algorithm, we consider statistics of the velocity of a particle on a ring of length L = 2π with position x moving in a periodic potential. The periodic potential has the form U (x) = U 0 cos(x) with magnitude U 0 , and is driven by a constant force f , such that is the total force for the particle on the ring. The observable we consider is the integrated current, O[X 0,T ] = J[X 0,T ] given by This observable has a different interpretation depending on whether the dynamics are under-or overdamped, both of which we consider below. In the underdamped case, the current is simply a function of the state with A(x) = v and B = 0, while in the overdamped case it depends on the stochastic increment, A(x) = 0, B(x) = 1.
The corresponding scaled CGF we aim to compute is The first derivative of λ(s) v(s) = − dλ(s) ds (55) reports on the average velocity in the tilted ensemble and is a useful indicator of the tails of the reference distribution. The scaled CGF exhibits a Lebowitz-Spohn symmetry 116 such that where f is the affinity for the current. The scaled CGF can be computed by the numerical solution of a generalized eigenvalue problem, 25 which we use for this low dimensional system to compare the accuracy of our results. Despite its simplicity this system has been shown to present non-trivial non-equilibrium phenomena due to the competition between ballistic and diffusive motion. 122,124,125 Here, the overdamped regime acts as a simple benchmark which can be easily solved by diagonalizating a projection of the Fokker-Planck equation. 125 The underdamped regime is a much more difficult problem to solve, due to a higher dimensional state space and long relaxation time. Indeed, despite access to the SCGF via diagonalization, 125 accurate results for the force in the underdamped case have been elusive. However, the actor-critic approach can solve this problem easily.
A. Current fluctuations of an overdamped particle In the overdamped case, the evolution equation for the particle on a ring is given by which is a dimensionless one-dimensional SDE. We integrate this equation with a timestep of 0.001. Since the position is periodic, an ideal representation of both the force and value function is given by a Fourier series and with coefficients a θ ,a ψ , The results of the differential AC algorithm are shown in Fig. 4. We have truncated the basis with M = 5 and used learning rates of α θ = 0.1 and α ψ = 0.01. We annealed across the range s considered, first learning the dynamics at s = −0.5, before sweeping across to s = 1.5 in steps of ∆s = 0.1. The reward learning rate began at α R = 10 −5 and decreased linearly to α R = 10 −6 throughout training at each value of s, to enable rapid convergence to an accurate result.
We detail estimates of three quantities calculated during the learning process. In Fig. 4(a) we show the estimate of λ(s), the quantity the algorithm is attempting to maximize. In Fig. 4(b) we show an estimate of the time-averaged KL divergence. In Fig. 4(c) we show an estimate of the time-averaged velocity. These estimates are running averages calculated using the samples taken from each transition, with learning rates of 0.1α R . Learning curves are plotted for training at each individual bias s during the annealing process. For small changes of s, we see that convergence to an accurate estimate of the scale CGF is achieved in approximately 10 6 training steps, each utilizing data from a single transition. This results in a speed of up to two orders of magnitude over the MCR algorithm. 25 In Figs. 4(d-f) we plot the end points of each of these learning curves for the three observables plotted in Figs. 4(a-c). In Fig. 4(d) we see the expected Lebowitz-Spohn symmetry with reflection about s = 1/2 for the scaled CGF. The inset shows the absolute error compared This antisymmetry implies that the optimal force differs from the reference force more for s > 1 than s < 0. This demonstrates that the regular production of trajectories with significant negative time-integrated velocities requires a substantial change in the systems dynamics, in contrast to those with a significant positive velocity. Nevertheless the learning algorithm employed here is capable of parametrizing the modified force sufficiently well to work across these regimes.

B. Current fluctuations of an underdamped particle
In the underdamped case, the position and velocity evolve according to two coupled SDEs given by where the noise acts only on the velocity, v, and the friction, inverse temperature, and mass are taken as unity.
As before we discretize our equations with a timestep of 0.001. For the underdamped case, the modified force and value function depends on both the position and velocity of the particle. The approximation need only provide a single output for a force applied to the velocity, as the optimal dynamics can not change the evolution of the position since the position is not directly influenced by noise. To do accomplish this, a simple approach we have taken is to discretize the force and value function approximation along the velocity dimension. More precisely, we can adapt the Fourier series from the overdamped case, with velocity dependent coefficients given by where I j,j+1 (v) = 1 v 0 + j∆v < v < v 0 + (j + 1)∆v 0 else (63) and the boundary cases I 0 (v) and I M2+1 (v) return 1 for v less than v 0 or greater than v 0 +(M 2 +1)∆v, respectively. We employ analogous equations for b θ i (v) and c θ i (v). To achieve accurate results, we find a spacing of ∆v = 0.02 is sufficient, with v 0 = −8, M 2 = 700 providing a broad enough range to encompass all relevant velocities at the biases considered. We use a Fourier basis with M 1 = 5. As before, we use the same functional for the value function as for this modified force. Figure 5 shows estimates of same three quantities as the overdamped case throughout the same annealed learning process. Here we increased the value learning rate to α ψ = 0.1, retain a dynamics learning rate of α θ = 0.1, and keep the scaled CGF learning rate fixed to α R = 10 −6 throughout training. Curves in Figs. 5(b,c) are produced from data calculated using the same learning rate as the scaled CGF, before using a windowed average over 100 steps to smooth the curve. We generally see fast convergence to an accurate result in approximately 10 8 transitions worth of updates. The large learning time compared to the overdamped results reflect the significantly finer basis employed for the underdamped model. The ends of these curves are plotted below in Figs. 5(df). In the inset of Figs. 5(d) we see that we find accurate results compared to the numerically exact answers across the range of s considered. We see analogous results to the overdamped case, reproduced in dashed lines in Figs. 5 (e,f), the underdamped system obeys the expected Lebowitz-Spohn symmetry. Compared to the overdamped system, the features of the KL divergence and average velocity in underdamped system are sharper.
There are three distinct behaviors for the system as a function of s. For comparison, we have optimized the same functional form using the MCR algorithm, as analogous to Ref. 25. The AC algorithm provides more accurate results than MCR, when optimized using the same amount of statistics. 25 The MCR results are produced by annealing across from s = 1.5 down to s = −0.5 in steps of 0.1. Training for each value of s involves 20 updates constructed using 50 trajectories with 10 6 time steps each, for a total of 10 9 transitions worth of data. After optimizing the hyperparameters, we see in Fig. 7 the convergence in the MCR algorithm is still much slower than the AC algorithm. As a consequence, the best results we can achieve using the same amount of transitions fail to converge to the correct values of the scaled CGF for biases close to s 1. This demonstrates one key advantage of utilizing value functions. Due to the reduction in variance of gradient estimates using a small amount of data, we can perform many more updates using the same amount of transitions, improving convergence.

VII. CONCLUSIONS
In this paper we have demonstrated how regularized reinforcement learning algorithms can be used to optimize a diffusive dynamics to effectively sample rare trajectories. A key ingredient of our approach is a value function that estimates how relevant each state is to the rare dynamics, a function learnt while simultaneously guiding optimization of the dynamics, allowing for reduced data generation and more detailed function approximations. Across a range of systems and observables, we found that the lower variance estimate of the gradient employing value functions enabled accurate and efficient characterization of rare dynamical fluctuations. In finite time problems, the AC algorithm in particular was able to solve particularly challenging problems associated with multiple reactive channels and long lived intermediates. In the long time limit, the AC algorithm reproduces exact results for the cumulant generating function by directly optimizing to an accurate representation of the Doob dynamics, removing the need to calculate additional corrections or do additional importance sampling.
While we have focused here on the simulation of rare event dynamics and the direct evaluation of their likelihoods, the methods of finding optimized forces developed here can be straightforwardly combined with trajectory importance sampling methods such as transition path sampling 85 or cloning 120 to correct for inaccuracies associated with an incomplete basis. Indeed, previous work has demonstrated that auxiliary dynamics can significantly improve the statistical efficiency of trajectory sampling methods. 25,[126][127][128] Further, Monte Carlo approaches can be used to generate data to train the optimal dynamics in a feedback routine as previously demonstrated. 14,15 This could emphasize the parts of the state space relevant to the rare events earlier than by simply generating data with the current dynamics, thus speeding up optimization. Application to more complex models, such as many-body systems, will be an important development of this line of research. Accurate approximation of the force in many-body problems may require the use of more sophisticated function approximations, such as neural networks, however, a difficult balance will need to be struck between the representative power of the approximation and the computational cost to calculate it. More powerful function approximations will also necessitate the use of more sophisticated algorithms, as training such approximations can become unstable when Comparison between AC and MCR algorithms: (a) learning curves plotted verses the amount of data used during training, for the AC algorithm (solid, colored lines) and the MCR algorithm (colored crosses and dashed gray lines). Curves and crosses are color coded by the value of the bias s being trained for. (b) Final results for the AC algorithm (colored crosses) and the MCR algorithm (gray triangles), with absolute errors to the value from numerical diagonalization shown in the inset.
using correlated data, as we do here.

ACKNOWLEDGMENTS
AD and DTL were supported by NSF Grant CHE1954580. DCR and JPG were supported by University of Nottingham grant no. FiF1/3 and EPSRC Grant no. EP/R04421X/1. JPG is grateful to All Souls College, Oxford, for support through a Visiting Fellowship during part of this work. DCR is grateful for access to the University of Nottingham Augusta HPC service. We now describe how the time-continuous equations of the reinforcement learning algorithm are efficiently implemented in simulations with a fixed discrete timestep ∆t, though variable timesteps may be easily used. We use an Euler propagator to integrate the SDE in Equation (10) as

DATA AVAILABILITY
where ∆W is a Gaussian random variable with mean 0 and variance ∆t. The trajectory probability from Eq. (11) is now given by products of stepwise probabilities Next we discretize the gradient of the logarithm of trajectory probabilities using the Ito convention. We propagate the Malliavin weights from Eq. (18) as We also write the full return (14) through a sum of stepwise rewards as where the timestep index j starts from -1 in this sum, with the notation t − accounting for the timestep before the current one, and the subscript j refers to the time t + j∆t. The reward at each step is defined as using the definition of the observable from Eq. (2) and accounting for an additional singular reward at the end of the trajectory after the last timestep n. Here the first three terms come from the observable and the last two terms represent the KL divergence between the original and optimized dynamics.
Now we combine the rewards, Malliavin weights and value functions in multiple ways to produce the gradients in the different algorithms. The pseudocodes of efficient implementations of these are presented below.

Monte-Carlo returns
The gradient in the Monte Carlo returns algorithm can be rewritten from Equation (19) as where the return has been written as a time integral of its differential changes, and t + is shorthand for t + for some small positive . This has converted the double time integral into a single time integral, which is then evaluated on-the-fly while propagating the trajectory. An implementation of this algorithm with a fixed timestep ∆t is described in the pseudocode in Alg. 3.

Algorithm 3 Finite time MCR
1: inputs dynamical approximation F θ (x, t) 2: parameters learning rate α θ ; total optimization steps I; trajectory length T consisting of J timesteps of duration ∆t each; number of trajectories N 3: initialize choose initial weights θ, define iteration variables i and j, force gradient δP , stepwise rewards r representing the increments in return 4: i ← 0 5: repeat 6: Using chosen method to generate trajectories X0,T with configurations, times, noises, Malliavin weights and rewards denoted by xj, tj, ∆Wj, y θ (tj) and r(xj+1, xj, tj) = rj respectively

Monte-Carlo returns with a value baseline
We use a similar technique to rewrite the double time integral for the gradient in the Monte Carlo value baseline algorithm, Equation (22), using a single time integral as We rewrite the gradient of the value error in Eq. (30) similarly as where the arguments of the value function V ψ (x t , t) have been suppressed as V ψ (t) and the integral of the gradient of the value function upto and including current time has been denoted as z ψ (t + ). We explicitly set the V (x t , t) to 0 for any t ≥ T , i.e., after the last timestep, in these expressions. The single time integral is then evaluated on-the-fly as the trajectory is propagated. If the force and the value function approximations use the same set of basis functions as we do with a fixed grid of Gaussians, the MCVB algorithm incurs no additional computational cost over the MCR algorithm. An implementation of this algorithm with a fixed timestep ∆t is described in the pseudocode in Alg. 4.

Appendix B: Comparing errors in gradient estimates
In Figure 8 we have directly compared the three algorithms for their ability to reduce the variance of the gradient estimates during optimization in the softened Brownian bridge problem. We have chosen the force and value function coefficients θ and ψ from the i = 100 step of the MCVB optimization run in Fig. 1(b) in the Brownian bridge problem. This value function is thus not exact for the corresponding force but is representative of typical inaccuracies encountered during learning. Keeping these coefficients fixed, we have estimated the gradients of the KL divergence using the three algorithms, while varying the number of uncorrelated trajectories N w The dependence on N w in log-log scale corresponds to a linear trend with a slope of −1 as expected from the variance of sample means of uncorrelated samples. We find that use of the MCVB and AC algorithms greatly reduces the variance compared to the MCR approach, equivalent to a 5 to 100 times increase in the amount of input trajectory data. We find that the smallest variance corresponds to the AC algorithm with the smallest possible τ , set to the timestep 0.001. However, this choice incurs a systematic error in the expectation of the gradient due to the inaccuracy in the value function, while neither MCVB nor AC with a large τ are susceptible to it. This is manifested in the scaled L 1 norm of the error in the expected gradient from the algorithms. The expectation is calculated over 10 5 trajectories and the error in MCR is zero by definition. The L 1 norms of the errors, divided by that of the true gradient, are 0.22, 7.49 and 1.16 from MCVB, AC(τ = 0.001) and AC(τ = 0.1) respectively. This shows that the systematic error incurred by AC at small τ can be reduced by having a larger τ , while still having significantly less variance than MCVB and MCR. The crossover between the systematic and statistical error in the AC algorithm depending on τ is also the reason starting the optimization with a small τ and later annealing with a large τ is an efficient strategy, given that the memory requirement scales linearly with τ . We note that the systematic error is formally zero by definition in the expectation of the MCVB gradient estimate as well: the small non-zero value stems from a finite number of samples being used to estimate the expectation.
Appendix C: Alternative CGF estimates

Numerically exact CGF
We have compared the CGF from the reinforcement learning algorithms in Section IV B with numerically exact values obtained from explicitly calculating h Γ p in equation (33) by solving the corresponding Fokker-Planck operator. The Fokker-Planck operator for the original dynamics in Eq. (36) is given by where F(x) = −∇U (x) is the underlying conservative force. We want to use this operator in order to find the probability h Γ p as We exponentiate the operator in its spectral eigenbasis. Since the forces in the original dynamics are conservative, diagonalizing L becomes easier through a similarity transform into a Hermitian operator L, 61,130 We diagonalize L to obtain eigenvalues −λ n and eigenfunctions φ n (x), Lφ n (x) = −λ n φ n (x).
Since L is Hermitian, the eigenfunctions {φ n (x)} are mutually orthonormal and can be used to introduce a resolution of identity The original operator L related by the similarity transform has eigenvalues −λ n and eigenfunctions e −U (x)/2 φ n (x). This spectral expansion of L can be used to estimate the probability h Γ p as The final time T that we use in our barrier-crossing simulations is chosen such that τ rlx < T < τ rxn where τ rlx and τ rxn are respectively the timescale of relaxation in the starting or the ending well, and the timescale of the barrier-crossing reaction, which is expected to be the slowest dynamical mode in the system. Hence when the set {λ n } is ordered, the factor e −λnT should be negligible for all but the few smallest values of n. The sum over n in Equation (C6) is thus expected to converge within a few terms. We diagonalize the operator L using a Discrete Variable Representation basis constructed from Hermite polynomials 104 in two dimensions, χ M,N (αx, αy), where α = 5 is a scaling factor. We obtain identically converged estimates of h Γ p with basis sizes ranging from 50×50 to 100 × 100 using 10 terms in the spectral expansion. The CGF value is then calculated using h Γ p in Equations (32) and (33).

CGF from Kramers escape rate
In one-dimension, corresponding to a dynamics of an approximate expression for the barrier-crossing probability in time T is given by the Kramers escape rate in the overdamped limit, 131 as where q is the reaction coordinate and q A and q † are the locations of the initial well and the barrier respectively. In the case of the Müller-Brown potential, we assume the ideal reaction coordinate to be along the Minimum-Energy Path obtained using a Nudged Elastic Band method. [94][95][96] With the potential energy U (q) computed along this path q, we use quadratic fits around the initial well (q A ) and around the largest barrier (q † ) to find the double-derivative terms. Finally we use this approximate value of h Γ p in Equation (32) and (33) to obtain the CGF.