Charting Galactic Accelerations: When and How to Extract a Unique Potential from the Distribution Function

The advent of datasets of stars in the Milky Way with six-dimensional phase-space information makes it possible to construct empirically the distribution function (DF). Here, we show that the accelerations can be uniquely determined from the DF using the collisionless Boltzmann equation, providing the Hessian determinant of the DF with respect to the velocities is non-vanishing. We illustrate this procedure and requirement with some analytic examples. Methods to extract the potential from datasets of discrete positions and velocities of stars are then discussed. Following Green&Ting (arXiv:2011.04673), we advocate the use of normalizing flows on a sample of observed phase-space positions to obtain a differentiable approximation of the DF. To then derive gravitational accelerations, we outline a semi-analytic method involving direct solutions of the over-constrained linear equations provided by the collisionless Boltzmann equation. Testing our algorithm on mock datasets derived from isotropic and anisotropic Hernquist models, we obtain excellent accuracies even with added noise. Our method represents a new, flexible and robust means of extracting the underlying gravitational accelerations from snapshots of six-dimensional stellar kinematics of an equilibrium system.


INTRODUCTION
In galactic astronomy, a fundamental problem is to extract the underlying gravitational potential from the kinematics of a tracer population. If stars are moving on circular orbits in a spherical potential, then the matching of the centrifugal force to the gravitational one gives the rotation curve, and by extension the potential. Elaborations of this basic idea to stellar streams have proved to be one of the most powerful methods available to us today (e.g., Lynden-Bell 1982;Johnston et al. 1999;Bowden, Belokurov & Evans 2015;Erkal et al. 2019;Malhan & Ibata 2019).
If the stellar population is not kinematically cold, the traditional way in which the problem is tackled is via the Jeans equations (Binney & Tremaine 2008, chap. 4). Given measurements of the second velocity moments and the density of the tracer population, the Jeans equations can be solved to yield the potential. There are numerous applications of this method both to the Milky Way (e.g., King et al. 2015;Bowden, Evans & Williams 2016;Nitschai, Cappellari & Neumayer 2020) and external galaxies (e.g., Cappellari 2008;Walker et al. 2009). Some studies have instead worked directly with the distribution function, fitting some assumed parametric form to the observed stellar data (e.g., Binney & Piffl 2015;Williams & Evans 2015;Posti & Helmi 2019). More rarely, the distribution function is constructed directly from the data, as in Kuijken & Gilmore (1989)'s numerical Abel inversion of the vertical tracer density. This though relies on the assumption that the vertical and in-plane dynamics are decoupled, and so is not of general applicability.
However, the advent of the Gaia satellite (Gaia Collaboration 2016) has made possible the empirical construction of the full phasespace distribution function for stellar populations in the Milky Way, and perhaps even for some of its satellite galaxies. The data now comprise the full positions and velocities of many millions of stars. The process of averaging to obtain the second velocity moments does not do justice to the richness of the data. Green & Ting (2020) recently raised the possibility of direct determination of the gravitational potential from the distribution function using the collisionless Boltzmann equation itself. This is the continuity equation satisfied by the distribution function in the six-dimensional phase space of positions and velocities.
At every location in physical space, the collisionless Boltzmann equation provides a single constraint on the three unknown components of the gravitational force. Thus, it is unclear if the identification of a stationary distribution function is sufficient to specify uniquely the gravitational potential (modulo an additive constant). So, the first aim of our paper is to establish the conditions under which the potential can be uniquely recovered, given the distribution function. The second aim of our paper is to provide a working algorithm to extract the potential. Whereas Green & Ting (2020) proposed a neural network, we instead utilize an efficient and accurate semi-analytic method, based on a direct solution of the collisionless Boltzmann equation. We demonstrate the efficacy of our method on mock datasets sampled from isotropic and anisotropic distribution functions of galaxy models, including the effects of errors.

THE COLLISIONLESS BOLTZMANN EQUATION AND THE POTENTIAL
Here, we address the theoretical question that underlies all this work: namely, when is the potential uniquely specified by the distribution function? We prove a uniqueness theorem in Section 2.1 subject to certain conditions, and investigate the instances when the conditions are violated in Section 2.2.

Uniqueness theorem
If F( p; x) is a phase-space distribution function (DF) in equilibrium in the static potential Φ(x), then it is an integral of motion of the Hamiltonian H = 1 2 3 j,k=1 g jk p j p k +Φ(x). Here p = (p 1 , p 2 , p 3 ) is the momentum component conjugate to a coordinate set x = (x 1 , x 2 , x 3 ) with the metric coefficients g i j and its inverse g i j . A mathematical representation of F being an integral of motion is given by the vanishing Poisson bracket of the integral F with the Hamiltonian H : namely, Considered as a partial differential equation for F, this is equivalent to the (time-independent) collisionless Boltzmann equation (CBE). 1 Since equation (1) is a linear homogeneous equation for F, any func- is also a solution. That is, the CBE only describes a (necessary) condition for the DF to be stationary and cannot uniquely determine the DF for any given potential. In fact, physical considerations make it obvious that many different DFs can indeed be in equilibrium with the given potential.
On the other hand, if a stationary DF is known, the CBE may also be interpreted as a partial differential equation for the potential. Here the question is whether the given DF (or more generally an integral of motion) can determine a unique potential through the CBE. The CBE is linear in Φ (albeit non-homogeneous) and so there exists a gauge freedom such that, if Φ 0 is a particular solution, the function Φ 0 +G(I), where G(I) is an arbitrary function of a particular solution I to the homogeneous counterpart, also satisfies the same inhomogeneous differential equation. However, the potential Φ = Φ(x) is a function of only the configuration-space coordinates, whereas the CBE is a partial differential equation in phase space. In other words, we must only consider the solutions that are also constant along any direction in momentum space; that is, the solution must also be subject to the constraints that ∂Φ/∂p 1 = ∂Φ/∂p 2 = ∂Φ/∂p 3 = 0. Are these then sufficient to uniquely determine the potential Φ for the given DF?
Let us suppose that a DF F(p; x) is known to be stationary in the potential Φ 0 (x). Then it follows that {F, If there exists another potential Φ which the same F is also a stationary DF in, the potential Φ satisfies the CBE with F in equation (1) or equivalently equation (2) but with Φ 0 → Φ. Eliminating the common terms between two CBEs, we can construct a homogeneous linear partial differential equation for the difference Φ − Φ 0 : Here Φ − Φ 0 is a function of only the real-space component (x 1 , x 2 , x 3 ), whereas F is a function of phase space in general. Thus taking the partial derivative with respect to one of the momentum components results in the set of three differential equations: Since the Hessian matrix [∂ p i ∂ p j F] (where ∂ p i = ∂/∂p i and so on) is real symmetric, it is diagonalizable at least locally by a point-wise orthogonal transformation. In the local coordinate diagonalizing the Hessian (in which ∂p i ∂p j F = 0 for i j), equations (4a) reduce to Therefore, if λ i = ∂ 2 p i F 0 for a direction in the transformed coordinate, then ∂(Φ−Φ 0 )/∂q i = 0 along the conjugate coordinate direction associated with the non-zero eigenvalue λ i . If m is the rank (i.e. the number of non-zero eigenvalues) of the Hessian, the difference Φ − Φ 0 is consequently an arbitrary function of 3 − m functionallyindependent functions q j = q j (x 1 , x 2 , x 3 ), which are the coordinate functions corresponding to the eigenvectors associated with the null eigenvalues.
In particular, if the Hessian determinant is non-vanishing, then λ i 0 for all i and m = 3. Solving equations (4a) as a series of linear equations for ∂(Φ − Φ 0 )/∂x i then results in where C is an arbitrary constant; that is, the potential Φ(x) satisfying the CBE for a given DF, if it exists, is essentially unique up to an additive constant (resulting in the identical gravitational acceleration field). In other words, the non-vanishing Hessian of equation (5) is a sufficient condition for the uniqueness of the potential for a given stationary DF.
2.2 Are there physical DFs that do not specify a unique potential?
If the Hessian [∂ p i ∂ p j F] is singular, there exists a local momentumspace coordinate system (p 1 ,p 2 ,p 3 ) such that the directional derivative of F in a fixed coordinate direction must be constant in momentum space. That is to say, the singularity condition indicates that at least one eigenvalue, which is the second-order partial derivative in the corresponding coordinate direction, must be zero (i.e. λ j = ∂ 2 p j F = 0 for ∃ j). Since the coordinate can be chosen to be orthogonal so that all the second-order cross partial derivatives vanish (∂p i ∂p j F = 0 if i j), there then exists a coordinate system in which all the second derivatives involving one particular coordinate should be zero (i.e. ∂p i ∂p j F = 0 for ∀i and ∃ j). Therefore the directional derivative of F in the same coordinate direction must be constant in momentum space; that is, ∂p j F = k 0 (x) for ∃ j. In the original coordinates, this implies that 3 i=1 k i ∂ p i F = k 0 where k i 's are the constants in momentum space (but they are functions of the real-space positions) and at least one of { k 1 (x), k 2 (x), k 3 (x) } is nonzero. In fact, if there are two or more distinct potentials satisfying the CBE with the given DF, equation (3) further indicates that there exists { k 1 , k 2 , k 3 } such that k 2 1 + k 2 2 + k 2 3 0 and (k 1 ∂ p 1 + k 2 ∂ p 2 + k 3 ∂ p 3 )F = 0. In other words, if the function F is an integral of motion in two (or more) distinct -as in generating different gravitational accelerations -potentials, then there exists a fixed direction (k 1 , k 2 , k 3 ) in momentum space that is tangent to the level surfaces of the DF everywhere in momentum space. However, the integral curve of a constant vector is a straight line and momentum space is topologically equivalent to R 3 . Consequently, all the level surfaces of F have infinite extent and the inverse image of any real interval under F −1 in momentum space cannot have a compact support (unless empty). That is to say, such a function F is not integrable and cannot be a physical DF.
In light of this, we argue that the unique determination of the potential is a property related to the global behaviour in momentum space. That is to say, the CBE only describes the balance amongst the gradients of the DF and the external acceleration field in the local neighborhood of a fixed phase-space location, whilst the external gravitational acceleration is shared in the whole momentum space at a fixed real-space position. By joining all the constraints on the acceleration fields coming from the CBE in different momentum-space locations (but at a fixed real-space position), we can narrow down to the unique acceleration. This fact is also demonstrated by the examples presented in the following section (Sect. 3) where a unique potential actually follows from insisting that the CBE holds for all values of the momentum components.

EXAMPLES
To gain insight into the steps needed to extract a unique potential from the CBE, we first look at some analytic examples.

Ergodic distributions: a unique potential
We start by examining the case of an ergodic DF F = f (E) in a fixed potential Φ 0 (x), where E = 1 2 2 + Φ 0 is the specific energy and is known as a function of the phase-space coordinates. Here, no further assumption is made on the self-consistency of the system and so the potential need not be spherically symmetric (cf. An, Evans & Sanders 2017). In Cartesian coordinates, the CBE is then reducible to the differential equation on the difference between any two possible potentials: Assuming that the DF in itself is not constant, that is, f (E) 0, then in order for this to hold everywhere in phase space, Therefore Φ = Φ 0 + C and the potential is unique (up to an additive constant).

Separable potentials with third integrals
If there exists a DF of the form , then the resulting CBE in equation (1) reduces to a cubic polynomial equation on p i 's. This is of course the old "ellipsoidal hypothesis" (see Chandrasekhar 1939;Camm 1941;Evans & Lynden-Bell 1991, and references therein). Assuming that the DF is stationary, the CBE should hold for any p i 's and so the coefficients to all the monomial terms (p i p j p k , p i p j , and p i etc.) must vanish identically. It is then found that the coefficients to the cubic and quadratic terms respectively only involve the tensor K i j and the vector X i , and the first-order partial differential equations resulting from setting them to be zero restrict the possible forms for K i j and X i (An 2013, and references therein). However if the DF is already given and known to be stationary, these conditions must hold automatically.
On the other hand, setting the coefficients to the linear terms to be zero results in the set of three differential equations: If ξ(x) is known, these can be considered as the coupled differential equations on the potential Φ. Provided that the matrix [K i j ] is invertible (here also note that K i j = ∂ p i ∂ p j J), equation (9) can be uniquely solved for ∂Φ/∂x i so that where In other words, if the local DF that is a function of a non-degenerate quadratic form of the canonical momenta is stationary, the gravitational acceleration is uniquely specified in the neighborhood.
As a concrete example, suppose that there exists a stationary DF of the form F = f (J) where with constants a and k, is the third integral of the Kuzmin (1956) disc potential in the cylindrical polar coordinate (R, φ, z) and = = ( · ) 1/2 is the magnitude of the specific angular momentum. Here, = x ×ẋ = (Rê R + zê z ) × ( RêR + φêφ + zêz ) and so follows that Since this holds for all (p R , p φ , p z ), we have ∂Φ/∂φ = 0 and where we have used ∂|z|/∂z = z/|z| (NB: ∂ξ/∂z at z = 0 does not exist). If a 0, we can solve equation (13) for ∂Φ/∂R and ∂Φ/∂z: which satisfies the compatibility condition, (14) can be directly integrated to yield a unique solution: which recovers the axisymmetric potential of the Kuzmin disc up to an additive constant C.

Integrals of motion due to the symmetry of the potential
Let us consider the DF F = f ( z ) where z = ·ê z is the component of the specific angular momentum in a fixed (say, Cartesian z) direction. Technically any such function cannot be integrable over the whole phase space and so is unphysical. Nevertheless, the CBE merely requires F to be an integral of motion, and so is still appli- Provided f ( z ) 0, this implies that ∂Φ/∂φ = 0, the general solution of which is any axisymmetric potential; that is, an arbitrary function Φ = Φ(R, z) of two coordinate functions R and z. Also note ∂F/∂p R = ∂F/∂p z = 0 indicates that the only non-zero component of the Hessian [∂ p i ∂ p j F] is ∂ 2 F/∂p 2 φ and so it follows that the rank of Hessian is 1 as long as ∂ 2 F/∂p 2 φ = f ( z ) 0. The result is independent of the choice of the coordinate, although the calculation may be more complicated. For example, in Cartesian coordinates, z = x y − y x and so the Hessian becomes and so, unless f ( z ) = 0, we have a homogeneous first-order linear partial differential equation on Φ(x, y, z), Utilizing standard techniques such as the method of characteristics, its general solution is found to be Φ = Φ(x 2 + y 2 , z), which is again an arbitrary axisymmetric function.
Similarly if a stationary DF (or rather an integral of motion) of the form F = f ( 2 ) is available, the CBE in the canonical phase-space coordinate (p r , p θ , p φ ; r, θ, φ) inherited from the spherical polar coordinate (r, θ, φ) is reducible to If f ( 2 ) is a non-constant integral of motion, equation (19b) should hold everywhere in phase space (i.e. for any p θ and p φ ) and so Hence the general solution is any spherically symmetric potential.
As for the rank of the corresponding Hessian, we observe that the rank of the matrix [∂ p i ∂ p j 2 ] is 2 (independent of the coordinate system) with the radial vector being the eigenvector associated with a null eigenvalue (note ∂ 2 /∂p r = 0 in the spherical polar coordinate). In addition the radial vector is also in the null space of the matrix [(∂ p i 2 )(∂ p j 2 )], thanks again to ∂ 2 /∂p r = 0. Hence, for any f ( 2 ), the radial vector is in the null space of the Hessian matrix; In other words, the Hessian is singular and its rank is at most 2.
Since any axisymmetric or spherical potential admits the integral of motion z or 2 , it is not an unexpected result that F = f ( z ) or f ( 2 ) only constrains the associated symmetry of the potential and cannot specify the unique potential. The above examples however demonstrate that such integrals of motion also fail the necessary condition of having a non-singular Hessian in momentum space. Furthermore, we also observe that f ( z ) and f ( 2 ) are not actually integrable in momentum space. That is to say, f ( z ) is independent of p R and p z , but both components are unbounded, and so the integral of any non-negative f ( z ) over the whole momentum space is infinite (unless it is identically zero). A similar argument can also be made for f ( 2 ) and the component p r . We have argued in Section 2.2 that this is not an accident, but that there is a logical connection between the singular Hessian and the non-integrability.

ALGORITHMS FOR EXTRACTING THE GRAVITATIONAL ACCELERATION
Suppose that stationary DF F is known. How then can we extract the gravitational accelerations? First consider the CBE in an arbitrary curvilinear orthogonal coordinate (in which the line element is ds 2 = h 2 1 dx 2 1 + h 2 2 dx 2 2 + h 2 3 dx 2 3 ) rearranged to be ∂F ∂ 1 where i = h iẋi is the velocity component projected on the orthonormal frame. Next let us observe that ∇Φ is constant at all the velocity-space points, given a fixed real-space position. Hence the subset of equations (22a) sampled over the range of velocity space at a fixed position results in an over-determined (assuming there are more than three sampling points) system of linear equations on (∂Φ/∂x 1 , ∂Φ/∂x 2 , ∂Φ/∂x 3 ). Technically, we only need samples at three different velocity-space points so as to uniquely determine the local gravitational acceleration, provided that the three vectors ∇ F at the three sampled points -where ∇ = (∂ 1 , ∂ 2 , ∂ 3 ) is the gradient operator in velocity space -are mutually linearly independent.
In fact, the non-singular Hessian of F as discussed in Section 2 guarantees the existence of such three points in velocity space (and so is a sufficient condition for the unique determination of the potential).
On physical grounds, the over-determined system of equations (22a) resulting from more than three velocity-space points at a single spatial location should be consistent and must possess a unique solution. However, due to the uncertainties in the data, the exact solution may not be necessarily found with the actual set of equations in practice. Instead, the problem should be approached by methods such as least-square: that is, minimizing where S is as defined in equation (22b), and the summation is over a suitably-chosen sample of velocities with the weights ς −2 . Finding the extrema with respect to ∇Φ = (∂Φ/∂x 1 , ∂Φ/∂x 2 , ∂Φ/∂x 3 ) is then equivalent to solving the set of linear equations: where which is basically the set of standard normal equations. Therefore, provided the matrix [A i j ] defined as in equation (24b) is invertible, ∇Φ that minimizes equation (23) at the same position can be found through a matrix inversion. Alternatively one may also attempt to minimize equation (23) summed over data points ranging in a region of space, in order to get the potential as an optimizing functional solution. In principle, this can be done with a suitably-chosen parametric function for the potential or non-parametrically (pixelized or otherwise), which is closer to the implementation proposed by Green & Ting (2020) to recover the potential. After reconstructing the DF from the discrete dataset via normalizing flows, Green & Ting (2020) characterized the potential as an optimized feed-forward neural network minimizing the cost function, which is defined similarly to equation (23) but with the absolute value instead of the square and also includes the penalty for the negative density. This procedure combines the determination of the local accelerations and their integration into the potential as one single optimization problem. Nonetheless, the actual physical constraints due to the CBE are in the form of an algebraic relation on the local acceleration and so the measurements of the accelerations at different spatial locations should in principle be independent (except for possible systematic correlations relating to the determination of the DF).

EFFECTS OF DISEQUILIBRIUM
If the stellar system is not in equilibrium, its DF F(p; x; t) by definition, is no longer an integral of motion. Provided that the collisional effects are negligible, the evolution of the DF is still governed by the CBE, but the CBE now must include explicit time dependence; D t F = ∂ t F + {F, H} = 0, where D t F is the (Lagrangian) phase-space convective derivative and ∂ t F = ∂F/∂t is the (Eulerian) time rate of change of F at a fixed phase-space coordinate, whilst {F, H} is the same as equation (1). We observe that the argument in Section 2.1 still holds for the time-dependent CBE as long as ∂ t F is also a known quantity. In particular, equation (22a) maintains the same form but the right-hand side additionally includes the ∂ t F term (S → S +∂ t F), and so the determination of the acceleration is still possible if ∂ t F's are known throughout phase space. However ∂ t F is impossible to measure directly within a practical time scale barring few exceptional situations -by contrast, if ∇Φ is known independently, ∂ t F may instead be determined using the CBE. If ∂ t F is considered as unknown, the system of equations (22a) becomes under-constrained and the problem is technically insoluble without some additional restrictive assumptions on the behaviours of ∇Φ or ∂ t F.
Nevertheless, we may still infer effects due to the system not being in equilibrium. If the time derivatives are neglected when not warranted, that will introduce a systematic bias. Notably, the linear system of equations (22a) would then not necessarily be consistent even if all the phase-space derivatives of F are known exactly. Whilst equation (24) still has a unique solution despite the system of equations (22a) being inconsistent, the resulting solution is actually offset by the "sample average" of ∂ t F. That is to say, if ∂Φ s /∂x i is the solution of inverting equation (24) with ∂ t F = 0 (whereas ∂Φ/∂x i is the true gravitational acceleration component), then and A −1 i j is the matrix element of the inverse matrix of [A i j ] in equation (24b). This follows from the fact that ∂Φ/∂x i is actually the solution of equation (24) with S → S + ∂ t F. If we insert back the solution (eq. 25) into equation (22a) and consider the departure from the equality at each sample point, then (with In other words, the residuals consist of the time derivative ∂ t F and the projection of the bias (i.e. B i ) onto ∇ F. We note that B i 's are unknown but fixed constants and so the last term is also considered as ∇ F projected onto a fixed (albeit unknown) direction, which behaves in a predictable systematic pattern. Consequently it would be a smoking gun for a system in disequilibria if the observed residual on each sample point exhibits a systematic behaviour over velocity space not consistent with a projection of ∇ F onto a fixed direction.

IMPLEMENTATION
Given a known DF, equation (24) furnishes us with a way to calculate gravitational accelerations, under the assumption of equilibrium.
We now wish to test this technique on a mock dataset.
Here, we demonstrate a complete pipeline from a six-dimensional (6D) stellar kinematics dataset to a map of accelerations. This will necessitate an additional step in the procedure, that is, obtaining the underlying DF of the data. Whereas a conventional approach might  (2020) and construct a non-parametric DF directly from the data. Our method can thus be summarized: (i) Employing a normalizing flow technique, we reconstruct a non-parametric DF from the mock data.
(ii) With this reconstructed DF in hand, we exploit eq. (24) to calculate accelerations. This exercise serves mainly as a proof of concept. In a subsequent paper (Naik et al., in prep.), we shall apply the same methodology to local stellar kinematics, with a view towards mapping the acceleration field (and thence the distribution of matter) in the solar neighbourhood.
It is worth noting that an acceleration field calculated with our method is not guaranteed to be physical, in the sense that it might show negative divergences (i.e. negative mass densities) or non-zero curls (i.e. non-conservative force). We view this feature as an advantage: the existence of such non-Newtonian accelerations can serve as a valuable post hoc test of our method. If they are found to be robust, they might hint at disequilibrium features or non-gravitational force (even modified gravity). On the other hand, the requirements for non-negative divergences and vanishing curls can be imposed a priori if so desired, by adding penalty terms to the loss function used to train the normalizing flow. These non-Newtonian accelerations are then still possible in principle, but heavily suppressed.

Ergodic models
We consider a simple galaxy halo model in which the DF selfconsistently generates both the potential and the density. We generate a mock 6D dataset using this DF, and then attempt to derive the underlying acceleration field from the mock data. For this model, we adopt the spherical Hernquist (1990) profile, specified by the potential-density pair where M and a are respectively the galaxy mass and scale radius. The isotropic (ergodic) DF for this model is given by 2 where = −Ea/GM ≥ 0 (here, −E is the specific binding energy of a star). In this case, the phase-space gradients of F are determined solely by the gradients of the energy E. A visualization of the isotropic DF, for M = 10 10 M and a = 5 kpc, is given in the left-hand panel of Figure 1. There is a clear curve above which the DF is everywhere zero: viz. the escape velocity esc = √ 2GM/(r + a). With this DF, we employ an MCMC technique to sample a mock 6D dataset with 10 6 stars. For this, we use the affine-invariant ensemble sampler implemented in the software package emcee (Foreman-Mackey et al. 2013). A density plot of this mock dataset is shown in the second panel of Figure 1.
From this mock dataset, we now want to learn the underlying DF by means of a normalizing flow technique (Rezende & Mohamed 2015). Normalizing flows are a relatively new probability density estimation technique, and the basic principle behind them is rather straightforward: a simple base distribution such as a Gaussian is subject to a series (or "flow") of complex (but bijective and invertible) transformations into a target distribution. The parameters of these transformations are then optimized so as to give a target distribution that closely resembles the data. More detailed descriptions of the technique are given in the article by Rezende & Mohamed (2015)  first describing normalizing flows, and the recent review articles by Kobyzev, Prince & Brubaker (2020) or Papamakarios et al. (2021).
Despite taking a single Gaussian as the starting point, a flow with sufficiently flexible transformations (and sufficiently many of them) is able to mimic arbitrarily complex, multimodal data distributions. In practice, even rather minimalist flow architectures are capable of achieving great complexity (see e.g., Kingma & Dhariwal 2018, for an impressive application of flows in image generation).
Another class of density estimation technique capable of emulating arbitrarily complex datasets is kernel density estimation. The advantages of flow-based techniques over kernel-based techniques are two-fold. First, flows are less susceptible to over/under-fitting data (Both & Kusters 2019). The second advantage is more contextdependent. Kernel-based techniques typically require no training beyond simply loading the kernels into memory, and perhaps some tuning of the kernel-width parameter. However, given a dataset of size N, evaluating the kernel density PDF then essentially requires the computation of N kernel functions, which can be costly as N grows large. Flows do require a training procedure, the cost and duration of which depend on the flow architecture and the size and complexity of the dataset in question. However, given a trained flow, evaluating the PDF is then a mere matter of computing a single Gaussian and a small number of transformations, regardless of N. In summary, ker- nel densities are cheap to train but expensive to evaluate, while flow densities are expensive to train but cheap to evaluate. In our context, we need to train a density estimator only once to learn the DF, but would then like to evaluate it many times, e.g., for the sums in equation (24). This would therefore suggest flows over kernels. Another notable aspect of normalizing flows is that the target distribution is guaranteed to be a well-behaved probability distribution, i.e. positive everywhere and normalized to unity. The positivity requirement is met straightforwardly by working in log-space, but the normalization requirement is more exacting: it restricts the space of usable transformations to bijective and invertible functions. This space is then restricted further by the desire for computational efficiency. Different normalizing flow techniques differ primarily in the details of these transformations, as well as the base distributions and flow architectures.
We differ from Green & Ting (2020) in that we employ "masked autoregressive flows" (MAFs; Papamakarios, Pavlakou & Murray 2017). This choice is motivated by the benchmarking of a number of normalizing flow algorithms. We train an ensemble of 30 MAFs, each with 8 transformations along the flow, each transformation being a neural network with one hidden layer of 64 units. We use the implementation of MAFs in the publicly available software package nflows. 3 The MAFs are trained on the mock data, and thus learn a nonparametric DF that closely resembles the data. This learned DF is shown in the third panel of Figure 1. It is worth emphasizing that, whilst this plot is in two dimensions, the MAFs are trained using 6D data and learn a 6D DF. The plotted values here are taken from a 2D Model / Exact -1 Figure 4. The anisotropic Hernquist DF, projected into rt space at fixed position (r = a). The four panels carry the same meanings as in their isotropic analogues in Fig. 1, although some differences are discussed in the text. The normalizing flow technique is also successful at recovering the anisotropic Hernquist DF, albeit with larger residuals than in the isotropic case.
slice through this 6D DF, with y = z = y = z = 0 (so that x = r, x = ). The rightmost panel of Figure 1 shows fractional residuals, i.e. F model /F exact − 1. Encouragingly, the residuals are less than 5 % throughout most of phase space. In other words, our algorithm is successfully able to reproduce the isotropic Hernquist DF.
One apparent qualification to this success is the region near the esc -curve, where the DF is consistently overestimated. The esccurve represents a hard edge in the Hernquist DF, and even very flexible non-parametric density estimation schemes can struggle to reproduce such a hard edge. However, this need not be a cause for concern, for the following reason: if we progress to step (ii) of our method and attempt to derive acceleration at a given spatial location using this learned DF, the right-hand side of equation (24) requires us to choose a number of points in velocity space. At this stage, we are free to choose whichever velocities we like, and we can thus choose to steer well clear of this region near esc , which we term a "zone of avoidance". Of course, in real-world applications, one might not know the exact value of esc , but one can always make an educated guess (e.g., Williams et al. 2017;Deason et al. 2019). Equation (24) requires the spatial and velocity derivatives of the DF to calculate accelerations. We therefore check if our technique accurately recovers not just the DF, but also its derivatives. Here, a compelling benefit of the normalizing flow technique is that the learned DF is everywhere exactly differentiable, irrespective of the complexity of the flow architecture. Thus, we can efficiently calculate exact derivatives, obviating the need for potentially noisy finite difference schemes. Figure 2 compares the first derivatives ∂F/∂x and ∂F/∂ x of the exact and reconstructed DFs, evaluated on a 2D (x, x ) plane in phase space. Inspecting the residuals in the lower panels of Figure 2, it is apparent that the MAFs are rather successful at accurately recovering the gradients of the DF; the residuals are less than 10 % throughout most of phase space.
As seen in Figure 1, there is a problematic region of larger residuals near esc . In addition to this, two more such regions are apparent. First, the ∂F/∂ x residuals grow rather large in the immediate vicinity of x = 0. This is the peak of the 1D x -distribution, and so the nearby gradients are small and susceptible to mis-estimation. Second, the ∂F/∂x residuals show similar issues around x = 0. The same arguments hold here, perhaps exacerbated by the power-law cusp in the Hernquist model. For calculating accelerations, the first problem can be avoided as in the esc case, i.e. by sampling velocities that avoid the region around x = 0 (likewise y , z ). However, in the second region around x = 0, the residuals appear to be consistently large throughout velocity space, suggesting that our calculated accelerations at these small very radii will be biased.
With these points in mind, we now progress to step (ii) of our method, and derive accelerations from our learned DF using equation (24). Here, we take 50 points along the x-axis, and at each of these points we sample 10 3 velocities for the sums on the right-hand side of equation (24). We perform this sampling by calculating the escape speed esc at each spatial point, then uniformly sampling 10 4 speeds between 0 and 0.9 esc . Random directions are then chosen from the unit sphere. Finally, we randomly subsample 10 3 velocities from this set, avoiding the region around i = 0.
After performing this sampling, we have 10 3 points in phase space at which we evaluate equation (24) for each spatial location. The results of this are shown as the "Isotropic, σ = 0" curve in Figure 3. It is clear that the method derives the accelerations in the isotropic Hernquist model very well. The fractional residuals shown in the lower panel indicate an accuracy everywhere at the level of 3 %.

Anisotropic models
We repeat this exercise, using a simple anisotropic DF for the Hernquist model (Baes & Dejonghe 2002;Evans & An 2005, 2006) Now, the DF depends on the magnitude of the angular momentum = r t (here 2 t = 2 θ + 2 φ ) as well as the (dimensionless) binding energy . As before, we sample one million positions and velocities from this DF, then feed this data to an ensemble of MAFs. Figure 4 is the anisotropic analogue of Figure 1, and shows the exact DF, a density plot of the mock data, the learned DF and the fractional residuals. As the DF is not isotropic, we do not show the DFs projected into (r, ) space, but rather into ( r , t ) or radial versus tangential velocity space at fixed position (r = a). Consequently, the "Data" panel does not show the full dataset as in Figure 1, but only the stars within a small radial slice around r = a. The residuals in the anisotropic DF are generally larger than in the isotropic case, but nonetheless reasonably small, ∼ 5-10 %. Moreover, there seems to a be an additional zone of avoidance here beyond those already discussed in the isotropic case, around t = 0. The source of the large errors here can be seen directly from the form of the DF (29): t = 0 means = 0, so the DF diverges. The probability distribution remains well-behaved, but the MAFs nonetheless struggle to reproduce the sharp rise in probability density at small t .
Despite these foibles, the accelerations are still well recovered in the anisotropic case. These are shown as the points labelled "Anisotropic, σ = 0" in Figure 3. Indeed, the residuals here are comparable to the isotropic case.
One aspect of our procedure worth emphasizing is that successful calculation of accelerations relies on the judicious choice of velocity samples, steering clear of the "zones of avoidance" in which the DF and its gradients are poorly estimated. We have seen above that the existence and locations of zones can vary from context to context, and so it might be difficult to know a priori where they are for any given real stellar population. This is a potential drawback to our method, but it can be readily circumvented by performing tests on mock datasets.

Effect of errors
As a final test, we assess the potential impact of observational errors by adding Gaussian noise to the isotropic dataset, at the 1 % and 10 % level. The results of this trial are also shown in Figure 3, alongside the original results for the noiseless dataset. Based on this test, it appears that random errors of this magnitude have no appreciable adverse impact on the calculation of accelerations, with residuals still at the percent level. The application of our method to real data is therefore unlikely to be limited by statistical error.
Going beyond our simple test, there is a natural way to propagate observational errors in our method: when training an ensemble of MAFs on the data, each MAF could be provided with a slightly different dataset from which to learn, generated from a different realisation of the error distribution. Each member of the ensemble will then have a different learned opinion about the acceleration at a given spatial location, and the spread of these values will incorporate observational errors.

CONCLUSIONS
The phase-space distribution (DF) for the stars in the Milky Way is an obvious way to organize the new datasets comprising of nearby stars in the full six-dimensional phase-space coordinates. One question that follows is what information the DF actually contains about the overall properties of the Galaxy. We have proved that, if the stationary DF of a population is known locally in the neighborhood of a fixed real-space position, then the gravitational acceleration at that location can be uniquely determined from the phase-space gradients of the DF, using the collisionless Boltzmann equation (CBE) under the assumption of dynamical equilibrium. A sufficient condition for this to be true is that the Hessian of the DF with respect to the momenta does not vanish (see eq. 5).
In practice, once the CBEs are set up locally at more than three independent phase-space points sharing the real-space coordinates, we have an over-determined system of linear equations on the potential gradients, which can be solved via techniques, such as the least square and normal equations. A practical prescription of how to do this is provided in equation (24).
In light of this finding, we address the question as to how to empirically reconstruct the DF suitable for the local measurements of the gravitational acceleration. Recent developments in machine learning techniques offer great promise in this regard. In particular, Green & Ting (2020) proposed that the DF of stars can be reconstructed from samples of discrete positions and velocities via the method of normalizing flows and the underlying potential can be recovered from this empirical DF. We examine this suggestion by devising tests derived from isotropic and anisotropic Hernquist models using masked autoregressive flows to build the DF. Once built, direct solution of the over-constrained linear equations for the accelerations (eq. 24) is highly efficient, and preferable to use of a neural network (cf. Green & Ting 2020). The accelerations are everywhere well reproduced with samplings of ∼ 1000 velocities at any given position. One caveat here is the existence of regions of velocity space in which the DF is poorly estimated, which need to be avoided in the sampling.
Tests with the addition of Gaussian noise at the 1 % or 10 % level suggest that the method is stable against errors of this magnitude.
There are a number of evident applications of this method, some of which we are actively pursuing. For example, if we reconstruct the velocity distributions of a homogeneous (in equilibrium) stellar population in the solar neighbourhood from the sample of the nearby stars (e.g., Gaia Collaboration 2021), it is possible to measure the local gravity at the sun's position due to the Galactic potential (Naik et al., in prep.). This has implications both for the measurement of the local dark matter density and for tests of alternative theories of gravity. Equally, the method is potentially applicable to the datasets of Milky Way halo stars to measure the mass of the Milky Way and its escape speed.
One assumption underlying the implementation of our method is that of dynamical equilibrium. Incorrectly assuming ∂F/∂t = 0 leads to an additive bias in the derived accelerations that is linear in ∂F/∂t. In addition, disequilibrium can manifest itself through the system of equations (22a) sampled at many different velocity space positions being inconsistent with a single value of ∂Φ/∂x i (after accounting for observational uncertainties), or equation (24) resulting in different values of the acceleration for distinct choices of samples.
There is now a significant body of evidence suggesting the existence of disequilibria in the Milky Way disc (e.g., Antoja et al. 2018;Schönrich & Dehnen 2018;Salomon et al. 2020), which will need to be carefully considered in future applications of our technique to local stellar kinematics. Banik, Widrow & Dodelson (2017) find the bias in inferred accelerations to be at the 10 % level if such systematic perturbations are ignored. So, it is interesting to explore whether the pattern of residuals at a sampling point has a systematic behaviour over velocity space that may be a tell-tale signature of departures from equilibrium (cf. Li & Widrow 2021, for a somewhat similar idea).
It is also worth remarking that the first step of our outlined procedure, i.e. learning the DF with normalizing flows, is entirely assumption-free. Given this learned DF, one could then study the non-equilibrium structures themselves. These non-equilbrium structures imprinted in the stellar kinematics are much more than merely sources of systematic error: perturbations to a system can reveal insights about the system itself. For example, Widmark et al. (2021) has shown that the shape of the Gaia phase spiral can be used to constrain the local gravitational potential.
To summarize, our method bypasses many of the assumptions that have been traditionally adopted in studies of galactic dynamics, and represents an efficient, flexible, and data-driven means of extracting underlying gravitational accelerations from snapshots of stellar kinematics.

ACKNOWLEDGEMENT
We thank Gregory Green and Yuan-Sen Ting for useful discussions, as well as the anonymous referee for a very useful report. APN and CB are supported by a Research Leadership Award from the Leverhulme Trust. CB is also supported by a Royal Society University Research Fellowship.