Accelerated Bayesian inference using deep learning

We present a novel Bayesian inference tool that uses a neural network to parameterise efficient Markov Chain Monte-Carlo (MCMC) proposals. The target distribution is first transformed into a diagonal, unit variance Gaussian by a series of non-linear, invertible, and non-volume preserving flows. Neural networks are extremely expressive, and can transform complex targets to a simple latent representation. Efficient proposals can then be made in this space, and we demonstrate a high degree of mixing on several challenging distributions. Parameter space can naturally be split into a block diagonal speed hierarchy, allowing for fast exploration of subspaces where it is inexpensive to evaluate the likelihood. Using this method, we develop a nested MCMC sampler to perform Bayesian inference and model comparison, finding excellent performance on highly curved and multi-modal analytic likelihoods. We also test it on {\em Planck} 2015 data, showing accurate parameter constraints, and calculate the evidence for simple one-parameter extensions to LCDM in $\sim20$ dimensional parameter space. Our method has wide applicability to a range of problems in astronomy and cosmology.


I. INTRODUCTION
In the last few years, we have witnessed a revolution in machine learning. The use of deep neural networks (NNs) has become widespread due to increased computational power, the availability of large datasets, and their ability to solve problems previously deemed intractable (see [1] for an overview). Deep learning is particularly suited to the era of data driven astronomy and cosmology, but so far applications have mainly focused on supervised learning tasks such as classification and regression.
Bayesian inference is now a standard technique, with codes such as CosmoMC [2], CosmoSIS [3], Emcee [4], MontePython [5], MultiNest [6][7][8] and PolyChord [9,10] used for parameter estimation and model selection. Many of these use Metropolis-Hastings Markov Chain Monte Carlo (MCMC) to draw samples from the posterior distribution. The dimensionality of the problem can be high, with 20 parameters for the Planck likelihood (including nuisance parameters), and potentially more for future large-scale surveys. For some problems the likelihood can be non-Gaussian (e.g. with curving degeneracies) and/or multi-modal, making conventional MCMC techniques inefficient and liable to miss regions of parameter space. Likelihoods can also be expensive to calculate, so minimizing the total number of evaluations is advantageous. This was the motivation behind BAMBI [11,12], which used a neural network to approximate the likelihood function where possible. Efficient exploration of the posterior distribution is crucial for Bayesian inference and model selection.
A good proposal function is vital to fully explore the distribution and ensure convergence of the chain. Deep learning has recently been shown to accelerate MCMC * adam.moss@nottingham.ac.uk sampling by using NNs to parameterise efficient proposals that maintain detailed balance (see e.g. [13,14]). The proposal function can be trained, for example, to minimise the autocorrelation length of the chain. Some of these methods (e.g. generalisations of Hamiltonian Monte Carlo) exploit the gradient of the target, and analytic gradients are often not available for astronomical or cosmological models. The training time can also be prohibitive, offsetting the gain in efficiency when sampling. In this work we use a NN to transform the likelihood to a simpler representation, which requires no gradient information and is very fast to train. This approach is inspired by representation learning, which hypotheses that deep NNs have the potential to yield representation spaces in which Markov chains mix faster [15,16].
The idea of optimising the proposal function originated in [17,18], which suggested using a normal distribution N (x, 2.38 2 Σ/d), where Σ is the covariance matrix, x the current state and d the dimension. The factor of 2.38 2 /d was shown to minimise the autocorrelation length of the chain. This is equivalent to transforming variables by L −1 x, where L is a lower triangular matrix defined by the Cholesky decomposition of the covariance matrix, Σ = LL T . Proposals can then be made using the diagonal normal distribution N (x, 2.38 2 I/d), where I is the identity matrix. Standard practice in cosmological parameter estimation is to use an initial set of samples to estimate the covariance matrix, and make proposals using the Cholesky decomposition. Linear transformations are also used in affine invariant MCMC methods [4,19].
This transformation works well when samples are well described by their covariance matrix, but can become inefficient for more complex distributions. In this paper we use a NN to parameterise more expressive transformations that are suitable for curved and multi-modal targets. The NN learns a non-linear, invertible, and nonvolume preserving mapping between data and a simpler arXiv:1903.10860v1 [astro-ph.CO] 26 Mar 2019 latent space via maximum-likelihood, and proposals are then made in this space. We apply our method to nested sampling, a commonly used tool to perform Bayesian inference and model comparison, although it could easily be incorporated into other MCMC frameworks. A major challenge in nested sampling is drawing new samples from a constrained target distribution, and we show that NNs can lead to improved performance over existing rejection and MCMC based approaches.
The structure of the paper is as follows. In section II we provide a short overview of nested sampling. In section III we show how NNs can be trained to transform complex target data to a simpler latent space. We give further details of our algorithm in section IV, and demonstrate results on both analytic likelihoods and Planck data in section V. We conclude and provide an outlook for future work in section VI.

II. NESTED SAMPLING
The nested sampling algorithm [20,21] was developed to accurately calculate the Bayesian evidence (or marginal likelihood). From Bayes' theorem, the posterior distribution of a set of parameters x, given data d and model M is is the prior and the normalising constant p(d|M ) is the evidence. This can be expressed as Two competing models M 1 and M 2 can be compared by calculating the Bayes factor which simplifies to the evidence ratio if the models have equal prior probability. This integral (2.2) is typically hard to evaluate, but can be turned into a simpler 1-d integral by a change of variables [20], such that X(λ) is the prior volume associated with a likelihood constant L(x) > λ. Parameters are scaled to the unit hypercube, so that the prior is normalised, and X takes values between 0 and 1.
The nested sampling algorithm first draws a set of n live live points from π(x). The prior is typically uniform or Gaussian, so is easy to sample from. At each iteration the point with the lowest likelihood is replaced by a new sample drawn from the prior, with the condition that the new point has a higher likelihood. The discarded samples are termed dead points. For a sequence of decreasing X values, the integral in (2.4) can then be estimated using trapezoidal integration The error in log Z can be estimated by H/n live , where H is the negative relative entropy [6] Nested sampling can also perform posterior inference by using the sequence of dead points (and the current set of live points), and assigning a weight p i to the ith point The main difficulty with nested sampling is drawing a new, independent sample subject to a hard likelihood constraint. This can be achieved by rejection sampling, using an envelope function that encloses the current set of live points. For Gaussian likelihoods, it is most efficient to use a ellipsoidal envelope [22], and for multi-modal distributions a set of (possibly) overlapping ellipsoids can be drawn around clusters of live points [6,7]. The disadvantage of rejection sampling is that the envelope function requires scaling by an enlargement factor to ensure it contains the entire iso-likelihood contour. This can lead to poor scaling with dimension and introduces a choice of user specified hyper-parameter.
An alternative to rejection sampling is MCMC. Starting from an existing live point, a new, independent sample is obtained after performing a 'sufficient' number of steps. MCMC scales better with dimension, but is not guaranteed to explore or fully mix the distribution, and is generally not well suited if the likelihood is highly curved and/or multi-modal. Variants of MCMC developed to cope with challenging targets include Galilean dynamics [23], diffusive sampling [24] and slice sampling [9,10]. Some of these also use clustering algorithms to identify and sample from multi-modal distributions. The key point is that they all try and choose better proposals -in our case we will use a neural network to try and learn one.

III. NEURAL NETWORK SAMPLING
A. Non-volume preserving flows Initially, we have a set of data {x i } N i=1 , sampled from a target distribution x ∼ p X (x), where x has dimension n dim . Latent variables are drawn from a simpler prior distribution z ∼ p Z (z), with the same dimension. Given a bijection f : X → Z, the change of variables formula gives the distribution on X The inverse f −1 (z) provides a mapping from latent space to real space. The bijection can be parameterised by a neural network, with trainable parameters θ. Neural networks, however, are not generally invertible, and the Jacobian determinant in (3.1) is not easily tractable. Non-volume preserving (NVP) flows, introduced in [25], can transform simple latent distributions into rich target distributions. They are invertible and allow for tractable Jacobians by using a specific architecture for the neural network. They exploit the fact that the determinant of a triangular matrix is the product of its diagonal terms. The NVP transformation is [25] x = m x+(1−m) x exp s θs (m x) +t θt (m x) , where m is a binary mask vector consisting of alternating 1's and 0's, s θs and t θt are separate (s)cale and (t)ranslation NN's with trainable parameters θ s and θ t , and is the element-wise product. The transformation for n dim = 2 with m = (1, 0), for example, is with Jacobian determinant exp(s θs (x 1 )), and the inverse is with Jacobian determinant exp(−s θs (x 1 )). Note that x 1 is unmodified in this transformation.
A series of transformations can be composed into a flow by permuting components of the inputs in successive transformations, such that those modified in one transformation are left unchanged in the next. This can be achieved by setting the mask in the next transformation to 1 − m, so that successive masks resemble a checkerboard pattern. The Jacobian determinant is still tractable, and is simply the product of each individual transformation. A flow transforms a target to latent space, and an inverse flow transforms latent space to the target. Each transformation step of the flow has separate s and t networks. Given a set of samples, the s and t networks can be trained by minimising the loss function Parameters are updated by back-propagating the loss using gradient descent, and the minimum loss is equivalent to the maximum-likelihood estimate of θ s and θ t .
As an example, we fit flows to two datasets, shown in Fig 1. The initial data x ∼ p(x) is obtained from the nested sampling of the 2 dimensional Rosenbrock and Himmelblau functions (defined in section V) for n live = 1000 points, evaluated when the prior volume X = 0.02. The target distribution is therefore uniform when L(x) > λ , with λ defined by X(λ ) = 0.02, otherwise the probability density is zero. These functions are chosen as they are challenging examples of curved and multi-modal distributions.
We choose the prior distribution p Z to be a diagonal Gaussian with unit variance, i.e. z ∼ N (0, I). We use 4 successive transformations in the flow, each parameterised by a fully connected s and t network with an input layer of dimension 2, two hidden layers of dimension 128, and an output layer of dimension 2. We use rectified non-linear activation (ReLU) functions after the input and hidden layers in each network.
The resulting inverse flows are shown in Fig. 2, after training for 50 epochs (each epoch is a complete pass over training data). Each transformation step begins with an (x 1 dependent) scaling and translation of x 2 (the vertical axis in the plot), with x 1 unchanged. These are then permuted, and scaling and transformation operations are applied to (the now) x 1 . The initial Gaussian can be flowed into the Rosenbrock function continuously, but narrow connecting ridges appear for the Himmelblau function. This is because it cannot continuously be deformed into the target distribution. Nethertheless, the volume of these ridges is quite small compared to the region where the target probability density is non-zero.
If the network could fit the target distribution perfectly, it would be trivial to generate new, independent samples. One could simply sample from latent space z ∼ N (0, I) and perform the inverse flow to obtain x. In reality the fit isn't perfect, but we can use the learnt mapping to improve the efficiency of MCMC by making proposals in the simpler latent space rather than data space.

B. MCMC Sampling
The acceptance probability required to maintain detailed balance in a Metropolis-Hastings update is where the proposal function q(x |x) is the conditional probability of state x given x. If the proposal function is symmetric (e.g. a Gaussian with the same covariance matrix for each state) then q(x |x) = q(x|x ). For proposals made in latent space z, the acceptance probability must be modified by the Jacobian determinant to satisfy detailed balance Given the prior distribution of latent space is a diagonal, unit variance Gaussian, we use a symmetric proposal function where σ is a scaling parameter. Based on estimates of optimal proposals for Gaussian distributions [17,18], we tune σ to give an acceptance rate of 50% using the method in [6], where N a and N r are the number of accepted and rejected samples in the current MCMC chain.
To demonstrate this gives the correct distribution on p X , we generate new samples for the flows fitted to the 2 dimensional Rosenbrock and Himmelblau functions. Starting from an existing sample x init , we run a chain of length N c = 1000 and repeat this for 20,000 chains, each time choosing the same initial x init . In Fig. 3 we plot a histogram of the resulting samples after 5 and 20 MCMC iterations, also showing the first 20 proposed moves of an example chain (note that not all of these proposals are accepted). After only 5 iterations, the resulting distribution is non-uniform, with a higher probability near the initial point, but after 20 iterations the samples appear to have lost all memory of where they began. By sampling in latent space, the chain is able to take large steps in data space, even jumping directly between modes, with an overall acceptance rate in each case of around 40%.
To quantify how many iterations are required to generate a new, independent sample, we calculate the effective sample size (ESS). Given a chain of N c correlated sam- where ρ s is the autocorrelation of x at lag s. Since there is an autocorrelation and ESS for each parameter, we use the (worst-case) minimum ESS to set the chain length requirement. We use the following estimate for ρ s , whereμ andσ 2 are the mean and variance of the initial data. We truncate the sum over ρ s whenρ s < 0.05, as the estimate can become dominated by noise for large lags [26].
For the 2 dimensional Rosenbrock and Himmelblau functions, we obtain an average minimum ESS of ∼ 100 for N c = 1000. This suggests that, on average, it takes around 10 iterations to generate a new, independent sample. Empirically, we find this scales as ∼ 1/n dim for higher dimensions.

C. Fast-slow decorrelation
In practice, the likelihood function can be computationally more expensive to evaluate for some parameters ('slow' parameters) than others ('fast' parameters). In astronomy/cosmology applications, for example, nuisance parameters are often much faster to evaluate than physical parameters of the model, when keeping physical parameters fixed. It is therefore desirable to split parameter space into a speed hierarchy, allowing for fast exploration of subspaces where it is inexpensive to evaluate the likelihood [27].
Fast-slow decorrelation can naturally incorporated into our method by fitting flows to each subspace and performing a further transformation to decorrelate them. In the case of a single hierarchy, for example, we fit separate flows to the slow (x s ) and fast (x f ) subspaces, concatenate the output into the vector (y s , y f ), and then apply a transformation with mask (1, 0). This means that slow parameters are unchanged by updating only the fast block, and a slow update changes both fast and slow parameters. This is illustrated in Fig. 4.
Given a speed hierarchy, we choose the sampling rate to be proportional to the number of parameters in each block. In the case of a single hierarchy, with n dim = n slow + n fast , where n slow is the number of slow parameters and n fast the number of fast parameters, at each MCMC iteration we perform a fast update with probability n fast /(n slow + n fast ), otherwise performing a slow update. In our experiments we find this leads to a minimum ESS similar to the full update of all parameters.

IV. NEURAL NEST ALGORITHM
In this section we give further details of our algorithm applied to nested sampling, although it could easily be incorporated into other MCMC frameworks. As outlined in section II, the algorithm begins by drawing n live points from the prior distribution π(x). At each iteration the point with the lowest likelihood (denoted by λ ) is replaced by a new sample drawn from the prior, subject to the condition that L(x) > λ .
To obtain a new, independent sample, we first train a flow on the current set of live points. Starting from an existing live point (chosen at random), we then per-form sampling in latent space, finally accepting the new point after n MCMC steps, with the requirement that it must have made at least one move (in practice it will make many moves). Further pertinent details of our implementation are: Number of MCMC iterations: Based on estimates of the ESS, we set n MCMC = 5n dim . Empirically, we find this works well for a range of target distributions. We monitor the ESS to ensure it does not significantly drop below 1 as the algorithm progresses, and that the chain performs a large number of updates by adjusting the proposal width as in (3.9).
Fast-slow hierarchy: We will consider models with either no hierarchy (n dim = n slow ) or a single fastslow hierarchy (n dim = n slow + n fast ). In the latter case, we perform fast updates at each MCMC iteration with probability n fast /(n slow + n fast ), otherwise performing a slow update that changes all pa- rameters. On average, there will be approximately 5n slow slow likelihood evaluations per chain.
Initial rejection sampling: In the initial stages of selecting a new point, the prior volume X ∼ 1. In this case, it is more efficient to use rejection sampling from the prior hypercube. We switch to MCMC when the rejection efficiency is equal to the MCMC efficiency, i.e. after the prior volume has decreased by a factor of 1/(5n slow ).
Training updates: The set of live points changes relatively slowly, so we only retrain the flow every n live iterations. We train each update for 50 epochs. Training is fast, taking < 60 seconds on a CPU. The NN is trained by backpropagating the loss in (3.5) using the Adam optimiser [28].
Adding jitter: We add 'jitter' (random perturbations) to the set of live points during training to reduce overfitting. Jitter is chosen to be Gaussian with zero mean and a standard deviation of 0.2 times the average nearest neighbour separation between live points. The level of jitter therefore reduces as the algorithm progresses.
Validation data: During training we use 90% of the current live points to train the flow. The remaining 10% are used as validation data to ensure the loss does not increase due to overfitting.
Termination: The algorithm is terminated on determining the fractional remaining Z to 0.5 in logevidence (see [6] for details).
Parallelisation: We parallelise our code using MPI, communicating the set of live points between processes. Each process trains a separate flow to the (same) set of points, providing a type of ensembling across NNs. A new live point is generated by each process, and these are communicated back to the master process, which is responsible for updating the set of live points.
The NN was coded using the PyTorch library 1 and the nested sampling code is available on request from the author.

A. Analytic likelihoods
We first test our method on several challenging analytic likelihoods: Mixture of 4 Gaussians: This is the same multimodal distribution given in [10], with a likelihood function Rosenbrock function: The is the archetypal example of a banana shaped degeneracy, with a loglikelihood (5.2) We choose uniform priors U(−5, 5) on the parameters x. The analytic evidence for n dim = 2 is log Z = −5.80 [11]. There is no analytic expression for n dim > 2, so for n dim = 3 we perform numerical integration to obtain the ground truth value. For higher dimensions we found this too expensive to compute numerically.
Himmelblau function: This is an example of a multimodal distribution, with a log-likelihood We also choose uniform priors U(−5, 5) on the parameters x. The Himmelblau function has four identical local minima at (3, 2), (-2.81, 3.13), (-3.78, -3.28) and (3.58, -1.85). There is also no analytic expression for the evidence, so we perform numerical integration to obtain the ground truth value for Z.
In Fig. 5 we show the marginalised 1 and 2-d posterior distributions for the Gaussian mixture model with n dim = 4. These marginalised values agree well with the expected values.
In Tab. I we show log Z and the number of likelihood evaluations for each of the 3 analytic likelihoods. We compare our results to the nested sampling codes MultiNest [7] and PolyChord [9]. MultiNest uses multimodal ellipsoidal rejection sampling, and PolyChord uses MCMC slice-sampling with multi-modal clustering. Each is run with their default settings 2 , and are set to stop on 2 An efficiency factor of 0.3 is used for MultiNest and nrepeats = 5n dim for PolyChord. determining the fractional remaining log-evidence to an accuracy of 0.5. In our code, we use 5 transformations in the NVP flow, each parameterised by a fully connected s and t network with an input layer of dimension 2, two hidden layers of dimension 128, and an output layer of dimension 2. We use rectified non-linear activation (ReLU) functions after the input and hidden layers in each network. In all three codes we set n live = 1000 and perform 5 separate runs to obtain summary statistics of log Z and the number of likelihood evaluations. Each code also produces an estimate of the evidence error for each run.
For the Gaussian mixture model we obtain results consistent with the ground truth values. The number of required likelihood evaluations is around a factor of 5-10 higher (i.e. less efficient) than MultiNest, primarily due to the n MCMC = 5n dim iterations we perform to obtain a new, independent sample. In contrast, the number of evaluations required per sample for MultiNest is only ∼ 3. Using default settings, however, MultiNest tends to overestimate log Z for n dim ≥ 10 -this can be alleviated by decreasing the efficiency parameter, but at the cost of more likelihood evaluations. Decreasing the efficiency to 0.05, for example, gives log Z = −59.92 ± 0.07 for n dim = 20, with an average 703,182 likelihood evaluations, and log Z = −89.95 ± 0.15 for n dim = 30, with 962,111 likelihood evaluations. This means that MultiNest is still a factor ∼ 5 times more efficient. Our error estimate in log Z using (2.7) is consistent with our summary statistics over 5 runs, being 0.08, 0.12, 0.17 and 0.21 for n dim = 5, 10, 20 and 30 respectively.
We also consider the same Gaussian mixture model but with a fast/slow hierarchy. In this case we choose x 1 and x 2 to be slow parameters, with the remainder fast parameters. We take the number of likelihood evaluations to be the number of slow evaluations. This number is now significantly reduced and is comparable to MultiNest at low dimensions, which does not implement any speed hierarchy, and is more efficient at high dimensions.
For the Himmelblau function we obtain results consistent with the ground truth value. The number of likelihood evaluations is now only around a factor of 2 higher than MultiNest. This is an improvement over the mixture model at the same dimension, as ellipsoidal rejection sampling is less efficient for non-Gaussian distributions.
For the Rosenbrock function we also obtain results consistent with the available ground truth values. For n dim = 2 the number of likelihood evaluations is now less than a factor of 2 higher than MultiNest, and for n dim ≥ 20 the poor scaling of rejection sampling becomes apparent. There is a difference in log Z compared to MultiNest for n dim ≥ 20, and although the results of PolyChord are consistent with ours, the lack of a ground truth value makes it difficult to draw conclusions. Our error estimate in log Z using (2.7) is again consistent with our summary statistics, being 0.07, 0.09, 0.11, 0.13, 0.19, 0.28 and 0.34 for n dim = 2, 3, 4, 5, 10, 20 and 30 respectively.
Compared to PolyChord, also an MCMC sampler, our method requires a lower number of likelihood evaluations, by a factor of ∼ 5. PolyChord also performs n MCMC = 5n dim repeats to obtain a new sample, but requires additional evaluations to determine the width of the slice. Recently, dyPolyChord [10] has been developed, which dynamically allocates live points during sampling. We have not performed direct comparisons with dyPolyChord in Tab. I, as we wish to compare the efficiency of each algorithm with a constant number of live points. From our experiments, however, dyPolyChord has similar performance to PolyChord at low dimensions, but significantly improves the scaling at higher dimensions. For the Rosenbrock function, for example, an average of 6,660,409 likelihood evaluations are required for n dim = 20 and only 9,182,957 for n dim = 30.
Other MCMC based results in the literature are limited, but in [23] it was shown that Galilean dynamics required around 120,000 and 220,000 likelihood evaluations for the Himmelblau and n dim = 2 Rosenbrock functions respectively.

B. Planck
Our Python implementation can easily be integrated with codes such as MontePython [5,29] and cobaya 3 to perform cosmological parameter estimation and model selection. The Planck datasets used in our analysis come from the 2015 mission [30,31]. In particular, we use the TT+lowP+lensing combination, which contains the 100-GHz, 143-GHz, and 217-GHz binned half-mission TT cross-spectra for = 30 − 2508 with CMB-cleaned 353-GHz map, CO emission maps, and Planck catalogues for the masks and 545-GHz maps for the dust residual contamination template. It also uses the joint temperature and the E and B cross-spectra for = 2 − 29 with E and B maps from the 70-GHz LFI full mission data and foreground contamination determined by 30-GHz Low Frequency Instrument (LFI) and 353-GHz High Frequency Instrument maps. The Planck lensing likelihood [32] uses both temperature and polarization data in the multipole range = 100 − 2048 to estimate the lensing power spectrum.
We use the full version of the Planck likelihood with MontePython, fitting for a total of 6 base LCDM parameters and 15 nuisance parameters. We also fit for simple one-parameter extensions to LCDM with a variable effective number of neutrino species N eff and curvature density Ω K . We assume uniform priors on the cosmological parameters, with the upper and lower limits corresponding to approximate Planck ±5σ values. To account for any Gaussian priors on nuisance parameters used in the Planck analysis, we use uniform priors with ±5σ limits, and add an additional term to the likelihood function. The prior ranges, along with a description of each parameter, are shown in Tab. II. For Planck the total computational time is dominated by the calculation of the cosmological observables, so we use a fast hierarchy for the nuisance parameters. We use n live = 500 points and the same network architecture as in the previous section. In Tab. III we give the marginalised cosmological parameters for the base LCDM, LCDM + N eff and LCDM + Ω K models. These agree very well with the published Planck values, and in Fig. 6 we show marginalised 1 and 2-d posterior distributions for the base LCDM model, compared to results from standard MCMC. These again agree extremely well, showing that we obtain accurate parameter constraints using our nested sampler.
We have also calculated the evidence, finding the Bayes factor to be log B = −1.7 ± 0.2 and −2.1 ± 0.2 for LCDM + N eff and LCDM + Ω K respectively. The error on the Bayes factor is obtained from adding the errors from (2.7) in quadrature. In the revised Jeffreys scale [33], | log B| > 1 is regarded as positive evidence, | log B| > 3 as strong evidence, and | log B| > 5 as very strong. These results therefore suggest that Planck disfavours both extensions to LCDM. Although the evidence is dependant on the choice of priors, our results are consistent with those in [34], who reuse MCMC chains produced for parameter inference to calculate the evidence.

VI. CONCLUSIONS
In this paper we have trained a neural network to parameterise efficient MCMC proposals, by transforming the target distribution to a simpler latent representation. This approach is inspired by representative learning, which suggests that deep NNs yield latent spaces in which Markov chains can mix faster. Our method is a non-linear extension of the commonly used technique of transforming parameter space using the Cholesky decomposition of the covariance matrix.
We have applied this method to nested sampling, finding excellent performance on highly curved and multimodal targets. At low dimensions the efficiency is within a factor of a few times that of state-of-the-art multimodal rejection sampling, but has better scaling in higher dimensions. Parameter space can also naturally be split into a speed hierarchy, making it suitable for models with a subset of parameters where it is inexpensive to evaluate the likelihood. We demonstrate this for Planck data in ∼ 20 dimensional parameter space, accurately recovering the expected posterior distributions. As example applications, we calculate the Bayesian evidence for variable effective number of neutrino species N eff and curvature density Ω K , finding the data disfavours these extensions to LCDM.  There are several possibilities for future work. Firstly, it would be interesting to see if the flow model can more naturally be extended to multi-modal distributions. Currently, the latent representation forms narrow connecting ridges between modes, which reduces the efficiency on models with a very high number of modes. One could also potentially use more general types of neural network, but these may not have the desirables properties of being invertible with tractable Jacobian determinants.
In terms of nested sampling, it has recently been shown that dynamically allocating the number of live points can significantly improve performance. It would be interesting to apply this technique to our method, potentially even using a neural network to estimate the posterior mass L(X)X and control the allocation of points.
In follow up work we will develop a NN sampler specifically designed for fast inference, that can easily be integrated into standard parameter estimation codes. This would improve on the standard technique of using the covariance matrix to parameterise the proposal function, working for both highly curved and multi-modal likelihoods. With the ability of NNs to characterise complex data by simple representations, we expect they will become useful tools to improve the speed of inference on a variety of problems.