Exact Bayesian Inference for the Bingham Distribution

This paper is concerned with making Bayesian inference from data that are assumed to be drawn from a Bingham distribution. A barrier to the Bayesian approach is the parameter-dependent normalising constant of the Bingham distribution, which, even when it can be evaluated or accurately approximated, would have to be calculated at each iteration of an MCMC scheme, thereby greatly increasing the computational burden. We propose a method which enables exact (in Monte Carlo sense) Bayesian inference for the unknown parameters of the Bingham distribution by completely avoiding the need to evaluate this constant. We apply the method to simulated and real data, and illustrate that it is simpler to implement, faster, and performs better than an alternative algorithm that has recently been proposed in the literature.


Introduction
Observations that inherit a direction occur in many scientific disciplines (see, for example, Mardia and Jupp;2000). For example, directional data arise naturally in the biomedical field for protein structure (Boomsma et al.;2008), cell-cycle (Rueda et al.;2009) and circadian clock experiments (Levine et al.;2002); see also the references in Ehler and Galanis (2011). A distribution that has proved useful as a model for spherical data which arise as unsigned directions is the Bingham distribution (Bingham;1974;Mardia and Jupp;2000).
The Bingham distribution can be constructed by conditioning a zero-mean multivariate Normal (MVN) distribution to lie on the sphere S q−1 of unit radius in R q . In particular, for a given matrix A of dimension q × q, the density with respect to the uniform measure on S q−1 is given by where c(A) is the corresponding normalising constant.
Having observed some directional data, interest then lies in inference for the matrix A in (1). The likelihood of the observed data given the parameters can easily be written down and at first glance it appears that maximum likelihood inference for A is straightforward. However, inferring the matrix A is rather challenging. That is due to the fact that the likelihood of the observed data given the matrix A involves the parameter-dependent normalising constant c(A) which, in the general case, is not available in closed form. Therefore this poses significant challenges to undertake statistical inference involving the Bingham distribution either in a frequentist or Bayesian setting.
Although a maximum likelihood estimator for A can be derived by iterative techniques which are based on being able to approximate c(A) (see, for example, Kent;2005, 2007Sei and Kume;2013), very little attention has been drawn in the literature concerning estimation of A within a Bayesian framework. Walker (2013) considered Bayesian inference for the Bingham distribution which removes the need to compute the normalising constant, using a (more general) method that was developed earlier 2011) and cleverly gets around the intractable nature of the normalising constant. However, it requires the introduction of several latent variables and a Reversible-Jump Markov Chain Monte Carlo (RJMCMC) sampling scheme.
The main contribution of this paper is to show how one can draw Bayesian inference for the matrix A, by exploiting the recent developments in Bayesian computation for distributions with doubly intractable normalising constants (Møller et al.;Murray et al.;. The main advantage of our approach is that it does not require any numerical approximation to c(A) and hence, enables exact (in the Monte Carlo sense) Bayesian inference for A. Our method relies on being able to simulate exact samples from the Bingham distribution which can be done by employing an efficient rejection sampling algorithm proposed by Kent and Ganeiber (2012).
The rest of the paper is structured as follows. In Section 2 we introduce the family of Angular Central Gaussian distributions and illustrate how such distributions serve as efficient proposal densities to sample from the Bingham distribution. In Section 3 we describe our proposed algorithm while in Section 4 we illustrate our method both using simulated and real directional data from earthquakes in New Zealand. In Section 5 we discuss the computational aspects of our method as well as directions for future research.
2 Rejection Sampling 2.1 Preliminaries Rejection sampling (Ripley;) is a method for drawing independent samples from a distribution with probability density function f (x) = f * (x)/Z f assuming that we can evaluate f * (x) for any value x, but may not necessarily know Z f . Suppose that there exists another distribution, with probability density function g(x) = g * (x)/Z g , often termed an envelope density, from which we can easily draw independent samples and can evaluate g * (x) at any value x. We further assume that there exists a constant M * for which M * g * (x) ≥ f * (x) ∀x. We can then draw samples from f (x) as follows: 1. Draw a candidate value y from g(x) and u from U (0, 1); 2. if u ≤ f * (y) M * g * (y) accept y; otherwise reject y and go to step 1. The set of accepted points provides a sample from the target density f (x). It can be shown that the number of trials until a candidate is accepted has a geometric distribution with mean M , where Therefore, the algorithm will work efficiently provided that M is small or, in other words, the probability of acceptance (1/M ) is large. Moreover, it is important to note that it is not necessary to know the normalising constants Z f and Z g to implement the algorithm; the only requirement is being able to draw from the envelope density g(x) and knowledge of M * (rather than M ).

The Angular Central Gaussian Distribution
The family of the angular central Gaussian(ACG) distributions is an alternative to the family of the Bingham distributions for modelling antipodal symmetric directional data (Tyler;). An angular central Gaussian distribution on the (q − 1)−dimensional sphere S q−1 can be obtained by projecting a multivariate Gaussian distribution in R q , q ≥ 2, with mean zero onto S q−1 with radius one. In other words, if the vector y has a multivariate Normal distribution in R q with mean 0 and variance covariance matrix Ψ, then the vector x = y/||y|| follows an ACG distribution on S q−1 with q × q symmetric positive−definite parameter matrix Ψ (Mardia and Jupp;2000). The probability density function of the ACG distribution with respect to the surface measure on S q−1 is given by where the constant w q = 2π q/2 /Γ(q/2) represents the surface area on S q−1 . Denote by c ACG (Ψ) = w −1 q |Ψ| −1/2 the normalising constant where Ψ is a q × q symmetric positive−definite matrix. Kent and Ganeiber (2012) have demonstrated that one can draw samples from the Bingham distribution using the ACG distribution as an envelope density within a rejection sampling framework. In particular, the following algorithm can be used to simulate a value from the Bingham distribution with parameter matrix A:

Rejection Sampling for the Bingham Distribution
1. Set Ψ −1 = I q + 2 b A and M * ≥ sup x f * (x) g * (x) ; 2. draw u from U (0, 1) and a candidate value y from the ACG distribution on the sphere with parameter matrix Ψ; then it can be shown that if x is drawn from a distribution with probability density function f (x; A), the corresponding random vector y = X T V is drawn from a distribution with density f (x; Λ) (see, for example, Kume and Walker;2007). Therefore, without loss of generality, we assume that A = Λ = diag(λ 1 , . . . , λ q ). Moreover, to ensure identifiability, we assume that λ 1 ≥ λ 2 ≥ . . . λ q = 0 (Kent;. Therefore, the probability density function becomes with respect to a uniform measure on the sphere and Suppose (x 1 , x 2 , . . . , x n ) is a sample of unit vectors in S q−1 from the Bingham distribution with density (4). Then the likelihood function is given by where . The data can therefore be summarised by (n, τ 1 , . . . , τ q−1 ), and (τ 1 , . . . , τ q−1 ) are sufficient statistics for (λ 1 , . . . , λ q−1 ).

Bayesian Inference
We are interested in drawing Bayesian inference for the matrix Λ, or equivalently, for λ = (λ 1 , . . . , λ q−1 ). The likelihood function in (5) reveals that the normalising constant c(Λ) plays a crucial role. The fact that there does not exist a closed form expression for c(Λ) makes Bayesian inference for Λ very challenging.
For example, if we assign independent Exponential prior distributions to the elements of λ with rate µ i (i.e. mean 1/µ i ) subject to the constraint that λ 1 ≥ λ 2 ≥ . . . ≥ λ q−1 then the density of the posterior distribution of Λ up to proportionality given the data is as follows: Consider the following Metropolis-Hastings algorithm which aims to draw samples from π(λ|x 1 , . . . , x n ): Note that N q−1 (m, S) denotes the density of a multivariate Normal distribution with mean vector m and variance-covariance matrix S.
Step 2 of the above algorithm requires the evaluation of the ratio π (λ can |x 1 , . . . , x n ) /π (λ cur |x 1 , . . . , x n ), which in turn involves evaluation of the ratio c(Λ can )/c(Λ cur ). Therefore, implementing the above algorithm requires an approximation of the normalising constant. In principle, one can employ one of the proposed methods in the literature which are based either on asymptotic expansions (Kent;, saddlepoint approximations 2005) or holonomic gradient methods (Sei and Kume;2013). Although such an approach is feasible, in practice, it can be very computationally costly since the normalising constant would have to be approximated at every single MCMC iteration. Furthermore, despite how accurate these approximations may be, the stationary distribution of such an MCMC algorithm won't be the distribution of interest π(λ|x 1 , . . . , x n ), but an approximation to it.

An Exchange Algorithm
The main contribution of this paper is to demonstrate that recent developments in Markov Chain Monte Carlo algorithms for the so-called doubly intractable distributions enable drawing exact Bayesian inference for the Bingham distribution without having to resort to any kind of approximations. Møller et al. (2006) proposed an auxiliary variable MCMC algorithm to sample from doubly intractable distributions by introducing a cleverly chosen variable in to the Metropolis-Hastings (M-H) algorithm such that the normalising constants cancel in the M-H ratio. In order for their proposed algorithm to have good mixing and convergence properties, one should have access to some sort of typical value of the parameter of interest, for example a pseudo-likelihood estimator. A simpler version that avoids having to specify such an appropriate auxiliary variable was proposed in Murray et al. (2006). Although both approaches rely on being able to simulate realisations from the Bingham distribution (see Section 2.3), we choose to adapt to our context the approach presented in Murray et al. (2006) because it is simple and easy to implement, since a value of the parameter of interest does not need to be specified.
Consider augmenting the observed data with auxiliary data y, so that the corresponding augmented posterior density becomes π(λ, y, |x) ∝ π(x|λ)π(λ)π(y|λ), where π(y|λ) is the same distribution as the original distribution on which the data x is defined (i.e. in the present case, the Bingham distribution). Proposal values for updating the parameter λ are drawn from a proposal density h(·|λ), although in general this density does not have to depend on the variables λ. For example, random walk proposals centred at λ or independence sampler proposals could be used. Now consider the following algorithm: 1. Draw λ ∼ h(·|λ); 2. Draw y ∼ π(·|λ ); 3. Propose the exchange move from λ to λ with probability where f * (x; A) = exp(−x T Ax) is the unnormalized Bingham density as previously. This scheme targets the posterior distribution of interest (the marginal distribution of λ in (7)), but most importantly, note that all intractable normalising constants cancel above and below the fraction. Hence, the acceptance probability can be evaluated, unlike in the case of a standard Metropolis-Hastings scheme. In practice, the exchange move proposes to offer the observed data x to the auxiliary parameter λ and similarly to offer the auxiliary data y the parameter λ.

Artificial Data
We illustrate the proposed algorithm to sample from the posterior distribution of λ using artificial data.

Dataset 1
Consider a sample of n = 100 unit vectors (x 1 , . . . , x 100 ) which result in the pair of sufficient statistics (τ 1 , τ 2 ) = (0.30, 0.32). We assign independent Exponential prior distributions with rate 0.01 (i.e. mean 100) to the parameters of interest λ 1 and λ 2 subject to the constraint that λ 1 ≥ λ 2 ; note that we also implicitly assume that λ 1 ≥ λ 2 ≥ λ 3 = 0. We implemented the algorithm which was described in Section 3.2.1. The parameters were updated in blocks by proposing a candidate vector from a bivariate Normal distribution with mean the current values of the parameters and variance-covariance matrix σI, where I is the identity matrix and the samples were thinned, keeping every 10th value. Convergence was assessed by visual inspection of the Markov chains and we found that by using σ = 1 the mixing was good and achieved an acceptance rate between 25% and 30%. Figure 1 shows a scatter plot of the sample from the joint posterior distribution (left panel) whilst the marginal posterior densities for λ 1 and λ 2 are shown in the top row of Figure 2. The autocorrelation function (ACF) plots, shown in the top row of Figure 3 reveal good mixing properties of the MCMC algorithm and by (visual inspection) appears to be much better than the algorithm proposed by Walker (2013, Figure 1). Mardia and Zemroch (1977) report maximum likelihood estimates ofλ 1 = 0.588,λ 2 = 0.421, with which our results broadly agree. Although in principle one can derive (approximate) confidence intervals based on some regularity conditions upon which it can be proved that the MLEs are (asymptotically) Normally distributed, an advantage of our (Bayesian) approach is that it allows quantification of the uncertainty of the parameters of interest in a probabilistic manner.

Dataset 2
We now consider an artificial dataset of 100 vectors which result in the pair of sufficient statistics (τ 1 , τ 2 ) = (0.02, 0.40) for which the maximum likelihood estimates areλ 1 = 25.31, λ 2 = 0.762 as reported by Mardia and Zemroch (1977). We implement the proposed algorithm by assigning the same prior distributions to λ 1 and λ 2 as for the Dataset 1. A scatter plot of a sample from the joint posterior distribution in shown in Figure 1, showing that our approach gives results which are consistent with the MLEs. This examples shows that our algorithm performs well even when λ 1 >> λ 2 .

Earthquake data
As an illustration of an application to real data, we consider an analysis of earthquake data recently analysed by Arnold and Jupp (2013). An earthquake gives rise to three orthogonal axes, and geophysicists are interested in analysing such data in order to compare earthquakes at different locations and/or at different times. An earthquake gives rise to a pair of orthogonal axes, known as the compressional (P) and tensional (T) axes, from which a third axis, known as the null (A) axis is obtained via A = P × T . Each of these quantities are determined only up to sign, and so models for axial data are appropriate. The data can be treated as orthogonal axial 3-frames in R 3 and analysed accordingly, as in Arnold and Jupp (2013), but we will illustrate our method using the A axes only. In general, an orthogonal axial r-frame in R p , r ≤ p, is an ordered set of r axes, {±u 1 , . . . , ±u r }, where u 1 , . . . , u r are orthonormal vectors in R p (Arnold and Jupp;2013). The more familiar case of data on the sphere S 2 is the special case corresponding to p = 3, r = 1, which is the case we consider here. The data consist of three clusters of observations relating to earthquakes in New Zealand. The first two clusters each consist of 50 observations near Christchurch which took place before and after a large earthquake on 22 February 2011, and we will label these two clusters CCA and CCB respectively. For these two clusters, the P and T axes are quite highly concentrated in the horizontal plane, and as a result the majority of the A axes are concentrated about the vertical axis. It is of interest to geophysicists to establish whether there is a difference between the pattern of earthquakes before and after the large earthquake. The third cluster is a more diverse set of 32 observations obtained from earthquakes in the north of New Zealand's South Island, and we will label this cluster SI. We will illustrate our method by fitting Bingham models to the A axes from each of the individual clusters and considering the posterior distributions of the Bingham parameters. We will denote the parameters from the CCA, CCB and SI models as λ A i , λ B i and λ S i respectively, i = 1, 2. The observations for the two clusters of observations near Christchurch yield sample data of (τ A 1 , τ A 2 ) = (0.1152360, 0.1571938) for CCA and (τ B 1 , τ B 2 )(0.1127693, 0.1987671) for CCB. The data for the South Island observations are (τ S 1 , τ S 2 ) = (0.2288201, 0.3035098). We fit each dataset separately by implementing the proposed algorithm.Exponential prior distributions to all parameters of interest (mean 100) were assigned, subject to the constraint that λ j 1 ≥ λ j 2 for j = A, B, S. Scatter plots from the joint posterior distributions of the parameters from each individual analysis are shown in Figure 4. The plots for CCA and CCB look fairly similar, although λ 2 is a little lower for the CCB cluster. The plot for SI cluster suggests that these data are somewhat different. Posterior Sample of (λ 1 S ,λ 2 S ) λ 1 S λ 2 S Figure 4: Posterior samples for differences in λ 1 and λ 2 for the two sets of Christchurch data (left) and South Island and Christchurch data A (right). This shows a clear difference between the South Island and Christchurch data, but suggests no difference between the two sets of Christchurch data.
To establish more formally if there is any evidence of a difference between the two Christchurch clusters, we consider the bivariate quantity (λ A . If there is no difference between the two clusters, then this quantity should be (0, 0). In Figure 5 (left panel), we show the posterior sample of this quantity, and a 95% probability region obtained by fitting a bivariate normal distribution with parameters estimated from this sample. The origin is contained comfortably within this region, suggesting there is no real evidence for a difference between the two clusters. Arnold and Jupp (2013) obtained a p-value of 0.890 from a test of equality for the two populations based on treating the data as full axial frames, and our analysis on the A axes alone agrees with this.
The right panel of Figure 5 shows a similar plot for the quantity (λ A 1 − λ S 1 , λ A 2 − λ S 2 ). Here, the origin lies outside the 95% probability region, suggesting a difference between the first Christchurch cluster and the South Island cluster. Arnold and Jupp (2013) give a p-value of less than 0.001 for equality of the two populations, so again our analysis on the A axes agrees with this.  Figure 5: Posterior samples for differences in λ 1 and λ 2 for the two sets of Christchurch data (left) and South Island and Christchurch data A (right). This shows a clear difference between the South Island and Christchurch data, but suggests no difference between the two sets of Christchurch data.

Discussion
There is a growing area of applications that require inference over doubly intractable distributions including directional statistics, social networks (Caimo and Friel;2011), latent Markov random fields (Everitt;, and large-scale spatial statistics (Aune et al.; to name but a few. Most conventional inferential methods for such problems relied on approximating the normalising constant and embedded the latter into a standard MCMC algorithm (e.g. Metropolis-Hastings). Such approaches not only are only approximate in the sense that the target distribution is an approximation to the true posterior distribution of interest, but they can also suffer from being very computationally intensive. It is only until fairly recently that algorithms which avoid the need of approximating/evaluating the normalising constant became available; see Møller et al. (2006); Murray et al. (2006); Walker (2011); Girolami et al. (2013).
In this paper we were concerned with exact Bayesian inference for the Bingham distribution which has been a difficult task so far. We proposed an MCMC algorithm which allows us to draw samples from the posterior distribution of interest without having to approximate this constant. We have shown that the MCMC scheme is i) fairly straightforward to implement, ii) mixes very well in a relatively short number of sweeps and iii) does not require the specification of good guesses of the unknown parameters. We have applied our method to both real and simulated data, and showed that the results agree with maximum likelihood estimates for the parameters. However, we believe that a fully Bayesian approach has the benefit of providing an honest assessment of the uncertainty of the parameter estimates and allows exploration of any non-linear correlations between the parameters of interest. In comparison to the approach recently proposed by Walker (2013) (which also avoids approximating the normalising constant) we argue that our algorithm is easier to implement, runs faster and the Markov chains appear to mix better.
In terms of the further computational aspects, our algorithm is not computationally intensive and this is particularly true for the number of dimensions that are commonly met in practice (e.g. q = 3). For all the results presented here, we ran our MCMC chains for 10 6 iterations for each of the simulated and real data examples, which we found to be sufficient for good mixing in all cases. Our method was implemented in C++ and each example took between 20 and 30 seconds on a desktop PC with 3.1GHz processor 1 ; note, that is considerably faster than the algorithm proposed by Walker (2013) in which "running 10 5 iterations takes a matter of minutes on a standard laptop". In general the time taken for our proposed algorithm will depend on the number of auxiliary data points n that need to be simulated, as well as the efficiency of the underlying rejection algorithm for the particular parameter values at each iteration. In addition, the efficiency of the rejection algorithm is likely to deteriorate as the dimension q increases, but we found it to be very efficient for all our examples and it is reasonably efficient for at least a moderate number of dimensions according to simulations by Ganeiber (2012).
Statistical inference, in general, is not limited to parameter estimation. Therefore, a possible direction for future research within this context is to develop methodology to enable calculation of the model evidence (marginal likelihood). This quantity is vital in Bayesian model choice and knowledge of it will allow a formal comparison between competing models for a given dataset such as the application presented in Section 4.2.