Scaled von Mises–Fisher Distributions and Regression Models for Paleomagnetic Directional Data

Abstract We propose a new distribution for analyzing paleomagnetic directional data, that is, a novel transformation of the von Mises–Fisher distribution. The new distribution has ellipse-like symmetry, as does the Kent distribution; however, unlike the Kent distribution the normalizing constant in the new density is easy to compute and estimation of the shape parameters is straightforward. To accommodate outliers, the model also incorporates an additional shape parameter, which controls the tail-weight of the distribution. We also develop a general regression model framework that allows both the mean direction and the shape parameters of the error distribution to depend on covariates. The proposed regression procedure is shown to be equivariant with respect to the choice of coordinate system for the directional response. To illustrate, we analyses paleomagnetic directional data from the GEOMAGIA50.v3 database. We predict the mean direction at various geological time points and show that there is significant heteroscedasticity present. It is envisaged that the regression structures and error distribution proposed here will also prove useful when covariate information is available with (i) other types of directional response data; and (ii) square-root transformed compositional data of general dimension. Supplementary materials for this article are available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement.


Background: Paleomagnetic Directional Data
Spherical data are frequently encountered in the earth and environmental sciences (e.g., Schuenemeyer and Drew 2011;Borradaile 2003). A common example is paleomagnetic data consisting of observations on the direction of magnetism in either rocks, sediment or in archeological specimens, measured at various geological time points and spatial locations. The directions are usually measured as declination and inclination angles based on strike and dip coordinates (see Schuenemeyer and Drew 2011, p. 379 for a full definition). Often it is of interest to calculate a sample mean and standard error estimate of the direction at a particular spatial location and in small geological time ranges (e.g., Acton, Galbrun, and King 2000, p. 166). In other cases, depending on the data available, it is of interest to explore the relationships between the directions versus geological time and/or space to understand how the Earth's magnetic field has evolved. In this case, to account for the highly nonlinear relationships between the geomagnetic field directions and the covariates, in the geophysics literature, the geomagnetic field is usually expressed in terms of spherical harmonics, and the temporal evolution of the process is modeled using cubic B-splines. The residuals in these models are then assumed to have either an approximate Gaussian or Laplace distribution (e.g., Walker and Jackson 2000;Panovska et al. 2015). Paleomagnetic data is typically heavy-tailed and contains outliers (e.g., Acton, Galbrun, and King 2000;Panovska et al. 2015).
In this article, we focus on analyzing archeomagnetic data in the GEOMAGIA50.v3 database (GMAG; Brown et al. 2015), extracted in February 2017. GMAG is a very detailed online database providing access to a large amount of published paleomagnetic, rock magnetic, and chronological data from a variety of materials that record Earth's magnetic field over the past 50,000 years. For simplicity we restrict our analysis to a single spatial location which is the Eifel maars (EIF) lakes in Germany. Similarly to Panovska et al. (2015), we relocate nearby archeomagnetic data (latitudes in the range [40 • , 60 • ] and longitudes in the range [−3 • , 17 • ]) to the EIF location using an axial dipole correction as defined by eq. (1) in Noel and Batt (1990). Our archeomagnetic data is, therefore, equivalent or close to equivalent to (Panovska et al. 2015, fig. 10), top two plots (we exclude the sediment data). These plots are given here in Figure 1 and they show that the angles may be heavy-tailed and there is some evidence of nonconstant variability (heteroscedasticity) across time. Before, we analyze the data, we convert these angles to Cartesian coordinates defined on S 2 , where S p−1 denotes the unit sphere {y ∈ R p : ||y|| = 1}. In the conversion, we use Historically both the Kent distribution and von Mises-Fisher have been used to a limited extent to summarize paleomagnetic data samples (e.g., Fisher 1987;Tauxe 2010). One major issue with the Kent distribution is that the normalizing constant does not exist in closed form and involves multidimensional integrals that are difficult to compute. This has led to the use of either a high concentration or saddlepoint density approximation (Kent 1982;Kume and Wood 2005) to estimate the shape parameters. However, often the residual variability in applications is not small and these methods can lead to biased estimates especially when the shape parameters are spread and the ellipticity is high (Scealy and Welsh 2014). More recently Scealy and Welsh (2017) avoided the issue of estimating the shape parameters by instead modeling and then estimating the first-and secondorder moments of the Kent distribution. They based inference for the moments on a nonparametric bootstrap method, but this has the disadvantage of being computationally intensive and is cumbersome to apply.

Main Contributions of the Article
We propose a new family of flexible yet tractable error distributions for directional response data, which we call the Scaled von Mises-Fisher (SvMF) family. The SvMF family is generated by applying a bijective transformation of the sphere to itself which is defined in Section 2. The SvMF family has the same symmetry properties as the Kent (1982) distribution and the new density has virtually identical contours to the Kent density for certain ranges of the shape parameters. However, unlike the Kent distribution, the normalizing constant in a SvMF distribution is essentially that of the underlying von Mises-Fisher distribution and is, therefore, highly tractable. A further inter-esting property of the SvMF family is an extra parameter, which should be thought of as a tuning parameter, which allows some control of the tail-weight of the distribution for a given level of concentration of the distribution. We demonstrate that the shape parameters can be estimated in a computationally convenient way using standard maximum likelihood estimation methods. Moreover, we show how the new model can be used in the regression setting, allowing both the mean direction and the shape parameters to be modeled directly as functions of a general covariate vector. Simulation from the new model is also straightforward, as it just involves a simple transformation of a von Mises-Fisher random variable. We use this new modeling approach to analyze GMAG paleomagnetic data (see Brown et al. 2015).

Relevant Literature
We briefly mention some other families of distributions on S p of interest but do not consider them further in this article. Jones and Pewsey (2005) point out that the family they consider on the circle S 1 has an extension to S p , where p > 1. However, this family necessarily exhibits rotational symmetry, unlike the families considered here and mentioned below. Second, Paine et al. (2018) consider a subfamily of the angular Gaussian family (see Mardia and Jupp 2000 for the definition) whose distributions are Kent-like, that is, they have contours of constant density which exhibit ellipse-like symmetry. This angular Gaussian subfamily has some similar features to the family proposed here, though the mathematical form of the density is somewhat different. Downs and Mardia (2002) and Kato and Jones (2010) considered families of distributions on the unit circle S 1 generated by the Möbius transformation, while Kato and McCullagh (2015) proposed a Cauchy family of distributions on the unit sphere S p , p ≥ 1, which is based on a Möbius transformation on S p ; see Section 3 of their paper. However, although these constructions are similar in spirit to the construction proposed here, the resulting families of distributions are quite different. Jupp and Kent (1987) and Di Marzio, Panzera, and Taylor (2014) proposed nonparametric regression approaches on the sphere, but restricted to the case of a scalar covariate or a unit vector covariate (in the latter case only), and an isotropic error structure appears to be assumed in both papers. In contrast, our goal here is to develop a general flexible regression framework on the sphere which can handle general vector covariate structures and can accommodate heavy-tails and heteroscedasticity, without assuming a priori that the error distribution is rotationally symmetric. Finally, we mention that Rivest et al. (2016) suggested some interesting ideas for regression modeling on the circle; some of these ideas may prove useful for regression modeling on the sphere.

Structure of the Article
The rest of this article is organized as follows. In Section 2, we specify the family of transformations of S p−1 used to create the SvMF family, which is presented in Section 3. In Section 4, we propose iterative estimation schemes for the parameters, first considering the iid case and then focusing on the regression case. In Section 5, we describe our analysis of the GMAG data discussed in Section 1.1. In Section 6, we present simulation results which provide information about the properties of the parameter estimators for the SvMF model in the iid case. Conclusions are briefly summarized in Section 7 and proofs are given in appendices. Although the paleomagnetic directional data is defined on S 2 , throughout most of the article, we keep the dimension p ≥ 3 quite general.

A Group of Transformations on S p−1
The best-known transformation group on the unit sphere is of course the group of isometries. In a given Cartesian coordinate system, such an isometry may be represented by y → y, where is a p × p orthogonal matrix satisfying = = I p , where I p is the p × p identity matrix. We now consider a second type of transformation. In words, we consider a bijection of S p−1 onto itself obtained by rescaling the coordinate axes in the ambient space R p , and then projecting each point in the image of S p−1 (under the linear transformation of the ambient space) back onto the unit sphere.
To make this more mathematically explicit, define R + to be the set of strictly positive real numbers. For each a = (a 1 , . . . , a p ) ∈ R p + , define the transformation T a : By construction, z ∈ S p−1 , and it is clear that T a is a bijection from S p−1 to itself. Moreover, the set of transformations {T a : a ∈ R p + } forms a group with group operation T a • T b = T a•b , where, abusing notation slightly, we have used the same symbol for the group operation and for the Hadamard product of two vectors; here a • b = (a 1 b 1 , . . . , a p b p ) , where b = (b 1 , . . . , b p ) . Note that the inverse transformation T −1 a is given by T b , where b = (1/a 1 , . . . , 1/a p ) .
Let dS p−1 denote the standard geometric measure on the unit sphere. Let Z denote a random unit vector in S p−1 with probability density function f Z (z) with respect to the surface area measure dS p−1 . Then, since for each a ∈ R p + , T a defines a smooth bijection, it follows that if Z = T a (Y) then the random unit vector Y ∈ S p−1 has probability density function which satisfies The Jacobian function J a (y) is determined in the following lemma whose proof is given in Appendix A.1.
Lemma 2.1. For a ∈ R p + and y ∈ S p−1 , the function J a (y) is given by It is interesting to note that, when we take f Z (z) to be the probability density function of the uniform distribution on S p−1 , the resulting distribution of y turns out to be the angular central Gaussian distribution; see Watson (1983, p. 110) and Mardia and Jupp (2000).

Construction of a Kent-Like Distribution
When we take f Z (z) to be the von Mises-Fisher distribution and apply the transformation T a , we obtain a useful and seemingly new family of distributions, referred to as the SvMF family in the Introduction. Suppose that the components of a = (a 1 , . . . , a p ) in T a satisfy p j=2 a j = 1, and let f Z (z) denote the probability density function of the von Mises-Fisher distribution with respect to geometric measure dS p−1 (z) on S p−1 , and given by where z = (z 1 , . . . , z p ) , e j is the p-vector with component j equal to 1 and all other components zero, j = 1, . . . , p, c p (κ) = (2π) p/2 I (p/2)−1 (κ)/κ (p/2)−1 is the normalizing constant, and I ν denotes the modified Bessel function of the first kind of order ν. When p = 3, the normalizing constant takes the simple form c p (κ) = 2π(e κ − e −κ )/κ. Substituting z = T a (y), where a satisfies Equation (4), and using Equations (2) and (3), leads to the probability density function In Equation (6), the coordinate axes play a special role. We may write y j = y e j , where e j is the jth coordinate axis defined above, and generalizing from {e 1 , . . . , e p } to a general orthonormal basis {μ, γ 2 , . . . , γ p }, we obtain the general density Three theoretical results are now presented. The first result gives a sufficient condition for the density to be unimodal. The proof is given in Appendix A.2.
Proposition 1. Consider the density Equation (7) on S p−1 , where Equation (4) is satisfied, and without loss of generality assume that a 2 = max(a 2 , . . . , a p ) and a 1 ≥ 1. Then Equation (7) is unimodal and has a unique mode y = μ if If κ > 0 and a 1 > a 2 then, on the other hand, the density has a global maximum at y = μ (but is not necessarily unimodal).
Our second result shows that in the high-concentration limit, that is, when κ → ∞, the density is asymptotically Gaussian. The proof is given in Appendix A.3.
Later, for estimation it will also prove useful to consider the following alternative parameterization of γ 2 , γ 3 , . . . , γ p and a 2 , a 3 , . . . , a p . Following Scealy and Welsh (2014), define the p × p orthogonal matrix where μ L = (μ 2 , μ 3 , . . . , μ p ) and H * (μ) is a p×(p−1) matrix whose columns are orthogonal to μ. Let K * be a general (p−1)× (p − 1) orthogonal matrix defined such that holds. Then let where V is a (p − 1) × (p − 1) dimensional symmetric positive definite matrix with the constraint det (V) = 1, which corresponds to condition (4); and assume a 2 > a 3 > · · · > a p . In general the lower p − 1 elements on the diagonal of D in Proposition 3 do not correspond to the eigenvalues of V except under high concentration (see Proposition 2).
To obtain a Kent-like distribution it is convenient, for many practical purposes, to set a 1 = 1. However, numerical investigations indicate that as a 1 increases the density becomes heavier tailed with a higher probability of outliers and the shapes of the densities in the tangent space are more similar to those of a multivariate t-distribution of dimension p−1. The model is a Qsymmetric model as defined by Rivest (1984) and Rivest showed that the information matrix in such models is block diagonal (with the first block associated with location parameters and the second block associated with the shape parameters). In this context, the shape parameters are κ, a 1 , a 2 , . . . , a p and the location parameters are {μ, γ 2 , . . . , γ p }. It is also straightforward to prove that, under condition (4), both a 1 and κ are information orthogonal to all the shape parameters {a 2 , a 3 , . . . , a p }, but a 1 and κ are not information orthogonal to each other. In fact, as discussed later, a 1 and κ are not jointly estimable by maximum likelihood.

Models and Estimators
In this section, we assume that a 1 is fixed and not estimated. By default we suggest setting a 1 = 1 unless a heavier-tailed density is required.

Independent and Identically Distributed Data Case
Let Y 1 , . . . , Y n be an independent and identically distributed sample from the distribution with density Equation (7) and let y i = (y 1,i , y 2,i , . . . , y p,i ) for i = 1, 2, . . . , n denote the observed values of these random variables.

Moment and M-estimators of Location Parameters
The location parameters in can be estimated straightforwardly by using the Kent (1982) moment estimators. The moment estimator of μ is the sample mean directioñ and the moment estimators of γ 2 , γ 3 , . . . , γ p denoted bỹ γ 2 ,γ 3 , . . . ,γ p , respectively, are the unit eigenvectors corresponding, in decreasing order, to the p − 1 strictly positive eigenvalues of Note that there is some nonuniqueness in the definition of thẽ , in that any choice of the form {μ, ±γ 2 , ±γ 3 , . . . , ±γ p } will suffice. If we wish to specify the signs of theγ j uniquely, we can do this with probability one by choosing, for example, the first component of eachγ j to be positive.
The sample mean direction may not be efficient for heavytailed distributions (as seen in the simulation experiment in Section 6). In this case, the normalized spatial median estimator or the spherical median estimator of location available for the von Mises-Fisher distribution can be used (e.g., Ko and Chang 1993). These M-estimators are consistent under the model due to symmetry.

Maximum Likelihood Estimation of All Parameters
If μ and κ are known, then the log-likelihood for V is where y * * 1,i = μ y i and y * * L,i = H * (μ) y i . To estimate V, we maximize Equation (11) with respect to V subject to det (V) = 1 or equivalently subject to log{det (V)} = 0. This constrained optimization problem can be solved by using the method of Lagrange multipliers and the resulting Lagrangian function has a similar form to the log-likelihood for a general scatter matrix for an elliptically symmetric distribution defined on R p−1 . Similar to Maronna (1976, pp. 51-52), maximizing the Lagrangian function leads to the following estimating equation: This estimating equation can be solved by applying the following iterative reweighting algorithm , for some suitable starting value such asV 0 = I p−1 , wherê The modified Bessel function of the first kind is available in many software packages including in R and therefore the above log-likelihood is straightforward to maximize by applying onedimensional derivative free interval search methods. Given μ, to compute joint estimates of κ and V, we suggest iterating between maximizing Equations (11) and (12), where the most recent update of V is used in Equation (12) and the most recent update of κ is used in Equation (11), until convergence of both sets of parameters.
In practice μ, κ, and V are all unknown so we suggest first calculating a preliminary estimate of μ using the sample mean direction Equation (10) or the normalized spatial median estimator and then maximizing the log-likelihood conditional on the preliminary estimate of μ to update κ and V. Then, given the κ and V estimates, we suggest maximizing the log-likelihood for μ to obtain a second, but more efficient estimate of μ. This second estimator of μ can be calculated using the Nelder-Mead simplex algorithm (Nelder and Mead 1965) when the dimension p is low.
The parameter a 1 is set to a fixed value because it was not possible to jointly estimate both κ and a 1 at the same time using the method of maximum likelihood estimation. Specifically, κ and a 1 are not jointly identifiable. We simulated lots of datasets, both heavy-tailed and not heavy-tailed and we observed that in all of these cases the log-likelihood function increased as a 1 → 0 and κ → ∞. This phenomenon of parameters approaching boundary points is not unusual when modeling error distributions with an extra shape parameter; for example, see the comment in Taylor (1992, p. 41). Even for the tdistribution, often the degrees of freedom parameter is treated as a tuning constant rather than estimated because maximum likelihood estimation can sometimes give unsatisfactory results (e.g., Lange, Little, and Taylor 1989).

Preliminary Transformation
For computational convenience, prior to estimation we suggest applying the following orthogonal transformation to the response data where theỹ i ∈ S p−1 are the original data in Cartesian coordinates and˜ is the moment estimator of based on the original data. This preliminary transformation is needed to ensure the final estimates of μ and K * are not too far from the north pole (1, 0, . . . , 0) and the identity matrix, respectively. This initial transformation will lead to approximate information orthogonality of μ and K * and will be exact in the large sample limit case. Note that Kent, Mardia, and McDonnell (2006, pp. 758-759) applied a similar idea in estimation for the complex Bingham quartic distribution which is the analog of the Kent distribution in landmark-based shape analysis for two-dimensional objects.

Regression Case
Assume we have vector responses {Y i ∈ S p−1 : i = 1, 2, . . . , n} associated with a set of covariates {X i ∈ R q : i = 1, 2, . . . , n} and the responses are assumed to be conditionally independent given the covariates. Scealy and Welsh (2011) modeled the conditional distribution of Y i given X i = x i as having a Kent distribution (Kent 1982). In this model the location parameters were modeled as a function of x i and the shape parameters were assumed to be constant. We now describe a tractable way to also model shape parameters as functions of x i . We assume that the density of each Y i conditional on X i = x i is given by Equation (7), where all the parameters in the model are now functions of x i except a 1 , which is assumed fixed at some value, for example, a 1 = 1.

Preliminary Transformation
Prior to estimation, we suggest first replacing the observations y i byTy i for i = 1, . . . , n, whereT = H(p −1/2 1 p )˜ , and˜ is defined in Section 4.1.1. Note that, under mild conditions, T is a consistent estimator of its population analogue T = H(p −1/2 1 p ) and consequently, in large samples, the columns of T play an important role in the specification of the regression model if consistency holds. There is also the option of estimating T and using maximum likelihood estimation, a possibility that deserves further investigation, but in this article, we have opted to use a simpler approach, specifically the moment estimator for T indicated in Section 4.1.3. This transformation usingT results in estimators and fitted values, which are equivariant and is also convenient from a computational point of view, as typically K * is not too far from the identity and the response data is centered as near as possible to the middle of the positive orthant. Centering the responses in this way helps to avoid any of the regression coefficients getting too close to infinity points on the link function scale (see below for further details).

Link Functions
There are many different choices of link functions available to model the parameters in density Equation (7). For convenience and comparative purposes, we choose the same link functions as Scealy and Welsh (2017). However, an interesting topic for further research is the construction and exploration of other link is a vector of regression coefficients. This model assumes that the mean direction is in the positive orthant. In many applications, including the paleomagnetic data example discussed in Section 5, it is reasonable to assume that the conditional mean direction μ(x i ) is not highly variable across the range of x i and is contained well within the positive orthant after re-centering the data using the preliminary transformation discussed in Section 4.2.1. For p = 3, let where v i = g(x i ) ∈ R is a known function and σ 3 > 0, σ 4 > 0, δ 3 ∈ R, δ 4 ∈ R, and c 1 ∈ (−1, 1) are five variance component parameters. The above parameterization Equation (14) implies that where σ 1 > 0, σ 2 > 0, δ 1 ∈ R, δ 2 ∈ R, and c * 1 ∈ (−1, 1) are five variance component parameters, which satisfy δ 1 = δ 3 +δ 4 , δ 2 = δ 3 − δ 4 , σ 2 = σ −1 3 , c * 1 = c 1 , and σ 2 1 = σ 4 σ 3 (1 − c 2 1 ) −0.5 . The right hand side of Equation (15) is the same covariance matrix structure used by Scealy and Welsh (2017) to model their Kent distribution second-order moment matrix. This is a standard general flexible heteroscedastic variance-covariance structure (e.g., Pinheiro and Bates 2000, p. 205) and it can easily be extended into higher dimensions.

Estimation
The regression model parameters can be estimated directly by maximizing the log-likelihood. The log-likelihood is given by where We suggest a two-step iterative algorithm to maximize the above log-likelihood. First, calculate a preliminary estimate of the regression coefficients β by solving, for example, the estimating eq. (17) in Scealy and Welsh (2011). Then repeat the following two step algorithm until convergence of the parameters.
Step 1: Given β, update the variance component parameters in κ(x i ) and V(x i ) by maximizing Equation (16) with respect to these variance components.
Step 2: Given the variance components update from step 1, update β by maximizing Equation (16) with respect to β.
A standard second derivative Newton-Raphson algorithm can be applied in each step to do the optimizations. Note that the derivatives of the modified Bessel function of the first kind can be calculated straightforwardly from known recurrence relations. Approximate standard errors for β can also be estimated directly by using the observed information matrix obtained from the second derivative matrix for β conditional on the other parameters (treating the variance components as fixed). Or alternatively, a bootstrap can be employed to calculate estimated standard errors by resampling the (y i , x i ) pairs.

Equivariance
An important property of our new regression model is that the estimators are equivariant to orthogonal transformations. This is proved in Proposition 4 below. First, denote the original sample data by y 1 , y 2 , . . . , y n and let˜ y be the moment estimator of for this data defined in Section 4.1.1. Also definẽ which is the orthogonal matrix given in Section 4.2.1. Now defineỹ i =Q y y i , i = 1, . . . , n. We apply the regression modeling to theỹ i , not the y i . Suppose that, after doing the regression modeling, we end up with fitted mean directionŝ μ(x 1 ), . . . ,μ(x n ) forỹ 1 , . . . ,ỹ n , respectively. If we wish to find the corresponding fitted mean directions in the original coordinate system for the y i , we calculatê We now state the equivariance result. The proof is given in Appendix A.5. In what follows the subscript y indicates quantities based on the y i , and a subscript w indicates quantities based on the w i , defined in the proposition below.
Proposition 4. Suppose that y 1 , . . . , y n are unit p-vectors which span R p and have a nonzero vector sum. Let A denote an arbitrary orthogonal p × p matrix and define w i = Ay i , i = 1, . . . , n. Then there exists a choice˜ w such that the jth column of˜ w , j = 2, . . . , p, are eigenvectors of corresponding to positive descending eigenvalues; and alsõ w = A˜ y .
Moreover,Q w =Q y A ; theμ(x i ) based on thew i =Q w w i , i = 1, . . . , n, are invariant; and we havê and consequently the fitted mean directions are equivariant with respect to orthogonal transformation A.
There is a finite number of possible choices of˜ w , as of y . In both cases, this number is 2 p−1 , corresponding to sign changes of theγ j,w andγ j,y , respectively, and assuming distinct eigenvalues which occurs with probability one when n ≥ p. However, if we require that˜ w is continuous in A as A ranges over the p×p orthogonal matrices, then uniqueness in the choice of˜ w is recovered, and this leads to the equivariance claimed in the proposition.

Analysis of Paleomagnetic Directional Data
We now describe our analysis of the GMAG data discussed in Section 1. For illustrative purposes we considered three further subsets of the data. Case 1 refers to a single time point, where the geological time variable Age (in years) is set equal to 1250; this is the time point with the most data, leading to a sample size of n = 50. Case 2 covers the Age range 0 to 1500, giving a sample of size n = 788; and Case 3 covers the Age range 1500 to 1900, giving a sample of size n = 150. We fitted the independent and identically distributed model to Case 1 and the regression models to Case 2 and Case 3, with Age as the covariate. As a first step we calculated moment estimates for each of the three cases separately and then transformed the samples so that they were centered at the north pole using Equation (13).
We now discuss our analysis of the Case 1 data. The top two plots and bottom left plot in Figure 2 contain kernel density estimates of the components y 2,i , y 3,i , and y 1,i , respectively. The top left plot shows that a model with heavy-tails may be needed. We interpret the bottom right scatterplot of y 3,i versus y 2,i as providing evidence that the contours of the underlying density are elliptical in shape. As a first step we fitted the Kent distribution to the data using maximum likelihood estimation coupled with a saddlepoint approximation for the shape parameters as in Scealy and Welsh (2014). We then simulated a sample of size n = 100,000 from the fitted Kent model and plotted the resulting nonparametric kernel density estimate (solid black line in Figure 2). We then fitted the distribution defined in Section 3 with a 1 = 1 and then a 1 = 6 using the estimators defined in Section 4.1. The parameter estimates for these models are the true values in Tables 3 and 4 used in the simulation experiment described in Section 6 and the standard errors in these tables can be considered as parametric bootstrap estimates. We simulated large samples from these two fitted models and in Figure 2, we plotted the resulting kernel density estimates. The value a 1 = 6 was chosen to give densities as close as possible to the observed sample marginal distributions of y 2,i and y 3,i based on making the Kolmogorov-Smirnov (KS) two sample test statistics small. The p-values for the KS test statistics for y 2,i , when a 1 = 1 and a 1 = 6, were 0.07 and 0.79, respectively. This gives evidence that the sample could have been generated from the Kent or a 1 = 1 distribution since the test statistic was borderline significant at the 5% level, but the heavy-tailed distribution with a 1 = 6 provided a better fit. For the component y 3,i , all of the KS test statistics were similar and nonsignificant, and there was little difference between the fitted distributions.
We now describe the analysis of the Case 2 and Case 3 data. Letx i represent the ith value of Age. For convenience we rescaled the covariate for each of the two cases separately as In both Figures 3 and 4, the top left, top right, and middle left panels are plots of y 1,i versus x i , y 2,i versus x i , and y 3,i versus x i , respectively. It is seen that there are nonlinear relationships between y i and x i and the variability appears roughly to increase with x i for Case 2 and decrease with x i for Case 3. Before fitting the new regression models, we further transformed the two samples so that they were centered in the middle of the positive orthant as follows wherey i are the samples centered at the north pole based on the preliminary transformation given at Equation (13). We modeled the mean direction using a 6th degree polynomial and so the covariate vector is . Some of the regression coefficients were not significant based on the size of the estimated standard errors and these were removed from the models (i.e., these regression coefficients were set to zero and are omitted in Tables 1 and 2).  Next, we fitted three different models to both the Case 2 and Case 3 data separately. First we fitted the Kent model defined in Scealy and Welsh (2017), but with the random effects omitted (this is equivalent to a fixed-effects only regression model). In this model there are two regression coefficient vectors β 1 = (β 11 , β 12 , β 13 , β 14 , β 15 , β 16 , β 17 ) and β 2 = (β 21 , β 22 , β 23 , β 24 , β 25 , β 26 , β 27 ) and five variance component parameters. Similarly to sec. 5 of Scealy and Welsh (2011), we obtained approximate maximum likelihood estimates of these parameters. In summary, to update the regression coefficients we maximized the Kent log-likelihood and to update the shape parameters we maximized the approximate Gaussian log-likelihood. We repeated these two steps until convergence. In this Kent model ) using the right hand side of Equation (15) with v i = x i , and from the large concentration whereκ(x i ) andβ(x i ) are the Kent shape parameters for the ith unit. The parameter estimates, we obtained for this model are given in Tables 1 and 2. We also include approximate standard errors for the regression coefficients based on the observed Kent information matrix conditional on the asymptotic approximations for the shape parameters as well as standard error estimates from the nonparametric bootstrap with 1000 resamples. With the bootstrap standard errors account is taken of the preliminary transformation but this is not the case with the standard errors based on observed information; nevertheless, these two types of standard errors are often in reasonable agreement. We also fitted the new regression model defined in Section 4.2 with a 1 = 1 and a 1 = 6 to both the Case 2 and Case 3 data with the same covariate vector as the Kent model and with V(x i ) and κ(x i ) parameterized by Equation (14) with v i = x i . In Tables 1 and 2, covering Case 2 and Case 3, respectively, we present the following: parameter estimates; approximate standard errors for the regression coefficients based on the observed information matrix conditional on the variance component estimates; and standard error estimates obtained from the nonparametric bootstrap with 1000 resamples. Table 1 shows that the Case 2 parameter estimates for a 1 = 1 and the Kent model are virtually identical. In all three models the bootstrap standard errors are similar in size to the observed Fisher information standard errors. The asymptotic Gaussian approximation appears to be working reasonably well here for the Kent distribution. All of the retained regression coefficients are significantly different from zero based on the size of the estimated standard errors. The estimates of the regression coefficients for a 1 = 6 are not the same as the a 1 = 1 case, although they are not significantly different based on the size of the estimated standard errors. The standard error estimates are smaller for the a 1 = 6 model and this is not surprising because it has heavier tails and accounts better for outliers. Note that the t-distribution with small degrees of freedom often gives smaller standard errors in models when compared with the Gaussian distribution (Lange, Little, and Taylor 1989). Based on the size of the bootstrap standard errors, there is evidence that σ 3 in Equation (14) satisfies σ 3 > 1, implying that an elliptically symmetric model is needed. There is also evidence that δ 3 in Equation (14) satisfies δ 3 > 0, implying a heteroscedastic model is needed. Table 2 shows that the Case 3 parameter estimates for a 1 = 1 and the Kent model are a lot more different than in Case 2. The bootstrap standard errors and the observed Fisher information standard errors for the a 1 = 1 model are similar in size, but for the Kent model the observed Fisher information standard errors are sometimes much larger than the bootstrap ones. We suspect that the observed information method is grossly overestimating the standard errors for the Kent model because the shape parameter estimates are biased due to the asymptotic Gaussian approximation breaking down. Being able to approximate the standard errors well in both the a 1 = 1 and a 1 = 6 models using the observed information matrix is very useful because the bootstrap method is much more computationally intensive. Model selection for the terms in β can also be based on the approximate observed Fisher information standard errors, simplifying the analysis. Based on the size of the bootstrap standard errors, there is evidence that σ 3 in Equation (14) satisfies σ 3 = 1, implying that a rotationally symmetric model is reasonable in this case. Similarly to Case 2, there is evidence that δ 3 = 0, implying that a heteroscedastic model is needed. In both Cases 2 and 3 there is evidence that δ 4 = 0 which implies that V may not depend on Age.
The solid lines in the top two plots and the middle left plot in Figures 3 and 4 were obtained by plotting H(p −1/2 1 p ) μ(x i ) versus x i (we transformed from the center of the positive orthant back to the north pole), whereμ(x i ) is the predicted value from the fitted a 1 = 1 models (we also included a cubic smoothing spline for comparison in the y 2 and y 3 plots). The model for the mean direction appears to fit the data reasonably well in both cases. To further check the fit of the a 1 = 1 model we also calculated the following standardized residuals: The middle right, bottom left, and bottom right plots in Figures 3 and 4 contain plots of r 2,i versus x i , r 3,i versus x i , and r 3,i versus r 2,i , respectively, (we also included a cubic smoothing spline for comparison in the first two plots). These plots show no obvious patterns and the residuals appear randomly dispersed about zero with constant variance.

Simulation
We simulated 1000 samples of size n = 50 from the following models fitted to the data of Case 1 in Section 5: (i) Distribution defined in Section 3 with a 1 = 1 (P 1 ), and (ii) distribution defined in Section 3 with a 1 = 6 (P 6 ). For each sample we fitted both models and calculated five different estimates of μ using (a) moment estimator, (b) normalized spatial median, (c) maximum likelihood estimator obtained using the Nelder-Mead algorithm under the P 6 model, (d) maximum likelihood estimator obtained using the Nelder-Mead algorithm under the P 1 model, and (e) maximum likelihood estimator for the Kent model with shape parameters estimated via the asymptotic Gaussian approximation. We also calculated the maximum likelihood estimates of κ and V for each sample under the true model. Table 3 contains the estimated standard errors and true values of μ conditioned on in the simulations (the estimated standard errors are parametric bootstrap estimates for the Case 1 data in Section 5). The estimated biases in the mean direction estimators were all negligible and the standard errors and root mean squared errors were all very similar. Table 4 contains the true values of κ and V conditioned on in the simulations as well as standard error and bias estimates, where V ij denotes the (i, j)th element in V.
In Table 3 there are no results for μ 1 because we have the identity μ 2 1 +μ 2 2 +μ 2 3 = 1 and μ 1 is determined up to sign from μ 2 and μ 3 . It is best to give results for μ 2 and μ 3 only because the marginal distributions of y 2 and y 3 are centered close to zero and are approximately symmetric, whereas the marginal distributions of y 1 are highly left skewed due to being distributed close to the upper boundary 1. Biases and standard errors are less informative in this asymmetric case. The standard errors suggest that μ 2 and μ 3 could both be zero. This is due to the fact that we applied the prior orthogonal transformation before the analysis to guarantee equivariance (this transformation recentered the data so that the sample mean direction is at the north pole).
From Table 3 we see that the moment estimator has similar efficiency to the Kent and P 1 maximum likelihood estimator when simulating under P 1 . The normalized spatial median and the P 6 maximum likelihood estimator are slightly less efficient.  When simulating under P 6 , the normalized spatial median, and P 6 maximum likelihood estimator of the mean direction were more efficient than the P 1 maximum likelihood estimator, the moment estimator and the Kent maximum likelihood estimator. The normalized spatial median is only slightly less efficient than the P 6 maximum likelihood estimator when simulating under P 6 . Table 4 shows that V is not significantly different from the identity matrix when simulating under P 6 and there is evidence that the true model could be rotationally symmetric. This is not the case for P 1 as both V 11 and V 22 appear marginally significantly different from 1 based on the size of the standard errors in Table 4.

Conclusion
We introduced a flexible heteroscedastic regression model for paleomagnetic directional data. The error distribution, which is obtained via a novel transformation of the von Mises-Fisher distribution, has some desirable properties. Specifically, the error density has elliptical symmetry; and its normalizing constant is tractable, so that the shape parameters can be estimated directly using maximum likelihood estimation. The new model was successfully applied to the analysis of paleomagnetic data in the GEOMAGIA50.v3 database. It is evident from our analysis that there is significant heteroscedasticity in the data and that the new regression model provides a useful framework which captures nonlinear features in the data. Moreover, the model has a tuning parameter that enables the accommodation of both light-tailed and heavy-tailed directional data. . Now let v = v 2 , v 3 , . . . , v p , where v j = κ 1/2 z * j , j = 2, 3, . . . , p. When κ is large, it follows that the density of v is the (p − 1) dimensional Gaussian density with mean 0 and diagonal covariance matrix. The variables v j for j = 2, 3, . . . , p are each O p (1) and therefore z * j for j = 2, 3, . . . , p is O p (κ −1/2 ). By definition, κ 1/2 y * L = κ 1/2 1− z * 2 /4 1/2 z * = κ 1/2 z * + O p (κ −1 ) and therefore | κ 1/2 z * − κ 1/2 y * L |→ 0 in probability and y * L also has the same asymptotic Gaussian distribution as z * .

A.5. Proof of Proposition 4
First note thatμ Moreover, since A is an orthogonal matrix, I p −μ wμ w = A I p −μ yμ y A , and therefore Consequently, the first part of the proposition holds and, in particular, we may choose˜ w = A˜ y . It then follows directly from Equation (17) thatQ w =Q y A ; w i =Q w w i =Q y A Ay i =Q y y i =ỹ i , i = 1, . . . , n, so that, in particular, theμ(x i ) are invariant (as opposed to equivariant); and, finally, as required.