Nonparametric Statistical Methods on Manifolds

One of the many fundamental contributions that Rabi Bhattacharya, together with his coauthors, has made is the development of a general nonparametric theory of statistical inference on manifolds, in particular related to both intrinsic and extrinsic Frechet means of probability distributions thereon (cf. Bhattacharya and Bhattacharya 2012, Bhattacharya, Patrangenaru 2013 and 2005). With the increasing importance of statistical analysis for non-Euclidean data in many applications, there is much scope for further advances related to this particular broad area of research. In the following, we concentrate on two particular important themes in data analysis on manifolds: nonparametric bootstrap methods and nonparametric curve fitting.


Bootstrap methods
A major application of the central limit results in [4] and [5] for intrinsic and extrinsic means on general manifolds is to statistical inference, e.g. the construction of confidence regions and multi-sample tests via bootstrap methods.
The bootstrap, see [8], provides a way of estimating, on the basis of an observed sample, the sampling distribution of a statistic T (S; F ), where S = {X 1 , . . . , X n } is an independent random sample from a probability distribution F . Here we will assume that F is a distribution on a Riemannian manifold M and that the target for inference is a population location parameter µ = µ(F ) ∈ M. LetF denote the empirical distribution function (EDF) based on the sample S; so for A ⊂ M, where δ Xi (A) = 1 if X i ∈ A and zero otherwise. Let S * = {X * 1 , . . . , X * n } denote a sample of size n drawn randomly with replacement from the original sample S, with corresponding EDFF * . Then Efron's bootstrap principle is that the distribution ofμ = µ(F ) based on an i.i.d. sample {X 1 , . . . , X n } from F is approximately the same as the distribution ofμ * = µ(F * ) based on an i.i.d. sample {X * 1 , . . . , X * n } fromF . The usefulness of this result is that the distribution ofμ * can be approximated with arbitrary accuracy by simulation and hence used as a basis for inference about µ.
Two very basic but commonly arising inference problems concerning µ are as follows: (i) given a sample S = {X 1 , . . . , X n }, construct a confidence region for µ ∈ M; and (ii) given k ≥ 2 independent samples S 1 , . . . , S k , test the hypothesis that µ 1 = · · · = µ k , where µ i = µ(F i ) and F i is the population distribution from which the sample S i = {X i,1 , . . . , X i,ni } is drawn.
There are many possible ways to construct bootstrap procedures to address (i) and (ii), but doing so using pivotal statistics has well-known advantages; see, for example, [10] and [12]. A pivotal statistic is one whose asymptotic distribution does not depend on unknown parameters. Pivotal statistics for bootstrapping in the setting where µ is an extrinsic mean and M is the unit sphere were devised in [10], and the approach was generalised in [4] and [5] to general M.
Consider the following strategy for constructing a pivotal statistic for an extrinsic mean. Suppose that {X 1 , . . . , X n } is an i.i.d. sample from a subset of a finite-dimensional linear space, represented as R q or C q , in which M is embedded. Assume that a location parameter of interest can be written as where φ(Ξ) is a smooth map with codomain of dimension t, say. Typically, extrinsic means can be written in this form.
Suppose that the sample meanΞ = n −1 n i=1 X i satisfies a central limit theorem, so that as n → ∞, in distribution. Since φ(·) is smooth, then, by the delta method, where L is a matrix of derivatives, and in distribution. Then, provided LV L ⊤ has full rank, In practice it is convenient to replace LV L ⊤ with its asymptotically equivalent sample analogueLVL ⊤ , leading to the pivotal statistic A bootstrap version of this statistic is which can be used to contruct bootstrap algorithms to address problems (i) and (ii) as follows.
Step 4. Return the approximate 100(1 − α)% confidence region where T (µ) is the pivotal statistic based on the original sample S.
Step 2. Findμ pooled to minimise k j=1 T j (µ) over µ, and writê Step 3. Set up the bootstrap null hypothesis H Boot by adjusting the empirical distribution functionsF 1 , . . . ,F k toF adj 1 , . . . ,F adj k , in such a way that One can adjust theF j either by transforming the samples in some way, or by resampling with non-uniform resampling probabilities, or maybe a combination of the two.
where T * jb is the pivotal statistic based on S * 1b , . . . , S * kb .
Step 5. Calculate the bootstrap p-value Types of manifold-valued data for which bootstrap algorithms of this kind have been developed include directions and axes and 2D and 3D shape (cf. [1], [2], [10], [23] and [24]). In 2D shape analysis, for example, preshapes (configurations of landmarks with location and scale information removed) can be written as complex unit vectors Z 1 , . . . , Z n . Taking X i = Z i Z * i , where * denotes conjugate transpose, further removes rotation information, then the space of all possible X i can be identified with 2D similarity shape space (cf. [5]). The population and sample extrinsic mean shapes are the unit eigenvectors corresponding to the largest eigenvalues of Ξ = E(X 1 ) andΞ = n −1 X i , respectively. Defining φ(Ξ) as the maximum eigenvalue ofΞ leads to straightforward calculations forL andV in (1) (cf. [2]). In 3D shape analysis with p landmarks, preshapes can be written as 3-by- i Z i removes rotation (as well as reflection) information. A map φ(Ξ) defined to projectΞ onto the space of positive-definite matrices of rank ≤ 3 can be interpreted as an extrinsic mean reflection shape (cf. [7]). Calculations for L andV are lengthier but tractable. An important issue, however, is that the limiting χ 2 distribution of (1) has 3p − 7 degrees of freedom, and unless sample sizes are very large thenL * V * L * ⊤ can be singular or ill-conditioned under bootstrap resampling. This leads to poor coverage accuracy even with reasonably large sample sizes. Using regularised test statistics, such as (1) and (2) with adjustments made to the smaller eigenvalues of LV L ⊤ , appears necessary, and numerical evidence suggests such an approach is very effective (cf. [23] and [24]).
In summary, the pivotal statistic for Algorithms 1 and 2 requires: a central limit theorem for the extrinsic mean (as provided by [5] in a general manifold setting), and a smooth map φ(·) that defines a meaningful location parameter, for which the calculations of L and V are tractable. If any ingredient is missing, one may still develop bootstrap approaches which are nonpivotal; see examples for analysis of projective shape in [21], and planar curves in [9].
A broadly applicable nonpivotal bootstrap approach addressing problem (i) above is the following, in which d is a metric on M.
Step 2: Generate B independent resamples sampled randomly with replacement from the original sample S, and calculatê Step 4: return the (approximate) 100(1 − α)% confidence region It remains an open question how best to develop effective bootstrap algorithms on more general spaces, such as stratified manifolds and, more generally, various types of metric space of potential interest in applications. For the example of "open books" (disjoint copies of half-spaces glued along their boundary hyperplanes), the central limit theorem has a limit distribution which is a mixture of components (cf. [16]), which is a non-standard setting from a bootstrap perspective.

Curve fitting
Other classical statistical methodology, such as the widely-used method of principal component analysis, can be adapted and developed to analyse the variability of manifold-valued data further. The most straightforward way to apply principal component analysis to manifold-valued data is to perform the standard principal component analysis on the tangent space at the Fréchet mean of the data, i.e. obtain the eigendecomposition of the sample covariance of tangent coordinates, with eigenvectors giving the principal components ordered by the corresponding eigenvalues. In other words, the data are first transferred, using the inverse exponential map or other similar maps, to the tangent space at their Fréchet mean, and then principal component analysis is applied to the transferred data to find the lower dimensional subspace in that tangent space that maximizes the variance of the projection of the transferred data. This method, combined with generalised Procrustean analysis, is widely used in statistical shape analysis (cf. [6]). If the first principal component obtained in this way has a sufficiently high eigenvalue, a unit vector in the resulting 1-dimensional sub-tangent space determines a geodesic which often gives a good approximation to indicate the variability of the data.
This idea of using geodesics to model the variability of the data can be refined by replacing the use of the tangent space by working directly on the manifold, leading to the concept of principal geodesic component analysis, as introduced in [14]. Applications of principal geodesic component analysis so defined to Kendall's shape spaces can be found in [13] & [15], and to mediallydefined anatomical shapes can be found in [11]. For a set of data {X 1 , . . . , X n } in a given Riemannian manifold M with induced metric d, the first principal geodesic γ 0 to this set of data is defined to be where G(M) denotes the set of all possible maximal geodesics in M and the distance between a given point and a geodesic is defined as the minimum of the distances between that given point and points on the geodesic. To find such an optimal γ 0 , we note that, up to re-parametrization and translation along its curve, any geodesic γ can be expressed, using the exponential map, in terms of a point x on γ and a unit tangent vector v at x as γ(t) = exp x (tv). Then, the above minimization problem can be expressed in a more tractable way as This will allow us to use iterative computer algorithms to approximate the first principal geodesic. However, on account of the ambiguity in the choice of reference point x, mentioned above, the solution to such a minimization problem is no longer unique.
We can also consider searching for an optimal curve among other prescribed sets of curves to capture the main features of the data, for example the use of the set of small circles for data lying on a sphere. The method of principal nested spheres (PNS), proposed in [17], introduces a general framework for a novel nongeodesic decomposition of variability of data lying on high-dimensional spheres. Instead of searching for an optimal small circle directly, it decomposes a highdimensional sphere into a sequence of sub-manifolds with decreasing intrinsic dimensions, which can be interpreted as an analogue of principal component analysis. The procedure for finding the PNS involves iterative reduction of the data dimension. To describe it more explicitly, we assume that {X 1 , . . . , X n } is a sample in the unit m-sphere S m , where m > 1. Then the best fitting subsphere for this set of data is defined as the sub-sphere of dimension m − 1 in S m minimizing, among all possible such sub-spheres, the sum of the squares of the distances of the data points to it. Since any sub-sphere A m−1 of dimension m − 1 can be characterized by an r ∈ (0, π/2] and a direction x ∈ S m as where d denotes the intrinsic distance on S m , the problem of finding the best fitting sub-sphere becomes searching for (x,r) which solves arg min Then, each step of the iterative procedure for fitting principal nested spheres repeats this, after rescaling to standardize the radius of the spheres, for the (orthogonally) projected data onto the best fitting sub-sphere obtained in the previous step. Ifr = π/2, then the best fitting sub-sphere is a great sub-sphere, so that this method generalises the method of finding principal geodesics, similar to those from previous approaches to manifold principal component analysis.
The procedure for implementing PNS makes it clear that the method can also be used for dimension reduction of spherical data. Moreover, although the PNS method is primarily proposed for data lying on spheres as suggested by its name, the idea could possibly be generalised to fit more general manifold-valued data using principal nested sub-manifolds defined by a sequence of constraints.
For example, [17] did include principal nested 2D shape spaces, which is a simple generalisation. This would expand the range of techniques for the analysis of variability of such data. Nevertheless, this would require the understanding of the geometry of the underlying manifold. One of the challenging issues here is the careful choice of the class of sub-manifolds so that it is possible to implement the method and the interpretation is meaningful.
The methods mentioned above all use some form of orthogonal projection from the observed data to an estimated curve. However, this is not always adequate for the interpolation of time-indexed observed data on manifolds, so that it is necessary to adapt other classical techniques, such as generalising Euclidean cubic spline fitting. Recall first that, for a given data set {X 1 , . . . , X n } in R m , where X j is observed at time t j ∈ T , j = 0, . . . , n, the cubic spline in R m fitted to this dataset with smoothing parameter λ is the function f (·, λ) : among all C 2 -functions, where T is a time interval containing all the time points. One way to generalise the Euclidean cubic spline to manifolds is to use parallel transport to transfer data to tangent spaces, preserving the inter-relationships among the data as much as possible, and then to use the known procedure in Euclidean space to find the cubic spline for the transported data. More precisely, for a given dataset S = {X 1 , · · · , X n } in a manifold M, where X j is observed at time t j , and smoothing parameter λ, the M-valued smoothing spline fitted to S with parameter λ is defined to be the C 2 -function γ(·, λ) : [t 0 , t n ] → M such that its unrolling γ † onto the tangent space of M at γ(t 0 , λ) is the cubic smoothing spline fitted to the data S † obtained by unwrapping S at times t j , with respect to γ, into the tangent space of M at γ(t 0 , λ). For a more formal definition, see [18]. The terms unrolling and unwrapping are both intuitive descriptions of moving the curve and general points on the manifold using the concept of parallel transport. The M-valued smoothing spline defined in this way is, except at the data times, the solution to the 4th order differential equation where ∇ denotes the covariant derivative (cf. [18]). For a given data set, the search for the M-valued smoothing spline involves a straightforward iterative algorithm: given an estimate γ (i) at the ith stage, fit a Euclidean smoothing spline (γ (i+1) ) † with parameter λ, in the tangent space at γ (i) (t 0 ), to the unwrapped data S † with respect to γ (i) ; and then wrap (γ (i+1) ) † at the corresponding times back onto the manifold with respect to γ (i) to define γ (i+1) . This method was introduced in [18] for solving a non-parametric smoothing problem on the sphere. Subsequent developments have included applications to regression problems and extensions to more complicated manifolds(cf. [19] and [20]). However, for the construction of M-valued splines of this type to be feasible, it is crucial to have knowledge of parallel transport in the manifold M.
A more direct generalisation of the Euclidean spline to manifold-valued data {X 1 , . . . , X n } in M, where X i is observed at time t i , is to find the solution to the analogue for manifolds of the minimization problem (3), i.e. to find the solution to the problem of minimizing within an certain set of C 2 -curves on M. Then, the minimizing function is four-times differentiable and satisfies the differential equation except at the data times. Clearly, this differential equation is heavily entangled with the geometry of the manifold, and so to find the minimizing curve requires full knowledge of that geometry. Rather than directly solving such a, usually complicated, differential equation, it is possible to search for the minimizer using the steepest-descent direction iteratively (cf. [25]): at each step γ is replaced by γ τ , where γ τ (t) = exp γ(t) (−τ grad(E(γ))(t)) and τ is a pre-determined positive constant. To be able successfully to implement this procedure, the crucial feature is the use of the second order Palais metric defined by for any tangent vector fields v and w along γ. With this metric, the gradient of E has a relatively simple closed expression in terms of the curvature tensor, together with the parallel transport. Compared with the previous method, this requires more detailed knowledge of the geometry of manifolds. However, it has proved possible to carry it out (cf. [26]). In general, the model choice appropriate to the problem needs to take into account the balance between the need to respect to geometry of the manifold in order to preserve the inter-relation among the data and the requirement to know that geometry in detail.
If the data points are not indexed in some natural order, it is also possible to fit a curve to capture patterns of non-local variation by using the differential equations, as implied by the method of principal flows in [22]. The idea behind the method is that, at each point on the curve, the derivative of the fitted curve is the first principal component generated by a local tangent principal component, so that the curve always follows the direction of maximal variability. For a given set of data {X 1 , . . . , X n } in M, the method relies on the introduction of the localised version of the tangent covariance matrix which is a tensor on a suitably restricted open set of M defined by The positive constant λ is used to control the size of the neighbourhood and κ λ (X i , x) = K(λ −1 d(X i , x)) for a smooth non-increasing univariate kernel K on [0, ∞). Then, the local tangent principal component is the vector field W (x) defined to be the first principal unit eigenvector of Σ λ (x). Assuming that Σ λ is defined and has distinct first and second eigenvalues on an open set containing the Fréchet meanx of the data set, then the principal flow is the curve γ that solvesγ (t) = W (γ(t)), γ(0) =x.
The principal flow γ can be extended in both directions atx by choosing the opposite sign for W (x): otherwise the sign of W is determined by requiringγ to be smooth. Note that W (x) so defined is always tangent to the manifold and so the solution γ lies in M. The rigidity or flexibility of the principal flows can be controlled by varying λ.
Over the last decade there have been many models proposed for curve fitting techniques on manifolds: the preceding selection is by no means complete. On the other hand, although some statistical analysis, such as inference, has been carried out for the estimated curves (cf. [19]), their asymptotic properties have so far been relatively less explored. Whereas similar behaviour to their Euclidean counterparts is expected, the extent of the role played by the geometry of the underlying manifolds and its effect on that behaviour is certainly unclear. Methodology based on extensions of the work of Rabi Bhattacharya, e.g. the bootstrap in [4] and [5], is one plausible avenue to explore.