Radiologic Image-based Statistical Shape Analysis of Brain Tumors

We propose a curve-based Riemannian-geometric approach for general shape-based statistical analyses of tumors obtained from radiologic images. A key component of the framework is a suitable metric that (1) enables comparisons of tumor shapes, (2) provides tools for computing descriptive statistics and implementing principal component analysis on the space of tumor shapes, and (3) allows for a rich class of continuous deformations of a tumor shape. The utility of the framework is illustrated through specific statistical tasks on a dataset of radiologic images of patients diagnosed with glioblastoma multiforme, a malignant brain tumor with poor prognosis. In particular, our analysis discovers two patient clusters with very different survival, subtype and genomic characteristics. Furthermore, it is demonstrated that adding tumor shape information into survival models containing clinical and genomic variables results in a significant increase in predictive power.


Radiological imaging in cancer
There is intensive worldwide interest in preventing, detecting and treating cancer. Radiologic tools for detecting and treating cancer play central roles in disease management and surveillance. Technological advances in imaging equipment and techniques, and development of stage-specific methods for cancer, make medical imaging an indispensable tool for clinicians to monitor and stage various cancers (Gutman et al., 2013). Clinical decision-making, particularly for the brain, is routinely made on the basis of radiological image-based features in a magnetic resonance image (MRI). There are three main analytical tasks in such settings: (1) extraction or segmentation of the tumor region from the MRI, (2) characterization of the tumor via its shape, volume or other features, and (3) development of prognostic or predictive (survival time) models that link MRI features with genomic and clinical variables. Each task brings its own set of challenges.
In this article, we focus primarily on the latter two tasks. Brain tumor characterization is not straightforward because the tissue surrounding the tumor is often heterogeneous in spatial and imaging profiles (Krabbe et al., 1997), and sometimes overlaps with normal tissues (Provenzale et al., 2006). For example, it is extremely difficult to distinguish between primary central nervous system lymphoma and high-grade glioma using MRI (Liu et al., 2011). Integrating volumetric and morphological features of tumors obtained from MRI with clinical and genomic variables is usually based on non-objective numerical summaries of the features generated by experts. Thus, it is difficult to ascertain the reliability and reproducibility of such studies, to generalize to different clinical settings.
The biological process governing tumor growth generates artifacts that can assist in the tasks described above. A tumor normally originates from a single cell, and as it proliferates in size, it exhibits heterogeneity in physiological and shape-related features (Marusyk et al., 2012). Both inter-and intra-tumor heterogeneity are critical for characterizing tumors (De Sousa et al., 2013). Inter-patient tumor heterogeneity can be quantified by morphological characteristics such as the shape and size of the tumor (McLendon et al., 2008), in addition to the genomics and clinical characteristics of a patient.
The relevance of tumor shape in characterizing tumor heterogeneity is linked to its growth process. Intrinsic brain tumors tend to evolve along tracts of white matter, altering the tracts in complex ways that include infiltration, displacement and disruption (Goldberg-Zimring et al., 2005). It is conceivable that new insight into patterns of tumor growth and invasion in the brain can be obtained through a better understanding of the shape and evolution of the tumor. Tumor shape is significantly influenced by the location in the brain and other anatomical constraints-in some places it might infiltrate and in others displace the fiber tracts. Irregular or spiculated shapes suggest an anisotropic structure of the underlying white matter; spherical or regular shapes imply a lack of structural or anatomical restrictions. The size of the tumor evidently affects its shape, especially in the presence of anatomical restrictions. It is reasonable to theorize that a better understanding of the relationship between the tumor's shape and size, and histopathological factors related to the brain tumor would enhance the understanding of the tumor's biological growth process; this would not only enable better prognosis but also potentially predict the likelihood of therapeutic success. For example, Figure 1 of our motivating dataset shows two semi-automated segmentations of T2-weighted fluid-attenuated inversion recovery (FLAIR) brain-axial MRI of patients diagnosed with glioblastoma multiforme (GBM), also known as grade IV glioma, with survival times of longer than 50 months (left) and shorter than one month (right), respectively. The tumor shape for the patient with longer survival appears to be more regular or spherical than the irregular one corresponding to the patient with a short survival; the tumor sizes appear to be quite different as well. Evidently, the tumor locations for the two patients are different, which influences both size and shape.
While the potential importance of tumor shape as a prognostic biomarker has been rec- ognized (Goldberg-Zimring et al., 2005;McLendon et al., 2008), there is a striking paucity of progress in this direction. This is primarily due to the difficulty of representing and integrating tumor shape into existing statistical models. Current approaches that incorporate the information of a segmented brain tumor's shape and size into models for tumor characterization and classification are based on subjective features provided by experts such as tumor circularity/sphericity and irregularity, and numerical summaries such as surface-tovolume ratio, total tumor area and entropy of the radial distribution of boundary pixels (Krabbe et al., 1997). Such radiological features are only indicative of tumor shape and do not fully characterize the shape. Furthermore, the subjective nature of the features ensures that statistical inference founded on them will suffer from a lack of reproducibility and reliability. In a recent article exploring the predictive power of MRI features in the context of GBM, Gutman et al. (2013) state that (page 568): "...it is often challenging to extract objective information for scientific analysis from prose statements of imaging features by neuroradiologists who typically use idiosyncratic vocabulary." Gutman et al. (2013) used various measures of agreement of ordinal and numerical values of neuroimaging features such as size and percentage of necrosis suggested by three expert radiologists, and noted that volumetric and morphological information of the GBM tumor is informative for characterizing its biological growth process.

Statistical challenges and contributions
We can circumvent issues associated with qualitative and quantitative summaries of tumor shape by quantifying and utilizing information about the entire tumor shape. This extension, however, is not straightforward. Viewed statistically, tumor shape is a non-Euclidean object residing on (a quotient space of) some nonlinear manifold. Thus, appropriate representation of a tumor shape should naturally employ statistical methods for non-Euclidean data objects. Motivated by this need, we focus on examining the utility of the 2D shape of GBM tumors obtained from a single brain axial imaging slice with the largest tumor area in two contexts: (1) for detection of inter-tumor heterogeneity, and (2) for evaluation of its association with molecular (genomic) profiles and survival times of patients diagnosed with GBM. The methods we employ are broadly applicable to various tumor types. Recent studies of scalar on image regression models in neuroimaging data applications incorporated the entire image (Reiss and Ogden, 2010;Jiang and Ogden, 2010;Goldsmith et al., 2013;Li et al., 2015;Reiss et al., 2011;Huang et al., 2013); such methods are not applicable in the current setting for the following reasons: (1) the tumors grow and are located in different parts of the brain, (2) the whole-brain images contain a lot of additional information about other brain structures, which might not be directly related to tumor morphology, and (3) there is no standard template to which each tumor image can be registered to make population-level analyses conformable across patients. Thus, our work is distinguished by its focus on the shape of the tumor only.
We model the 2D tumor shapes as properties of parametric curves in R 2 , which provides the flexibility to accommodate uncertainty regarding landmarks and other curve features. Broadly speaking, parameterized curves representing tumor shapes can be viewed as functions defined on a suitable domain; this allows for viewing the tumor shape as a functional observation, making available a broad array of existing tools from functional data analysis. However, we show that the space of these curves cannot be modeled as a linear vector space (e.g., Hilbert space of square integrable functions), as is usually assumed in these techniques; it is hence not possible to directly use basis representations of the tumor shapes. Parametric curves representing tumor shapes are viewed as points on an infinite-dimensional nonlinear manifold, and the statistical methods developed have to be compatible with the constraints posed by such spaces. We adapt the geometric framework for the statistical shape analysis of closed curves proposed by Srivastava et al. (2011) for this purpose. In summary, our main contributions are: (i) We represent tumors as parameterized, planar, closed curves given by their outlines in 2D MRIs, and define a suitable shape space that captures relevant information pertaining to the tumor shapes.
(ii) We define notions of a geodesic path and distance between two tumor shapes, and an average tumor shape; we also describe methods to perform shape-based principal component analysis (sPCA) on the tumor shape space and to identify and visualize principal directions of variation in a sample of tumor shapes.
(iii) On the GBM application (Section 3.1), we illustrate the utility of the developed tools in clustering and classifying tumor shapes, and inferential tasks such as two-sample testing (tumor shape heterogeneity detection) and survival time modeling (prognostic and prognostic capabilities of tumor shape). These tasks are examined in the presence of genomic and clinical covariates, from which we assess the predictive capabilities of the proposed representation in terms of patient survival times.
We develop of a coherent statistical representation of the tumor shape, and use the geometric framework to assist us in implementing supervised and unsupervised tasks such as clustering and integrating tumor shape as a potential predictive or prognostic factor in statistical models commonly used in oncology studies. We find the motivation in the GBM dataset, for which issues about the use of MR imaging features have been recognized but not satisfactorily addressed. We examine statistical methods to integrate the tumor shape with genomic and clinical features of GBM, and investigate associations between them; this can subsequently accelerate effective personalized therapeutic strategies for cancer development and progression. Note that the presented method is more general and can be applied to other cancers and imaging modalities as well.
The rest of this paper is organized as follows. In Section 2, we provide statistical tools for analyzing tumor shapes under an elastic framework. In particular, we focus on comparing and averaging tumor shapes, and summarizing shape variability in a sample of tumors. Section 3 considers specific statistical tasks on the GBM dataset including clustering, hypothesis testing and survival modeling. Section 4 provides a short discussion and directions for future work.
2 Quantifying and visualizing variability in tumor shapes: A geometric approach Some issues associated with characterizing tumors in MRIs can be alleviated through a suitable representation, which should be versatile enough to accommodate various subjective evaluations by neuroradiologists, and at the same time, be mathematically and statistically well-defined so as to facilitate various inferential tasks. Shape analysis based on landmarks (finite collection of ordered points) (Dryden and Mardia, 1998) is not flexible enough in this context since the tumors rarely possess landmark features as such. Even if present, identifying tumor landmarks is difficult, may require subjective assessment, and may not correspond between tumors from different patients. A natural way to represent a tumor is to use a 2D curve that corresponds to its boundary or outline. This allows for uncertainty in all landmark locations, and offers a visually appealing representation.
We adopt and suitably customize the shape definition of Srivastava et al. (2011) that is particularly attractive in the current context (see (Joshi et al., 2007a,b), (Srivastava et al., 2011), and  for details). While describing the tools, we concurrently T1  T2  T1  T2  T1  T2 (1) (2) shapes and outcomes, more sophisticated approaches are required.

Representation of tumor shape and elastic metric
The tumor shapes should be invariant to translation and rotation. Scaling might be considered important, and can easily be incorporated into our framework. Denote a parameterized, planar, closed curve representing the outline of a tumor by a function β : S 1 → R 2 .
Since the tumor outline is a closed curve, it is natural to parameterize it using the domain S 1 , instead of an interval. There are several possibilities for representing β for the purpose of shape analysis. One can simply use the x and y coordinate functions of β; another possibility is to parameterize β using the arc length and compute the angleβ = dβ dt makes with the x-axis (here, t is the curve parameter) (Klassen and Srivastava, 2006).
The choice of a metric on the tumor shape space is vital for comparing two shapes.
Unlike typical problems in shape analysis, there is no template shape available while considering tumors. In this context, it is imperative that the metric capture and be based on all possible deformations or transformations that match one tumor shape to another. One candidate metric is the elastic metric, defined as follows. Suppose p(t) = |β(t)| is the speed function and θ(t) =β/|β(t)| is the angle function. Consider two tangent vectors (small perturbations) (δp i , δθ i ), i = 1, 2 in the tangent space of (p, θ). The elastic metric (Mio et al., 2007) is defined as: for constants a, b > 0. The first term in Equation (1) measures variations in the speed function (i.e., how fast the tumor outline is traversed), while the second term measures the variation in the direction of the unit tangent vectors; a and b provide the relative weights for the two terms. In other words, the first term captures the amount of stretching and the second term captures the amount of bending required to deform one tumor shape into another. Both terms are needed to preserve the geometric properties and natural deformations between tumor shapes; we explicitly show this in Section 2.2.3 via tumor shape reconstruction errors. However, choosing a and b is hard and problem-dependent.
An important source of variation is the choice of parameterization of the tumor contours. This is a nuisance parameter when comparing tumor shapes, since the choice of parameterization is arbitrary and shape preserving, i.e., the tumor contour can be re-parameterized in many different ways, but it's shape remains unchanged. A common approach in the shape analysis literature is to normalize curve parameterizations to arc length to ensure that traversal along the curve is at unit speed. Under this scenario, only bending deformations are allowed, which often results in suboptimal point correspondences across shapes (Mio et al., 2007). We describe how it is possible to not only profitably employ the elastic metric, but also ensure that the resulting geodesic distance is invariant to the choice of parameter-ization. Unless otherwise stated, all curves representing tumor shapes are parameterized according to their arc lengths.

Square-root velocity function
Let Γ = {γ : S 1 → S 1 |γ is an orientation-preserving diffeomorphism} be the group of reparameterization functions, and orientation imply clockwise or counter-clockwise traversal of the contour. The reparameterization of a tumor curve β, termed the action of Γ on the space of curves, is given by the composition (β, γ) = β • γ. The chief issue with using the popular L 2 metric is that the distance between two tumor contours β 1 and β 2 is not preserved under the action of Γ: is the L 2 metric for functions on S 1 . In other words, the action of Γ on the space of tumor curves is not isometric, which means that a comparison of two tumor shapes depends on their parameterizations.
A proposed solution (Joshi et al., 2007a,b;Srivastava et al., 2011;Kurtek et al., 2012) is to use a different representation of curves called the square-root velocity function (SRVF), , where | · | is the standard Euclidean norm in R 2 . This representation is convenient because it is automatically translation invariant. Conversely, β can be reconstructed from q up to a translation. If a tumor curve β is reparameterized to β • γ, then its SRVF changes from q to (q, γ) = (q • γ) √γ .
The main reasons for using the SRVF for tumor shape analysis are as follows.
(i) The complicated but desirable elastic metric reduces to the standard L 2 metric with a = 1/4 and b = 1, allowing for both bending and stretching of tumor shapes: if δq 1 and δq 2 are two tangent vectors to the SRVF q of a tumor curve β = (p, θ), then where ·, · is the standard L 2 inner product (defined later).
If invariance to scale is required, each tumor shape can be re-scaled to unit length. After re-scaling, q 2 = S 1 |q(t)| 2 dt = S 1 |β(t)|dt = 1, i.e., the representation space of all SRVFs is a Hilbert sphere. For tumor shapes, the scale or size of the tumor is often important, and the variability in tumor shape due to scale differences is considered to be important as well. In the GBM data example, we decouple tumor shape and size and consider them individually as covariates in the survival models. For a closed curve, which characterizes the tumor contours we are studying, the corresponding SRVF satisfies the additional closure condition S 1 q(t)|q(t)|dt = 0. Thus, the space of all unit length, planar, closed tumor curves, represented by their SRVFs, is given by

Geodesic paths and distances in the elastic shape space
In the absence of a template tumor shape, it is critical to be able to visualize deformations or changes in tumor shape. The choice of the elastic metric and the SRVF of two tumor shapes makes it possible to compute natural geodesic paths and distances between them; as a consequence, we can visually examine the meaningful deformations of one tumor shape that transforms it into the other by traversing the geodesic path. This is potentially useful to radiologists for assessing possible changes in tumor morphology, thereby facilitating targeted interventions. Figure 3 illustrates this with a simulated example, and Figure 4 offers illustrations on the GBM dataset.
Pre-shape space C with parameterization and rotation variability: The pre-shape space C is a nonlinear submanifold of the Hilbert sphere due to the closure condition. It becomes a Riemannian manifold with the standard where v 1 , v 2 ∈ T q (C) and the inner product in the integrand is the standard Euclidean inner product in R 2 . The task of computing geodesic paths between any two elements q 1 , q 2 ∈ C is accomplished numerically, using an algorithm called path straightening, introduced by Klassen and Srivastava (2006) and adapted to the SRVF representation by Srivastava et al. (2011). This algorithm initializes a path in C connecting q 1 and q 2 , and iteratively 'straightens' it until it becomes a geodesic. Let α * : [0, 1] → C denote the resulting geodesic path. Then, the length of this geodesic path provides a geodesic distance between q 1 and q 2 in C: The issue with d C is that it contains contributions from two nuisance sources of variation: (i) It is a non-zero distance between two tumor shapes when one is just a rotation of the other; (ii) It is a non-zero distance between two tumor shapes when one has been obtained through a reparameterization of the other; since we start with arc length parameterized tumor curves, this distance does not capture stretching deformations that pertain to the first term of the elastic metric in Equation (1).
Shape space S accounting for parameterization and rotation variability: To remedy the issues with the pre-shape geodesic distance d C between two tumor shapes, it needs to be computed while accounting for (i) all possible rotations and (ii) all possible reparameterizations of one tumor shape to optimally register it to the other. This is achieved in the following manner.
Let SO(2) be the group of 2×2 rotation matrices (special orthogonal group). For an arc length parameterized tumor contour β and a rotation O ∈ SO(2), the transformed SRVF of β is given by Oq. Thus, in order to unify all elements in C that denote the same tumor shape, we define equivalence classes of the type Each such equivalence class [q] is associated with a unique tumor shape and vice versa.
The set of all equivalence classes is called the shape space S and is the quotient space C/(SO(2) × Γ). The distance d C can be used to define a distance between tumor shapes on S according to The geodesic distance d S is now the elastic distance on the space of tumor shapes and is invariant to rotation and reparameterization; as a consequence, all possible deformations pertaining to stretching and bending of tumor shapes are captured. Moreover, d S is bounded above by π/2, thereby offering a natural scale for comparing tumor shapes. In practice, the minimization in the definition of d S is performed by sampling each curve with a large number of points, and then recursively applying singular value decomposition (SVD) to find the optimal rotation and the dynamic programming algorithm with an additional seed search to find the optimal reparameterization.
Illustrative examples: We present multiple simulated and real data examples comparing nonelastic geodesic paths and distances (we only optimize over rotations and the seed placement but not the full reparameterization group) to the proposed elastic versions computed in the shape space. The points along the geodesic path between two tumor shapes should be viewed as the possible deformations transforming one tumor shape into the other.
Since, in contrast to elastic shape analysis, the nonelastic framework does not allow stretching and compression deformations, we observe some unnatural shapes appearing along the geodesic paths in that case.
We first illustrate our approach on two simulated curves that are 'toy' tumor shapes.
The curves were generated so as to reflect the protrusion-type behavior of real GBM tumors, and were both initially parameterized with respect to their arc lengths. This example is shown in Figure 3. First, with the given arc length parameterizations, the geometric features on the two curves do not match. This can be seen from the four colored points.
Panel ( Curve with two protruding peaks before reparameterization (uniform spacing of points). (c) Same as (b) after reparameterization (optimal non-uniform spacing of points). We show four colored points of correspondence for improved visualization. The resulting geodesic paths were sampled uniformly using seven points (NE=nonelastic).
correspond to three peaks. Panel (b) shows the second tumor shape, where the green point corresponds to a peak while the other two do not. This results in an unnatural nonelastic geodesic deformation between these two surfaces, where two of the peaks on the first shape are distorted to form the second peak on the second shape; the resulting distance is d N E = 0.9249. Under the elastic framework on the shape space S, the optimal reparameterization is able to match the first two peaks across the two curves very well (green and black points). Of course, there is no counterpart to the third peak on the second curve (cyan point). This results in a natural deformation where the two matched peaks are preserved along the geodesic path while the third one simply grows; the resulting distance is d S = 0.4709 (nearly a 50% decrease). We hypothesize that improvements such as the one in this simulated example are extremely important in capturing natural variability Nonelastic geodesic path and distance Nonelastic geodesic path and distance Elastic Geodesic path and distance Elastic geodesic path and distance shown for the T2-weighted FLAIR modality in Figure 5. It is important to note that these geodesic path improvements are accompanied by significant distance reductions between the nonelastic and elastic frameworks; improvements of this form are also even more drastic when one considers statistical modeling of such tumor shapes. The presented examples thus support our proposal for the use of elastic shape analysis of GBM tumors for association with patient survival and genomic variables.

Statistical summaries of tumor shapes
Hereafter, our analyses focus on the shape space S and the distance d S . However, we illustrate the resulting differences in the statistical summaries under nonelastic and elastic shape analysis. We define and illustrate computations of a mean tumor shape and covariance of a sample of tumor shapes, both defined with respect to d S . Consequently, we demonstrate how sPCA can be applied to explore and visualize the directions of variation in tumor shape based on patient-level information. Identifying such directions can be useful in understanding the most likely deformations of the tumor shape, and can be potentially used to monitor the disease and for targeted therapeutic interventions.

Mean and covariance
Under the SRVF framework, the shape space S is a (quotient space of a) nonlinear submanifold of the Hilbert sphere, which is equipped with a Riemannian structure under the L 2 metric. We first introduce some notation. Let q 1 , q 2 ∈ C be the SRVFs of two tumor pre-shapes, and v ∈ T q 1 (C) denote an element of the tangent space to C at q 1 . Then, the maps q 2 → v = exp −1 q 1 (q 2 ) ∈ T q 1 (C) and v → q 2 = exp q 1 (v) ∈ C are the exponential and inverse exponential maps, respectively. These are not available analytically for the pre-shape space of closed curves; algorithms for computing these quantities (Srivastava et al., 2011) are similar to the technique for finding geodesics.
Let {β 1 , β 2 , . . . , β n } denote a sample of given tumor contours, and {q 1 , q 2 , . . . , q n } be their corresponding SRVFs. Then, the Karcher (Frechet) mean tumor shape is defined as A gradient-based approach for finding this mean is provided in Le (2001) and Dryden and Mardia (1998), and is omitted here for brevity. The Karcher mean is actually an entire equivalence class of curves. For the remainder of our analysis, we select one element of this classq ∈ [q]. The general computation of the covariance around the estimated shape mean is as follows. Let v i = exp −1 q (q * i ), i = 1, 2, . . . , n denote the shooting vectors from the mean shape to each of the shapes in the given data. This first involves an optimal rotation O * and optimal reparameterization γ * of each q i , resulting in q * i = (O * q, γ * ), to register it to the mean shapeq. Then, the covariance kernel can be defined as a function K q : In practice, since the curves have to be sampled with a finite number of points, say m, the resulting covariance matrices are finite-dimensional. Often, the observation size n is much less than m and, consequently, n controls the degree of variability in the stochastic model. A comparison of elastic (allowing optimal non-uniform spacing of points) and nonelastic (uniform spacing of points) averages for the T1-weighted post-contrast and T2-weighted FLAIR GBM tumor shapes is provided in Section 2 of the Supplementary Material. The elastic method is better at summarizing the shapes of GBM tumors as it provides averages that have sharp geometric features.

Shape-based principal component analysis
We explore dominant directions of variation in a sample of tumor shapes with an efficient basis for T [q] (S) using traditional PCA as follows. Let V ∈ R 2m×n be the observed tangent data matrix with n observations and m sample points in R 2 on each tangent, i.e., each column of V is v i = exp −1 q (q * i ), i = 1, 2, . . . , n, stacked into a long vector. Let K ∈ R 2m×2m be the resulting covariance matrix and let K = U ΣU T be its SVD. The submatrix formed by the first r columns of U , calledŨ , spans the r-dimensional principal subspace of the observed shapes and provides the observations of the principal coefficients as C = U T V ∈ R r×n . Thus, each original tumor shape can be represented using a finite set of principal coefficients acting as Euclidean coordinates. These coefficients can then be used in a survival model for prediction as shown later.
A comparison of elastic versus nonelastic sPCA is provided in Section 2 of the Supplementary Material. Elastic sPCA provides a more natural representation of variability in the given dataset. We also compute the overall variance for each sPCA model as 7.86 (elastic) and 12.74 (nonelastic) for T1-weighted post-contrast images, and 13.43 (elastic) and 27.68 (nonelastic) for T2-weighted FLAIR images. The elastic models are much more compact, and thus provide a more efficient Euclidean representation of the tumor shapes in terms of the principal coefficients.

Context-based simulation of tumor shapes
The exponential and inverse exponential maps provide tools to generate random tumor shapes using the principal directions of shape variability obtained from sPCA. Figure ?? Next, we study the quality of the elastic sPCA model by measuring the leave-one-out shape reconstruction error for each modality. This test is performed as follows. We first form a dataset of size n − 1 by leaving out the test tumor contour, say q i , and estimate the mean shapeq and PC basis U as previously described. Then, we compute two vectors in the tangent space T [q] (S): (1) v i = exp −1 q (q * i ) or the true tangent vector corresponding to the shape q * i (registered toq), and (2)ṽ i = n−2 j=1 v i , U j U j or the reconstruction of v i using the elastic PC basis. The leave-one-out reconstruction error for shape i is measured using E(i) = v i −ṽ i 2 . We repeat this for all tumor contours in the dataset, i.e., i = 1, . . . , 63. We repeat this procedure for the nonelastic framework and report all results of this simulation in Table 1. First, it is clear that the elastic sPCA model is better at reconstructing GBM tumor shapes than its nonelastic counterpart. Again, this suggests that elastic shape analysis is a more appropriate framework for studying GBM tumors. The elastic sPCA model estimated for the T1-weighted post-contrast tumor shapes is better at leave-one-out reconstruction (than the T2-weighted FLAIR elastic sPCA model) and thus provides a more faithful representation of the given tumor shapes. Nonetheless, for both modalities, the elastic sPCA models result in very low reconstruction errors. The maximum observed reconstruction error for the T1-weighted post-contrast tumor shapes is only 0.0446 or 1.81% of the maximum possible error ((π/2) 2 ); the median reconstruction error for the T2-weighted FLAIR tumor shapes is only 0.0955 or 3.87% of the maximum possible error.
The overall variability is also low in both cases for the elastic approach.

Radiologic imaging in Glioblastoma
The elastic framework for analyzing tumor shapes allows one to perform a variety of estimation and inferential statistical tasks. In particular, sPCA of tumors provides the possibility of devising methods based on principal coefficients, which can be profitably viewed as Euclidean features or summaries of the tumor shape for inclusion in regression models. Using a dataset of MRIs of GBM brain tumors, we applied clustering, two-sample testing, and survival modeling to illustrate the advantages associated with the elastic representation of tumor shapes and the related geometric framework in the context of assessing patient survival and association with genomic/clinical variables. Note that from here on, we perform statistical analysis via the elastic framework only. We begin with a detailed description of the GBM dataset.

Description of dataset
GBM, the most common malignant brain tumor found in adults, is a morphologically heterogeneous disease. Despite recent medical advancements, the prognosis for most patients with GBM is extremely poor. In the United States alone, 12 thousand new cases are being diagnosed every year 1 , among which less than 10% survive 5 years after diagnosis (Tutt, 2011 (Karnofsky and Burchenal, 1949). KPS indicates the ability of cancer patients to perform simple tasks (Crooks et al., 1991) and is widely used to assess quality of life during disease diagnosis and treatment.
The demographic variables corresponding to the clinical covariates in this dataset are presented in Section 1 of the Supplementary Material. Recent investigations have identified four different subtypes of GBM: classical, mesenchymal, neural and proneural, each of which is characterized by different molecular alterations (Verhaak et al., 2010). We also curated the information about these subtypes of GBM and some well-characterized  GBM driver genes (Frattini et al., 2013): DDIT3, EGFR, KIT, MDM4, PDGFRA, PIK3CA and PTEN. Biologically, a gene is known as a driver gene when there is a mutation along with DNA-level changes (amplifications or deletions). The full tumor volumes from T1weighted post-contrast and T2-weighted FLAIR MRIs were also recorded for each patient.
Pre-processing of images including details of segmentation, and a more detailed description of the clinical and genomic covariates can be found in Section 1 of the Supplementary Material.

Clustering of GBM tumor shapes
As a first unsupervised task, we performed hierarchical clustering of T1-weighted postcontrast and T2-weighted FLAIR tumor shapes using the elastic shape metric. We first calculated the pairwise distance matrix and then used complete linkage to separate the shapes into two clusters (motivated by short vs. long survival and supported by cluster visualization, see bottom panel of Figure 7) for each modality. To better visualize the variability in each cluster, we performed cluster-wise sPCA and plotted the three principal directions of variation in each cluster for the T1-weighted post-contrast and T2-weighted FLAIR modalities in Figure 7. We also report the cumulative variance in each cluster for both modalities in Table 2. In both cases, the variance in cluster 1 is much smaller than the variance in cluster 2. This can also be seen in the principal directions of variation; the shapes shown along cluster 1 directions (including the mean shape) are much smoother (or less 'wiggly') and circular.
We present a multidimensional scaling plot of the data in the bottom panel of Figure 7. This plot confirms that cluster 2 is much more variable than cluster 1. Furthermore, the separability of the clusters is very good for both modalities suggesting that the choice of two clusters is appropriate in this setting. In Table 3, we provide the mean and median survival times associated with the clusters, computed using tumor shape data in each modality. We see that the mean and median survival times are higher in cluster 1, which contains much lower cumulative variance. This suggests that cluster 1 is more homogeneous, which is associated with longer survival times. Cluster 2 is more heterogeneous and is associated with shorter survival times. This can also be attributed to the general morphological structure of tumors in the two clusters. The tumors in cluster 1 are often smoother and  Table 3: Summaries of cluster-wise survival for the T1-weighted post-contrast (T1) and T2-weighted FLAIR (T2) tumor data.
more spherical than those in cluster 2, which are more irregular. It is this irregularity and non-smoothness that is indicative of a more severe and infiltrative tumor with blurred margins and as a result, shorter survival times. Second, the mean difference in survival times between cluster 1 and cluster 2 computed using T1-weighted post-contrast tumor shapes is 6.8 months. This is a large number considering the median survival time in GBM is only approximately 12 months.

Cluster validation via enrichment
We use the concept of Bayesian cluster enrichment to study the association between the computed clusters, the tumor subtypes and other genomic covariates. In this approach, we want to compare the relative occurrence of a specific dichotomous covariate (with label 0 for no occurrence and 1 for occurrence) across the two clusters. To develop a Bayesian model for this purpose, let θ 1 ∈ [0, 1] denote the true proportion of 1s in cluster 1; let y 1 denote the observed number of 1s in cluster 1. Similarly, let θ 2 ∈ [0, 1] denote the true proportion of 0s in cluster 1 and y 2 denote the observed number of 0s in cluster 1.
Then, y 1 ∼ Binomial(n 1 , θ 1 ) and y 2 ∼ Binomial(n 2 , θ 2 ), where n 1 is the total number of 1s and n 2 is the total number of 0s. Consider a U nif orm(0, 1) (Beta(1, 1)) prior on the true proportions θ 1 and θ 2 . Since the Beta distribution is conjugate for the Binomial, the posterior distribution is of the same family as the prior; the resulting posterior distributions for θ 1 and θ 2 are given by π θ 1 (θ 1 |y 1 , n 1 ) ∼ Beta(y 1 + 1, n 1 − y 1 + 1) and π θ 2 (θ 2 |y 2 , n 2 ) ∼ Beta(y 2 +1, n 2 −y 2 +1). We generate a large number of samples from the two posteriors π θ 1 and π θ 2 and approximate the true probability P (θ 1 > θ 2 ) using a Monte Carlo method. We refer to this approximate quantity as the enrichment probability. The intuition behind this approach is as follows. If the computed clusters are not associated with the dichotomous covariate of interest, the resulting posteriors for θ 1 and θ 2 should be very similar. This in turn results in a Monte Carlo estimate of P (θ 1 > θ 2 ) close to 0.5, or no enrichment. On the other hand, when the two posteriors are drastically different, the Monte Carlo estimate of P (θ 1 > θ 2 ) would be either very close to 1 (if y 1 is much larger than y 2 ) or 0 (if y 1 is much smaller than y 2 ). These two scenarios constitute high enrichment of the covariate in one of the two computed clusters (a given covariate can be enriched in only one cluster at a time).
We present enrichment plots in Figure 8. Each plot shows the enrichment probabilities as a line plot with high and low cutoffs in the form of horizontal lines at 0.75 and 0.25. We note the following trends from the enrichment plots. The classical and mesenchymal tumor subtypes are enriched in cluster 1 for both modalities. The proneural tumor subtype is enriched in cluster 2 for the T2-weighted FLAIR modality. Interestingly, the mesenchymal subtype, a very aggressive form of GBM, was enriched in the cluster with higher survival.
However, upon closer examination, there was an equal number of mesenchymal and nonmesenchymal subtypes in cluster 1 for both modalities (the enrichment probability was mostly driven by the arrangement in cluster 2). Furthermore, the patients in cluster 1 with the mesenchymal subtype had lower survival than their nonmesenchymal counterparts (by approximately 1.5 months).
The enrichment plots for both imaging modalities display results consistent with some of the well-characterized genomic signatures in GBM. We note the following strong associations between tumor subtypes and driver gene mutations that have also been found in other studies (McNamara et al., 2013;Verhaak et al., 2010): 1. Proneural subtype and PDGFRA mutation (in T2-weighted FLAIR); 2. Classical and mesenchymal subtypes and EGFR mutation (in T2-weighted FLAIR).
EGFR mutation is a common molecular signature of GBM. It promotes proliferation of the tumor, which is associated with classical and mesenchymal subtypes (Fischer and Aldape, 2010). PDGFRA also plays an important role in cell proliferation and migration, and angiogenesis. Unlike EGFR, this gene was found to be mutated in high amounts in the proneural subtype of GBM tumors only (Verhaak et al., 2010).

Permutation test for difference in tumor shape means
The distance d S between two tumor shapes opens up the possibility of a distance-based nonparametric two-sample test for differences in mean tumor shapes. To ascertain the association between tumor shapes and survival times of GBM patients, we dichotomize the data based on four different survival cutoffs examined in the literature (Nebert, 2000;Affronti et al., 2009;Mazurowski et al., 2013): 12, 13, 14 and 15 months. Under the null hypothesis that the two groups have equal mean shapes, a permutation test analogous to the case of landmark-based shape analysis (Dryden and Mardia, 1998)   We only use the mean shape information in this hypothesis test, although we expect that the covariance information is also useful. We demonstrate how that can be achieved using a principal coefficient representation of tumor shapes in subsequent survival modeling.

Survival model adjusted for tumor shape
Next, we ascertain the utility of augmenting clinical and genetic information with imaging information when modeling survival probabilities of GBM patients. In particular, we investigate the association between the shape of a tumor and survival times (with censoring), in the presence of genetic and clinical covariates, using the geometry-based elastic shape methodology. Upon performing sPCA in the shape space S, each tumor contour shape is represented in the principal directions of the variation basis using its principal coefficients, which can be used as predictors in a survival model. Geodesic paths constructed using principal shooting vectors allow for the possibility of traversing the principal directions of shape variation and monitoring changes in the shape of a tumor. It is customary to choose a handful of principal directions that explain most of the shape variability; however, since S is infinite-dimensional, and it is unclear as to how one can interpret the directions in the context of tumor shapes, we propose to use all available directions so as to capture maximal information from the data. Indeed, it may very well be that a direction corresponding to a small (in magnitude) eigenvalue represents a physiologically important tumor shape deformation. In order to incorporate all information from the images, we perform separate sPCA on tumors obtained from both T1-weighted post-contrast and T2-weighted FLAIR MRIs, and collate the principal coefficients from each imaging modality. Employing all available shape principal coefficients translates to a large number of imaging-based shape predictors in a potential survival model necessitating dimension reduction through variable selection.
To assess whether incorporating imaging covariates, through principal tumor shape where A ⊂ B denotes that model A is nested within model B.
The clinical covariate of KPS contains a few missing values; therefore, we impute a value of 80 for those cases as advised by Gutman et al. (2013). A proportional hazards model (Cox, 1972), hereafter referred to as the Cox model, is used as the de facto model where |η| 1 is the L 1 norm of η. We use the R package glmnet by Friedman et al. (2011) for our implementation of model M 3 with leave-one-out cross-validation. The set I is consequently redefined to contain only the principal coefficients with non-zero regression coefficients obtained from this lasso regression.

Significant directions of shape variation and other results
Next, we focus on the results of fitting the three models. Using the lasso penalty for model M 3, we first identify the principal tumor shape coefficients with non-zero regression coefficients, owing to the lack of a general accepted way of testing for significance within the lasso framework (see recent work by Lockhart et al. (2014)). We uncover six principal Results from fitting the three Cox models are given in Table 5. Gutman et al. (2013) Table 5: Results from fitting Cox models M 1, M 2 and M 3. Predictors significant at the 5% level are tabulated, and the two concordance indices (C-index 1 = (Harrell et al., 1982), C-index 2 = (Gömen and Heller, 2005)) are reported.
from T1-weighted post-contrast and T2-weighted FLAIR images, we considered the shapes of tumor outlines rather than shapes and sizes. The size of the tumor was included in the model as a separate covariate through the tumor volume. It is known that tumors with EGFR mutations are larger than tumors with other mutations (Hatanpaa et al., 2010). In our analyses, EGFR and tumor volumes from both T1-weighted post-contrast and T2-weighted FLAIR images were not found to significantly correlate with survival time in the presence of tumor shape information. This finding is at odds with that of Gutman et al. (2013) where lesion size was used. It is known that older patients with GBM show high EGFR amplification. However, the variable EGFR informs us only if a mutation has occurred, not amplification. The age of a patient diagnosed with GBM is known to influence the survival time (Weller and Wick, 2011). Older age is typically used as a surrogate marker for change in the biology of GBM. The mean age in our dataset was 56.33 years; the variable Age appeared to have significant correlation with survival time in all three models, and the inclusion of tumor shape information did not alter that.
The discriminatory power of models M 1, M 2 and M 3 are compared using their concordance indices (C-indices), which are defined as the proportion of all pairs of patients whose predicted survival times are correctly ordered among all patients that can actually be ordered. For comparison purposes, we use the C-index proposed by Harrell et al. (1982Harrell et al. ( , 1984, and another version of it based on a U-statistic (Gömen and Heller, 2005). The C-indices (obtained through both methods) for the model M 3 are significantly higher than the C-indices for M 2 and M 1. This indicates a clear benefit in incorporating imaging predictors in the form of tumor shape principal coefficients into a survival model in order to obtain good discriminatory power. The Kaplan-Meier estimates of the survival functions for the three models, along with a detailed description, are provided in Section 3 of the Supplementary Material.
In summary, amongst the driver genes known to be significant in GBM studies, only DDIT3 appears to have a significant correlation with the survival time of a patient when adjusted for the effect of tumor shape. Mutation of the driver-gene DDIT3 appears to be associated with low survival probability (see Figure 5 in the Supplementary Material); it is known to indirectly regulate the glioma pathway through unregulated genes. Our analyses indicate that the shape of the tumor captures sufficient information about the individual relationships between each of the driver genes and survival time. A deeper study of the relationships between the shape of the tumor and driver genes is well worth exploring.

Discussion and future work
The use of shape analysis in medical imaging has been proposed before in other disease domains; we refer the reader to Chapter 17 of the edited volume by Tofts (2003) and references within, for a good review. The shape of specific anatomical structures in the brain has been successfully used in multiple sclerosis studies by Goldberg-Zimring et al. (1998), who, based the analysis of the shape on a few shape indices of the lesion. Landmark-based techniques using Procrustes averaging were used to study schizophrenia by DeQuardo et al. (1996). However, landmark and descriptor-based methods are not directly applicable to oncology due to multiple issues mentioned in this paper. In this work, we provide a comprehensive, Riemannian geometric solution to this problem that provides tools for various statistical analyses of tumor shapes. The benefits of this framework are clear: (1) it pro-vides an elastic metric to measure interpretable shape deformations, (2) it defines a formal mathematical and statistical framework, and (3) it provides tools for shape alignment, comparison, summarization, clustering, classification, hypothesis testing and other tasks.
We demonstrate these benefits through a detailed study of tumor shapes in the context of GBM. The proposed method can be readily extended to any cancer and/or other imaging modalities with similar data characteristics and scientific questions.
The focus of this article is on 2D tumor shapes obtained from the segmented tumor of a single axial slice of the brain with largest tumor area. The influence of the location and anisotropic nature of the white matter tracts on the shape of the tumor can be better assessed with 3D shape analysis, which is currently in progress. The geometric framework presented in this paper allows for the extension to 3D shapes (square-root normal fields (Jermyn et al., 2012)), but visualization of principal directions of variation becomes more difficult. Except for the work of Goldberg-Zimring et al. (2005) who used spherical harmonic functions to model the 3D shape of a tumor (akin to nonelastic analysis of tumor shapes), there is a lack of progress in this direction.
One way to view the proposed survival model is within the context offered by regression with functional predictors. The parametric closed curve representing a tumor shape predictor can be viewed as an element of the pre-shape space C, which is the Hilbert space S ∞ .
Nevertheless, S ∞ is a submanifold of L 2 (S 1 ) and not a vector space; current approaches with functional predictors using basis representations of the tumor shape or the coefficient function, or both, are hence inapplicable (see (Morris, 2015) for a detailed review).
The geometric framework used in this article enables us to perform PCA on the space of tumor shapes under a Riemannian metric. The physiological interpretation of the principal directions, however, is unclear and much work remains to be done in this direction.
Construction of a set of basis functions for the tangent space of a tumor shape that captures the biologically relevant deformations of the shape would be particularly useful; this requires significant input from clinicians in the form of prior shape information. In Figure  9, the deformations observed in the tumor shape as we move away from the mean tumor shape along the direction of decreased survival are striking; the shape appears to become more spiculated, which is consistent with the heuristic understanding of the seriousness of an irregularly shaped tumor. The visualization afforded within our framework, in our opinion, can profitably be used by neuroradiologists for initial non-invasive diagnoses.
Applying the survival model M 3 to the GBM dataset, we uncover several potentially interesting relationships between the shape of the tumor (expressed through the principal coefficients) and driver genes. This merits further consideration, and the implementation of our methods on other GBM datasets would offer more insight. Biological validation of the correlations between the two can significantly impact targeted personalized treatment strategies for GBM patients. Importantly, prognostic and predictive biomarkers of the transition time from a low-grade glioma to a malignant one can be determined.