Massively parallel-in-space-time, adaptive finite element framework for non-linear parabolic equations

We present an adaptive methodology for the solution of (linear and) non-linear time dependent problems that is especially tailored for massively parallel computations. The basic concept is to solve for large blocks of space-time unknowns instead of marching sequentially in time. The methodology is a combination of a computationally efficient implementation of a parallel-in-space-time finite element solver coupled with a posteriori space-time error estimates and a parallel mesh generator. This methodology enables, in principle, simultaneous adaptivity in both space and time (within the block) domains. We explore this basic concept in the context of a variety of time-steppers including $\Theta$-schemes and Backward Differentiate Formulas. We specifically illustrate this framework with applications involving time dependent linear, quasi-linear and semi-linear diffusion equations. We focus on investigating how the coupled space-time refinement indicators for this class of problems affect spatial adaptivity. Finally, we show good scaling behavior up to 150,000 processors on the Blue Waters machine. This is achieved by careful usage of memory via block storage and non-zero formats along with lumped communication for matrix assembly. This methodology enables scaling on next generation multi-core machines by simultaneously solving for large number of time-steps, and reduces computational overhead by refining spatial blocks that can track localized features. This methodology also opens up the possibility of efficiently incorporating adjoint equations for error estimators and inverse design problems, since blocks of space-time are simultaneously solved and stored in memory.


Introduction
We describe the methodology, implementation details and application examples of space-time block adaptive solutions to parabolic partial differential equations (evolution problems).This approach is primarily motivated by the necessity of designing computational methodologies that can scale to leverage the availability of very large computing clusters (with ∼ 10's to 100's of thousands of processors).For evolution problems, the standard approach of decomposing the spatial domain is a powerful paradigm of parallelization.However, for a fixed spatial discretization, the efficiency of purely spatial domain decomposition degrades substantially beyond a threshold (usually ∼ thousands of processors, beyond which efficiency is communication limited) which make this approach unsuitable on larger machines.To overcome this barrier, a natural approach is to consider the time domain as an additional dimension and simultaneously solve for blocks of time, instead of the standard approach of sequential time-stepping.Early work on this approach was considered by Hughes and coworkers [15,16], Tezduyar and co-workers [24], and Reddy and co-workers [21], while variations on this theme have recently been explored by several groups [6,9,18,19,22,28].Other examples include methods that exploit the fact that some coefficients during matrix assembly can be calculated concurrently [7] and methods that transform the equations so as to build one system of equations for more than one timestep [17].
The concept of solving for blocks of time simultaneously has recently gained a lot of attention to enable effective usage of exascale computing resources (see for instance the US DOE's Exascale Mathematics Working Group [14]).In addition to this obvious advantage, solving for space-time blocks also allows natural incorporation of a posteriori error estimates for mesh adaptivity, and enables the solution of inverse problems (involving adjoints).This has several additional tangible benefits in the context of computational overhead.For evolution problems that exhibit regionalized behavior in space and time (wave equations, moving interfaces like bubbles and shocks) solving in blocks of space-time that are locally refined to match the regional behavior provides substantial computational gain [9].Similarly, the availability of error estimates across a block of time allows optimal choices of space and time adaptivity.
Motivated by these considerations, this paper presents a methodology for the solution of linear, quasi-linear and semi-linear (time dependent) diffusion equations in three dimensions.We discuss the development of a parallel adaptive framework for the solution of large blocks of space-time.We detail the development of the block space-time framework for two classes of time-steppers (θ schemes, Backward Difference Formula (BDF)).We discuss implementation details of the massively parallel adaptive block-space time framework and illustrate scaling behavior up to 150,000 processors.We subsequently define a posteriori space-time error indicators to identify spatial regions for mesh adaption.Finally, we demonstrate that for sufficiently large problems the block space-time approach is much more computationally efficient than the standard sequential time-stepping approach.
The outline of the rest of the paper is as follows: Section 2, and 3 detail the block space-time framework for linear, and non-linear evolution equations, respectively.Section 4 discusses the space-time error estimates for these classes of problems.In Section 5, we discuss implementation details and show scaling performance and analysis.Section 6 illustrates several numerical examples of the framework.We conclude in Section 7.

2.
Basic space-time formulation: linear and non-linear versions 2.1.Space-time framework for a linear problem.Given a bounded domain Ω ∈ R 3 , and a finite time domain [0, T ], consider the parabolic equation that solves for u : Ω × [0, T ] → R: where f : Ω×[0, T ] → R is a smooth source function, and κ > 0. We consider (without loss of generality) that Dirichlet boundary conditions are imposed on the boundary Γ , unless otherwise specified.Considering a tesselation, T ≡ {Ω1 , ..., Ω e , ...}, of the domain Ω into elements (with average size h), the weak form of this equation is given as: where (. , . ) is the L 2 inner product on Ω and (3) with P (Ω e ) being the space of the standard polynomial finite element shape functions on element Ω e .To obtain a fully discretized form, we employ a time stepping technique on the above semi-discrete equation.While any time-stepping method can be used, as an example, consider the Euler Backward Formula that is defined on a discretization {0, t 1 , t 2 , ..., T } of the time domain: where the subscript denotes evaluation at that discrete time, and ∆t = t n+1 − t n is the timestep.Following standard FEM practice, with the tesselation of the domain resulting in k nodal values (to describe spatial variation) of the field u, equation ( 4) can be expressed in terms of matrix-vector products as: (5) where M and K are the global mass and stiffness matrices, respectively. 1u n+1 and u n are vectors containing the nodal values of the field u at time step n + 1 and n, respectively.Equation (5) represents the system of equations solved to get the solution for timestep n + 1.The size of vector u is equal to the number of nodal unknowns, k.Similarly matrices K, M are sparse matrices of size k × k.Consider a block-wise division of the total time domain.Each block, B i , consists of multiple timesteps.This is schematically represented in Figure 1.Instead of sequentially solving for each time step (as in (5)), consider solving for the field variable in a complete time block, B, consisting of N timesteps simultaneously, i.e. solve for u i , i = 1, ..., N simultaneously.This results in a block diagonal matrix of size (N × k)-by-(N × k) given by: where I is an identity matrix (of size k) and IC are the imposed initial conditions.This system solves for N timesteps at once with a total number of unknowns N × k.
Remark 1.By treating the (unknown) nodal values at different time steps as multiple degrees of freedom associated with each spatial node, we can leverage standard algorithmic approaches (assembly, memory usage) tailored for multiple d.o.f problems.Many approaches in uncertainty quantification (polynomial chaos representation, spectral stochastic methods) leverage such an approach of representing field variation along additional dimensions (stochastic dimensions) as simply additional degrees of freedom at each spatial location [20,27,11,23].2.2.Space-time framework for non-linear problems.Extending the approach to certain non-linear problems is straightforward.Consider the case where κ is a function of the dependent variable, u.We assume that κ(u) satisfies appropriate smoothness and boundedness assumptions to ensure existence and uniqueness, 0 < κ ≤ κ(u) ≤ κ < ∞.In this case the weak form for a block B is The solution to such non-linear equations are usually via (quasi-)Newton schemes.
The methodology involves construction of the Jacobian and residual, which are used to compute updates.This is represented in matrix-vector terms as: for i = 1, ..., until convergence and for n = 0, 1, ... where J u i n+1 is the Jacobian (or linearized form), and F u i n+1 is the residual of the above equation, both computed using u i n+1 .More specifically, for the non-linear diffusion equations defined by ( 7), the residual, F u i n+1 , is given by ( 9) where K(u i n+1 ) denotes the solution dependent stiffness matrix.The Jacobian is given as: (10) Instead of sequentially solving for each time step (as in ( 9)), consider solving for the field variable in a complete time block, B, consisting of N timesteps simultaneously.That is, One can alternatively ignore the off-diagonal entries of the block Jacobian to construct an approximate diagonal Jacobian.The propagation of time information is then limited to the residual on the right hand side.We tested both approaches, with the latter approach taking marginally more iterations to convergence, while providing substantial ease of implementation.Unless otherwise stated, all our results are based on the latter approach.

Space-time formulation: Higher Order Time schemes
We next look at extending the space-time strategy to incorporate two families of higher order time-steppers: Θ scheme and Backward difference formula.We consider linear and nonlinear diffusion, and moreover, we also consider the treatment of the Allen-Cahn equation, which is a parabolic PDE with a lower-order nonlinearity, whose solution has evolving layers and for which adaptivity is particularly useful.
3.1.Θ scheme: Linear equation.The semi-discrete form of the Θ-schemewhich is a generalization of the Euler Backward scheme -is as follows: The fully discrete matrix-vector representation is given by ( 13) Again, it is straightforward to group and simultaneously solve for N time steps together.The corresponding matrix form is expressed as: Again, the global space-time matrix ( 14) has a block structure.
3.2.Θ scheme: Nonlinear diffusion with variable coefficient.The corresponding weak form for this case is given as: The Jacobian is: while the residual is given by: (17) Again, it is straightforward to group and simultaneously solve for N time steps together.The corresponding (diagonal) block Jacobian (see Remark 2) is given as: As before, the upper index denotes Newton-Raphson iteration.
3.3.Θ scheme: Allen Cahn Equation.The Allen-Cahn equation is a semilinear diffusion equation with a non-linear reaction term: where we set f (u) = u(u 2 − 1).The initial condition is u(x, 0) = u 0 along with zero flux conditions in the boundaries.The corresponding semi-discrete form is given as: Using the Θ-scheme results in the fully discrete form: Again, it is straightforward to group and simultaneously solve for N time steps together.The corresponding (diagonal) block Jacobian is given as: Alternative schemes for the Allen-Cahn equation and other phase-field models are described in, e.g., [13].
3.4.Backward difference formula based time steppers.BDF based timesteppers of order s utilize the solution at s previous time steps to construct the solution at the next time step.A general s order BDF scheme is given as: where the left hand side is the BDF scheme representation of the time derivative, ∂u ∂t , in terms of the solution u i at time point i, and the right hand side collects all other terms.Here, α and β are known BDF coefficients [8].A first order (s = 1) BDF scheme is identical to the Euler Backward scheme described earlier.The simplest multi-step scheme is for s = 2 and for the linear diffusion equation it is given as: (24) Again, it is straightforward to group and simultaneously solve for N time steps together. 3The corresponding space-time block equations are as follows: (25) It is clear that higher order multistep methods produce block matrices that have a larger bandwidth (see Figure 3).Nonlinear problems can be treated similarly.A central idea of this work is to develop a block space-time methodology that can be integrated with mesh adaptivity.This will enable targeted refinement of regions that exhibit variations in the corresponding block of time.Mesh adaptivity requires the definition of an indicator function that determines which regions of the space require refinement/coarsening.In this work, we build on prior work and utilize standard residual-based error indicators; see, e.g.[4,26,25].Alternative dualitybased indicators for nonlinear parabolic problems are described in, e.g.[9,10,12].
Residual based error indicators η e are constructed for each spatial element, Ω e , and consist of two terms -an interior residual, r int , and a jump residual, r jump [26]: where h e is the size of element, Ω e .For example, for the linear and quasi-linear diffusion equation, the detailed derivation of the interior and jump residual is available in the work of Verfurth [26].We refer the interested reader to that work and only show the key results here.Basically, the two terms are constructed (as the name suggests) from the definition of the residual: This residual is decomposed into element-wise terms as: where Γ h is that part of the boundary with Dirichlet conditions imposed.The interior and jump residuals for the Allen-Cahn equations are similarly defined as (31) Recall that we are using error indicators defined over a block of time.We extend the concept of error indicator defined at one time step to the notion of an error indicator defined over a block of time steps.There are several choice of error indicators with two of them being: the average value of the error indicator across the block of time The former approach is advantageous for slow variations in the block and avoids frequent migrating of elements between processors, which can negatively affect scalability.The latter approach is advantageous when there is rapid changes across a few timesteps.In this case the error estimator is not 'diffused' by timesteps where solution is changing slowly.
Remark 3. In this work, we approximate the semi-discrete term, ∂u ∂t in terms of its finite difference representation (Θ or BDF scheme).An implicit assumption is that the time steps are small enough that this approximation is valid.Ideally, one would choose a consistent representation in both space and time; i.e. using a finite element representation for time variations [21,15].This provides several advantages in terms of mathematical elegance.We defer this development to a subsequent paper.

Implementation details
We utilize our in-house scalable, parallel FEM framework that is optimized for distributed memory computing.The FEM software library is implemented in C++ and uses object oriented software principles.Linear algebra, parallel matrix and vector storage are all performed by the PETSc library [5].PETSc modules (KSP, SNES) are used to solve (non)linear equations.The FEM library is dynamically linked to the Parallel Hierarchical Grid (PHG) library [2] which is a parallel mesh refinement framework.PHG uses a bisection type algorithm [29], specifically newest vertex bisection to refine/coarsen elements4 .PHG operates on simplex elements and produces conforming meshes after refinement.While our existing implementation was reasonably optimized, several software engineering principles had to be carefully implemented to ensure efficient execution of the block space-time problems.Some of the simpler changes that had substantial impact for the space-time formulation 5 , but had minimal impact on the existing, standard iterative formulation6 are: (a) moving calculations out of loops whenever possible, (b) executing calculations and storing results in array before loops, (c) conversion to 64-bit based integer variables for storing matrix indices, which allows going beyond 4 billion unknowns, (d) using local values instead of values obtained through pointer or reference 7 .
We made software engineering decisions to ensure that the space-time implementation is compatible with our existing sequential FEM frameworks (see remark 1).Basically, the space-time formulation is treated as the solution of a steady-state problem.Initial conditions are imposed as boundary condition on the first degree of freedom.We discuss key aspects of the implementation next.5.1.Memory interlacing and matrix bandwidth.We rearrange the vector of unknowns u to enumerate time points before looping over space.i.e. u = {u 1  1 , u 1 2 , ..., u 1 N , u 2 1 , u 2 2 , ..., u 2 N , ..., u k N }, where subscript refers to time and superscript denotes space.This allows for more efficient assembly, because all data (coefficients in system of equation) for a specific element share memory locality, thus preventing cache misses.Moreover, this approach is compatible with existing frameworks for FEM, because problems with multiple DOFs are supported in existing FEM frameworks, so we can use standard procedures for system assembling or imposing boundary conditions.This has the additional advantage of reducing the matrix bandwidth.Table 1 enumerates the bandwidth for space-time formulation (2D mesh with 100 × 100 quad elements, linear basis function) using a Euler Backward formulation.As expected, the bandwidth linearly increases with increasing number of timesteps.Computational time for different sizes of block size in block based matrix aij format .Investigated example was the 2D linear diffusion problem using 100x100 quad elements with one block of 100 timesteps.Numbers in legend describes block size, the 'aij' is sparse matrix without blocks and 'no zeroes' is special case with ignored zero entries in element matrix.

Matrix access and storage.
We explored the use of block structured matrix formats that are optimized for storing the global matrix, where the block size is a function of the number of timesteps.We use blocks of sizes ranging from 2 × 2 to N × N (where N is the number of timesteps in a block, B). Figure 4 plots the time required to assemble and solve a linear diffusion problem using the space-time formulation (Backward Euler).We can see that different non-zero blocks sizes can give observable savings in computational time.Note also that each of these blocks is itself quite sparse (see Figure . 6).Careful identification of the non-zero pattern could potentially result in much larger savings.This is illustrated in the 'no-zero' column in Figure 4 where we avoid inserting zero elements to the global matrix.The memory requirement for this case is also substantially minimized (see figure 5).5.3.Compressed element matrix.Figure 6 shows the non zero patterns within an elemental stiffness matrix.The size of this matrix is (nbf where nbf is the number of basis functions (assuming 1 dof) and N is the number of time steps in the block, B. Clearly, as N increases, the sparsity of the elemental stiffness matrix improves.This suggests using compressed formats for storing the elemental stiffness matrices.This issue becomes more important with increasing block size.
5.4.Scaling.We performed scaling studies on two machines.Preliminary scaling was performed on the TACC Stampede [3].Stampede consists of 6400 compute nodes each equipped with two Intel E5-2680 8-core processors and 32 GB of memory.
Each compute node has access to 250GB of local storage and Lustre filesystem with a write performance of 150GB/s.The nodes are connected with FDR InfiniBand  network.Scaling results are shown in Figure 7.We also performed scaling on NCSA Blue Waters [1].Blue Waters consists of 22,640 nodes, each consisting of two AMD 6276 "Interlagos" processors for total number of 362,240 computing cores.Each node has 64GB of memory, which gives total system memory equal to 1.476PB.Compute nodes have access to storage with Lustre filesystem that has size 26.4PB and offers total bandwith of more than 1TB/s.Figures 8, 9, 10 show scaling for the linear, nonlinear and Allen-Cahn equations respectively.We show good scaling upto 131,072 processors.The largest problem size used was ∼ 5.2 billion degrees of freedom.This is a 800 3 mesh with 10 timesteps per block.The results illustrate reasonably good scaling for the spacetime approach, and Figure 7 suggests that for larger number of processors, the space-time approach performs better than the iterative approach.Specifically, beyond a threshold number of processors, the total time to solve a given problem is  lower for the space time approach compared to the iterative approach.We anticipate that for more complex problems (complex geometries, remeshing, I/O) the space-time approach may perform even better.

Numerical Examples
We illustrate our adaptive parallel-in-space-time framework on linear and nonlinear diffusion equation and the Allen-Cahn equation.

6.1.
Problem A -Linear Diffusion.We first consider the linear case.We set κ = 1.We use the method of manufactured solutions to construct the forcing term in (1) to ensure an analytical solution, u:  with a = 0.2, b = 0.5, ω = 2π, d = 0.1.Thus, u is a rotating exponential hill centered at the midplane and rotating with a radius of 0.2 and an angular speed of 2π.This manufactured solution is constructed by the following forcing term The equation is solved in a region of dimensions [0 : 1] × [0 : 1] × [0 : 1].Every boundary face of the region has an essential boundary condition with prescribed value of u equal to the value computed from (35).We use a time step of 0.01 and solve for a block of 100 timesteps.
Figure 11 shows a cutoff of the refined spatial mesh after 20 refinement iterations.We remind the reader that each iteration consists of solving the space-time problem, constructing the time-averaged elemental refinement indicators, equation (33), and then refining the 3D mesh according to the indicators.Given that this is a moving source problem, it is clearly seen that there is refinement in regions where the source has passed through.
We compare spatial convergence rates for three implementations: (a) sequential time stepping with no spatial adaptivity, (b) space-time implementation with no spatial adaptivity, and (c) space-time implementation with spatial adaptivity.We plot convergence in Figure 12, where error is u − u h 2 .The first two implementations (obviously) overlap, with the adaptive mesh implementation showing a reduced error.All three curves show a slope of 2, which is to be expected.The Figure 13 shows time-step convergence, with expected slopes of 1 and 2 for Backward Euler and Backward Difference Formulae, respectively.6.2.Problem B -Nonlinear Diffusion.For the nonlinear case, we set the coefficient κ(u) = 1 + 10u 2 and we choose an exact solution such that |u| ≤ 1, thus bounding κ.As before, we choose our analytical solution to be given by (35).The forcing term, f , is consequently: 14 plots several time snapshots of the moving non-linear source problem, with the solution accurately tracking the moving source.In Figure .15, we plot spatial convergence for three implementations: (a) sequential time stepping with no spatial adaptivity, (b) space-time implementation with no spatial adaptivity, and (c) space-time implementation with spatial adaptivity (plotted with the average element size).Convergence rates follow along expected lines, with a slope of 2. 6.3.Problem C -Allen-Cahn.In this final example we solve the modified Allen-Cahn problem.The key physics (i.e.phase change) which is described by this non-linear equation essentially occurs in a highly localized region of the domain (on a surface of co-dimension 1).Adaptive refinement (and coarsening) has been a very effective approach to accurately resolve this localized region.The equation is given with zero flux conditions on all boundaries.This represents an initial solid of radius, r = 0.5, that is melting.Using symmetry arguments, we consider a single octant of the space ([0 : 1] × [0 : 1] × [0 : 1]).We consider a time step of 0.02 and a block size of 50 time-steps.Figure 16 shows snapshots of the melting sphere at various time points.Figure 17 illustrates adaptive mesh refinement across the time block.The figure shows the refined mesh for the first block of time (t = 0 to t = 1) Note that refinement is localized along the solidification front within the time block.To capture this thin interface using a uniform mesh would have required 589 824 elements, instead of the 137 776 elements that were used here.

Conclusion
We present formulation, implementation details and representative examples of a parallel-in-space-time based adaptive methodology for the solution of (linear and) non-linear time dependent problems.The basic concept is to solve for large blocks of space-time unknowns instead of marching sequentially in time.The methodology is a combination of a computationally efficient implementation of a parallel-in-spacetime finite element solver coupled with a posteriori space-time error estimates and a parallel mesh generator.We illustrate how this implementation is especially tailored for massively parallel computations.We show good scaling behavior up to 150,000 processors on the Blue Waters machine.This methodology enables scaling on next generation multi-core machines by simultaneously solving for large number of timesteps, and reduces computational overhead by refining spatial blocks that can track localized features.This methodology also opens up the possibility of efficiently incorporating adjoint equations for error estimators and inverse design problems, since blocks of space-time are simultaneously solved and stored in memory.Our future work is focused on extending the space-time framework to utilizing finite element basis functions in time (which enables formal derivation of space-time a posteriori error estimates), and subsequently implementing 4D finite elements to enable simultaneous space and time adaptivity.

Figure 1 .
Figure 1.Indexing of timesteps per block (B).Number of timesteps per block is equal to N.

Figure 2 (
left) illustrates an elemental matrix considering N (= 10) time steps (i.e. 10 degrees of freedom per node), for a 2D quadrilateral element 2 .Note the sparse structure of the elemental matrix as well as the resulting global matrix (for a 2 × 2 quad mesh) in Figure2(right).

Figure 2 .
Figure 2. Left, Nonzero entries in an elemental matrix.Right, Nonzero entries in the resultant global matrix for a 2x2 mesh of quadrilateral elements.Dark squares represent nonzero entries and number are node indices.The time block consists of 10 time steps.

Figure 3 . 4 .
Figure3.View of nonzero entries of an elemental matrix (2D quad with linear basis functions) for a block of space-time consisting of 2 timesteps + initial condition (left) and 9 timesteps + initial condition(right) with Backward difference formula of second order.First timestep is calculated using Euler Backward scheme.

Figure 4 .
Figure 4. Computational time for different sizes of block size in block based matrix aij format .Investigated example was the 2D linear diffusion problem using 100x100 quad elements with one block of 100 timesteps.Numbers in legend describes block size, the 'aij' is sparse matrix without blocks and 'no zeroes' is special case with ignored zero entries in element matrix.

Figure 5 .Figure 6 .
Figure 5. Memory requirements for different sizes of block size in block based matrix aij format and sparse element matrix On the example of elliptic problem on grid of 100x100 elements with one block of 100 timesteps

Figure 7 .
Figure 7. Speed up on Stampede (top) and time requirements for space-time and iterative formulations on the same machine (bottom)

Figure 10 .
Figure 10.Speed up on Blue Waters Supercomputer for Allen-Cahn problem
convergence for space-time formulation, dt=0.01 iter space-time space-time, adapt

Figure 12 .
Figure 12.Grid convergence for linear heat equation.Error is u − u h 2 from last time-step versus average element size h.

Figure 13 .
Figure 13.Time convergence for linear heat equation.Error is u − u h 2 from last time-step versus size of time-step.

Figure 14 .
Figure 14.Results of computations where upper half of computational domain is removed.Values of u change in time from 0s (a) to 1s (d)

Figure 15 .
Figure 15.Grid convergence for nonlinear heat equation.Error is u − u h 2 from last time-step versus average element size h.

Table 1 .
The bandwidth size for space-time formulation Figure 8. Speed up on Blue Waters for linear time dependent diffusion