Neural Control of Discrete Weak Formulations: Galerkin, Least-Squares & Minimal-Residual Methods with Quasi-Optimal Weights

There is tremendous potential in using neural networks to optimize numerical methods. In this paper, we introduce and analyse a framework for the neural optimization of discrete weak formulations , suitable for ﬁnite element methods. The main idea of the framework is to include a neural-network function acting as a control variable in the weak form. Finding the neural control that (quasi-) minimizes a suitable cost (or loss) functional, then yields a numerical approximation with desirable attributes. In particular, the framework allows in a natural way the incorporation of known data of the exact solution, or the incorporation of stabilization mechanisms (e.g., to remove spurious oscillations). The main result of our analysis pertains to the well-posedness and convergence of the associated constrained-optimization problem. In particular, we prove under certain conditions, that the discrete weak forms are stable, and that quasi-minimizing neural controls exist, which converge quasi-optimally. We specialize the analysis results to Galerkin, least-squares and minimal-residual formulations, where the neural-network dependence appears in the form of suitable weights. Elementary numerical experiments support our ﬁndings and demonstrate the potential of the framework.


Introduction
In recent years there has been tremendous interest in the merging of neural networks and machine-learning algorithms with traditional methods in scientific computing and computational science [24,17,27,39].In this paper we demonstrate how neural networks can be utilized to optimize finite element methods.
In one of its most familiar mathematical forms, the finite element method is a discretization technique for partial differential equations (PDEs) based on a weak formulation using discrete subspaces, i.e., the exact solution u ∈ U is approximated by u h ∈ U h , which is the unique solution of the discrete problem: where U h is a discrete subspace of the infinite-dimensional Hilbert or Banach space U (typically a Sobolev space on a domain Ω ⊂ R d ), V h is a subspace of a Hilbert or Banach space V with dim V h = dim U h , b : U × V → R is a continuous bilinear form, f : V → R a continuous linear form, and the exact solution u satisfies b(u, v) = f (v) for all v ∈ V. 1  It is well-known that the accuracy of u h can be improved by enlarging U h (e.g., by refining the underlying finite element mesh). 2 However, for a fixed value of h, the particular u h defined by (1) may be very unsatisfactory.In fact, there is no reason why a certain quantity of interest of u h is accurate at all, 3 or why the approximation inherits certain qualitative features of the exact solution. 4Indeed, the discrete problem (1) is a rigid statement in the sense that it identifies a single element in U h , irrespective of desired attributes, whereas there could be many other elements in U h that are far superior.

Neural optimization of discrete weak forms
The objective of this work is to propose and analyse a framework for the neural optimization of discrete weak formulations to significantly improve quantitative and qualitative attributes of discrete approximations.In particular, we consider Galerkin, least-squares, and minimal-residual formulations.
The main idea of the framework is that it incorporates a neural-network function ξ as a control variable in the discrete test space V h (ξ).That is, the approximation u h = u h,ξ now depends on ξ and solves the discrete problem: Then, in order to obtain a desired approximation u h, ξ , we aim to find a neural-network function ξ that quasi -minimizes a desired cost (or loss) functional: 5 J(u h, ξ ) −→ quasi-min . (3) 1 When U h = V h , this is a Galerkin method, otherwise it is a Petrov-Galerkin method.
2 Indeed, a priori error analysis reveals that u − u h U ≤ C inf w h ∈U h u − w h U , provided b(•, •) satisfies a discrete inf-sup condition on U h × V h ; see e.g., [38,19]. 3E.g., the value u h (x 0 ) for some point x 0 ∈ Ω is generally quite distinct from u(x 0 ). 4 E.g., u h may exhibit spurious oscillations, while u is monotone. 5We also allow for the inclusion of a regularization term in the cost functional; see Section 2.1.
The notion of quasi-minimization is critical when aiming to minimize over a set of neural-network functions (i.e., the set of functions implemented by neural networks of a fixed architecture); see Section 2.2 for further details (in particular, Definitions 2.1 and 2.2).The quasi-minimization problem (3) is essentially a nonstandard PDE-constrained optimization, with the nonstandard part being the dependence of the state problem (2) on ξ via the discrete test space V h (ξ).Importantly, V h (ξ) will be parameterized by ξ in such a way so as to ensure stability of the discrete problem (2).Moreover, as will become clear in the following sections, the basis functions in V h (ξ) need not be computed explicitly, but equivalent formulations to (2) can be used, which instead incorporate ξ by means of suitable weight functions.These formulations essentially lead to a PDE-constrained optimization with a nonlinear control-to-state map.

Potential of the methodology
There are two main benefits of having neural control of discrete weak forms: • Incorporation of data: Knowledge of quantities of the exact solution can be taken into account in a natural way by setting, for example, where q : U → R is a functional measuring the quantity of interest and q ∈ R is known data. 6Minimizing such a J(•) ensures that the discrete solution u h to (2) is data-driven in the sense that u h becomes constrained by the data. 7We note that multiple quantities can be taken into account using, for example, or, more generally, using some operator Q : U → Z; see Section 2.
• Incorporation of stabilization mechanisms: Qualitative attributes of the discrete solution can be enhanced by minimizing a suitably-chosen J(•).In this way discrete solutions can be enforced to, e.g., satisfy an a priori known maximum principle, have monotone (or spurious oscillation free) behavior around discontinuities and layers, or have a certain discrete wave number (i.e., free from pollution).In the past decades, many different stabilized finite element methods have been proposed (and analyzed) that impose such attributes [21,10,26,20,15,40].Within our framework such a method is naturally obtained after (quasi-) minimization (i.e., method (2) with ξ = ξ).As an example, Guermond [21] advocates the L 1 -minimization of the residual; in other words, within our framework one would choose: where f − Bu h,ξ is the strong form of the residual.
The idea of using neural networks to parameterize the test space was initially proposed in our earlier work [8], where it was restricted to minimal-residual formulations within a parametric PDE setting.The current work presents significantly more general settings and formulations as well as analyses of their well-posedness and convergence.
While the above shows examples of J(•) corresponding to unsupervised learning (i.e., there is no need to know the exact solution u), when the original problem is parametric itself (e.g., a parametric PDE), supervised learning becomes meaningful.Indeed, in that case, the data may be the exact solution u λi for certain parameters λ i , i = 1, . . ., N data .This then allows for the training of finite element discretizations with superior accuracy in quantities of interest even on very coarse meshes.We refer to our earlier work [8] for the methodology and illustrative examples in that case.

Main contributions: Well-posedness, convergent quasi-minimizers, weighted conforming formulations
Let us briefly outline the main contributions of this work.The first main contribution is the analysis of an abstract constrained-optimization problem associated to (3); see Section 2. In particular, we consider an abstract state problem equivalent to (2), but in the form of a mixed system with a ξ-dependent bilinear form. 8We prove, under suitable conditions, that the state problem is well-posed (uniformly with respect to ξ); see Proposition 2.9.Furthermore, we present differentiability conditions (on the ξ-dependence) that allow us to prove the existence of quasiminimizers (within sets of neural-network functions, of some size n) to the associated constrained optimization (3), which converge quasi-optimally (upon n → ∞); see Corollary 2.12 for details.We note that our analysis is based on a fundamental result for the quasi-minimization of strongly-convex and differentiable functionals (see Theorem 2.A), which is of independent interest and applies, e.g., to the analysis of deep Ritz methods [54,42,37] and PINN methods [48,35,11].
The second main contribution of this work is the application of our framework to certain weak formulations used by conforming finite element methods; see Section 3. In these applications, the neural-network control variable ξ will appear by means of suitable weights in the bilinear forms.In particular, we will analyse weighted least-squares, weighted Galerkin, and weighted minimal-residual formulations.
For weighted least-squares and weighted minimal-residual formulations, suitable conditions on the weights imply (via the abstract result of the first main contribution) stability of the discrete problem (uniformly in ξ).Furthermore, suitable differentiability conditions on the weights imply existence of (quasi-optimally) convergent quasi-minimizers of the associated constrained minimization.
On the other hand, for weighted Galerkin, it turns out that stability is not immediate, and may require constraints on ξ depending on the problem at hand. 9 Therefore, neural control is far more convenient for least-squares and minimal-residual formulations, the fundamental reason being the inherent stability that comes with their underlying minimization principle.
We support our findings with numerical experiments in Section 4. While our theoretical results directly apply to any linear operator, we choose the advection-reaction PDE to illustrate various numerical aspects, viz., the incorporation of data (Section 4.1), the quasi-optimal convergence of quasi-minimizers (Section 4.2), and the incorporation of L 1 -type stabilization (Section 4.3).

Related work
There are a number of works related to ours.
Optimizing numerical methods: Traditionally, the incorporation of known data or other desired attributes in numerical PDE approximations is achieved via the method of Lagrange multipliers, see e.g., Evans, Hughes & Sangalli [20], Kergrene, Prudhomme, Chamoin & Laforest [28], and references therein.More recently, neural networks have been proposed to learn the parameters that define a numerical method; see Ray & Hesthaven [45], Mishra [33] and others [2,16,53,47].Interestingly, a recent learning methodology for adaptive mesh refinement has been proposed that ensures optimal convergence; see Bohn & Feischl [6].Within the context of optimizing finite-element formulations, a minimal-residual framework that ensures stability was proposed in our previous work [8].Our current work contributes to these developments by providing the analysis of a general framework for neural optimization of finite element methods.
Neural networks for PDEs: The use of neural networks for approximating directly the solution to PDEs has received wide-spread interest since the works by E & Yu [18], Sirignano & Spiliopoulos [49], Berg & Nyström [3] and Raissi, Perdikaris & Karniadakis [43], amongst others.Recently, there have been a number of ideas that propose an adaptive construction of neural-network approximations; see Ainsworth & Dong [1], Liu, Cai & Chen [31] and Uriarte, Pardo & Omella [52].Neural networks can also be used to obtain the coefficients of the basis expansion used by a standard (linear) approximation [23,29].
Neural networks for inverse PDEs: In the context of inverse problems involving PDEs, the use of neural networks to represent unknown PDE coefficients (fields) and constitutive models has been explored by, e.g., Teichert, Natarajan, Van der Ven & Garikipati [50], Berg & Nyström [4] and Xu & Darve [55].These works are similar to the current work in the sense that standard (finite element) methods are used to solve the PDE, while a neural network is embedded within the discrete formulation.We note that the analysis provided by our current work can be extended to those inverse problems.
Error analysis for neural-network approximations: There are a number of works containing a priori error analysis for neural-network based PDE approximations.For those related to the deep Ritz method; see Xu [54, Section 5], Pousin [42, Section 3], and Müller & Zeinhofer [37].For those related to physics-informed neural networks (PINN) and least-squares methods; see Sirignano & Spiliopoulos [49, Section 7], Mishra & Molinaro [35,34], Pousin [42,Section 4] and Cai, Chen & Liu [11].Recently, a posteriori error analysis has also been studied, in particular goal-oriented analysis using the dual-weighted residual (DWR) methodology; see, e.g., Roth, Schröder and Wick [46], Minakowski & Richter [32] and Chakraborty, Wick, Zhuang & Rabczuk [12].We note that in our current work, while we have in mind the error analysis for neural-control approximations, the abstract analysis presented in Section 2 is essentially an extension of the above-mentioned a priori analysis to a certain class of problems involving a convex and differentiable cost functional.

Abstract framework
In this section we present the analysis of the abstract state equation (in the form of a mixed system) and the associated optimization problem.We essentially follow the classical theory of optimal control (PDE-constrained optimization) by Lions [30]; see also, [25,51,7].Our resulting optimization problem bears similarity to that of parameter identification of PDE coefficients; see Rannacher & Vexler [44] and references therein for its error analysis.While we present our abstract framework within Hilbert spaces (and using a quadratic cost), we note that extensions to Banach spaces are feasible, but not within the scope of the current work.

Discrete state problem and associated cost functional
Let X be a Hilbert space for the control variable, U and V be Hilbert spaces for trial and test functions, respectively, U h ⊂ U be a discrete (finite element) subspace, and V ⊆ V. 10 In all that follows, we think of h (hence U h ) as being fixed.Given ξ ∈ X and f ∈ V * (the dual of V), we consider the discrete state problem given by: where b(•, •) is a continuous bilinear form on U × V, i.e., b(•, •) ∈ L(U × V; R), and for each ξ ∈ X, a(ξ; •, •) is a continuous bilinear form on V × V, i.e., a(ξ; •, •) ∈ L(V × V; R).To explicitly indicate the dependence of r and u h on ξ, we use the notation: (r ξ , u h,ξ ) = solution of (4a)-(4b) for a given ξ .
In Section 2.4, we demonstrate that (4a)-(4b) is equivalent to (2) for a particular choice of V h (ξ); see Proposition 2.10.The discrete problem in (4a)-(4b) is essentially a general formulation, which for a specific choice of a(• ; •, •) and V reduces to a (weighted) Galerkin, least-squares or minimal residual method; see Section 3.
Next, let Z be a Hilbert space, and let Q : U → Z be a linear continuous (observation) operator.Then, given an observation z o ∈ Z and regularization parameter α ≥ 0, we consider the cost (or loss) functional J : U h × X → R defined by: where The associated reduced cost functional j : X → R is then given by: where j 1 : X → R is defined by: While ideally we would like to minimize j(•) over (the infinite-dimensional) X, we proceed by considering neural-network approximations.

Neural quasi-minimization
To accommodate neural optimization, we consider the subset M n ⊂ X consisting of all functions implemented by neural networks of a fixed architecture parameterized by n. 11 We shall simply refer to M n as a set of neural-network functions, and we think of n as a measure of the size of the architecture (e.g., the total number of neurons, or total number of parameters).When aiming to minimize j(•), a significant complication is that the set M n may not be closed (topologically) in X. 12 Hence, even though j(•) may have an infimum on M n , there may not be a minimizer in M n .Therefore, one should not aim to completely minimize j(•), but instead use a relaxed notion of quasi-minimization as used by Shin, Zhang & Karniadakis [48] 13 (for which the existence of an infimum implies the existence of a quasi-minimizer): Definition 2.1 (Quasi-minimizers and quasi-minimizing sequences) Let j : X → R be a cost functional.
(i) Let δ n > 0 and M n ⊂ X be a subset of X (not necessarily closed in X).A function ξn ∈ M n is said to be a quasi-minimizer of j(•) if the following holds true: (ii) Consider a sequence of subsets (M n ) n∈N of X, with N being a strictly-increasing sequence of natural numbers.A sequence ( ξn ) n , with ξn ∈ M n , is said to be a quasi-minimizing sequence if ( 9) holds true for all n ∈ N with δ n > 0 such that: In summary, the neural optimization problem that we consider is the following:

(The quasi-minimizing control problem)
The following statements are equivalent.
Reduced quasi-minimizing control problem: For j(•) given by (7), we aim to quasi-minimize j(•), i.e., given Constrained quasi-minimizing control problem: For J(•, •) given by ( 5), we aim to quasi-minimize Example 2.3 (Need for quasi-minimizers) Let us discuss a simple example illustrating the non-existence of minimizers, hence the need for quasi-minimizers. 15et denote the characteristic function of the subset [z, 1]. 16Consider the following cost functional: over X solves a first-order PDE (constant advection in the direction of the x 2 -axis) with discontinuous data given by χ [z,1] , which is a well-posed problem [5].
Let M n be the set of two-layer neural-network functions Ω → R 2 → R using two neurons and ReLU activation in the hidden layer, i.e., Note that an infimizing sequence of j(•) in M n is given by: On the other hand, quasi-minimizers ξn do exist in M n , in particular, ξ m as defined above is a quasi-minimizer for m large enough. 17

Analysis of reduced control problem
We first proceed with the analysis of the reduced control problem (9).Let the state operators R h : X → V and S h : X → U h be defined by: where r h,ξ and u h,ξ are the first and second component, respectively, of the solution to the mixed system (4).Then the reduced cost j(•) given in ( 7) can be written as follows: Our main result depends on the following fundamental theorem, which is of independent interest: Theorem 2.A (Differentiable, strongly-convex quasi-minimization) Let j : X → R be a cost functional.Assume that j(•) is Gâteaux differentiable with derivative j : X → X * being Lipschitz continuous, i.e., there is a constant L > 0 such that Furthermore, assume that j(•) is strongly convex, i.e., there is a constant γ > 0 such that Then the following hold true: (i) j(•) has a unique minimizer ξ ∈ X, which satisfies: (ii) For any subset M n ⊂ X, j(•) has a quasi-minimizer ξn ∈ M n that satisfies (8).
(iii) Any quasi-minimizer ξn in M n satisfies the following quasi-optimal error estimate: Proof See Appendix A.1.
We now analyse when our j(•) satisfies the assumptions of Theorem 2.A.
and S h (•) are uniformly bounded on X, and S h (•) is Lipschitz continuous.Then: (i) j 1 , j 2 , j : X → R are Gâteaux differentiable with j 1 , j 2 , j : X → X * Lipschitz continuous.
Proof The results of Theorem 2.B are the assumptions of Theorem 2.A.
Remark 2.5 (Quasi-optimal rates) The first part on the right-hand side of the quasioptimality result ( 14) can be estimated in terms of n using results from neural-network approximation theory; see, e.g., Yarotsky [56], Gühring, Kutyniok and Petersen [22], and references therein.Such a result may be useful in finding a proper balance of δ n as n → ∞.Alternatively, the choice of δ n may be found through a proper a posteriori estimator, which seems to be an open problem.
Remark 2.6 (Condition on α) The proof of Theorem 2.B reveals that the condition that α is sufficiently large may be weakened if j 1 has additional structure (e.g., convexity).Indeed, convexity of j 1 guarantees that j will be strongly convex, with strongly convexity constant equal to α > 0. If the case, there is no need of Lipschitzness of j 1 in order to prove statement (iii) of Theorem 2.B, only α > 0 will be enough.Furthermore, statement (v) of Theorem 2.B becomes: Remark 2.7 (Physics-informed neural networks (PINN)) Theorem 2.A can be applied to PINN [43] (for neural-network approximations to PDEs).Indeed, consider , where f − Bξ is an abstract residual in some abstract Hilbert space L (which may include the PDE residual, initial condition and boundary conditions, as in [35], as well as a data residual, as in [34]).If B : X → L is a linear operator, then the assumptions of Theorem 2.A (Lipschitz continuity and strong convexity) hold true.
Remark 2.8 (Deep Ritz method) Theorem 2.A can also be applied to the Deep Ritz method [18].Indeed, consider where b ∈ L(X × X; R) is a coercive bilinear form and f ∈ X * .For such a j(•), the assumptions of Theorem 2.A (Lipschitz continuity and strong convexity) hold true.

Analysis of constrained control problem
We now proceed with the analysis of the constrained control problem (10).We begin by providing conditions that guarantee the well-posedness of the state problem.
Proposition 2.9 (Stability of the state problem) Let a(ξ; Then, the following statements hold true: (i) For each ξ ∈ X, problem (4) is well-posed (for any f ∈ V * ) if and only if there exist constants α h ≡ α h (ξ) > 0 and β h > 0 such that: (ii) If (15) is satisfied, then the following a priori bound holds true for the solution u h ∈ U h of problem (4): then α h = (C 1,ξ ) 2 in (15a), and additionally, the following improved a prior bound holds true: Proof See Appendix A.3.
To establish the equivalence between the mixed system (4) and the Petrov-Galerkin statement (2), let us define the operators A : X → L( V, V * ) and B ≡ B h ∈ L(U h ; V * ) by: Note that the state equations (4a)-(4b) can then be written as follows: Proposition 2.10 (Equivalent Petrov-Galerkin problem) Assume the conditions of Proposition 2.9, including the well-posedness condition (15b).Instead of (15a), assume the stronger hypothesis (full inf-sup, instead of just on the kernel): Let the test space V h (ξ) be given by: Then the state problem (4) is equivalent to the Petrov-Galerkin problem (2) with V h (ξ) given by (21).
Proof See Appendix A.4.
Finally, we now present (differentiability) conditions on ξ → A(ξ) that guarantee the (differentiability) requirements on ξ → S h (ξ) in Theorem 2.B and Corollary 2.4.Once in place, existence of (quasi)-minimizers and quasi-optimal convergence follow immediately for the constrained control problem.
To anticipate the connection between derivatives A and S h (as well as R h ), 20 note that a formal differentiation of (19) (with r = R h (ξ) and u h = S h (ξ)) with respect to ξ in the 20 Recall that the Gâteaux derivative of, e.g., A at ξ ∈ X in the direction η ∈ X is given by A direction η ∈ X yields: One may therefore expect that suitable conditions on A(•) will imply desired conditions on S h (•) (and R h (•)): and S h (•) be the state operators as defined in (11), and let A(•) be as defined in (18a).
Assume the conditions of Proposition 2.9, including the well-posedness conditions (15).Then, the following statements hold true: are uniformly bounded on X, then R h (•) and S h (•) are also uniformly bounded on X.
Proof See Appendix A.5.
In other words, the constrained control problem (10) has a quasi-minimizer in M n that converges quasi-optimally to the unique minimizer in X.
Proof The results of Propositions 2.9 and 2.11, together with α sufficiently large, are the assumptions of Theorem 2.B, whose results are the assumptions of Theorem 2.A.

Conforming weak formulations with suitable control
In this section, we study various weighted versions of conforming weak formulations, viz., leastsquares, Galerkin and minimal-residual formulations.The aim is to propose suitable ξ-dependent weighting within the weak forms, in order to be able to prove the assumptions of Propositions 2.9 and 2.11.By Corollary 2.12, we can then conclude that the corresponding constrained neural-control problem has desired properties (existence of quasi-minimizers and quasi-optimal convergence).
In what follows, we often consider a positive weight function ω.We shall use the notation := 1/ω to indicate the (multiplicative) inverse of ω.

Weighted least-squares formulations
Let d ∈ N and Ω ⊂ R d be an open bounded domain.Let B : H B → L 2 (Ω) be a linear differential operator in strong form, where U = H B denotes the graph space We further assume that H B is a Hilbert space when endowed with the inner product and that B is boundedly invertible from , and a conforming discrete finite element space U h ⊂ H B , we aim to find u h ≡ S h (ξ) ∈ U h , which is the solution of the weighted least-squares problem: .
The optimality condition of such a minimizer is: In particular, notice that we can directly identify the test space in (2) as To establish the connection with the general mixed system (4), we set r = ω(ξ)(f − Bu h ) so that ( 22) is equivalent to: Thus, in this case the bilinear forms a(ξ; 4) are given by Proposition 3.1 (Weighted least squares) Let : L 2 (Ω) → L ∞ (Ω) be a differentiable map, such that for some positive constants min , max , ∞ , and L , the application (•) satisfies Then, the following statements hold true: (i) The bilinear forms in (24) satisfy the inf-sup conditions (15), and thus the mixed problem (23) is well-posed.
(ii) The state operator S h (•) (= u h ) of the mixed problem (23) is uniformly bounded on X = L 2 (Ω) and differentiable.
(iii) The derivative S h (•) is uniformly bounded on X = L 2 (Ω) and Lipschitz continuous.
Proof See Appendix A.6 Remark 3.2 (Neural control of weighted least squares) Proposition 3.1 guarantees that the conditions of Propositions 2.9 and 2.11 are satisfied, hence Corollary 2.12 applies to the neural optimization of the above weighted least-squares formulation.

Weighted Galerkin formulations
Consider a Hilbert space U = V on Ω ⊂ R d and a bilinear form b ∈ L(V × V; R) satisfying (for some constant β > 0) the following conditions Given f ∈ V * , the well-known Babuška-Brezzi theory (see, e.g., [19]) ensures the existence of an unique Now, given a weight function ω : L 2 (Ω) → W + (the space W + will be clarified later), a control ξ ∈ X = L 2 (Ω), and a conforming discrete subspace U h ⊂ V, we consider the following weighted-Galerkin discretization of problem (26): Notice that one can directly identify the test space in (2) as ξ)w h for some w h ∈ U h .We will show next that problem (27) admits also an equivalent mixed formulation of the type (4), and therefore it fits the abstract setting of Section 2. First, we need to provide sense to the weighted object ω(ξ)v h ∈ V. Thus, we further consider an abstract Banach space W ≡ W(Ω) of measurable functions on Ω, such that for any w ∈ W, the multiplication operator M w : V → V given by M w v := wv, ∀v ∈ V , is a well-defined linear and continuous map.
Then it is easy to see that the Sobolev space W = W 1,∞ (Ω) is a space of functions for which the multiplicative operator M w : H 1 (Ω) → H 1 (Ω) is a well-defined linear and continuous map, for all w ∈ W 1,∞ (Ω).The latter is also true for Hilbert spaces V ⊂ L 2 (Ω) containing at most first-order (weak) derivatives in L 2 (Ω) (e.g., first-order graph spaces).
A particular subset of interest for us will be W + := w ∈ W ∃w min > 0 for which w min ≤ w(x) ≤ 1 wmin , ∀x ∈ Ω .
Notice that 1 w ∈ W + iff w ∈ W + .We can then define M −1 w := M 1 w , which is justified by the fact that The adjoint operators of M w and M −1 w will be denoted by M * w and M − * w respectively.Using the relations (28) it is straightforward to see that the adjoint operators satisfy We translate problem (27) into operator notation by means of the operator B ∈ L(V; V * ) such that V w → Bw := b(w, •) ∈ V * .Notice that such an operator is invertible thanks to conditions (25).Problem (27) translates into finding Hence, by means of the adjoint relation we get Since B is invertible, so is Thus, multiplying this last equation by M * (ξ) , using ( 29), (30), and the definition of r ∈ V, we arrive to the mixed form Observe that (31) has the structure of (4) for V := V = U; U h := V h ; and The next proposition establishes a sufficient condition for the well-posedness of (31), or equivalently (27).
) is uniformly bounded and Lipschitz-continuous, then also S h (•) is uniformly bounded and Lipschitz-continuous.
Proof See Appendix A.7. Remark 3.5 (Neural control of weighted Galerkin) Proposition 3.4 guarantees that the conditions of Propositions 2.9 and 2.11 are satisfied, hence Corollary 2.12 applies to the neural optimization of the above weighted least-squares formulation.Remark 3.6 (Inconvenient condition for weighted Galerkin) While for the weighted least-squares method the conditions on the weight are explicit (recall Proposition 3.1), for weighted Galerkin the condition ( 33) is problem dependent.Furthermore, Example 3.7 shows it may require inconvenient constraints on ξ.It seems therefore much more convenient to have neural control of least-squares formulations, or of dual minimal-residual formulations, as we will see in Section 3.3.
Example 3.7 (Weighted Galerkin for Laplacian) Let us illustrate the difficulty of condition (33) using the elementary Laplacian.
In particular, let (x) = w min + c y • (x − x 0 ) for some c ∈ R and y, x Then there is a c > 0 such that b( v, v) = 0.This shows that (33) can not be satisfied in general without additional conditions on .Indeed, from (34) a sufficient condition can be obtained.First notice that, for any where a Poincaré inequality was used.Therefore, the constraint C Ω ∇ L 2 (Ω) < w min is sufficient to guarantee (33).Unfortunately, since = (ξ), such a condition translates into a constraint on ∇ξ, which may be very inconvenient to impose in practice.

Weighted discrete-dual minimal residual formulations
Let U h ⊂ U and V h ⊂ V be discrete subspaces, and assume: For each ξ ∈ X, we consider an equivalent (weighted) inner product (•, •) V,ξ on V, i.e., such that its induced norm satisfies (16).
The minimal-residual method that we consider is then: This has the structure of (4) for V := V h and a(ξ; r, v) := (r, v) V,ξ .
As shown in [36,Theorem 4.1], the mixed formulation ( 36) is equivalent to minimizing the residual as measured by a discrete-dual norm: Because •, • V,ξ and • V h,ξ depend on ξ, we refer to the above as a weighted discrete-dual minimal residual formulations.(35).Consider a parametrized set of equivalent inner-products for all ξ ∈ X and v ∈ V.Then, the following statements hold true: (i) The mixed discrete formulation (36) is well-posed.
is uniformly bounded and Lipschitz-continuous, then also S h (•) is uniformly bounded and Lipschitz continuous.
Proof See Appendix A.8.
Remark 3.9 (Neural control of weighted residual minimization) Proposition 3.8 guarantees that the conditions of Propositions 2.9 and 2.11 are satisfied, hence Corollary 2.12 applies to the neural optimization of the above weighted minimal-residual formulation.

Numerical results
In this section, we consider numerical examples for the advection-reaction PDE in 1-D and 2-D.We consider both weighted least squares and weighted residual minimization. 21e construct weight functions ω : L 2 (Ω) → L ∞ (Ω) that are based on algebraic expressions, i.e., for which ω(ξ)(x) = ω(ξ(x)) for x ∈ Ω.These are convenient expressions, but the price to pay is that ω : L 2 (Ω) → L(L 2 (Ω); L ∞ (Ω)) can not be Lipschitz.We do not believe this to have a major impact, and we leave the construction of more complicated weight functions for future investigation.While using algebraic weight functions, we have not observed any undesirable numerical effects.In fact, our results in Section 4.2 do demonstrate quasi-optimal convergence, as expected in our current theory.

Weighted least-squares approach
Let Ω = (0, 1) ⊂ R and r > 0. Consider the advection-reaction problem Since the exact solution to (38) is u(x) = 1−exp(−rx), we observe that u(x) → 1 when r → +∞, for all x > 0. Hence, for r > 0 sufficiently large, the exact solution has a boundary layer in the neighbourhood of x = 0. Let U h ⊂ H 1 (0 (Ω) := {w ∈ H 1 (Ω) : w(0) = 0} be the conforming subspace of continuous piecewise linear functions on the uniform mesh of N elements of size h = 1/N .We use the weighted least squares method from (22), with weight function: It is well known that the standard least-squares solution (i.e., the one with ω(ξ) ≡ 1) will exhibit overshoots around the boundary layer.Aiming to remedy this situation, we choose a cost functional that measures the distance to the exact solution at the point value x = h.In fact, we consider Let M 8 be the set of neural network functions with one hidden layer, 8-neurons, and ReLU activation, i.e., We then consider the neural optimization of j(•); see Definition (2.2).
For our first experiment, we choose a finite element space U h consisting of N = 16 elements of size h = 1/16.We set r = 160 and α = 0. We compute least-squares approximations for several configurations of the weight function (39), varying the M constant.Figure 1 (left) shows that the weight needs to have enough room for variability (M = 100) in order to pull down the cost functional to zero. Figure 1 (right) shows that our strategy is effective in reducing the overshoots of the finite element solution.
For the second experiment of this section, we fix M = 100 and we investigate variations of the α-parameter.Figure 2 (left) suggest that the L 2 -norm of ξ has to be able to reach high values (case when α = 0) in order to pull down to zero the cost functional.This is also related to allowing the weight to have more variability.Figure 2 (right) shows the impact of α reducing the overshoots of the finite element solution (the smaller α, the better).

Weighted discrete-dual residual minimization approach
This experiment has exactly the same configuration of the previous experiment in Section 4.1.1,except that S h (ξ) is computed with the discrete-dual minimal residual methodology.First, the approximation (trial ) space U h ⊂ L 2 (Ω) corresponds to the space of piecewise constants functions over the mesh.Additionally, we make use of a discrete test space consisting in conforming piecewise linear functions over the refined uniform mesh of 2N = 32 elements.The weighted discrete-dual residual minimization formulation that computes S h (ξ) is as follows: Find  As in the previous Section 4.1.1,the computation of S h is carried out for several configurations of the weight function ω(ξ) (see (39)), varying its M constant.Figure 3 (left) shows that larger values of M allow to pull down faster the cost functional in the training procedure.Figure 3 (right) shows how the overshoots of the finite element solutions are controlled.
The second experiment investigates variations of the α-parameter.Figure 4 (left) suggest that the smaller α, the better for faster minimization of j(•). Figure 4 (right) shows the impact of α reducing the overshoots of the finite element solution.

Convergence of artificial neural networks
Let Ω := (0, 1) ⊂ R be a one-dimensional domain and consider the simple advection problem with f (x) := π sin(πx).Notice the exact solution to (42) is u(x) = 1 − cos(πx).Let H 1 (0 (Ω) := {w ∈ H 1 (Ω) : w(0) = 0} and let U h ⊂ H 1 (0 (Ω) be the finite element subspace of continuous piecewise linear functions on a uniform mesh consisting of N elements of size h = 1/N .We consider the weighted least-squares formulation where the weight function is such that Let M n be the set of neural network functions with one hidden layer, n-neurons, and ReLU activation, i.e., Consider the cost functional with ω(x) = 1 + sin(πx/2).Since the minimization of the cost functional and the discrete problem ( 43) are both weighted least-squares formulations of the same problem (42), we expect that ω(ξ n ) → ω as n → +∞, which is confirmed in Figure 5 (left).Additionally, solving for ξ n we get (see Figure 5 (right)) To initialize the minimization algorithm, we have chosen ξ (0) n ∈ M n as the neural network function that (linearly) interpolates ξ on a uniform mesh of n − 1 subintervals of Ω (i.e., having n uniformly distributed nodal points).The space U h has been fixed to N = 16 uniform elements.
In Figure 6, we plot the error ξ−ξ n L 2 , which confirms quasi-optimal convergence behaviour; indeed the asymptotic rate is O(n −1/2 ), which is expected for our single-hidden-layer ReLU neural network approximations (continuous piecewise-linear polynomials).

L 1 -based controls
We now consider numerical experiments that incorporate a stabilization mechanism.We note that the employed cost functionals use an L 1 -type norm, and hence do not fit within the currently presented theory.However our numerics show that desirable quasi-minimizers have been computed.

Minimizing the total variation
In this section we work exactly with the same problem of the previous Section 4.1.1,but we introduce a modification in the cost functional.Instead of minimizing the distance to the exact solution of a particular point value (supervised training), we take an unsupervised approach by minimizing the total variation of u h (i.e., the L 1 -norm of u h ).Hence, we consider the cost functional: For a fixed value of M = 100, Figure 7 (left) shows the behavior of the cost functional for different values of α, indicating that this value has to be chosen small enough to speed up the as the limiting case of a vanishing viscosity regime (i.e., an equivalent problem having an extra −εu term that vanishes as ε → 0 + ).Of course, the exact solution that we want to approach (u(x) = 1 − e −x ) only satisfies one of the boundary conditions.However, any discrete solution in a H 1 0 (Ω)-conforming space must satisfy both constrains.In this case, it is well-known that the standard least-squares solution to this problem does not deliver satisfactory results.To remedy this drawback, we propose a cost functional that mimics the L 1 residual minimization as proposed in [21].Thus, our (unsupervised ) cost functional will be We consider the weighted least-squares formulation for u h,ξ , solved on a uniform mesh of N = 8 elements.For a fixed M = 1000 constant in the weight function (39), we compute the discrete solution for several values of the α-parameter.Large values of α allow for small values of ξ L 2 , and thus the weight becomes almost constant (close to the standard least-squares approach).On the other hand, small values of α allow for more variability of the weight, and thus, we observe that we can recover a discrete solution mimicking the vanishing viscosity case (see Fig. 8).
We approach (47) using a coarse (and over-constrained) finite element space of piecewise linears functions of the form We use the weighted least-squares method: using the same weight (39) with M = 1000.On the other hand, the cost functional j(•) for this case is defined as The discrete neural network space where we minimize j(•) will be M 8 (see (40)).Results for the α = 0 case are depicted in Figure 9.We observe a strong correlation with the results in [21, Figure 9].

A Proofs
A.1 Proof of Theorem 2.A (i) Strong convexity of j implies coercivity, i.e., j(ξ) → +∞ when ξ X → +∞.Moreover, j is continuous in the strong topology since it is differentiable.Additionally, we know that convexity plus continuity implies that j is weakly lower semicontinuous (see, e.g.[9, Corollary 3.9]).We thus satisfy all the hypothesis of the theorem of existence of minimizers for coercive and sequentially weakly lower semicontinuous functionals [13, Theorem 9.3-1].Moreover, strong convexity ensures that such a (global) minimizer ξ ∈ X is unique.Besides, global differentiablity of j implies the first-order necessary optimality condition j ξ = 0.
(ii) We now that j has a global lower bound.Thus, by the infimum property, for any δ n > 0 there must exist ξn ∈ M n such that (iii) Let ξ ∈ X be the global minimizer and let ξn ∈ M n satisfy (8).By characterization of strong convexity we have for all t ∈ (0, 1) Thus, for all t ∈ (0, 1) and η n ∈ M n we get On the other hand, using the facts that j is L-Lipschitz and j ξ = 0, we deduce [13, cf.proof of Thm.7.7-3, page 488] Hence, combining (50) with (51), taking the limit when t → 1 and the infimum over all η n ∈ M n , we get the estimate γ ξ − ξn

A.2 Proof of Theorem 2.B
We proceed to prove each one of the statements.
(i) Since Z and X are a Hilbert spaces, the quadratic maps Z z → 1 2 z 2 Z and X ξ → 1 2 ξ 2 X are differentiable.On the other hand, S h and Q are also differentiable (Q is linear), and thus j 1 is differentiable by means of the chain rule (see, e.g.[51,Theorem 2.20]).Moreover, Thus, we conclude that j 1 is Lipschitz since where we have used the mean value theorem together with • the boundedness of S h , with bounding constant M S ; • the Lipschitzness of S h , with Lipschitz constant L S ; • the boundedness of S h , with bounding constant M S .
Finally, by making L 1 := Q 2 L(U,Z) M 2 S + L S M S , it is straightforward to see that L 1 + α will be a Lipschitz constant for j.
(ii) Just observe that Thus, j is strongly convex whenever α > 0 is sufficiently large.
Next, let w h ∈ U h .Since A(ξ) * is surjective (by ( 20)), there exists a v w

A.5 Proof of Proposition 2.11
Let us start proving statements (i), (ii) and (iii) at the same time.
Recall the definition of the kernel space K := ker B * ⊂ V.For any ξ ∈ X, consider the restricted operator A(ξ) K : K → K * , as well as the restriction f K ∈ K * .Observe that the inf-sup condition (15) ensures that A(ξ) K is a boundedly invertible linear operator.Thus, given a direction η ∈ X and t ∈ R, from the first equation of the mixed system (19) (restricted to K) we obtain that In particular, continuity of A(•) implies continuity of R h (•).Moreover, using the inf-sup condition (15), it is clear that Next, adding the term A(ξ) KR h (ξ + tη) on both sides of equation (54a), rearrange it, and subtracting equation (54b) we get from which, if A (ξ)η exists, we imply that R h (•) has a Gâteaux derivative and Finally, if A(•) is Gâteaux-differentiable at ξ, then using the inf-sup condition (15), the boundedness of the linear operator A (ξ), and the estimate (55), we imply which proves that R h (•) is Gâteaux-differentiable at ξ. Besides, if A (•) and α −1 h (•) are uniformly bounded on X, then R h (•) is uniformly bounded on X.Now is the turn of S h .From the mixed system (19) we deduce Since B is boundedly invertible onto its closed range we get Moreover, if A(•) is Gâteaux-differentiable, then using the inf-sup condition (15) and the estimate (57) we get which proves that S h (•) is Gâteaux-differentiable.Besides, it is clear from (59) that S h (•) L(X,U) will be uniformly bounded on X, whenever A(•) L( V, V * ) and A (•) are uniformly bounded on X, as well as α −1 h (•).(iv) Let us prove Lipschitzness.Using (56), observe that for any ξ 1 , ξ 2 , η ∈ X we have On the other hand, we are under the assumption that the operator B : H B → V * is boundedly invertible.Hence, there must be a uniform constant β > 0 such that which implies the second inf-sup condition in (15).
(ii) Uniform boundedness of S h (•) is a consequence of Proposition 2.9(iii).Indeed, in our particular case we get To show differentiability of S h (•), let us recall the operator A : X → L(V, V * ) defined in section 2.4, which in this particular case, given ξ ∈ L 2 (Ω), it takes the form Furthermore, we have the uniform bound Since (•) is differentiable, it is straightforward to check that A(•) is also differentiable, and given ξ, η ∈ L 2 (Ω), we have Moreover, we can verify Thus, the differentiablity of S h (•) is a consequence of Proposition 2.11(ii).
(ii) Using the hypothesis of this statement and the estimate (17) in Proposition 2.9(iii), we get the uniform bound (iii) Direct application of Proposition 2.11, noticing also that α −1 h (ξ) ≤ C−2 and

Figure 1 :
Figure 1: Point value control for weighted least-squares.Minimization of the cost functional for several values of M (left).Overshoot control of the discrete solutions (right).

Figure 2 :
Figure 2: Point value control for weighted least-squares.Minimization of the cost functional for several values of α (left).Overshoot control of the discrete solutions (right).

Figure 3 :
Figure 3: Point value control for weighted discrete-dual residual minimization.Optimization of the cost functional for several values of M (left).Overshoot control of the discrete solutions (right).

Figure 4 :
Figure 4: Point value control for weighted discrete-dual residual minimization.Optimization of the cost functional for several values of α (left).Overshoot control of the discrete solutions (right).

Figure 7 :
Figure 7: Total variation control.Minimization of the cost functional for several values of α (left).Overshoot control of the discrete solutions (right).

Figure 8 :
Figure 8: Discrete weighted least-squares solutions, with L 1 residual minimization control, for several values of the α-parameter.

Figure 9 :
Figure 9: Overconstrained weighted least-squares for advection-reaction, with L 1 residual minimization control.From left to right: exact solution; standard overconstrained least-squares, controlled weighted least-squares.
4 Proof of Proposition 2.10 (19)(r, u h ) ∈ V × U h solves the state problem (4), or equivalently(19)in operator form.Testing with elements in v h