A Local Search for Numerical Optimisation based on Covariance Matrix Diagonalisation

. Pattern Search is a family of optimisation algorithms that improve upon an initial solution by performing moves along the directions of a basis of vectors. In its original deﬁnition Pattern Search moves along the directions of each variable. Amongst its advantages, the algo-rithm does not require any knowledge of derivatives or analytical expression of the function to optimise. However, the performance of Pattern Search is heavily problem dependent since the search directions can be very eﬀective on some problems and lead to poor performance on others. The present article proposes a novel enhancement of Pattern Search that explores the space by using problem-dependent search directions. Some points are sampled within the basin of attraction and the diagonalisa-tion of the covariance matrix associated with the distribution of these points is performed. The proposed Covariance Pattern Search improves upon an initial point by varying it along the directions identiﬁed by the eigenvectors of the covariance matrix.


Introduction
In the context of optimisation, a basin of attraction of a search algorithm is the set of points of the decision space that would converge to the same local optimum, see [15]. For example, if we consider a multimodal problem in the continuous domain, a gradient based algorithm would process some randomly sampled solutions by converging to the nearest local optimum. If two points converge to the same optimum, they belong to the same basin of attraction with respect to the gradient based optimiser. Thus, for a given search algorithm (with its variation operator) a decision space can be seen as mapped into several (and possibly overlapping) subsets, with each of them being a basin of attraction.
This concept is fundamental in optimisation, especially when complex, multivariate, and multimodal fitness landscapes are taken into consideration, see [19,31,28,1]. A popular strategy to address these issues is the use of multiple search operators within the same framework. This idea has been extensively applied over the past three decades in both continuous and combinatorial domains. Some of the keywords associated with this idea are Metaheuristics [9,25], Hyper-heuristics [2], Memetic Computing [19], Genetic Programming [11], Agent Systems [24], and Algorithm's Portfolio [33,22].
In these frameworks a distinction between global and local search operators has been traditionally presented as part of the nomenclature to describe algorithmic operations. Whilst the distinction between global and local optima is mathematically rigorous, the distinction between global and local search is more subtle, especially in the context of heuristic optimisation where there is no theoretical guarantee of the detection of an optimum. More specifically, global and local search algorithms differ in purpose: whilst global search algorithms search for a candidate solution with the lowest (or highest) function value within the entire decision space, local search algorithms search for a candidate solution with the lowest (or highest) function value within a subset of the entire space, often referred to as a neighbourhood, see [10].
In mathematical optimisation, and with reference to the continuous domain that is the focus of this paper, the concepts of basins of attraction and local search are well-defined. When considering a minimisation problem a basin of attraction is a set containing one or more infinite contiguous (saddle) points with null gradient (unless the optimum is on the bounds), surrounded by points with non-null gradient and higher function values. Local search algorithms that calculate the gradient, e.g. descent methods, Newtonian methods, Quasi-Newtonian methods, and conjugate gradient methods, would (approximately) detect the local optimum in a given time-frame from any starting point in the basin of attraction, see [21].
When the derivatives are not available and a heuristic method is applied, the concepts of basins of attraction and local search become unclear. For example, the Nelder-Mead algorithm [18], which is often considered a local search algorithm [17], can search for the optimum in an area with only one local optimum, a larger area, or potentially the entire decision space (depending on the initial simplex).
In the absence of derivatives, one of the simplest ideas to optimise a function is to vary one design variable at a time by steps of the same magnitude and calculate the function value at each step. Then, when no increase or decrease in any one design variable improves upon the performance of the current best solution, the algorithm halves the step size and repeats the process until the steps are sufficiently small. This simple algorithm was employed in the 1940s in Los Alamos laboratories by Fermi and Metropolis, see [8]. This idea has been refined and modified over the decades, and its most famous implementation is the Pattern Search (PS) algorithm by Hooke and Jeeves [14] which proved to be convergent in [34]. This idea has been conceptualised in [29] where Pattern Search Algorithms are introduced as a family of optimisation algorithms, and the term Generalised Pattern Search was coined as a Pattern Search using any generic basis of a vector. The search logic of PS is used as part of many modern algorithms. One example is [30] where a greedy version of the Patter Search is used within a three algorithm portfolio for large scale problems. Furthermore, other recent examples of Pattern Search implementations have re-named the algorithm as "S" and put it into the context of Memetic Algorithms [5,7] and in a restarting scheme in [6].
The present article exploits this idea to propose a novel enhancement of Generalised Pattern Search. The proposed algorithm attempts to enhance the original scheme by using an alternative and more convenient reference system to move within the search space. This alternative system is provided by the eigenvectors of a covariance matrix of a set of points crowding a basin of attraction.
The remainder of this article is organised as follows. Section 2 introduces the notation, presents the naive Pattern Search and highlights its limitations. Section 3 presents the proposed algorithm including its theoretical justification, as well as its limitations. Section 4 provides the numerical results of this study. Finally, Section 5 gives the conclusions to this work.

Notation and background
In order to clarify the notation used, we refer to the minimisation of an objective function f (x), where the candidate solution x is a vector of n design variables in a decision space D ⊂ R n : The Pattern Search (PS) processes a single solution and, whilst moving along the axes, improves upon its performance (objective function value). PS can be viewed as a simple deterministic local search algorithm which can be part of a more complex framework, see [3,5].
In this paper we will refer to the naive implementation of this idea proposed in one of the searchers in [30]. Starting from an elite solution x, this local search generates a trial solution x t by performing steps along each of the variables. For each design variable i, is calculated, where ρ is the exploratory radius, · indicates the product of a scalar and a vector, and e i is the i th versor that is a vector composed of zeros and one 1 in the i th position. Subsequently, if x t outperforms x, the trial solution x t is updated (taking the value of x), otherwise a half-step in the opposite direction is performed: This exploration is repeated for all the design variables and stopped when a prefixed budget is exceeded. For the purpose of this paper we will refer to the naive implementation used in [30,5]. The pseudo-code displaying the working principles of this PS implementation is given in Algorithm 1.

Algorithm 1 Pattern Search according to the implementation in [30]
INPUT x while local budget condition do

Limitation of Pattern Search
The naive PS algorithm, albeit easy to implement, is characterised by heavy problem dependent performance due to its variation operators. Since the variables are perturbed/modified one by one, PS can lead to very good performance if the function is separable. In practice, PS can be successfully applied to various partly separable and non-separable problems (see [7]), but its performance would generally be poor. In order to illustrate this statement let us consider the ellipsoid function in two dimensions shown in Figures 1. The PS algorithm would quickly solve the problem in Fig. 1a from any starting point, since one of the directions of the search (one of the axes) coincide with the direction of maximum gradient. However, the PS algorithm would not be efficient in optimising the problem in Fig. 1b, since the search directions are not along the maximum gradient. The algorithm would halve the radius in the early stages of the search and move slowly towards the optimum. In the case of optimisation in higher dimensions the difference in performance between the two scenarios would be significant.

The proposed Covariance Pattern Search
If we could move along the direction of maximum gradient for every problem, PS would work efficiently regardless of the problem. Unfortunately, in real-world optimisation, problems are unknown and there is no prior knowledge of the most convenient reference system. This consideration is not novel and is common to several algorithms for numerical optimisation, such as the Rosenbrock Algorithm, see [27], where the gradient is estimated adaptively and by a new reference system determined by Gram-Schmidt orthogonalisation.
On the basis of this classical consideration, the present article proposes a novel local search implementation of Pattern Search. This section outlines and  discusses the proposal. Subsection 3.1 provides a theoretical justification for the method, Subsection 3.2 describes the implementation details of the methods, and Subsection 3.4 highlights the limitations of our proposal.

Theoretical justification
Unlike the Rosenbrock Algorithm, the proposed method makes use of pre-processing of the problem under investigation to determine the most convenient search directions. In order to understand the theoretical foundation of the proposed method let us consider a population of m vectors/candidate solutions in an ndimensional space These points can be interpreted as a statistical distribution characterised by a mean vector x i n and a covariance matrix Due to the commutative property of the product of numbers, it follows that ∀j, k : c j,k = c k,j , i.e. the covariance matrix is symmetric.
The vector µ represents the barycentre of the distribution. The covariance matrix C describes the geometry of the distribution with respect to the reference system. More specifically, the diagonal elements represent the deviation of the design variable from the barycentre whilst the extradiagonal elements represent a measurement of how two design variables vary together.
Let us consider an optimisation problem and let us imagine to sample m points, with m arbitrarily large, within the decision space D. With reference to Fig. 1a, let us imagine to crowd the elliptic inner contour with points and calculate the mean vector and covariance matrix. It can be verified that µ would be the optimum and the covariance matrix C would be diagonal. However, if we crowded with points the elliptic inner contour in Fig. 1b we would observe that the corresponding covariance matrix C would be full.
This observation could be extended to the general case: a basin of attraction that has the highest gradient along one of the axes is characterised by a diagonal covariance matrix of the points crowding it. Hence, we propose to use the reference system corresponding to a diagonal covariance matrix as move directions for PS in order to find the highest gradient direction. That is, we propose to run PS along the directions identified by the eigenvectors of C, see [20]. These directions are the columns of a non-singular matrix P such that Observation 1 Since the covariance matrix C is symmetric, it follows that 1. C is always diagonalisable and hence there always exists a non-singular transformation matrix P that diagonalises C. 2. each pair of the eigenvectors of C (column of matrix P) is orthogonal, the proposed search directions compose an orthogonal system of coordinates.
The matrix P is the linear transformation that changes the variables of the problem. For example, the directions of the eigenvectors associated with the covariance matrix or the rotated ellipsoid in Fig. 1b are shown in Fig. 2.
Equivalently to what is stated above, the proposed algorithm is based on the consideration that the nonseparability is not just a feature of the function to optimise. Nonseparability is a feature of the function within its reference system. Hence, a (local) change in the reference system can (locally) lead to a much easier optimisation problem.

Algorithmic outline of the proposed method
Let us assume that we have a reliable matrix C available of the basin of attraction under investigation. The matrix C would be diagonalised and the eigenvectors would be the columns of a transformation matrix P: In this article the diagonalisation is performed by the numerical method described and implemented in [20].
In order to better illustrate the relation between the change of coordinates and the diagonalisation of the covariance matrix, let us consider the ellipsoid function Figures 3 illustrates this function (with a = 1 and b = 6), for α = 0 • and α = 30 • respectively, a sample of points, the eigenvectors, and the directions identified by them. It can be observed that for α = 0 • the directions of the eigenvectors coincide with those of the reference axes whilst for α = 30 • the axes appear rotated.
The new version of PS operates on a starting point x and generates trial solutions for the i th design variable by computing where the product of a matrix by a vector Pe i is equal to p i . The pseudocode of the proposed Covariance Pattern Search (CPS) Algorithm is shown in Algorithm 2

Algorithm 2 The proposed Covariance Pattern Search
INPUT x INPUT the covariance matrix C Process the covariance matrix C and calculate the transformation matrix P = p 1 , p 2 , . . . , p n whose columns are the eigenvectors of C while local budget condition do

Working example
In order to highlight the difference in functioning between Algorithm 1 and Algorithm 2, and the benefits of the proposed CPS with respect to its standard counterpart, we have run both of the algorithms on the ellipsoid function above in this case with the arbitrary values a = 1 and b = 76. We arbitrarily chose an angle α = We let both the algorithms run until ρ < 10 −75 , which was considered to be when the problem to be solved. Whilst PS requires the calculation of 16169 objective function evaluations, CPS solves the problem after only 2040 evaluations. Fig. 4 comparatively illustrates the functioning of the two algorithms. The proposed CPS moves along more convenient directions than the variable directions of PS, thus quickly reaching the optimum.

Limitations of the proposed Covariance Pattern Search
The most evident limitation of the proposed CPS is that a reliable matrix C of points of the basin of attraction must be estimated. In the present paper, when a basin of attraction is identified, a set of points populating this basin is found by treating the optimisation problem as a level set problem: random points are generated and those whose objective function values fall below a threshold are saved in a data structure, see Figures 3. These points are then used to calculate the covariance matrix C.
The major issue associated with this approach is that a suitable threshold is problem dependent and empirically determined in each case. Besides being inelegant, this procedure can also be computationally expensive in the high dimensional domain. Although, the initial computational effort may be paid off by a much faster solution to the optimisation problem, see Fig. 4.
The second limitation is that for multimodal optimisation problems promising basins of attraction must be identified before one basin is selected and CPS is executed. This limitation would make CPS unsuitable as a stand-alone algorithm and/or would require a preprocessing algorithm to estimate the locations of potential basins of attraction. However, CPS just like other local search algorithms can be embedded in a metaheuristic framework and can exploit the function calls performed by a global optimiser (or other local search algorithms) as part of the preprocessing, see [19].
Alternatively, CPS can be applied to a multimodal problem at the end of preprocessing for multimodal optimisation, see e.g. the method in [26] or the intelligent sampling proposed in [23]. In the latter, an initial population of m candidate solutions (n-dimensional vectors) is sampled within the decision space D. Each candidate solution undergoes the application, with a limited budget, of the two local searchers. The solutions returned by the two local search algorithms are then processed by the K-means clustering algorithm enhanced by a control on the Silhouette to decide the correct number of clusters, as explained in [23]. Each cluster represents a basin of attraction and its candidate solutions are the sampled points on which the covariance matrix C is computed.

Numerical Results
In order to experimentally demonstrate the effectiveness of CPS, and the idea inspiring it, we have tested CPS against PS in multiple scenarios. To simulate the local search conditions we have considered and adapted a sample of the functions from the CEC 2013 benchmark (focussing on unimodal problems), see [16]. We have used two versions of the ellipsoid since the use of two versions was relevant to demonstrate the performance of the local search. Each problem has been scaled to 10, 30, and 50 dimensions and has been studied in [−100, 100] n . Table 1 displays the functions and depicts the shape of the corresponding basins of attraction. For each problem, a shift and a rotation has been applied: with reference to Table 1 the variable where o is the shifting vector (the same used in [16]) and Q k is a rotation matrix (a randomly generated orthogonal matrix) set for the k th problem, see [4].
The PS in Algorithm 1 has been executed with a budget of 10000 × n function calls where n is the problem dimensionality. In order to guarantee a fair comparison, the budget of the proposed CPS in Algorithm 2 has been split into two parts: 5000 × n function calls have been used to build the covariance matrix C whilst 5000 × n function calls have been spent to execute the algorithm. Due to the nature of PS and CPS, i.e. deterministic local search, the bound handling has been performed by saturating the design variable to the bound. We preferred the saturation to the bound over the toroidal insertion or reflection [10] since the latter two mechanisms would be equivalent to the sampling of a point. This sampling would disrupt the gradient estimation logic of Pattern Search.
Although PS and CPS are deterministic algorithms, their performance can depend on the initial point and, for CPS, on the sampled points used to estimate the covariance matrix. Thus, for each scenario, we sampled 51 initial points within the entire domain and used them to execute PS and CPS. The average objective function values and standard deviations over the 51 runs have been calculated. The statistical significance of the results has been enhanced by the application of the Wilcoxon rank sum test, see [32,12]. In the Tables in this section, a "+" indicates that CPS significantly outperforms PS, a "-" indicates that PS significantly outperforms CPS, and a "=" indicates that there is no significant difference in performance. Regarding CPS, the threshold values used in this study are reported in Table 2. Table 2. Thresholds thr in 10, 30, and 50 dimensions n f1 f2 f3 f4 f5 f6 10 10 4 10 9 5 × 10 8 10 9 10 9 10 4 30 5 × 10 4 5 × 10 11 2 × 10 9 2 × 10 9 10 8 10 5 50 10 5 5 × 10 13 5 × 10 9 2 × 10 9 5 × 10 7 3 × 10 5 Numerical results show that, besides f 1 , the proposed CPS consistently outperforms the standard PS across the three dimensions under consideration. The problem f 1 is characterised by a central symmetry. This means that the rotation is ineffective and any reference system would broadly perform in the same way. The use of the eigenvectors as search directions not only systematically improves upon PS but appears to solve some problems such as f 5 (discus).
Numerical results in Table 4 show that CPS maintains the same performance irrespective of the rotation matrix. These results allow us to conjecture that the proposed mechanism of optimising along the direction of the eigenvectors is effective and the generation of the covariance matrix is robust.
To further illustrate the functioning of CPS with respect to PS, Figures 5 show the performance trend of two specific runs.
The results in Figures 5 show that CPS achieves results orders of magnitude better than PS. It must be noted that we represented the performance of CPS as a constant for the first half of the trend. This is due to the objective function evaluation budget used to calculate the covariance matrix and the corresponding eigenvectors, with the optimisation only occurring in the second part of the budget. At half of the budget CPS starts the optimisation by exploiting the directions suggested by the preliminary phase. A dramatic decrease in the fitness value is evident in the illustration. When the proposed logic is considered within a local search/optimisation algorithm, this logic can be efficiently implemented by using the samples produced by a global searcher, whilst the local search can run with a short budget, just to exploit the abrupt improvement.
Finally, in order to highlight the potential and limitations of CPS, we have compared it against the Covariance Matrix Adaptation Evolution Strategy (CMAES), according to the implementation in [13] and the default parameters therein (ini- Table 4. Average error avg ± standard deviation σ over 51 runs for the problems listed in Table 1  tial σ set to one third of the domain). Table 5 displays the results of this comparison for the same problems in Table 3.  Table 5 shows that for about half of the problems CPS is competitive (f 1 and f 4 ) or outperforms CMAES (f 5 ), whilst for some other problems CMAES achieves results that are orders of magnitudes better in performance than CPS. This fact happens especially in cases of steep gradients, such as the ellipsoid functions f 2 and f 3 . However, this paper proposes a principle about search directions rather than a full algorithm. For example, unlike CMAES, CPS employs a naive search operation based on the simple halving of a radius. Nonetheless, the proposed logic can be exported to other schemes and more sophisticated operators, e.g. an adaptive search radius, can enhance upon the current performance.

Conclusion
This article provides a proof of principle of a conjecture: the search directions identified by the eigenvectors of the covariance matrix of a population of points filling a basin of attraction are efficient to search for the local optimum. This idea has been tested on a naive implementation of pattern search, which has displayed a major systematic improvement over the standard version of pattern search for the problems considered in this study.
The main limitation of the proposed approach is that in order to build the covariance matrix a threshold value has to be set. This value is currently set manually. Our future work will include a protocol for setting this threshold to ensure that a sample of points representing the basin of attraction is detected.
Other future developments will include the extension of the proposed logic to more complex algorithms such as Hooke-Jeeves Pattern search and population based algorithms. Furthermore the proposed algorithm will also be tested within frameworks composed of multiple algorithms, such as Memetic Algorithms and Hyperheuristics.