A Local Search with a Surrogate Assisted Option for Instance Reduction

. In data mining, instance reduction is a key data pre-processing step that simpliﬁes and cleans raw data, by either selecting or creating new samples, before applying a learning algorithm. This usually yields to a complex large scale and computationally expensive optimisation problem which has been typically tackled by sophisticated population-based metaheuristics. Unlike the recent literature, in order to accomplish this target, this article proposes the use of a simple local search algorithm and its integration with an optional surrogate assisted model. This local search, in accordance with variable decomposition techniques for large scale problems, perturbs an n -dimensional vector along the directions identiﬁed by its design variables one by one. Empirical results in 40 small data sets show that, despite its simplicity, the proposed baseline local search on its own is competitive with more complex algorithms representing the state-of-the-art for instance reduction in classiﬁcation problems. The use of the proposed local surrogate model enables a reduction of the computationally expensive objective function calls with accuracy test results overall comparable with respect to its baseline counterpart.


Introduction
Data science is a discipline that studies methods to store and manage data with the aim of extracting knowledge from it [6]. A typical problem in data science is to have a very large raw data set which requires pre-processing to enable data mining techniques to learn from a more manageable data set that is free of noise, redundant or irrelevant samples. In order to overcome this issue, a normal practice consists of selecting some instances and discarding others, or creating artificial samples that better represent the original training data.
However, it is fundamental to properly select or generate those instances. Instance reduction techniques, either selection [7] or generation [28], have to allow still to extract the required knowledge. In other words, we would like to simplify the original data set and keep it as informative as it is when it contains all the data, or even better if noisy data is removed appropriately [14].
Instance reduction can be formulated as an optimisation problem and be addressed by search algorithms. The pure selection of instances can be seen as a binary space search problem [1]. The generation of new representative instances, however, can be expressed as a continuous space search problem. The latter approach turned out to be more flexible, but also more complex [30]. In both cases, Evolutionary Algorithms (EAs) have excelled in comparison with other approaches [7,28]. EAs for instance generation are based on optimising the location of a subset of instances [18,30].
Note that most instance reduction algorithm were originally designed to enhance the performance of the Nearest Neighbour classifier (NN) [5], but the resulting pre-processed data set could be used, in principle, by any classifier [1]. In this work, we are focused on instance generation for NN classification, also known as prototype generation.
Two major challenges are associated to the instance reduction problem: the high dimensionality of the problem and the high cost of each objective function evaluation, which typically consists of classifying the training data. The first challenge is addressed by using an exploitative operator which can be embedded within heuristic frameworks. Some examples under the umbrella name of Memetic Algorithms are proposed in [8,9]. A comparison reporting the advantages of the extra local search is reported in [20]. In the recent literature, these problems are currently being addressed by using distributed approaches in big data platforms [33], but population-based approaches keep taking a long time to pre-process the data. Thus, there is a need for simpler and faster, yet powerful, search algorithms.
This article also explicitly addresses the second challenge by proposing a technique to limit the cost of instance reduction within the optimisation process. More specifically, this article proposes the use of a local search algorithm for large scale problems and a surrogate (approximated) local model to reduce the number of objective function calls. To the best of our knowledge, this is the first local search proposed for instance generation, and the use of surrogate models has been often neglected. The proposed local search samples the points in its neighbourhood and makes use of them to build a multi-variable (local) linear model. The resulting surrogate assisted local search [11,26,25,22] alternates the use of the true objective function with the approximation given by the surrogate model. A mechanism to ensure that wrong search directions are suggested by the surrogate model has been implemented: the algorithm checks the promising points provided by the surrogate model before accepting a new base point.
The remainder of this article is organised in the following way. Section 2 describes the instance reduction as an optimisation problem and provides an explanation why the problem is unavoidably large scale and why calculation of the objective function is computationally expensive. Section 3 describes and justifies the proposed method. Details about the implementation and linear regression model are also included. Section 4 displays the algorithmic results. Finally, Section 5 provides the conclusion of this study.

Problem Formulation
Let TR be a training data set and TS a test set for a supervised classification problem. Both data sets can be viewed as a matrix whose rows are the instances and columns are the features: Each instance belongs to a class ω. For the TR set the class ω is known, while it is unknown for TS. The objective of an instance reduction algorithm is to provide a reduced set RS of instances, which are either selected or generated from the examples of TR, l that still allows the data representation of TR. RS should be created to efficiently represent the distributions of the classes. The size of RS should be significantly reduced to minimise the information that requires storing, and speed up the posterior classification phase.
We may, equivalently, represent the matrix RS as a vector x of length n = i × m whose elements are the rows of TR arranged sequentially The objective function f (x) will measure how well the resulting RS exemplifies the original training data T R. To do so, in the literature, RS is inferred using the T R matrix as representative information of the problem, assuming that this will allow us to classify the elements of T S. In particular, this objective function simply calculates the classification accuracy (i.e. number of correct classifications regarding the total number of instances classified) using RS as training data, and T R as test data.

Computational Cost of the Objective Function
The exact computational cost of the objective function depends on the particular classifier that is being used. Most of the instance reduction literature focused their efforts on improving the well-known NN classifier, because it is one of the most affected classifiers by the size of the training data.
Focusing on the NN rule as base classifier, calculating the accuracy of RS consists of computing the Euclidean distance between all elements of T R against all elements of RS and determine which is the closest instance in RS for each element of T R. The class label of the closest instance is used as prediction.
This intuitively shows that the cost of the objective function will be very high when the size of T R is very big. The complexity of instance reduction models is O((i · m) 2 ) or higher, and best performing methods are based on EAs [30].
Current research is typically focused on the use of divide-and-conquer approaches, implemented with big data technologies, to parallelise the execution of instance reduction approaches. We can also find an approximation strategy, called windowing [31], which estimates the fitness value of RS using a random subset of T R at every iteration of the search (this reduces significantly the cost, but could mislead the search). However, the use of more sophisticated surrogate models to reduce the number of evaluations for instance reduction algorithms has been neglected.

A Local Search for Instance Reduction
This section presents the proposed method, outlines its theoretical and implementation aspects and justifies the choices made. More specifically, Subsection 3.1 presents the structure of the baseline Local Search, Subsection 3.2 describes the multivariable linear model used in this study, Subsection 3.3 outlines the surrogate assisted technique to build and use the surrogate model with the original objective function, and finally, Subsection 3.4 provides a justification of the algorithmic choices made.

Baseline Local Search
The proposed algorithm is based on a greedy local search [13,21] of the family of Pattern Search algorithms [27]. The algorithm perturbs each variable (of x) at the time and replaces the current best point with a better one as soon as an improved solution is found. Along the directions identified by each variable, the algorithm attempts to move one step in one oriented direction and then half step in the opposite oriented direction if the first attempt fails. More specifically, the algorithm explores at first where the scalar ρ is the step-size (exploratory radius) defined by the user and e i is the i th versor, i.e. a vector composed of zeros and only a one in the i th position. Then if this exploration fails, the algorithm attempts to explore

Algorithm 1 Baseline Local Search used for Instance Reduction (LSIR)
1: INPUT x 2: while local budget condition do 3:

4:
for i = 1 : n do 5: if f x t ≤ f (x) then 7: x = x t 8: else 9: Algorithm 1 shows the pseudocode of the baseline Local Search for Instance Reduction (LSIR) used in this study.
For the experiments carried out in this paper, on the basis of preliminary tests we employed a toroidal handling of the bounds, i.e. for while if x i < x low it is reinserted by reassignment ∀i. The parentheses indicate the truncation to the lower integer.

Linear Multivariable Surrogate Model
In order to approximate the objective function f and generate a surrogate functionf , a multivariable linear regression with least square method is implemented, see [10,12]. For the sake of clarity, we built a local surrogate linear model In order to identify the n + 1 parameters c 0 , c 1 , c 2 , . . . c n the least square method has been applied.
The method processes a sample of n + 1 observation vectors and the corresponding function values In order to find the parameters c 0 , c 1 , c 2 , . . . c n we have to minimise the fol- Thus, we have to calculate the partial derivatives of ∆ with respect to c 0 , c 1 , . . . c n . The derivative with respect to c 0 and c 1 are, respectively The derivative with respect to the generic coefficient c i is By simultaneously equating the derivatives to 0, we obtain the system of linear equations Lc =ŷ, that is The solution of this system of linear equation is the set of parameters c which allow the construction of the surrogate modelf (x).

The Proposed Surrogate Local Search for Instance Reduction
With reference to Algorithm 1, each exploration in the for loop samples at least n and at most 2n trial points x t in the neighbourhood of the current best point x. The proposed Surrogate Assisted Local Search for Instance Reduction (SALSIR) exploits this logic by storing the visited points in a data structure Surr, that is a list where each entry is a point x and the corresponding f (x): The data structure Surr is filled until it contains n entries. Since the starting point is also inserted in Surr, (n + 1) points are available. These points are used to build a surrogate modelf (x).
For the remaining function calls, the LS uses the surrogate modelf (x) instead of the computationally expensive objective function f (x). However, to ensure that wrongly estimated search directions do not jeopardise the functioning of the algorithm, when a solution estimated by the surrogate model outperforms the current best solution, its actual objective function value is checked. This increases the cost of the algorithm (reduces the advantages of the surrogate model [19]) . On the other hand, this strategy enhances the reliability of the search.
If the moves failed in all directions, the exploratory radius is halved and the search repeated in a closer neighbourhood of x. The pseudocode of this algorithm is shown Algorithm 2. We highlighted that the main loop of the algorithm is divided into parts: in the first the surrogate model is built while in the second the surrogate model is used as an alternative to the objective function.

Motivation of the Proposed Design
This section justifies the algorithmic choices and in particular answers to the following two questions.
1. Why did we choose this algorithmic structure for this problem? 2. Why did we choose a multivariable linear model as a surrogate?
To address the first question, we have to consider that the optimisation problem under examination besides being computationally expensive is large scale. In data science, it is very likely to have a large volume of data and matrix RS above can easily have still hundreds if not thousands of rows.
For this reason, we selected a LS component that is especially suited for large problems as it is the main element of the algorithm proposed in [34] and then used as a LS in [36] and modified as a stand-alone LS within other frameworks, see e.g. [3,4].
Techniques that perturb the variables separately, just like that used in this article, are known to be effective for large scale problems, see [24,17,15]. This observation was reported in the experimental study in [2]. Large scale problems are by no means easier than low-dimensional problems. However, since in practice the computational budget cannot grow exponentially with the problem dimensionality only a very limited portion of the decision space is explored.
Under these experimental conditions, the algorithm "sees" the problem as separable: average Pearson and Spearman coefficients of the variables approach zero independently on the problem when the dimensionality grows, see [2].
This study is one of the reasons behind the decision of using a linear surrogate model (second question above).

26:
Calculate f x t {**Ensure that the surrogate does not mislead the search**}

27:
if f x t ≤ f (x) then 28: x = x t

33:
Calculate f x t {**Ensure that the surrogate does not mislead the search**}

34:
if f x t ≤ f (x) then 35: From the perspective of the interaction among variables, a complex model is unnecessary in the highly multivariate domain since both the objective functions and surrogate model (for the limited budget) would appear separable. The second reason is that, to our knowledge, there are no studies on the fitness landscape of the instance reduction problem. Although many studies propose many algorithms to achieve the reduction of instances, we do not know yet the features of the problem, e.g. how multimodal it is, and we do not know how the landscape depends on the specific data set. Hence, the simplistic approach of using a local linear model is a natural choice, see [22,16]. In the present paper, we propose a local surrogate model that is designed to work in a limited portion of the decision space by using the neighbour points visited by the local search algorithm, see [11,37,23,35].

Experimental study
This section describes the experimental setup and presents the numerical results of our study. Subsection 4.1 provides a description of the experimental framework while Subsection 4.2 displays, analyses, and interprets the results achieved against a number of algorithms for instance reduction previously proposed in the literature. Finally Subsection 4.3 analyses the benefits and drawbacks of SALSIR with respect to its baseline counterpart.

Experimental Framework
For the proposed study, in order to test the viability of the use of a local search for instance reduction, we have chosen 40 small data sets from the KEEL data set repository [32] with less than 2,000 instances (based on [7,28]). Table 1 outlines the main features of these data sets. For each data set, the total number of examples (#Ex.), number of attributes (#Atts.), and number of classes (#ω.) are shown. These data sets are partitioned using a ten fold cross-validation scheme (10-fcv).For each data set, n can be computed as n = 0.9×#Ex. ×#Atts. In order to evaluate the proposed methods, the following two measures have been used: Classification accuracy: Acc = N cc N I where N cc is the number of correct classifications and N I is the total number of instances. The classification is performed by the NN classifier using the resulting RS. The results in training (T rainAcc) and test (T estAcc) partitions are reported.
Reduction rate: where size(RS) and size(T R) are the sizes of reduced and training sets, respectively, that is the number of rows of the two matrices. This index measures the reduction of storage requirements achieved by an instance reduction algorithm. Various instance reduction methods representing the state-of-the-art have been used for comparison with the proposed LSIR and SALSIR. In this study, We focused on the family of positioning adjustment methods (see [28]), which are the best performing instance reduction methods in the literature and follow a working logic similar to that of the proposed local search algorithms.
In order to compare the methods, we used as a benchmark the NN rule employing the entire T R set for training. In addition, we compared against the entire set of the positioning adjustment-based methods reviewed in [28]. We also included two advanced instance reduction algorithms: an incremental Differential Evolution (IPADE) [29], and a hybrid instance selection and instance generation algorithm (SSMA-SFLSDE) which is the current state-of-the-art according to [30]. Hence, 17 algorithms in total are considered in this study.
For the proposed LSIR method, and its surrogate variant, the search is started with a random subset of 5% of the rows of TR as suggested in [28]. Both LSIR and SALSIR are stopped either when ρ < 10 −5 or when 100 × n objective function calls have been performed. Table 2 presents in greater detail the configuration parameters for IPADE, SSMA-SFLSDE and the proposed methods. Regarding the other comparison methods and related parameters we used the setup suggested in https://sci2s.ugr.es/pr/pgtax/experimentation.  Table 3 provides the average results of reduction rate, training and test accuracy on the 40 data sets used in this paper. For each type of result, the algorithms are ranked from the best to the worst. The NN algorithm is highlighted in bold as the benchmark method.  Table 3 shows that the proposed LSIR and SALSIR achieve the best and the third best training accuracy result, and are ranked forth and fifth in terms of test accuracy. The reduction rates of LSIR and SALSIR are comparable with those of the other methods that use a reduction rate parameter of 5%. It must be remarked that despite the low number of evaluations and a simple local search strategy, the proposed LSIR algorithm provides the highest train accuracy. This indicates that LSIR may be incurring in overfitting of the training data sets. Particularly interesting is the comparison with PSO, which also starts off from the same random set of instances (5%). PSO does not seem to find an RS that fits that well the training data. This turns out to be in its favour as it reduces the overfitting of the training data, providing a higher test result. This may suggest that an even lower number of evaluations may prevent our algorithm from overfitting the data. Table 4 presents the average test classification accuracy results (from the 10fcv), for the proposed methods and the NN rule. The best result for each data set is highlighted in the bold face.

Results
We can observe that for the majority of the datasets (29 out of 40), both proposed methods outperform the benchmark NN. This means that the methods are not only able to reduce the size the training data by 95%, but also are also able to improve the performance of the NN classifier. In the remaining cases the data reduction process may deteriorate the performance of the NN algorithm (e.g. on aut data set). This may be due to overfitting of the training data, or when the (random) original selection of instances per class was not suitable for these data sets. Thus, our local search strategy could benefit from a preliminary instance selection step before optimising the location of the instances, as proposed in [30].
In order to understand the significance of the provided results, we applied the Wilcoxon test to establish a fair comparison with the state-of-the-art. Table  5 displays the result of this comparison. Table 5: Summary of the Wilcoxon test. The symbol •indicates that the method in the row outperforms the method in the column. The symbol •indicates that the method in the column outperforms the method of the row. Upper and diagonal of level significance 0.9 and 0.95, respectively Table 5 highlights that the hybrid SSMA-SFLSDE algorithm remains to be the best algorithm, outperforming all the other methods. However, it should be remarked that SSMA-SFLSDE is composed of two population-based metaheuristics (a binary search to select relevant instances, and an adjustment of the position based on differential evolution). The selection of an appropriate number of instances per class is a well-known issue for instance generation techniques [29,30], and the instance selection mechanism of SSMA-SFLSDE helps it to reduce overfitting and improve test accuracy. In this preliminary study, LSIR and SALSIR are naively used without a careful selection of instances per class.
More generally, LSIR and SALSIR are remarkably simpler than all the metaheuristicbased algorithms used in this study, and perform only a local search of the decision space. Despite these limitations, we observe that the proposed methods are competitive with a way more complex population-based metaheuristics such as PSO and IPADE. This study can be viewed as a stepping stone towards the generation of a hybrid algorithm that employs LSIR and SALSIR.

The effect of the surrogate
Regarding the performance of LSIR and SALSIR, we should make two considerations. On the one hand, Table 3 shows that LSIR appears to outperform SALSIR on both test and training accuracy. On the other hand, training results suggest that the surrogate variant suffers from overfitting less than the its baseline counterpart. However, the Wilcoxon test finds significant differences between the two algorithms (in the test phase) at a level of significance α = 0.9 and that LSIR tends to outperform SALSIR. As an example of this fact, Fig. 1 shows the convergence plot on a single partition of the Bupa data. We can observe that both algorithms progress steadily but LSIR marginally outperforms SALSIR. This result was expected since a surrogate assisted algorithm often deteriorates the performance of its counterpart that uses only the true objective function, see e.g. [11,19].
From the perspective of the computational saving, in the case depicted in Fig.  1, SALSIR saved 1991 evaluations with respect to LSIR. Since the purpose of a surrogate assisted algorithm is to reduce the number of objective function calls, we reported in Fig. 2 a histogram displaying the total number of evaluations and the number of saved evaluations by the surrogate variant. On average, around 15% of the evaluations have been saved.

Conclusion
This paper proposed a local search algorithm for addressing the large scale challenges imposed by instance reduction problems. The proposed local search is also endowed with a local surrogate model to mitigate the computational cost generated by objective function calls. The proposed local search algorithm can potentially be used within optimisation frameworks, such as portfolios, hyperheuristics, and memetic algorithms. Numerical results indicate that the use of  local search algorithms is a promising subfield of optimisation for addressing instance reduction problems. The proposed local search, despite its algorithmic naivety, outperformed numerous classical algorithms for instance reductions and is competitive with sophisticated population-based metaheuristics representing the state-of-the-art. The comparison between the versions with and without surrogate assisted model shows that the proposed surrogate design/implementation allows for an approximately 15% saving on the number of objective function calls, with a relatively small loss in accuracy. As a future work, we will investigate the integration of the proposed local search within advanced instance reduction algorithms to address larger classification problems.