A Self-adaptive Multi-objective Feature Selection Approach for Classiﬁcation Problems

. In classiﬁcation tasks, feature selection (FS) can reduce the data dimensionality and may also improve classiﬁcation accuracy, both of which are commonly treated as the two objectives in FS problems. Many meta-heuristic algorithms have been applied to solve the FS problems and they perform satisfactorily when the problem is relatively simple. However, once the dimensionality of the datasets grows, their performance drops dramatically. This paper proposes a self-adaptive multi-objective genetic algorithm (SaMOGA) for FS, which is designed to maintain a high performance even when the dimensionality of the datasets grows. The main concept of SaMOGA lies in the dynamic selection of ﬁve different crossover operators in different evolution process by applying a self-adaptive mechanism. Meanwhile, a search stagnation detection mechanism is also proposed to prevent premature convergence. In the experiments, we compare SaMOGA with ﬁve multi-objective FS algorithms on sixteen datasets. According to the experimental results, SaMOGA yields a set of well converged and well distributed solutions on most data sets, indicating that SaMOGA can guarantee classiﬁcation performance while removing many features, and the advantage over its counterparts is more obvious when the dimensionality of datasets grows.


Introduction
As a crucial branch of machine learning, classification has received great attention [1,2,3].The models used for classification are often referred to as classifiers [4,5].In fact, one instance, which serves as the input of the classifiers, is always composed of a set of features and its label.To better solve the classification problems, a large amount of features are often included, however, most of them are irrelevant or redundant in many cases, resulting in the possibility of reduction of classification accuracy, model complexity, etc [6,7,8].Classification is a fundamental task in machine learning [9,10] which enables a range of applications such as medicine [11,12,13,14] with numerous 1 Corresponding Author: F. Neri, School of Computer Science, Jubilee Campus, Wollaton Road, Nottingham NG8 1BB, UK; Email: ferrante.neri@nottingham.ac.ukY. Xue and H. Zhu equally contributed to this work and should be considered co-first authors application in diagnostics based on electroencephalogram [15,16,17].Other studies, more broadly, focus on neurobiology [18,19].
In the early days, researchers either optimized only one objective, i.e., classification accuracy, or aggregated multiple objectives into a single objective for optimization [20].These practices often cause some weaknesses such as inferior optimization results.Recently, FS has been considered as a multi-objective optimization problem (MOP), see [21,22].Meanwhile, metaheuristic techniques go viral, and they have been applied in many fields [23,24,25,26,27].Metaheuristic techniques include for example evolutionary computation (EC) techniques [28] such as genetic algorithm (GA) [29], ant colony optimization [30], particle swarm optimization (PSO) [31,32,33], differential evolution [34], artificial bee colony [35], ant colony optimization [36,37], pattern search [38] etc.The main reasons for their popularity are: (i) there is no need of prior knowledge of the problem; (ii) the population-based search approach is capable of obtaining multiple solutions in a single run; (iii) they specialize in global search.Benefit from these advantages, there have been many attempts to solve FS with EC techniques [39,40].Among these methods, GA is often used due to its simple encoding scheme, excellent generalization and global search strategy [41].
Generally, most of the multi-objective FS methods based on GA are using non-dominated sorting genetic algorithm II (NSGA-II) [42,43].However, most of existing studies only use the original algorithm to solve FS, or just study the population size, crossover probability, and mutation probability of the algorithm.In particular, only a single crossover operator with a single mutation operator is usually used in the search process.As the number of features (NF) increases, the search space accordingly grows at a tremendous rate [44,45,46].For instance, suppose NF is n, the number of all possible feature subsets is 2 n − 1.If only a single crossover operator is used for the entire evolutionary process, then the superior genes of the parents may not necessarily be retained to the offspring in some cases, which leads the algorithm to get stalled and end up finding the local optima.Indeed, selection, crossover, and mutation operations in genetic algorithms play a very important role, and all of these three operations contribute to the final optimization result.In this paper, we only focus on the crossover operations and try to improve the behavior of the crossover operation to get better performance of the algorithm.
In this work, we treat FS as a MOP, where classification error and solution size are considered as two objective functions.To solve it, a self-adaptive multiobjective genetic algorithm (SaMOGA) is proposed.Different from previous studies based on NSGA-II for FS, in SaMOGA, five crossover operators with different search characteristics work in conjunction with one mutation operator by applying a self-adaptive mechanism.Concretely speaking, the performance of each crossover operator in the evolutionary process is recorded, and at different evolutionary stages, the selfadaptive mechanism selects the currently preferred crossover operator for the crossover operation based on their previous performance, followed by the mutation operation.Besides, a search stagnation detection mechanism (SSDM) is also proposed to detect the search stagnation, with which the exploration can be more effectively.To verify the efficiency of SaMOGA, we conducted experiments on sixteen datasets varying widely in dimensionality, and the obtained results are also compared with five multi-objective FS algorithms.
The contributions of this work are listed as follows: • The most appropriate crossover operator is used at different stages using an self-adaptive mechanism • The self-adaptive mechanism and SSDM are combined to improve search capabilities.

Related Work
In this section, the introduction of the concept of multi-objective optimization and NSGA-II is firstly given.Subsequently, a brief review of FS approaches using multi-objective GAs is provided.

Multi-objective Optimization Problems
Some problems have multiple objectives to be optimized [47], and these objectives are often contradictory to each other, in other words, they cannot obtain the best optimization results for each of them at the same time.Such problems are called MOPs [48,49,50,51].The mathematical expression of a rough multiobjective minimization problem is shown as follows: where X = (x 1 , x 2 , ..., x n ) is a solution to the problem that falls within the search space Ω, x i means the i th decision variable, and F(X) represents the objective functions, see [52,53,54].

NSGA-II
NSGA-II is one of the most classic evolutionary algorithms and is an improved version of traditional genetic algorithms for tackling MOPs [55].It incorporates the Pareto dominance relationship and the crowding distance mechanism to drive the evolution of pop-ulation so that a set of trade-off solutions, which are also named as Pareto front (PF), can be obtained.The key flow of the algorithm is simply described in the following paragraph.Let's assume that the size of parent population in t th iteration P t is N.After undergoing the genetic operations, i.e., crossover and mutation, N offspring are generated that make up the offspring population Q t .Then, these solutions are combined together to form the combination population R t and the next generation of individuals are selected from R t according to the following steps.Firstly, individuals in R t are classified into multiple hierarchies by applying non-dominated sorting procedure.Then, individuals are selected into the next generation in a hierarchical order.As can be seen from Figure 1, individuals in H 1 -H 3 have been added into the next generation.When H 4 is about to be added, the size of population would exceed the predetermined one, in which case, the individuals with the bigger crowded distance have priority in entering the next generation P t+1 .
The non-dominated sorting, crowding distance calculation are summarized as follows:

Non-dominated sorting
Firstly, for each solution p in the population, two properties are calculated, one is domination count n p , i.e., the number of solutions which dominate the solution p, another one is S p , i.e., a set of solutions that dominated by the solution p. Since the domination counts of the solutions in the first non-dominant front equal to 0, for each solution p with n p = 0, visit each member q in its S p and reduce 1 from its domination count, if the domination count of q becomes 0, it is put in a list Q.Thus, Q includes the solutions belong to the second non-dominated front.After that, the process is repeated for the solutions in queue Q till the third front is found.This process continues till all the fronts are found.

Crowding distance calculation
To begin with, the population is sorted according to each objective function values in ascending order.After that, for each objective function, the boundary solutions are assigned an infinite distance value, and each mediate solution is assigned a distance value equal to the absolute normalized difference of two adjacent solutions.This calculation process is repeated with other objective functions.The crowding distance value is calculated as the sum of individual distance of each objective.

Multi-objective GAs for FS
FS is now commonly considered as a MOP in many works, and GAs are highly preferred to solve it in these studies.In [56], the fault classification error is minimized, and the accuracy of dissolved gas analysis and diagnosis of power transformer is improved by selecting the optimal feature subset and the optimal feature number.Labani et al. [57] considered the relevance of the text features to the target class and the correlation between the features as two objectives in text FS, and proposed a multi-objective algorithm, namely MORDC.Karasu and Sarac ¸ [58] used NSGA-II to find the optimal solutions for two different fitness functions, i.e., NF and classification accuracy.In [59], NSGA-II are used to obtain a set of Pareto-optimal solutions in different pattern recognition domains, the number of used features and the classification error are set as two objectives, see [60].Das et al. [61] proposed a multi-objective GA with mutation pool to solve FS problem.Two objective functions are based on rough set theory and multivariate mutual information so that the most precise and informative feature subsets can be obtained.In [62], the application of MOGA to FS based on different filter importance criteria are evaluated.Bouraoui et al. [63] introduced a novel approach to optimize the proper kernel function, its parameters, SVM parameters and FS for SVM classification at the same time based on NSGA-II.These works have verified the feasibility of using MOGA to solve FS, and have also achieved relatively decent results, yet most of them simply applied the NSGA-II framework without making any improvement.Therefore, this paper aims at improving NSGA-II for solving FS problems, so that the improved algorithm has better performance in both improving the classification performance and cutting the features.

The Self-adaptive Multi-objective Genetic
Algorithm (SaMOGA) This section describes the main framework of SaMOGA, as well as the details of some important procedures.The multi-objective FS algorithm proposed in this study, i.e., SaMOGA, embeds the self-adaptive mechanism and SSDM to improve the performance of NSGA-II for solving FS problems.The flowchart of SaMOGA is shown in Figure 2.Meanwhile, Algorithm 1 demonstrates the pseudo-code of SaMOGA.Firstly, vector R and P used to record the performance of different operators in the crossover operator set, as well as N discrete-encoded individuals are initialized.Then, the probability corresponding to each crossover operator is initialized (see line 1-3 from Algorithm 1).Assuming Q crossover operators are used, each of them is assigned a probability of 1/Q.After that, the following procedures keep repeating until the stop criteria is satisfied.It is worth noting that the second, fifth and the last steps are where SaMOGA distinguishes itself from the genetic algorithm in terms of innovation.
• Retention of PF of current population.The SSDM proposed in this paper is based on the change of PF of two adjacent generations of populations during the evolution process, so before evolving the current population, PF of current population is stored using PF Lg.Meanwhile, the stagnation marker stagFlag is initialized to 0, 0 means the search is not stagnated while 1 is the opposite (see line 8-9 from Algorithm 1).• Crossover operation.Based on the current probability assigned to each crossover operator, one of them is first selected using roulette wheel selection.Then, two parent individuals are randomly selected from the population Pop and the crossover operation is performed using the selected crossover operator.Note that the individuals already selected are not selected again in the current generation (see line 11-13 from Algorithm 1).• Mutation operation.The uniform mutation operator is performed on two offspring individuals generated by the crossover operation (see line 14 from Algorithm 1).• Fitness evaluation.The obtained two offspring individuals are evaluated for fitness.Then, the number of fitness evaluation is increased by 2 (see line 15 from Algorithm 1).
• Operator evaluation.The purpose of this step is to evaluate the crossover operator used for this evolution by comparing the two generated offspring with the parents (see line 16 from Algorithm 1).If the generated offspring are decent, the crossover operator is rewarded, otherwise a penalty is given.Both the reward and penalty serve to update the probability of the corresponding crossover operator being assigned.The specific procedures are illustrated in Section 3.6.

Chromosome Encoding
GAs encode a solution to the problem as a vector, which is also known as chromosome.In the application of FS, there are roughly two general encoding schemes [64], i.e., continuous encoding scheme and discrete encoding scheme.Since in the former, it is often necessary to set a threshold value for conversion, which is often difficult to determine, in this paper, we adopt the discrete encoding scheme to represent a chromosome.The length of each chromosome is the same as NF in the dataset and each locus can be either a value of 0 or 1, where 0 means the feature is unselected and 1 means the feature is selected.Suppose that six features are included in a dataset, if the coded chromosome is represented as 101000, then it denotes that the first and third feature are selected while the remaining ones are unselected.

Objective Function
The objective function is the basis for evaluating the goodness of the individual.In this study, we use the wrapper-based FS approach, so the classification error is used as one of the objective functions, while NF, i.e., solution size, is used as the second objective function in order to remove useless features.

Classification error
By calculating the quotient of the misclassified instances and the total instances, the classification error can be obtained as follows: where X is a solution, h is the parameter in h-fold cross-validation, N Err and N All are the number of misclassified instances and all instances, respectively.
where x i means the i th decision variable, D denotes the number of raw features.

Crossover Operator Set
Offspring inherit genes from the parents through crossover operations, and promising genes are highly expected to be inherited.Thus, the selection of the appropriate crossover operator is critical to the performance of the offspring in terms of fitness values.However, picking out the appropriate crossover operator is time-consuming.In this paper, we intend to apply crossover operators with different properties to compensate for this deficiency, so five popular and commonly used crossover operators, i.e., singlepoint crossover operator [65], two-point crossover operator [66], uniform crossover operator [67], shuffle crossover operator [68] and reduced surrogate crossover operator [69] are adopted to form a crossover operator set for the crossover operation.A detailed description of them can be obtained in their referred papers.By applying different crossover operators at different stages of evolution process, superior genes in the parents have a greater chance of being passed on to the offspring, resulting in the better performance of offspring individuals.

Search Stagnation Detection Mechanism
In the process of searching for a set of more optimal solutions, search stagnation frequently occurs, i.e., the PF is not updated after one generation of evolution, which affects the convergence rate of the population.Therefore, the SSDM is proposed, which lies in detecting the change between the PF obtained in the previous generation and the PF obtained in the current generation.Take the two-objective minimization problem as an example, in total, four different scenarios exist, and the specific examples are shown in Figure 3.
With each generation of the population, we expect the PF to move in the direction of the coordinate origin, or to complement itself.In Figure 3 3(a) that there exits offspring solution(s) complementing the solutions on last PF.It is favorable in these three cases since PF is updating.When it comes to Figure 3(d), last PF neither moved nor is complemented, so it can be assumed that the search of this generation is stagnant.As mentioned in Section 3.1, SSDM is used as a complementary technique to the update of SPs, mainly because it is normal for PF to remain unchanged for several generations at a late stage of population evolution, and using this technique alone to determine whether the search being stagnated can disrupt the search direction of the population, leading to a reduction in the speed of convergence as well as in the likelihood of searching for the global optimal.

Crossover Operators Evaluation
The multiple crossovers are handled in an ensemble fashion [70] by means of a success-based adaptation similar to that of hyperheuristics [71].
Assume that Q operators are included in the crossover operator set, and the q th one is selected for the crossover operation.After undergoing crossover and mutation operations, it is intended to evaluate the selected crossover operator based on the performance of children.The selected crossover operator is rewarded if the produced children is promising, otherwise a penalty is given.We therefore employ two vectors for this purpose and they are implemented as follows: where R and P are initialized with all elements at zero.

Algorithm 2 crossoversOperatorEvaluation(Par,Ch, q)
Input: Par: List of two parents Ch: List of two children q: The selected crossover operator Output: R, P 1: Compare the Pareto dominance relationship of two parents 2: if One dominates the other, assume Par 1 ≺ Par 2 then 3: if Par 1 ≺ Ch i then 5: P q ← P q + 1 6: else 7: R q ← R q + 1 8: end if 9: end for 10: else 11: for i = 1 to 2 do 12: if Par 1 ⊀ Ch i && Par 2 ⊀ Ch i then 13: R q ← R q + 1 14: else 15: P q ← P q + 1 16: end if 17: end for 18: end if Based on the relationship between the children generated by the crossover operator and the corresponding parents in the objective space, we record information on the use of the crossover operator using the Pareto dominance relationship between them.Two potential scenarios can be found below.
Scenario 1: One of the parents dominates the other one.In this case, the two children merely have to compare the Pareto dominance relationship with the superior parent.If the child do not dominated by it, R q + 1, otherwise P q + 1 (see line 3-9 from Algorithm 2) Scenario 2: Two parents do not non-dominate each other.Slightly different from the first case, the two children need to be compared with the two parents respectively.If the child is not dominated by two parents at the same time, R q + 1, otherwise P q + 1 (see line 11-17 from Algorithm 2).
After a certain number of fitness evaluation, R and P will be reinitialized again.The specific process is shown in Algorithm 2. for q = 1 to Q do 5: if R q = 0 then 6: R q ← ε 7: end if 8: Update the probability of the operator q using Eq. 5 9: end for 10: Normalize the probabilities for all operators using Eq.6 11: end if

Update of Selection Probabilities Assigned to Crossover Operators
After the number of fitness evaluations set in advance is reached, SPs are updated based on R, P and stagFlag, which serves as a symbol indicating whether the PF of two adjacent generations has changed.For easy implementation, number of fitness evaluations for updating SPs C is set as a multiple of the population size.Suppose the population size is N, C = c × N, where c is a pre-defined constant.The specific steps are shown in Algorithm 3. First, in order to avoid the search stagnation caused by one certain crossover operator not having a chance to be selected, it is necessary to determine whether there is one or more elements in vector Pro that are 0 and stagFlag is 1 at the same time.If so, Pro needs to be reinitialized the same way as mentioned before (see line 1-2 from Algorithm 3).Otherwise, SP for q th crossover operator is calculated as follows: In fact, SP of q th operator is calculated by dividing the reward R q by the sum of reward R q and penalty P q .However, there are cases where operator q may not be selected once in the uFE fitness evaluations, resulting in both R q and P q being 0. Therefore, a very small positive number ε, which is not equal to 0, is assigned to R q to avoid the situation where the divisor is 0.
Finally, we normalize SP assigned to operator q as follows:

Experimental Settings
In this section, the experimental datasets and classifier are first introduced, followed by the comparison algorithms and the setting of relevant parameters, the metrics for evaluating the experimental results are described at last.The proposed algorithm and its counterparts are implemented using Matlab language.All the experiments are conducted on an Intel Core (TM) i5-9500 CPU with 8 GB of RAM and 1TB of hard disk.

Datasets and Classifiers
Eighteen datasets are used to train and test the proposed algorithm and its counterparts.The full descriptions of them are available in the UCI Machine Learning Repository [72].These datasets are composed of different numbers of features (NF in Table 1), labels (NL in Table 1) and instances (NI in Table 1).All datasets are split into training sets (70% of the raw datasets) and test sets (30% of the raw datasets) at random.As can be seen from Table 1, the number of features ranges from 30 to 1300.In addition, the number of classes varies from 2 to 26 and the number of instances varies from 32 to 1080.Thus, the ability of the algorithms to solve the feature selection problem can be effectively evaluated with these comprehensive and complex characteristic datasets.
A learning algorithm is usually required in wrapperbased feature selection methods to evaluate the classification performance of the feature subset.In fact, many classifiers can be used for this purpose.In this paper, k-NN is applied due to its simplicity and promising classification performance.Since the division of datasets have an impact on the generalization of the model, in order to avoid this situation as much as possible, three-fold cross-validation is introduced to reduce the over-fitting and under-fitting problems [73].

Comparison Algorithms and Parameter Setting
Five multi-objective algorithms for FS are adopted for comparison.They are NSGA-II [55], NSPSOFS [74], CMDPSOFS [74], SPEA2 [75] and MOEA/D [76].All the algorithms have been implemented in Matlab by the authors on the basis of their original description [77].Among them, NSGA-II is one of the most classic MOEAs, and SaMOGA is an improvement on NSGA-II.NSPSOFS and CMDPSOFS are two multiobjective PSO proposed for feature selection.SPEA2 is based on the concept of Pareto dominance relationship for fitness value assignment and selection operations, and it applies the niche method and elite mechanism.Meanwhile, SPEA2 is known for its strong convergence ability.MOEA/D introduces the decomposition strategy into the evolutionary algorithm framework, and uses the decomposition strategy to divide the multi-objective optimization problem into a set of single-objective sub-problems, and then optimizes these sub-problems simultaneously.For each algorithm, 30 independent runs are performed on each dataset.Meanwhile, All algorithms run the same number of fitness evaluations (nFE).The parameter setting of SaMOGA is listed in Table 2.The parameter values of benchmark algorithms are set as in their referred papers.

Performance Metrics
Since a set of non-dominated solutions are obtained in the final results, this paper uses two popular performance metrics to evaluate them, namely inverted generational distance (IGD) [78] and hypervolume (HV) [79,80,81,82].Generally speaking, the smaller the IGD values are, the better performance an algorithm gains, which is opposite for HV.Specially, they are able to evaluate both the convergence and the diversity of the obtained Pareto solution sets.

Results
SaMOGA is compared with five comparison algorithms in three aspects in this section, i.e., the IGD and HV metrics, best PFs evolved on training sets and average PFs obtained on test sets.The IGD and HV are introduced to evaluate the convergence and diversity of the results achieved by these six algorithms on both training and test sets.Moreover, the Wilcoxon test [83] with a confidence level of 95% is also used on the IGD and HV metrics to check whether the performance of SaMOGA is significantly different from that of its counterparts.Since the true Pareto front is unknown, we first combine all the solutions obtained by the algorithms and compute the non-dominated solution among them as the true Pareto front to obtain the IGD and HV values.
To perform the experiments over all the datasets above, SaMOGA required approximately six hours.

Analysis of IGD and HV
Table 3 -Table 6 show the mean values and standard deviations achieved by all comparison algorithms in terms of IGD and HV on training and test sets.Best mean value for each dataset is presented in bold.In addition, symbol "W" (it stands for Wilcoxon) indicates whether there is a significant difference between the compared algorithms and SaMOGA."+" or "-" indicates that SaMOGA is significantly better or worse than compared algorithms, "=" means there is no significant difference between them.
From Table 3, it can be observed that SaMOGA obtains the minimum IGD values on sixteen out of eighteen training sets.It also achieves much smaller values than other algorithms do on DS08 -DS18.Note that a common characteristic of these datasets is the large number of features.For DS08, NSGA-II and NSPSOFS achieve the similar results and are worse than the other four algorithms.Meanwhile, SaMOGA is slightly superior to SPEA2, followed by CMDP-SOFS and MOEA/D.This is also the similar case for DS09.However, CMDPSOFS is more promising than SPEA2 on DS10 and DS11.As NF of the dataset increases, SaMOGA is always in first place, and CMDPSOFS and SPEA2 achieve similar IGD values and are in second and third place, followed by  In terms of IGD values on the test sets, as presented in Table 4, SaMOGA still obtains the minimum IGD value among all compared algorithms and it is also significantly better than all compared algorithms on DS08 -DS18.On DS01, NSPSOFS achieves the smallest IGD value and it is significant better than SaMOGA.On DS02, MOEA/D obtains the best result, and SaMOGA achieves result similar to NSGA-II, NSPSOFS, CMDPSOFS and SPEA2.CMDPSOFS performs well on DS05 and DS06.It can be observed that SaMOGA is significantly better than CMDPSOFS and SPEA2 on training sets of DS06 and DS07, but is not significantly different from CMDPSOFS and SPEA2 on the test sets of DS06 and DS07.
From Table 5, SaMOGA obtains the maximum HV values on fourteen out of sixteen training sets.SaMOGA performs well on DS08 -DS18.CMDP-SOFS and SPEA2 are ranked in the second and third place on DS12 -DS16, followed by MOEA/D, NSGA-II and NSPSOFS.On DS01, DS02 and DS05, SPEA2 achieves the biggest HV values among all comparison algorithms.Although SaMOGA achieves the biggest HV values on DS03 and DS04, it is not significantly different from SPEA2.From DS01 -DS02, we can observe that SPEA2 is very promising.
Table 6 presents the HV values on test sets.From Table 6, SaMOGA obtains slightly worse HV values than that on the training sets but it is still able to achieve the biggest HV values on twelve out of sixteen test sets.In particular, its performance remains stable on test sets with a large number of features.On DS01 -DS05, SPEA2 achieves the biggest HV values for two times while CMDPSOFS and MOEA/D achieve the biggest HV values once each.SaMOGA is significantly superior to NSGA-II, NSPSOFS and MOEA/D on DS06 and it obtains slightly bigger HV value than all compared algorithms on DS07.
From IGD and HV values on both training and test sets presented in Table 3 -Table 6, we can find that SaMOGA wins on most training and test sets and it never lose on datasets with relative large number of features.Meanwhile, on datesets with relative small number of features, CMDPSOFS performs well in terms of IGD metric while SPEA2 is good in terms of HV metric.From these results, we can conclude that SaMOGA outperforms NSGA-II, NSP-SOFS, CMDPSOFS, SPEA2 and MOEA/D on most datasets in terms of IGD and HV values, especially for those with large number of features.

Analysis of Best PFs Evolved on Training Sets
In this subsection, we present the best PFs evolved on training sets to analyze the convergence and diversity of the obtained solutions by different algorithms.For each algorithm, we first merge all PFs obtained from 30 independent runs to form the PF ensemble, and then perform non-dominated sorting operation to it and select the non-dominated solutions to obtain the best PF.Considering the length of the article, we only present the plots of the best Pareto front for the six training sets here, which are named as DS01 Tr, DS02 Tr, DS07 Tr, DS08 Tr, DS10 Tr and DS11 Tr in Figure 4, the remaining ones are shown in the Appendix.Meanwhile, the number of raw features and the classification error obtained by using them for classification on each training sets is presented at the bottom of the corresponding subfigures.
It can be seen from DS01 Tr and DS02 Tr that the final solutions obtain by SaMOGA are wellconverged.Meanwhile, it achieves the similar results to SPEA2 and CMDPSOFS on most cases.Among all comparison algorithms, the final solutions obtained by NSPSOFS are inferior.Despite the wide distribution of solutions in the results obtained by NSPSOFS, it is difficult to obtain better convergence.On DS07 Tr, all comparison algorithms perform well and achieve the same results.It is also noticeable that NSGA-II and NSPSOFS tend to obtain solutions with small classification error and big solution size.
Meanwhile, SaMOGA outperforms other five comparison algorithms on DS08 Tr, DS10 Tr and DS11 Tr.Notably, SaMOGA is good at searching out the solution that has small solution size and high classification error.Perhaps it is with these solutions that SaMOGA can ensure population diversity and promote population convergence to prevent stagnation.On DS08 Tr, which has 256 raw features, the gap between CMDPSOFS, SPEA2, MOEA/D and SaMOGA is not so obvious.However, MOEA/D starts to become quite inferior as NF increases.It can be also observed that CMDPSOFS and SPEA2 have better search ability, CMDPSOFS outperforms SPEA2 on DS10 Tr, DS11 Tr while SPEA2 outperforms CMDPSOFS on the other eight training sets.
From best PFs evolved on the training sets, we notice that the best PF obtained by SaMOGA is great.Besides, better final solutions are obtained when training on the datasets with a larger number of features.Therefore, we can conclude that SaMOGA has strong search capability and the advantage over its counterparts is more obvious when NF in the dataset is larger.

Analysis of Average PFs on Test Sets
To evaluate the performance of different algorithms on the test sets, the average PFs obtained on the test sets are analyzed.Considering the length of the article, we only present the plots of the average Pareto front for the six test sets here, which are named as DS01 Te, DS02 Te, DS07 Te, DS08 Te, DS10 Te and DS11 Te in Figure 4, the remaining ones are shown in the Appendix.Since each algorithm is run 30 times on each dataset, 30 obtained PFs invariably contain solutions with the same solution size but different classification errors.To obtain the average PF of each algorithm on each dataset, we average the classification errors of solutions with the same solution size so that one solution size will correspond to one classification error.The PF obtained in this way is called average PF.Meanwhile, the number of raw features and the classification error obtained by using them for classification on each test sets is presented at the bottom of the corresponding subfigure.
According to DS01 Te, DS02 Te and DS07 Te, all the comparison algorithms achieve the similar re-sults, i.e., small classification error and small solution size, and they can obtain a feature subset with only one feature on all low-dimensional datasets except for NSPSOFS.In addition, on these three test sets, SaMOGA can obtain solutions with smaller or slightly higher classification errors than using all features, but with more feature cuts, e.g., 93% of features on DS01 Te and 94% of features on DS02 Te.
In terms of the classification error, it can be observed that MOEA/D wins on DS08 Te, however, it has a deteriorated performance on the other datasets, achieving similar or worse results than NSGA-II and NSPSOFS.In almost all cases, SaMOGA obtains similar or smaller classification errors than that of CMDP-SOFS and SPEA2, achieving smaller classification errors than it does without using the FS operation.From the perspective of solution size, NSPSOFS is able to obtain smaller solution size than that of NSGA-II, yet inferior to MOEA/D.On most test sets, CMDPSOFS and SPEA2 obtain feature subsets with similar solution size.Furthermore, SaMOGA is capable of achieving feature subsets with minimum solution size among all comparison algorithms on all test sets, especially when NF increases, its strength in cutting features becomes stronger.The most outstanding one is that SaMOGA gets a classification error of 0 with only one feature on DS11 Tr.Overall, SaMOGA is great at removing irrelevant or redundant features and ensuring a low classification error.
On the basis of the results of this study, a future direction of our research will include the hybridization of the proposed SaMOGA with modern ingenious paradigms in neural systems and classification such as Enhanced Probabilistic Neural Network [84], Neural Dynamic Classification Algorithm [85], Dynamic Ensemble Learning Algorithm [86], and Finite Element Machine for Fast Learning [87].
Another direction of future research is the application of the proposed self-adaptive multi-objective logic to other algorithmic structures for optimization such as distributed neural dynamic algorithms [88], spiral dynamic algorithms [89], harmony search algorithm [90], water drop algorithm [91], and central force metaheuristic optimization.

Conclusions
This paper considers FS as a MOP, and proposes SaMOGA to handle it.By adopting the self-adaptive mechanism and SSDM simultaneously, SaMOGA commits to a sufficient search in the search space to yield a set of solutions with the small classification error and solution size without ending up finding local optima.Experiments are conducted on sixteen datasets with the number of feature from 30 to 1300.In addition, five multi-objective optimization algorithms for FS are adopted for comparison.The results reveal that SaMOGA is able to obtain a lower classification error using fewer features than the comparison algorithms.Meanwhile, SaMOGA is able to obtain better results than other comparison algorithms on most training and test sets in terms of IGD and HV.The success of the proposed SaMOGA over the other multi-objective approaches resides in its flexibility due to the integrated set of crossover operators in an ensemble fashion.
In the future work, we intend to investigate whether individuals can be encoded with variable length to narrow the search space, and then apply SaMOGA to further improve the classification performance.

Figure 1 .
Figure 1.The Brief Evolution Process of NSGA-II

Figure 4 .
Figure 4. Best and Average PFs Evolved on Training and Test Sets • Elitist selection.If all individuals in the current generation have undergone reproduction, and offspring population Pop new are produced.The two populations are combined together into population Rec, the non-dominated sorting and crowding distance calculation are then performed on it.After that, N individuals are selected according to the nondominance hierarchy and crowding distance.After that, PF of new generation is stored to PF (see line 19-21 from Algorithm 1).•Update of the selection probability (SP) assigned to each crossover operator.After C fitness evaluation, it is necessary to determine that whether PF Lg is the same as PF, if not, stagFlag is assigned to 1.Thereafter, SPs are updated using R, P and stagFlag.Variable c is then initialized to 0 after the update (see line 22-30 from Algorithm 1).

Table 1 .
Details about Used Datasets

Table 2 .
Parameter Setting of SaMOGA

Table 3 .
Results of Mean IGD Values on Training Sets

Table 4 .
Results of Mean IGD Values on Test Sets

Table 5 .
Results of Mean HV Values on Training Sets

Table 6 .
Results of Mean HV Values on Test Sets