Automated Design of Metaheuristics Using Reinforcement Learning Within a Novel General Search Framework

Metaheuristic algorithms have been investigated intensively to address highly complex combinatorial optimization problems. However, most metaheuristic algorithms have been designed manually by researchers of different expertise without a consistent framework. This article proposes a general search framework (GSF) to formulate in a unified way a range of different metaheuristics. With generic algorithmic components, including selection heuristics and evolution operators, the unified GSF aims to serve as the basis of analyzing algorithmic components for automated algorithm design. With the established new GSF, two reinforcement learning (RL)-based methods, deep $Q$ -network based and proximal policy optimization-based methods, have been developed to automatically design a new general population-based algorithm. The proposed RL-based methods are able to intelligently select and combine appropriate algorithmic components during different stages of the optimization process. The effectiveness and generalization of the proposed RL-based methods are validated comprehensively across different benchmark instances of the capacitated vehicle routing problem with time windows. This study contributes to making a key step toward automated algorithm design with a general framework supporting fundamental analysis by effective machine learning.


I. INTRODUCTION
A DDRESSING highly complex combinatorial optimization problems (COPs) with various realworld constraints has proven to be one of the current research challenges in evolutionary computation. Current state-of-theart include metaheuristic algorithms, which are successful in finding good-quality solutions within a reasonable computational time. However, most metaheuristic algorithms proposed in the literature only work for particular problem instances or at particular stages of problem solving, and rely heavily on the experience of human experts. In addressing this issue, automated algorithm design has attracted considerable attention recently from the research community [1], [2].
Toward automated algorithm design, the problem of designing metaheuristics itself is defined as a COP in [3], upon a search space of different decision variables, e.g., algorithm parameters, portfolio of algorithms, or algorithmic components. The research in this field therefore can be categorized into automated algorithm configuration, algorithm selection, and algorithm composition, based on the different types of decision variables considered in the search space of algorithms [3]. The first category aims to automatically configure the parameters of a specific type of algorithms. The second category focuses on selecting a candidate algorithm or combining several existing algorithms against problem/instance characteristics. In contrast to these two categories, by combining basic algorithmic components, automated algorithm composition aims to generate general algorithms, i.e., the algorithms generated do not belong to any specific search algorithms, for example genetic algorithms or particle swarm optimization, etc.
Algorithm configuration can determine a well-performing parameter setting; however, it requires sufficient prior knowledge about which specific algorithm should be used. Algorithm selection addresses the limitation of the first category; however, it introduces the difficult problem of identifying the key characteristics of the problem. Automated algorithm composition aims to flexibly compose and generate new algorithms; however, some human expertise is still required to preselect candidate heuristics in existing frameworks. This study falls into the third category to investigate the elementary and basic components to automatically design search algorithms within a unified framework.
In the literature, reinforcement learning (RL) [4] has been used to automatically design algorithms through modeling the problem of algorithm design as a Markov decision process (MDP). RL is a learning technique, where an agent determines an optimal action at each state based on its interaction with the environment. Based on the rewards or punishments after performing each selected action, it learns to intelligently select the action in the current state by forming the state-action pairs through trial and error [5]. Some researchers have used the simplest tabular RL techniques, such as SARSA [4] and Q-learning (QL) [6] for evolutionary algorithm design. One research issue in applying tabular RL is concerned with the discretization of the continuous state space, leading to unreliable results [7], [8]. In this research, the neural network function approximation is adapted to address the above issue. Besides, there are less studies on RL techniques to support the effective design of evolutionary algorithms to solve constrained COPs such as the capacitated vehicle routing problem with time windows (CVRPTW).
To support the automatic design of effective metaheuristic algorithms for COPs, a general search framework (GSF) is first established, within which learning techniques can be applied to the design space of algorithms and thus support automated algorithm design. At this stage of research, instead of studying all the algorithmic components, we only focus on investigating the key issue of automatic composition of key evolutionary operators which have the biggest impact on algorithm performance. RL is used in automated algorithm composition to reward or penalize combinations of key evolutionary operators based on their performance. The research work aims to make the following contributions.
1) A new GSF is established to formulate different singlesolution-based and population-based algorithms. The unified GSF serves as the basis to analyze algorithmic components, generating effective search algorithms for CVRPTW automatically.
2) The automated algorithm composition process is formulated as an MDP. Two RL methods, deep Q-network (DQN) [9] and proximal policy optimization (PPO) [10] have been investigated within the proposed GSF to address the key issue of automatic selection and combination of the most efficient evolutionary operators during different stages of the evolution. Results on CVRPTW demonstrate the effectiveness of the trained policy compared to a search procedure without learning.
3) The generalization of the trained policy is further validated by applying it directly to new CVRPTW instances. In addition to the knowledge extracted and retained in the DQN and PPO models, the training time of RL-based techniques is also justified by the time and expertise needed to develop new models and algorithms from scratch to tackle new problem instances. The remainder of this article is structured as follows. Section II presents related work on existing automated algorithm design frameworks and RL techniques within these frameworks. Section III describes the proposed GSF and learning techniques. In Section IV, the optimization model of CVRPTW is described, and experimental results are analyzed on a benchmark dataset, whereas Section V presents the conclusions and discusses future research.

II. RELATED WORK
Most evolutionary algorithms and metaheuristics in the existing literature have been designed manually by researchers of different expertise, many with ad hoc chosen algorithms for the specific problems in hand. There is relatively less work on building GSFs to support effective algorithm design.

A. Existing Frameworks for Automated Algorithm Design
The algorithm design problem has been formally modeled as a COP, namely, the general COP (GCOP) model in [3]. Based on the fundamental difference in the decision space, automated algorithm design can be divided into three categories: 1) algorithm configuration; 2) algorithm selection; and 3) algorithm composition. A number of frameworks have been developed in the literature to support the task of automated algorithm design within these different categories.
Automated algorithm configuration aims to find a wellperforming parameter setting of a target algorithm across a given set of problem instances. Frameworks built to support this task include ParamILS [11], which utilizes iterated local search, F-race [12], and irace [13], both using a racing mechanism, and the surrogate-based methods, such as SPOT [14], SMAC [15], MIP-EGO [16], and Hyperopt [17].
In automated algorithm selection, a specific algorithm or a portfolio of algorithms is automatically chosen on an instanceby-instance basis. Frameworks developed include PAP [18], which integrates different evolutionary algorithms to solve numerical optimization problems, and Hydra [19], with a configuration technique for the portfolio-based algorithm selection, and machine learning-based algorithm selectors [20].
In automated algorithm composition, a set of heuristics is automatically combined to generate new algorithms to solve instances across different problem domains. The most investigated technique is hyper-heuristics [21], which is broadly concerned with intelligently selecting or generating heuristics. Frameworks developed include HyFlex [22], EvoHyp [23], SHH [24], etc. HyFlex explores a decision space of low-level heuristics (e.g., taking search operators from ten well-known techniques [25]) while EvoHyp adapts evolutionary algorithms as high-level strategies. SHH is specifically built for automatically combining different components of swarm intelligence algorithms [24]. In addition, some frameworks have been built within a template of specific metaheuristics, such as CMA-ES [26] and PSO-DE [27].
The recent fast growth of automated algorithm composition is due to its greater potential to generate more general search algorithms to solve complex COPs. It is not restricted within a template of existing specific search algorithms. This study focuses on automated algorithm composition problem, drawing in advanced RL for effective algorithm design.
Although existing automated algorithm composition frameworks (e.g., HyFlex, EvoHyp, and SHH) have been successfully used for solving a variety of COPs, several limitations remain. HyFlex requires a set of predefined or problemspecific heuristics rather than basic algorithmic components to generate more general and powerful search algorithms for a wider range of problems. EvoHyp predefines the selection operator and evolution operator, while SHH mixes these two types of operators. These frameworks thus build on the reduced search space of algorithm design, however, result in the loss of some advantageous combinations of basic components which may never be obtained or explored.
With the new standard in algorithm design, namely, GCOP established in [3], this research systematically investigates learning techniques within the unified GSF to underpin automated algorithm design.

B. Reinforcement Learning Within Automated Algorithm Composition
In automated algorithm composition, some RLs, such as SARSA [4], QL [6], and DQN [9], have been used to support the intelligent selection of the most appropriate heuristic operators. They utilize feedback on the performance of operators during different stages of the search process. The research in this field can be classified into two categories based on how the action space is defined.
The first category of RL techniques in the automated algorithm composition defines the operators in a specific type of search algorithms as the optional actions of the RL agent. In the literature, RL techniques are mostly applied to evolutionary algorithms, such as the genetic algorithm to select efficient mutation and crossover operators [7]. Results on the Traveling Salesman Problem and the 0-1 Knapsack Problem have demonstrated the superiority of this automated method [7], [28], [29], [30].
However, due to the complexity of RL techniques, most studies in this field [7], [28], [29] have only focused on using the simplest tabular RL methods, such as SARSA and QL. Few studies have investigated advanced techniques to handle the continuous state spaces when applying RL to select evolution operators [30]. There is a lack of research on advanced RL in effective and efficient automated algorithm design in evolutionary computation.
The second research category treats problem-specific heuristics as the optional actions of the RL agent. RL techniques are used as the high-level strategy to automatically combine different low-level heuristics in hyper-heuristics. Results of these RL-based approaches on unmanned aerial vehicles [31] and different COPs within the HyFlex framework [5] demonstrated the effectiveness of these methods.
In these studies, several search-dependent features have been used to represent the state. The number of features identified, however, is limited and insufficient for learning. Also, simple positive/negative reward schemes are used, which cannot accurately reflect the effects of the selected action. Furthermore, it is often not clear how the RL techniques within the hyper-heuristic framework have been devised, i.e., lack of clear definition on the three fundamental elements of RL, namely, the state, action, and reward scheme. There is still a large scope and gap in this area of research, as it is often challenging to reimplement the exact same method and subsequently replicate the results.
In this study, we apply two RL techniques with a neural network function approximator in the learning of automated algorithm composition. The state space with sufficient features for effective learning is carefully defined. The action space is defined as the basic algorithmic components to learn  reusable knowledge in the automated design of general search algorithms. Also, an effective reward scheme is defined to encourage the RL system to find efficient search policies. It should be noted that this study adopts an offline RL framework, in which the policy is trained offline but used in an online fashion. This is different from most of the RL-based automated algorithm composition methods in the literature.

A. General Search Framework
Evolutionary algorithms and metaheuristics in the literature follow a similar underlying philosophy of artificial evolution driven by selection and reproduction. The evolution and search process of a specific metaheuristic is distinguished and mainly depends on the selection heuristics and evolution operators.
Based on the analysis of the basic schemes of metaheuristic algorithms, a GSF has been developed, as illustrated in Fig. 1. The framework is composed of five modules as shown in Table I for updating the individuals and four archives as shown in Table III for storing the individuals. For each of these components, different settings, heuristics, or parameters can be chosen, as shown in Tables V-VII, to automatically compose and design different general search algorithms within the GSF. Algorithms represented by the combination of heuristics and operators are set as the output.
With respect to Initialization, although some problemspecific heuristics (h p ) have been developed, the majority of existing studies generally adopt a "purely at random" (h r ) strategy. The two most common criteria for Termination are computation time (h t ) and population convergence (h c ). Of the five modules presented in Fig. 1, Selection for Evolution, Evolution, and Selection for Replacement contribute more to the search performance. Therefore, they are discussed in detail in the following section.
The proposed GSF is able to formulate in a unified way a range of single-solution-based algorithms and populationbased algorithms by setting different parameters for the modules and archives, as shown in Table II, e.g., different population size, the four archives, and heuristic sets in the Selection for Evolution module. This article focuses on RL on the automated design of population-based search algorithms. Table III shows the archives defined within GSF and Table  IV presents the heuristic/operator set for the module. In the Selection for Evolution module, some individuals within the Current Population archive (A C ) are selected and stored in the Parent Population archive (A P ) using selection heuristics (H SE ). The population is updated or evolved by using the evolution operators (O E ) in the Evolution module and stored in the Offspring Population archive (A O ). The Current Population archive is then generated by adopting the selection heuristics (H SR ) in the Selection for the Replacement module. In addition, every individual has a Personal Archive (A i I ) to reserve individual trajectories.

B. Basic GSF Modules
In GSF, the Selection for Evolution and Selection for Replacement modules select individuals using various heuristics based on the fitness of individuals in the population archives. Without loss of generality, all selection heuristics are set for solving optimization problems where the aim is to minimize the objective value.  Table V shows, h 1 , h 2 , and h 3 a select parent according to a probability related to their fitness. h 4 , h 5 , and h 6 select a parent in a deterministic way instead of by probability.

1) Selection for Evolution: As
2) Selection for Replacement: After evolution, the population is updated by using the selection heuristic (h 7 , h 8 ) in the Selection for Replacement module, as shown in Table VI.
3) Evolution Operators: Evolution operators (O E ) in the Evolution module include O mutation , which operates upon one individual, and O crossover , which operates on multiple individuals. Regarding the CVRPTW, crossover operators are prone to infeasible solutions. Therefore, in this study, we focus on investigating various mutation operators, which are defined in Table VII, for solving CVRPTW. Note that these general basic operators (exchange, insert, remove, etc.) can be adapted accordingly to automatically design algorithms for different COPs.

C. Reinforcement Learning for Automated Algorithm Composition in GSF
RL is a machine learning technique, where intelligent agents take actions based on the learned policy trained through trial and error interactions with the environment by maximizing total reward. The environment of RL is considered as an MDP, which is composed of a set of possible states and a set of selectable actions. Each state-action pair is given a total reward value (Q-value).
With the established GSF, RL is used for automated algorithm composition as shown in Fig. 2. The actions are the selectable combinations of algorithmic components (i.e., evolution operators). The states are defined by different features of the search process, the solution and the instance, as shown in Table IX. The automated algorithm composition process starts with the observation of the agent's current situation (a state) and the selection of a combination of algorithmic components (an action). The execution of the resulting algorithmic component (selected action) leads to a new state of the optimization process (environment) by the chosen selection heuristic and evolution operator to the current state. A reward (or penalty) is assigned to the selected action against the current state.
Tabular RL techniques, such as SARSA [4] and QL [6], have been used to select heuristic operators in the literature. However, a Q-  RL techniques can be roughly divided into value-based methods and policy-based methods based on their policy update mechanism [4]. To comprehensively verify the effectiveness of RL on automated algorithm design, a typical value-based method and a typical policy-based method are investigated within GSF in this research.
In value-based RL, DQN [9], the first deep RL method, is selected. The DQN-based method to automatically design an algorithm within the GSF is named DQN-GSF. In policybased RL, PPO [10], which outperforms other policy gradient approaches, is selected, named PPO-GSF in this study. Table VIII shows the notations used in this study. The pseudocodes of DQN-GSF and PPO-GSF are shown in Algorithms 1 and 2, respectively.
Note that h 1 and h 8 are fixed in the Selection for Evolution and Replacement modules to address our key research issues, i.e., how to automatically design algorithms with evolutionary operators which have the most impact on evolutionary algorithms. With the newly established GSF, at this stage of research, the focus is on the key modules of evolution, rather than on determining all the components in all modules simultaneously to find the best results within a reasonable computational time. With controlled experiments on the key module while fixing the other submodules, we can focus on examining the results only due to different settings in the Evolution module. From the preliminary experimental analysis, compared with the Evolution module, Selection for Evolution and Replacement modules have a smaller impact on the algorithm performance. Therefore, the most commonly used components in the existing metaheuristic algorithms, i.e., Algorithm 1 Pseudocode of DQN-GSF 1: Initialize memory buffer D 2: Initialize evaluation action-value function Q network and target action-value functionQ network 3: Generate initial population, record the initial state s 0 4: for episode k = 1 to NoE do 5: initialize the state s 0 6: for timestep t = 1 to NoT do 7: observe the current state s t by calculating values of different state features in Table IX   8: with probability ε select a random action a t , with probability 1−ε select an action that has a maximum Q-value: a t = arg max Q(s t , a t ) 9: select parents using a selection heuristic h i (i = 1, 2, ..., 6) from H SE (fixed as h 1 in this study) 10: generate offspring population by performing the selected action a t to state s t 11: update the population using a selection heuristic h i (i = 7, 8) from H SR (fixed as h 8 in this study) 12: observe reward r t based on Equation (3) and Equation (4), and next state s t+1 , store experience (s t , a t , r t , s t+1 ) in D 13: sample random minibatch of experiences s j , a j , r j , s j+1 J (J denotes the size of the sampled minibatch) from memory buffer D and calculate the loss: r j + γ max a j+1Q s j+1 , a j+1 − Q s j , a j 2 , γ denotes the discount factor. 14: perform gradient descent with respect to Q network in order to minimize the loss 15: every N timesteps resetQ = Q 16: end for 17: end for h 1 in Selection for Evolution and h 8 in Replacement, are chosen for focused investigations.
In DQN-GSF and PPO-GSF, the two RL techniques, DQN and PPO, are first applied in multiple episodes to train the policy within the GSF. After that, the trained policy is used to design the search algorithm online. The training process is the key research issue, and described in details as follows.
As shown in Algorithm 1, the DQN-GSF is trained on every timestep. Specifically, an action (an evolution operator) is deterministically selected with the largest Q-value for exploitation or randomly selected for exploration (line 8, Algorithm 1). The designed search algorithm with predefined selection heuristics is executed for one timestep (lines 9-11, Algorithm 1). The next state and reward are identified, and this experience (s t , a t , r t , s t+1 ) is stored in the memory buffer (line 12, Algorithm 1). After that, a minibatch of experiences is randomly sampled from the memory buffer to train the evaluation network (lines 13 and 14, Algorithm 1). The process is iterated at each timestep until the end of the episode. In the process, the target network parameters are periodically synchronized with the evaluation network parameters (line 15, Algorithm 1). Algorithm 2 Pseudocode of PPO-GSF 1: Initialize memory buffer D 2: Initialize policy parameters θ 0 , value function parameters Φ 0 3: Generate initial population, record the initial state s 0 4: for episode k = 1 to NoE do 5: for timestep t = 1 to NoT do 6: observe the current state s t by calculating values of different state features in Table IX   7: select parents using a selection heuristic h i (i = 1, 2, ..., 6) from H SE (fixed as h 1 in this study) 8: generate offspring population by performing the selected action a t based on policy π k = π θ k . 9: update the population using a selection heuristic h i (i = 7, 8) from H SR (fixed as h 8 in this study) 10: observe reward r t based on Equation (3) and Equation (4)   11: collect experience (s t , a t , r t ) and save it in D 12: end for 13: update the policy by maximizing the PPO objective θ k+1 based on Equation (1) 14: fit value function Φ k+1 based on Equation (2) 15: empty memory buffer D 16: end for Unlike value-based DQN-GSF, policy-based PPO-GSF is trained on every episode rather than each timestep. As shown in Algorithm 2, first, a series of actions are selected based on the probability of the policy π θ k , k = 1, 2, . . . , NoE, and then the designed search algorithm with predefined selection heuristics is correspondingly executed for one episode (lines 5-12, Algorithm 2). Then, the policy is updated by maximizing the PPO objective based on (1) (line 13, Algorithm 2) and the value function is fitted by time differential error based on (2) (line 14, Algorithm 2). Finally, the memory buffer is set to be empty (line 15, Algorithm 2). A series of actions are selected based on the updated policy to perform the next episode of optimization (line 8, Algorithm 2) denotes the probability ratio r t (θ ) = (π θ (a t |s t )/π θ k (a t |s t )). A π θ k (s t , a t ) is an estimator of the advantage function at timestep t.
is a hyperparameter. clip(r t (θ ), 1 − , 1 + ) denotes the modified surrogate objective by clipping the probability ratio. The rewards-to-goR t is calculated according to trajectory τ : [(s 1 , a 1 , r 1 ), (s 2 , a 2 , r 2 ), . . . , (s t , a t , r t )]. Please refer to [10] for more detail about these two equations. Search-dependent features observe the search process, such as the total improvement over the initial solution. Solutiondependent features are associated with the solution encoding scheme, take TSP as an example, the encoding of a complete tour can be directly defined as the state. Instance-dependent features refer to the instance-specific characteristics, such as the vehicle number or the vehicle capacity of VRP.
When search-dependent or instance-dependent features are used to define the state space, the learned information can be transferred to other instances of the same problem, or even to other problems. In many cases, the solution-dependent features cannot be used to develop a general methodology since they are problem specific. Therefore, in this study, as shown in Table IX, four search-dependent features (f 1 -f 4 ) and four instance-dependent features (f 5 -f 8 ) are used to define the state space.
2) Action Representation: In DQN-GSF and PPO-GSF, the set of possible actions in each state is defined by the set of evolution operators (O E ) in Table VII. Once an action is selected, it is applied to the whole population.
3) Reward Scheme: The reward scheme, which encourages the RL system to find efficient search policies, is very important for an RL method. In DQN-GSF and PPO-GSF, the reward is calculated based on the improvement of the fitness of the current population over the initial population, as shown in (3) and (4). When population fitness is optimized above a certain threshold, a larger reward is given for the same fitness improvement.
Two methods are used in setting the reward: normalize f 1 to increase the training efficiency; assign a larger reward by using a log function to the same fitness improvement in the later stage of the optimization process.
Many of the simple positive/negative reward schemes in the literature track the fitness improvement by counting the number of steps achieved successfully. The proposed reward scheme is designed to instead maximize the total fitness improvement itself, which is what really needs to be optimized. The proposed reward scheme not only reflects but also measures the positive/negative impact of the selected action. Moreover, it assigns a larger reward to the actions that lead to fitness improvements at the later stage of the optimization process, to address the issue that such improvements are usually very small at the final stage of evolution.
4) Episode Setting: An episode is defined as the whole optimization process. Since the time-based stopping criteria is used in this study, the period of each episode equals to the given optimization time t max . An episode is divided into NoT timesteps, so the period of each timestep equals to t max /NoT. For training purposes, the proposed DQN-GSF and PPO-GSF are executed for NoE episodes. For testing purposes, the designed DQN-GSF and PPO-GSF are executed for one episode.

IV. EXPERIMENTS AND DISCUSSION
The proposed RL-GSF methods within the novel GSF are investigated and evaluated on one of the mostly studied COPs, CVRPTW, in this research. All experiments have been conducted using a computer with Intel Xeon W-2123 CPU@ 3.60-GHz processors, and with 32.0 GB of memory. The RL-GSF methods are implemented in a Java environment with IntelliJ IDEA 2020.3.3 as the development tool.
The experimental investigations aim to address two research issues: 1) the effectiveness of the new RL techniques to automatically generate a search algorithm to tackle the benchmark Solomon CVRPTW dataset and 2) the generalization of the trained policies to new problem instances. To analyze the influence of the Q-value function approximator on learning models, two value-based RL-GSF methods with fitness improvement as the state definition, namely, QL-GSF with a Q-table and DQN-GSF with a neural network function approximator, are compared in Section IV-B1. To analyze the influence of the policy update mechanism on learning models, DQN-GSF and PPO-GSF are assessed in Section IV-B2. The generalization of the trained policies across the same and different types of problem instances are assessed by directly applying the trained policies to new instances in Sections IV-C1 and IV-C2

A. Problem Definition and Dataset
CVRPTW has been intensively tested as a benchmark problem in evaluating the performance of evolutionary and metaheuristic algorithms [32]. This article investigate the CVRPTW to gain a better understanding on the proposed RL-based automated algorithm design methodologies.
The CVRPTW can be mathematically formulated as follows [33].
A fleet of K vehicles is used to serve n customers. To customer v i , the service start time b i must fall within the time window [e i , f i ], where e i and f i represent the earliest and latest time to serve q i (i.e., the demand of v i ), respectively. If a vehicle arrives at v i at time a i < e i , a waiting time w i = max{0, e i − a i } occurs. Consequently, the service start time b i = max{e i , a i }. Each vehicle with a capacity Q travels on a route connecting a subset of customers starting from v 0 and ending within the schedule horizon [e 0 , f 0 ]. d ij represents the distance from customer v i to customer v j .
Decision Variables: X k ij = 1, if the edge from v i to v j is assigned in the route of vehicle k; otherwise X k ij = 0. Objective Function: Minimize Constraint: The first objective is to minimize the number of vehicles (5) while the second objective is to minimize the total traveled distance (6). Constraints (7)-(9) limit every customer to be visited exactly once while ensuring that all customers are served. Constraints (10)-(12) define the route by vehicle k. Constraints (13) and (14) define the customer time windows constraint and vehicle capacity constraint, respectively. Constraint (15) defines the domain of the decision variable X k ij . We adapt the same evaluation function in the literature, where the two objectives are transformed into a single objective with a weight [34] as shown in (16): The Solomon benchmark dataset [35] consists of six sets of instances of different characteristics (C1, C2, R1, R2, RC1, and RC2. The instances differ with respect to the customers' geographical locations, vehicle capacity, density, and tightness of the time windows. Customers in instance sets C1 and C2 are clustered geographically; while customers in instance sets R1 and R2 are randomly located. Instance sets RC1 and RC2 contain a mixture of random and clustered customers. The customer coordinates are identical for the same type of problem instances. The instances within one type differ with respect to the density and tightness of the time windows, i.e., the percentage of time-constrained customers and the width of the time windows.

B. Effectiveness of the Learning Models 1) Influence of Q-Value Function
Approximator on the Learning Models: QL, representing the tabular RL methods, and DQN, representing the function approximation RL methods, are applied in the established GSF. The random algorithm is chosen as the baseline algorithm to demonstrate the performance of the RL methods. The Random-GSF method randomly selects algorithmic components within the established GSF during different stages of the optimization process without any learning, i.e., each algorithmic component has the same probability of being selected.  The state space is defined by the total fitness improvement over the initial population fitness, i.e., f 1 in Table IX. As CVRPTW is an NP-hard problem with a finite fitness search space, the QL-GSF method needs to handle a large number of states. An approximation technique based on the concept of the state aggregation [36], [37], [38] is used within the QL-GSF to aggregate the state space into several disjoint categories.
From the preliminary analysis, in type-R1 and type-RC1 instances, the values obtained fall into the range [0.4, 0.6]. The range is slightly different in type-C, type-R2, and type-RC2 instances, observed as [0.3, 0.5] from experiments. Therefore, the state space of type-R1 and type-RC1 instances is divided as: Apart from the Q-value function approximator, the experimental environment and parameters settings are identical for these three algorithms. In all algorithms, the population size, the number of timesteps NoT, and predefined maximum running time of one episode t max are set to 100, 50, and 600 s, respectively. For training the policy, the number of episodes NoE is set to 500. For testing purposes, as shown in Tables X-XII, by running each learning algorithm ten times, we collected the average best fitness (AVG), standard deviation (SD), the best fitness (BEST), and the GAP between BEST and the best-known solution in [39]. The published results (BEST) of two state-of-the-art manually designed algorithms, i.e., RT [40] and HG [41], are listed in all tables to the comparisons. These two manually designed algorithms are selected as the comparison since they report most of the information in the published papers.
It should be noted that it is usually not possible to compare the design time of automated methods and manual design methods since this information is usually not reported in the published papers. Some of them only published their results without time. A direct comparison on the computational expenses between the proposed automated methods and the manual methods is unfair due to the different computing platforms and implementation languages. Furthermore, the  However, the aim is not to develop a fast method but rather to automatically develop search algorithms that can produce state-of-the-art results with a higher degree of generality. The extra time can be compensated by solving different problem instances without redesigning or fine-tuning algorithms in the long term. On the type-C instances, the results of BEST and GAP in Table XI demonstrate that these three methods can produce the current best-known solutions [39]. This type of instances can be solved by evolutionary search without any learning techniques. The different AVG and SD indicate that the proposed RL-GSF methods, especially DQN-GSF, are more stable to automatically design a search algorithm for solving type-C instances with statistical significance (measured by the Wilcoxon rank sum test with p < 0.05), and indicated by * in all the tables of results.
On the type-R and type-RC instances, as shown in Tables XI and XII, DQN-GSF achieves the best results among these three algorithms in most instances. QL-GSF is the second best, with a higher AVG and a smaller GAP than Random-GSF in most instances. It indicates that learning-based models are more effective than the nonlearning search procedure.
In conclusion, a neural network function approximator outperforms the simple Q-table. With more features to define the state space, the effectiveness of the learning methods is likely to be further improved. However, the memory required by a simple Q-table to handle multiple features will increase and the amount of time required to explore each state to create the required Q-table becomes unrealistic. In comparison to a Q-table, a neural network is able to handle multiple features.
It can also be observed that Random-GSF shows comparable performance on type-C instances but poorer performance on type-R and type-RC instances. This indicates that learning mechanisms can help to find a better combination of the algorithmic components, obtaining better solutions. In the next section, two neural network-based RL-GSF methods, DQN-GSF and PPO-GSF, will be investigated further.
2) Influence of Policy Update Mechanisms in the Learning Models: In the value-based method DQN-GSF, and policybased method PPO-GSF, apart from the policy update mechanism, the other parameters, such as the population size and the maximum running time are all identical to conduct a fair comparison.
The policies of PPO-GSF and DQN-GSF are gradually improved during the training process. A certain degree of randomness must be maintained to avoid being trapped in a local optimum. The reward curve rises with some fluctuation as a result of this, thus is smoothed by using a sliding window filter method (moving average), as shown in (17): where x buff is the raw reward per episode, y buff = [1, . . . , 1] q is a vector with the length of the smooth factor q, z buff = [1, . . . , 1] NoE is a vector with the length of the whole training data. q is set to 5 in the experiment. On the type-C instances, as illustrated in Fig. 3, PPO-GSF performs better than DQN-GSF in most instances. As shown in Table XIII, the AVG and SD also demonstrate the superiority of PPO-GSF over DQN-GSF. Again both learning methods can produce the current best-known solutions [39].
On the type-R instances, as illustrated in Fig. 4, PPO-GSF outperforms DQN-GSF in terms of algorithm convergence and solution quality. Further,    On the type-RC instances as illustrated in Fig. 5, PPO-GSF clearly outperforms DQN-GSF in all instances. In Table XV, in most type-RC instances, the solutions obtained by PPO-GSF and DQN-GSF are nondominated solutions to the best-known solution identified by all the other metaheuristics in [39]. In conclusion, the experimental results show that both the PPO-GSF and DQN-GSF methods can support effective learning in GSF to automatically generate evolutionary algorithms for solving different types of CVRPTW instances. In particular, with a neural network approximator, PPO-GSF, the policy-based model, is more effective than DQN-GSF, the value-based model. There are mainly two reasons. First, policy-based methods can learn stochastic policies while value-based methods can only learn deterministic policies. Policy-based methods are more capable of better environmental exploration. Second, PPO-GSF can ensure that the learned policy is monotonically increasing due to its effective value function optimization method, leading to better exploitation.

C. Generalization of the Learning Models
The training process of RL-GSF models is very timeconsuming. This section investigates the generality of the policies trained by the proposed RL-GSF models, potentially reducing the time, and reusing policies learned on automated algorithm design in solving new problem instances.
1) Generalization Across the Same-Type Instances: The policies trained on instance R101 by DQN-GSF and PPO-GSF are used to validate their generality to other type-R instances. Results in Table XVI of applying these policies to other five instances demonstrate a good degree of generalization. NV denotes the number of vehicles and TD denotes the total distance. Policies trained by DQN-GSF lead to a GAP less than 2% apart from instance R202. With PPO-GSF, the GAP is less than 3% in all instances, obtaining comparable results to the best-known results in the literature [39].
2) Generalization Across Different-Type Instances: Generality of the policies trained on instance R101 by DQN-GSF and PPO-GSF are validated by directly applying them to type-C and type-RC instances with different features. Results in Table XVII to other 12 instances again demonstrate the generalization of the trained policies. For type-C instances, all the GAP values are equal to 0, which means the trained policies of DQN-GSF and PPO-GSF can produce the current bestknown solutions. On the type-RC instance, in most instances,  In conclusion, the experimental results show that the algorithms designed automatically by DQN-GSF/ PPO-GSF are able to produce high-quality solutions for different problem instances, of the same and also different types. This indicates that the proposed framework is reliable for different scenarios, which is the aim of the automated algorithm design.

V. CONCLUSION
In this study, a GSF is first established to formulate different metaheuristics, including single-solution-based algorithms and population-based algorithms. RL methods, DQN and PPO, are devised within the established unified GSF to automatically design population-based algorithms by intelligently selecting appropriate combinations of the algorithmic components (i.e., evolution operators) during different stages of the optimization process. The proposed models showed to be able to effectively design algorithms within GSF, by learning from interactions with the environment (optimization process).
The performance of the proposed two RL models has been evaluated on different benchmark instances of the CVRPTW to investigate their effectiveness and generality. Regarding the effectiveness of the learning models, investigations on the Q-value function approximator and policy update mechanism show that the policy-based models with a neural network function approximator (i.e., PPO) are more suitable to automatically design search algorithms. Regarding the generality, the policies learned on one instance are applied across the same-type and different-type instances. The results validate the generality of the trained policies of the DQN-GSF and PPO-GSF models. This provides promising evidence in learning reusable new knowledge in designing algorithms based on the basic algorithmic components within the unified GSF.
For future work, the proposed GSF can be extended to support the automated design of multiobjective algorithms. Precise measure of population diversity in both the solution space and the objective space, as well as fitness landscape analysis on the search space of algorithm compositions may further identify search-dependent features to better represent the state, enhancing the RL-based methods toward effective learning on algorithm design.

APPENDIX
The detail of the neural network used in the DQN-GSF and the PPO-GSF are shown in Figs. 6 and 7, respectively.
In Fig. 6, the state and action are taken as the input ( 1 , 3 ) for the Q networks, evaluation network, and target network, respectively. The parameters of the target network are replaced ( 11 ) by the evaluation network every n episodes. The output of the Q networks is a set of Q values of all actions ( 2 , 4 ). The action is decided by the maximal Q values ( 5 ). The selected action a is executed ( 6 ) in the environment, called one step. The (s t , r t , a t , s t+1 ) generated at each step is stored in the replay buffer ( 7 ). The loss of the parameters of the evaluation network is calculated ( 8 ). After the update of the Q networks ( 10 , 11 ), the data flows back to 1 .  In Fig. 7, the state is taken as the input ( 1 ) of the actor neural network, the output of which is a probability distribution of all actions ( 2 ). The action is decided by the obtained probability distribution ( 5 ). The selected action a is executed ( 6 ) in the environment, called one step. The (s t , r t , a t , s t+1 ) generated at each step is stored in the replay buffer ( 7 ). The parameters of the critic neural network are updated ( 8 ) by minimizing the advantage function value. On the other hand, the state and reward are taken as the input ( 3 ) of the critic neural network, and the output, advantage function value of the current state ( 4 ), guides the update direction of the actor neural network ( 9 ). Then, the data flows back to 1 .
In RL, the learning rate, discount rate, and size of the neural network are the key hyperparameters in the algorithms. Specifically, the learning rate is adjusted adaptively, set as 0.002 at the beginning, and halved when the output is stable. The discount rate is set to 0.99 thus the learned policy focuses more on sequential decisions. The topology of the network is set based on the complexity of the problem, and the number of layers and neurons has been shown in Figs. 6 and 7.