Self-adaptation of Mutation Rates in Non-elitist Populations

The runtime of evolutionary algorithms (EAs) depends critically on their parameter settings, which are often problem-specific. Automated schemes for parameter tuning have been developed to alleviate the high costs of manual parameter tuning. Experimental results indicate that self-adaptation, where parameter settings are encoded in the genomes of individuals, can be effective in continuous optimisation. However, results in discrete optimisation have been less conclusive. Furthermore, a rigorous runtime analysis that explains how self-adaptation can lead to asymptotic speedups has been missing. This paper provides the first such analysis for discrete, population-based EAs. We apply level-based analysis to show how a self-adaptive EA is capable of fine-tuning its mutation rate, leading to exponential speedups over EAs using fixed mutation rates.


Introduction
An obstacle when applying Evolutionary Algorithms (EAs) is that their efficiency depends crucially, and sometimes unpredictably, on their parameter settings, such as population size, and mutation rates. Parameter tuning [7], where the parameters are fixed before running the algorithm, is the most common way of choosing the parameters. A weakness with parameter tuning is that optimal parameter settings may depend on the current state of the search process. In contrast, parameter control allows the parameters to change during the execution of the algorithm, e. g. according to a fixed schedule (e.g. as in simulated annealing), through feedback from the search, or via self-adaptation [7]. Adaptive parameters can be essential and advantageous (e. g. covariance-matrix adaptation [8]) in continuous search spaces. In discrete spaces, it has been shown that changing the mutation rate as a function of the current fitness [2] can improve the runtime, and the 1/5-rule has been used to adapt the population size [5].
While previous studies have shown the benefit of adaptive parameters, only global parameters were analysed. Our focus is different, we look at the so-called "evolution of evolution" or true self-adaptation [7], in which the parameter is encoded in the genome of individual solutions. As far as we know, the existing studies on this topic from the EC literature is mostly experimental [1,7,12], or about proving the convergence of the population model at their limit [1], i. e. infinite population.
We study how mutation rates can evolve within a non-elitist population, where the mutation rate of each individual is encoded by its own genome. The rate at which the mutation rate mutates is specified by a strategy parameter p. In endogenous control, the strategy parameter is itself evolved [1,13]. Here, we consider exogenous control of the strategy parameter p, where the value of the parameter is fixed before the run. Our contribution is twofold: using LeadingOnes as a benchmark, we provide the necessary and sufficient conditions, especially those on p, for self-adaptation to work; by making a small modification of the function, we show that self-adaptation is essential in optimising the modified function, more precisely that a single mutation rate or uniform mixing of mutation rates requires exponential time, while self-adaptation is efficient. We also prove that a non-elitist EA can outperform the (µ+λ) EA.

Preliminaries
For any n ∈ N, define [n] := {1, . . . , n}. The natural logarithm is denoted by ln(·), and the logarithm to the base 2 is denoted by log(·). For x ∈ {0, 1} n , we write x(i) for the i-th bit value. The Hamming distance is denoted by H(·, ·) and the Iverson bracket by [·]. Given a partition of a search space X into m ordered "levels" (A 1 , . . . , A m ), we define A ≥j := ∪ m i=j A i . A population is a vector P ∈ X λ , where the i-th element P (i) is called the i-th individual. Given A ⊆ X , we let |P ∩ A| := |{i | P (i) ∈ A}| be the number of individuals in population P that belong to the subset A. All algorithms considered here are of the form of Algorithm 1 [4]. A new population P t+1 is generated by independently sampling λ individuals from an existing population P t according to p sel , and perturbing each of the sampled individuals by a variation operator p mut . The selection mechanism p sel implicitly embeds a fitness function g : Y → R.
We consider the standard bitwise mutation operator, where for any pair of bitstrings x, x ′ ∈ {0, 1} n and any mutation rate χ ∈ (0, n], the probability of obtaining To model the parameter control problem, we assume that Algorithm 1 must choose mutation rates from a predefined set M. Uniform mixing, denoted p mix mut , chooses the mutation rate χ uniformly at random from the set M every time an individual is mutated, p mix mut (x) := mut(x, χ), where χ ∼ Unif(M).
Here, we focus on |M| > 1. It is known that such mixing of mutation operators can be beneficial [10,6].
We analyse the runtime of Algorithm 1 using the level-based theorem [3]. This theorem applies to any population-based process where the individuals in P t+1 are sampled independently from the same distribution D(P t ), where D maps populations P t to distributions over the search space X . In Algorithm 1, the map is D = p mut • p sel , i.e., composition of selection and mutation.
We apply the negative drift theorem for populations [9] to obtain tail bounds on the runtime of Algorithm 1. For any individual P t (i) in Algorithm 4 where t ∈ N and i ∈ [λ], define R t (i) := |{j ∈ [λ] | I t (j) = i}|, i.e., the number of times the individual was selected. We define the reproductive rate of the individual P t (i) to be E [R t (i) | P t ], i.e., the expected number of offspring from individual P t (i). Informally, the theorem states that if all individuals close to a given search point x * ∈ X have reproductive rate below a certain threshold α 0 , then the algorithm needs exponential time to reach x * . The threshold depends on the mutation rate. Here, we derive a variant of this theorem for algorithms that use multiple mutation rates. In particular, we assume that the algorithm uses m mutation rates, where mutation rate χ i /n for i ∈ [m] is chosen with probability q i . The proof of this theorem is similar to that of Theorem 4 in [9], and thus omitted Theorem 2. For any x * ∈ {0, 1} n , define T := min{t | x * ∈ P t }, where P t is the population of Algorithm 1 at time t ∈ N. If there exist constants α 0 , c, c ′ , δ > 0 such that with probability 1 − e −Ω(n) -the initial population satisfies H(P 0 , x * ) ≥ c ′ n for all t ≤ e cn and i ∈ [λ], if H(P t (i), x * ) ≤ c ′ n, then the reproductive rate of individual P t (i) is no more than α 0 , -m j=1 q j e −χj ≤ (1 − δ)/α 0 , and max j χ j ≤ χ max for a constant χ max , then Pr T ≤ e c ′′ n = e −Ω(n) for a constant c ′′ > 0.
Proof (of Theorem 2). We apply Theorem 1 in [9]. The first condition holds immediately. We use the distance function g( . Without loss of generality, we assume that x * = 1 n , hence g(x) is the number of 0-bits in x.
For the second condition, the drift of the process ∆(i) , and the number of 0-bits flipped, where Q ∼ q. For i < b(n), we use exp(np(e κ − 1)) as an upper bound on the mgf of a binomially distributed random variable with parameters n and p, and get Noting that e −κ ≤ ln(1 + δ)/(4χ max ), we get The second condition is then satisfied. The third and fourth conditions can be satisfied for any mutation rate χ/n for appropriate positive constants δ 2 , δ 3 ∈ (0, 1) and D(n), as long as κ(n) ≥ ln(2) (see the proof of Theorem 4 in [9]). ⊓ ⊔ Theorem 3. The runtime of Algorithm 1 with reproductive rate α 0 and mutation rate χ high /n ≥ (ln(α 0 )+δ)/n for some constant δ > 0 satisfies Pr (T ≤ e cn ) = e −Ω(n) on any function with a unique global optimum x * assuming that H(P 0 , x * ) ≥ c ′ n for two constants c > 0 and c ′ ∈ (0, 1).
For |M| = 2, we have the following general result, again due to Theorem 2.
Theorem 4. Consider Algorithm 1 with reproductive rate α 0 and mutation rates χ low /n and χ high /n. If there exist constants δ 1 , δ 2 , ε > 0 such that the EA chooses mutation rate χ high with probability at least δ1(1+ε) δ1+δ2 , then Pr (T ≤ e cn ) = e −Ω(n) on any function with a unique optimum x * given that H(P 0 , x * ) ≥ c ′ n for some constants c ′ , c > 0 Proof (of Theorem 4). We have which by Theorem 2 implies the result.

Robust self-adaptation
The previous section showed how critically non-elitist EAs depend on having appropriate mutation rates. A slightly too high mutation rate χ high can lead to an exponential increase in runtime. Uniform mixing of mutation rates can fail if the of mutation rates M contains one such high mutation rate, even though the set also contains an appropriate mutation rate χ low .
Self-adaptation has a similar problem if the strategy parameter p is chosen too high. However, we will prove for a simple, unimodal fitness function that for a sufficiently small strategy parameter p, self-adaptation becomes highly robust, and is capable of fine-tuning the mutation rate. For the rest of this section, we consider a set of two mutation rates M = {χ low , χ high } which for arbitrary parameters ℓ ∈ [n] and ε > 0 are defined by 1 − By the previous section, if ℓ is chosen sufficiently small, and hence χ high sufficiently high, then uniform mixing will fail on any problem with a unique optimum. In contrast, using a Chernoff and a union bound, the following lemma shows that individuals that have chosen χ high will quickly vanish from a self-adapting population, and the population will be dominated by individuals choosing the appropriate mutation parameter χ low .
Proof (of Lemma 1). For an upper bound, we assume that search points in B have higher fitness than search points outside B. The probability of producing a B-individual with (µ, λ)-selection is at most Hence, Y t+1 is stochastically dominated by a random variable Z ∼ Bin(λ, p s ). It now follows by a Chernoff bound that .
Proof. We partition the search space into the following n + 2 levels The special level A −1 contains search points with too high mutation rate. We first estimate the expected runtime assuming that there are never more than (3/4)µ individuals in level A −1 . In the end, we will account for the generations where this assumption does not hold. We now show that conditions (G1) and (G2) of the level-based theorem hold for the parameters γ 0 := (1/8)(µ/λ), δ := pε, and z j = Ω(1/n). Assume that the current population has at least γ 0 λ = µ/8 individuals in A ≥j−1 and γλ < γ 0 λ individuals in A ≥j , for 0 ≤ j ≤ n and γ ∈ [0, γ 0 ). If 0 ≤ j ≤ ℓ − 1, then an individual can be produced in levels A ≥j if one of the γλ individuals in these levels is selected, and none of the first j bits are mutated. Assuming in the worst case that the selected individual has chosen the high mutation rate, the probability of this event is at least ( γλ µ ) 1 − Condition (G3) holds for any population size λ ≥ c ln(n) and a sufficiently large constant c, because γ 0 and δ are constants. It follows that the expected number of generations until the optimum is found is t 1 (n) = O(n log(λ) + n 2 /λ). By Markov's inequality, the probability that the algorithm has not found the optimum after 2t 1 (n) generations is less than 1/2.
Finally, we account for the generations with more than (3/4)µ individuals in level A −1 . We call a phase good if after t 0 (n) = O(log(λ)) generations and for the next 2t 1 (n) generations, there are fewer than (3/4)µ individuals in level A −1 . By Lemma 1, a phase is good with probability 1 − (t 0 (n) + 2t 1 (n)) · e −Ω(λ) = Ω(1), for λ ≥ c ln(n) and c a sufficiently large constant. By the level-based analysis, the optimum is found with probability at least 1/2 during a good phase. Hence, the expected number of phases required to find the optimum is O(1). The theorem now follows by keeping in mind that each generation costs λ evaluations.

⊓ ⊔
We have shown that the EA can self-adapt to choose the low mutation parameter χ low when required. Nevertheless, uniform mixing of mutation rates with a sufficiently small χ low could achieve the same asymptotic performance. Furthermore, naively picking a mutation rate from the beginning also has a constant probability of optimising the function in polynomial time. Our aim is therefore to show that there exists a setting for which all the above approaches, except self-adaptation, fail. To prove this, we have identified a problem f m where a high mutation rate is required in one part of the search space, and a low mutation rate is required in another part. For 1 ≤ m < n, define f m (0 n ) := m and f m (x) := LeadingOnes(x) for all x = 0 n . We call the local optimum 0 n the peak, and assume that all individuals in the initial population are peak in-dividuals. It is clear that the elitist algorithm (µ+λ) EA without any diversity mechanism will only accept a search point if it has at least m leading 1-bits. Theorem 6. Starting at 0 n , the (µ+λ) EA has expected runtime n Ω(m) on f m .
To reach the optimal search point more efficiently, it is necessary to accept worse individuals into the population, e.g. a non-elitist selection scheme should be investigated. Since f m has a unique global optimum, either using only a too high mutation rate or uniformly mixing a correct mutation rate with a too high one can lead to exponential runtime as discussed above. Analogously to the (µ + λ) EA, we also prove that using a too low mutation rate fails because the population is trapped on the peak (e. g. due to Theorem 2, individuals fell off the peak have too low reproductive rate to optimise m leading 1-bits). Subsequent proofs use the two functions q(i) := (1 − χ low /n) i and r(i) := (1 − χ high /n) i , which are the probabilities of not flipping the first i ∈ [n] bits using mutation rate χ low /n and χ high /n respectively. Clearly, q(i) and r(i) are monotonically decreasing in i. We also use the function β(γ) := 2γ(1 − γ/2), which is the probability that binary tournament selection chooses one of the γλ fittest individuals.
Proof (of Theorem 7). We will prove that with probability 1 − e −Ω(λ) , all individuals during the first e cn generations have less than m leading 1-bits, where c > 0 is a constant. Clearly, this stronger statement implies the theorem.
We now assume that the run is not a failure. Furthermore, we assume that the algorithm is optimising the function g(x) := min(m, f m (x)) instead of f m . Clearly, the time to reach at least m leading 1-bits is the same, whether the algorithm optimises g or f m . Assuming that there are more than (λ/2)(1 + δ/2) peak individuals, the reproductive rate of any non-peak individual is always less For non-peak individuals, the last n − m bit-positions are irrelevant when the algorithm optimises g. We can therefore apply the negative drift theorem (Theorem 2) with respect to the algorithm limited to the first m bit positions only. The variation operator in this algorithm flips each of the m bits independently with probability χ ′ /m, where χ ′ = χ low (m/n). Hence, we have e −χ ′ < 1 = (1 − δ/2)/α 0 , and the conditions of the theorem are satisfied. Our intuition is that with sufficiently high mutation rate, some individuals fall off the peak and form a sub-population which optimises the LeadingOnes part of the problem. This will happen if the selective pressure is not too high. However, at the same time, the population should be able to reach the optimal search point 1 n after escaping the local optimum. Here we used the level-based technique to infer constraints on the mutation rates and the strategy parameter p. The proof idea follows closely from these observations. We will need the following result to limit the number of individuals at unfavourable portions of the search space, i. e. too many individuals in those portions will prevent the algorithm from moving in the right direction.
Lemma 2. Given any subset A ⊂ X , let Y t := |P t ∩ A| be the number of individuals in generation t ∈ N of Algorithm 1 with tournament size 2, that belong to subset A. If there exist three parameters ρ, σ, ε ∈ (0, 1) such that Pr (p mut (y) ∈ A) ≤ ρ for all y ∈ A and Pr (p mut (y) ∈ A) ≤ σγ * − ε for all y ∈ A, where γ * : Proof (of Lemma 2). For an upper bound, we assume that all search points in A have higher fitness than search points in X \ A. The probability of selecting an individual in A is therefore β(Y t /λ). The probability that any given offspring in generation t + 1 ≤ e cλ − 1 belongs to subset A is no more than Hence, Y t+1 is stochastically dominated by the random variable Z ∼ Bin(λ, p s ). It now follows by a Chernoff bound that The proof is completed by induction with respect to t and a union bound. ⊓ ⊔

Proof (of Theorem 8).
We apply the level-based theorem with respect to a partitioning of the search space X = {0, 1} n × M into the following n + 2 levels where ℓ ∈ [n] is the unique integer such that 1 − We first estimate the expected runtime assuming that every population contains less than ψλ individuals in A −1 , and less than ξλ individuals in the set B := {(y, χ high ) | Lo(y) ≥ ℓ}, where ψ := 123/250 and ξ := 1/5. In the end, we will account for the generations where these assumptions do not hold. We begin by showing that condition (G2) of the level-based theorem hold for all levels.
Levels m + 1 ≤ j < ℓ: The probability of mutating an individual from A ≥j into A ≥j , pessimistically assuming that the selected individual uses the high mutation rate χ high , is at least r(ℓ − 1)(1 − p) + q(ℓ − 1)p > r(ℓ − 1)(1 − p) + q(n)p > (85/171)(1 − p) + (2/3)p = 1/2 + 1/180. Hence, assuming that the current population has γλ individuals in A ≥j where γ ∈ (0, γ 0 ), the probability of selecting one of these individuals and mutating them into A ≥j is at least for some δ ′ > 0 given that γ 0 is a sufficiently small constant. Note that the lower bound on β(γ) here does not depend on ψ, and nor on ξ because in this setting the peak individuals have lower fitness than the individuals in A j , and B ⊂ A ≥j .
Levels ℓ ≤ j ≤ n: By the level-partitioning, any individual in these levels uses the low mutation rate χ low , and other individuals with at least ℓ leading 1-bits belong to the set B. Assume that the current population contains γ ∈ (0, γ 0 ) individuals in levels A ≥j . An individual in A ≥j can be produced by having a binary tournament with at least one individual from A ≥j and none of the at most ξλ individuals in B, not mutating any of the bits, and not changing the mutation rate. The probability of this event is at least 2γ(1 − γ 0 /2 − ξ)q(n)(1 − p) ≥ γ(4/5 − γ 0 /2)(19/15) = γ(1 + 1/75 − (19/30)γ 0 ) > γ(1 + δ ′ ) for some constant δ ′ > 0, assuming that γ 0 is sufficiently small.
We now show that condition (G1) of the level-based theorem is satisfied for a parameter z = Ω(1/n) in any level j. Assume that the current population contains at least γ 0 λ individuals in A ≥j . Then, to create an individual in A ≥j+1 , it is sufficient to create a tournament of two individuals from A ≥j , flip at most one bit, and either keep or switch the mutation rate. The probability of such an event is at least γ 2 0 (χ low /n)(1 − χ high /n) n−1 p = Ω(1/n). To complete the application of the level-based theorem, we note that since δ and γ 0 are constants, condition (G3) is satisfied when λ ≥ c ln n for some constant c. Hence, under the assumptions on the number of individuals in level A −1 and B described above, the level-based theorem implies that the algorithm obtains the optimum in expected t 1 (n) = O(n log(λ) + n 2 /λ) generations. Furthermore, by Markov's inequality, the probability that the optimum has not been found within 2t 1 (n) generations is less than 1/2.
To complete the proof, we justify the assumption that less than ψλ individuals belong to level A −1 , and less than ξλ individuals belong to B. We will show using Lemma 2 that starting with any population, these assumptions hold after an initial phase of t 0 (n) = O(log(λ)) generations. We call a phase good if the assumptions hold for the next t 1 (n) < e cλ generations.
Similarly, the probability of not destroying a B-individual with mutation is by definition of ℓ at most 1 − To create a Bindividual from X \B, it is in the best case necessary to change the mutation rate from χ low to χ high and not mutate the first ℓ bit-positions. The probability of this event is 1 − χ high n ℓ p ≤ 85 171 1 20 = 17 684 . Therefore, by Lemma 2 with respect to σ := 3/20 and the above value of ρ, for every generation t where t 0 (n) < t < e cλ and t 0 (n) = O(log(λ)) it holds Pr (|P t ∩ B| ≥ ξλ) = e −Ω(λ) ,where ξ := 1/5.
To summarise, starting from any configuration of the population, a phase of length t 0 (n) + 2t 1 (n) = O(n log(λ) + n 2 /λ) generations is good with probability 1 − e −Ω(λ) . If a phase is good, then the optimum will be found by the end of that phase with probability at least 1/2. Hence, the expected number of phases required to find the optimum is O(1), and the theorem follows, keeping in mind that each generation costs λ function evaluations. The initial population, including mutation rates, are sampled uniformly at random. Hence the (1/10)-ranked individual will have fitness close to 1 in the first generations. For j ≤ 5, i. e. early in the run, approximately half of the population chooses the low mutation. However, the population quickly switches to the higher mutation χ high until the (1/10)-ranked individual in the population reaches a value approximately j ≥ 60 where the population switches to the lower mutation χ low . Almost all individuals choose χ low for j ≥ 108. These experimental results confirm that the population adapts the mutation rate according to the region of the fitness landscape currently searched.

Conclusion
This is the first rigorous runtime analysis of self-adaptation. We have demonstrated that self-adaptation with a sufficiently low strategy parameter can robustly control the mutation-rates of non-elitist EAs in discrete search spaces, and that this automated control can lead to exponential speedups compared to EAs that use fixed mutation rates, or uniform mixing of mutation rate.