A Multiobjective Evolutionary Approach Based on Graph-in-graph for Neural Architecture Search of Convolutional Neural Networks

,


15
Convolution neural networks (CNNs) have 16 achieved remarkable results in solving many prob-17 lems, such as image classification 16

123
This list does not mean to be exhaustive since 124 other methods not belonging to any of the cat-125 egories above exist, such as Monte Carlo Tree 126 search 90 . The various NAS methods belonging to 127 each category above present advantages and dis-128 advantages. Specifically, RL-based algorithms re-129 quire a large computational time to perform the 130 automatic design, even on median-scale datasets, 131 such as cifar10 and cifar100 37 . Unlike RL-based 132 algorithms, gradient-based algorithms are usually 133 very fast. Besides, their search logic leads to ob-134 taining a local optimum problem which may have 135 a much poorer performance than the desired opti-136 mal design. Moreover, the gradient-based search al-137 gorithm needs to construct a super network in ad-138 vance, which should contain as much search space 139 as possible. The construction of this super network 140 requires substantial human intervention of an ex-141 pert, see Ref. 15; 25 . Although EAs are not theoreti-142 cally guaranteed to converge to the global optimum 143 of problem, they are able to overcome the local op-144 tima. Also, they do not require a super network. 145 Thus, EAs are often considered a viable compromise 146 for NAS since they are relatively fast and can be ap-147 plied to NAS without human intervention or prior 148 knowledge of the problem. One pioneering example 149 is in Ref. 94 . It is worthwhile remarking that there ex-150 ist other search strategies integrated in NAS meth-151 ods such as Ref. 57 , Ref. 18 , and Ref. 55 .

152
This paper focuses on EAs for NAS. In the 153 following subsections, some context is provided 154 around the two major challenges of this approach: 155 encoding mechanism and evaluation of the candi-156 date solutions. 158 The encoding of candidate network architectures 159 for NAS methods are broadly divided into two 160 categories 38 : direct encoding and indirect encod-161 ing. Indirect encoding was often used in early 162 works on NAS usually referred to as Neuroevolution, 163 see Ref. 71 , which is similar to NAS. Neuroevolu-164 tion uses evolutionary computation to optimize the 165 structure and parameters of neural networks at the 166 same time 4; 27; 30; 26; 1 , and many researchers still 167 work on it 76; 77; 64; 32; 8 . However, due to the limita-168 tions of equipment at that time, the neuroevolution 169 can only be performed on small networks. Further-170 more, due to the very large number of parameters in 171 fully connected networks, direct encoding cannot be 172 used to represent the whole network. Therefore, a 173 lot of effort is made to find simple ways (i.e.,indirect 174 encoding) to represent the connections and weight 175 parameters of neurons. Thus, indirect encoding is a 176 popular strategy to simplify the search space. These 177 search purposes determine that search space is dif-178 ficult to represent with direct encoding, so indirect 179 encoding is needed to simplify the encoding and 180 early researchers used indirect encoding to repre-181 sent individuals.

182
In recent years, most of the NAS studies have 183 been conducted on neural networks that albeit com-184 plex, can be naturally schematised as intercon-185 nected blocks. This is the case, besides the CNNs, of 186 Generative Adversarial Networks (GANs) 28 , and 187 Recurrent Neural Networks (RNNs) 47 . For net-188 works of these types, direct encoding is an easy 189 and natural option. For example, CNNs contain 190 convolution blocks, pooling blocks, batch normal-191 ization operations, and sometimes activation func-192 tions. These blocks are often represented by a few 193 parameters. Convolution blocks can be fully repre-194 sented by the number of convolution cores, the size 195 of the convolution cores, stride, padding, dilation 196 and groups (in fact, some parameters can be directly 197 ignored based on the actual search strategy and pur-198 pose). In most cases, pooling blocks, batch normal-199 ization operations and activation functions do not 200 even require parameters for special representations, 201 and they just need the position in the structure to 202 represent the modules.

233
To evaluate a candidate structure, the general prac-   Early closure is another way to reduce the to- dard crossover is illustrated. The sequences G 2 and 298 G 4 are swapped over and four sets of corresponding 299 weights W 5 , W 6 , W 7 and W 8 are randomly initial-300 ized, thus generating new networks (indicated with 301 a darker colour). In the lower right part of the figure, 302 the weight inheritance method is illustrated. When 303 the crossover occurs, the offspring solutions inherit 304 the weights of the parent (the weights of that por-305 tion of the network). Thus, the first offspring solu-306 tion is composed G 1 and G 4 with the weights W 1 307 and W 4 while the second solution is composed of 308 G 3 and G 2 with the weights W 3 and W 2 .

311
In this section, we introduce the framework 312 of the proposed NAS algorithm, namely Multi-313 Objective Graph-in-graph Network (MOGIG-Net) 314 whose flowchart is shown in Fig. 3.

315
This section firstly introduces the overall 316 framework of the proposed algorithm and then de-317 scribes the encoding mechanism, crossover, muta-318 tion, decoding method, evaluation, and environ-319 ment selection in details.  2: Convert all genes in P0 to models and evaluate the fitness of the models by method 12, and record the fitness of each corresponding individual; 3: Record the fingerprint and fitness value of each individual. 4: t ← 0 5: while t < T do 6: Q ← φ 7: if the length of Q < P then 8: Two individuals were randomly selected and will cross and mutate by the method of Algorithm 7 and 8, and then two offspring will be generated. 9: Record the fingerprint of each offspring, and add the two offspring into population Q.
10: end if 11: Pt ← Pt ∪ Q 12: Convert genes to models and evaluate the fitness of individuals in Q by method 12 to control sequence, and record the fitness of each individual; 13: Sort Pt by non-dominated sorting algorithm, and retain P individuals with better performance, and delete other individuals with poor performance, then we will get Pt+1;

347
In this study, we use a graph structure to encode the 348 architecture of the network. We propose the encod- The search space contains up to 2 L possible candi-418 date networks.

422
Due to the encoding mechanism proposed in this 423 paper, an ad-hoc crossover operator is here pro-424 posed to ensure that the offspring solutions mean-425 ingfully represent structures of neural networks 13 . 426 Furthermore, a meaningful chromosome must rep-427 resent a connected graph.

428
The proposed crossover operator combines two 429 chromosomes I and II by selecting randomly some 430 blocks from the first and then filling the missing 431 gaps with the genotype of the second to ensure that 432 the offspring is meaningful. Fig. 7 provides the im-433 plementation details of the crossover.

434
For the chromosome I, two separators are ran-435 domly selected. Then the number of separators n 436 between the two selected separators is calculated 437 (line 6). Then, two separators in the chromosome 438 II are selected while the number of separators be-439 tween these two separators is ensured to be also n 440 (line 7). Finally, the genes between the two separa-441 tors are exchanged (line 8).

442
The mutation operation, outlined in Fig. 8, con-443 sists of the random flip from 0 to 1 or from 1 to 0 444 of a gene (except for the position of separator). Al-445 though the location of mutation changes is limited, 446 the fact is that only small connection changes will 447 affect all the input feature maps after this.   Fig. 9 represents the construction of the CNN from 454 its chromosome. At first, the CB j (in blue) are de-455 coded. If the CB j is the same as that in the cor-456 responding position in its parents, the module is 457 copied from its parents. Otherwise, the module is 458 generated according to the procedure illustrated in 459 Fig. 5. Then, the S is decoded and the corresponding 460 connection is represented by an input array of each 461 block (such as the two red arrows pointing to CB 3 ). 462 Finally, P is decoded and the corresponding posi-463 tion in each input array of each block is wrapped 464 by an adaptive pooling (like the right sub-figure in 465 9). The connection method in the detailed structure 466 depends on the method which we choose before the 467 algorithm. If we use the residual structure, we add 468 the connection directly. If we use the dense struc-469 ture, we adjust the channel and merge it by using 470 the 1x1 convolution kernels to a unitize the channel 471 number. Fig. 9. Construction of a CNN from its chromosome: The blocks or connections are decided by the part of encoding in the same colour. The green squares represent fixed structures. FC means a fully connected layer. P means a pooling layer. Blocks are built in Fig. 5 We also implemented a mechanism to han- Furthermore, since maintaining the consistency 518 of image size outside the convolution block (i.e., the 519 macro structure) another countermeasure has been 520 adopted. We also encode the reduced position of the In the left encoding, node 3 would have only inputs. Thus, an output link is generated to guarantee connectivity. In the central encoding, node 3 would have only outputs. Thus, an input link is generated to guarantee connectivity. In the right encoding, node 3 would be isolated. Thus, the node is removed from the graph.
bold because we choose to add adaptive pools be-542 fore them. In this way, we can control the size of in-543 put and output in each layer by controlling the po-544 sition of adaptive pooling. The location of adaptive 545 pooling and the combination of these channels are 546 referred to as the detailed structure of the individ-547 ual and are recorded separately.

549
We divide the training sets D into two parts, 80% 550 of which are real training sets D train , and the rest 551 are validation sets D valid . When the new popula-552 tion of offspring solutions is generated, their per-553 formance must be assessed to select the population 554 undergoing the following generation. The networks 555 composing the new population undergo training by 556 means of the training set D train . When the change 557 range is below a pre-arranged threshold, the learn-558 ing rate is adjusted accordingly. If the learning rate 559 adjustment is less than a prearranged value, the 560 training will be stopped.

561
In our approach, we use weight inheritance to speed 562 up the search. Since our crossover operation can en-563 sure that most of the modules of the network remain 564 unchanged, the weight of the model constructed by 565 the child will directly inherit the weight from the 566 model of the parent. This method, like weight shar-567 ing, can make the network model obtain a relatively 568 high accuracy rate at the early stage of evolution. 569 In this way, we only need to continue training at a 570 relatively small learning rate to achieve the best per-571 formance of each network.

572
After the training, the accuracy q.acc (that 573 is the error rate) of the network is assessed by 574 means of the validation set D valid . Furthermore, 575 the model size in terms of the number of param-576 eters q.params is also calculated. Both the scores 577 q.acc and q.params characterise the quality of the 578 candidate CNN. The non-dominated sorting 50 is 579 used to select among parent and offspring solu-580 tions the population undergoing the following gen-581 eration, which often used to evaluate the quality 582 of two solutions in the process of multi-objective 583 optimization 93 . The condition for one individual 584 to dominate another is to have a performance not 585 worse than the other according to all objective and 586 Fig. 11. The left subgraph is the macro structure without pool layers. After executing line 11 of the algorithm in Fig. 6, the adaptive pooling is added at the specified location (center subgraph). The right subgraph is micro structure. Each block includes some convolution cells, and each cell is consist of 3x3 and 1x1 convolution kernels, which do not change the size of input.
to outperform it according to at least one objective.    The popular datasets considered in this study 616 are Cifar-10 and cifar-100 proposed by the Cana-617 dian Institute for Advanced Research 37 . These two 618 datasets are often used to verify the performance of 619 network models. Each dataset comprises 60000 im-620 ages, including 50000 in the training set and 10000 in 621 the test set. Each image is a 3-channel colour image, 622 and the height and the width are both 32. There are 623 10 categories in cifar-10 and 100 categories in cifar-624 100. Both cifar-10 and cifar-100 come from a larger 625 dataset of 80 million small images. Therefore, to a 626 certain extent, cifar-10 and cifar-100 can illustrate 627 the predictive ability of the model. 628 Table 1 displays the results of MOGIG-629 Net and twenty-one NAS competitors on cifar-630 10 and cifar-100. The listed methods are divided 631 into three design categories: NAS human design, 632 single-objective approaches and multi-objective ap-633 proaches. For each NAS method considered in this 634 study, the reference to its original implementation. 635 For each method we report the result of the ob-636 jectives in the proposed model, that is the accu-637 racy q.acc expressed in terms to percentage error for 638 Cifar-10 and Cifar-100 and the complexity q.param 639 expressed in million of parameters of the network 640 designed by the corresponding NAS method. We 641 may observe that the proposed MOGIG-Net can ef-642 ficiently detect networks which combine a relatively 643 low number of parameters and a low percentage er-644 ror. For example, none of the seventeen competitor 645 NAS methods can achieve an error rate of 14.38% on 646 Cifar-100 with only 3.7 million parameters. With re-647 spect to NSGA-Net 50 , that is a recent NAS method 648 considered the state-of-the-art in the field, the pro-649 posed MOGIG-Net designed networks with a com-650 parable performance notwithstanding a lower num-651 ber of parameters (approximately 10% fewer pa-652 rameters).

653
Figures 13 and 14 display the solutions in the 654 objective space considered in this study detected by 655 the proposed MOGIG-Net and its competitor. To en-656 hance the readability of the figures, we present a 657 zoom around the non-dominated solutions. 658 We noticed that when the network structure is 659 relatively large, the number of pooling in the de-660 tailed structure greatly affects the required training 661 time and the memory space. When the number of 662 pooling is small and the network structure is large, 663 the size of intermediate variables is very large and 664 the training time is very long. The results in this 665 study have been detected after two weeks of calcu-666 lation.

667
Experimental results show that for networks 668 with similar structures, the accuracy of large mod-669 els is higher than that of small models, includ-670 ing our method. The reason of this phenomenon is 671 that the increase in the number of parameters ap-672 pears to improve the generalization capability of the 673 model. Therefore, the maximum accuracy that can 674 be achieved with large models is higher than that of 675 smaller models.  Since numerical results indicate that the pro-   Thirty-First AAAI Conference on Artificial Intelligence,