The Use of Probabilistic Systems to Mimic the Behaviour of Idiotypic AIS Robot Controllers

Previous work has shown that robot navigation systems that employ an architecture based upon the idiotypic network theory of the immune system have an advantage over control techniques that rely on reinforcement learning only. This is thought to be a result of intelligent behaviour selection on the part of the idiotypic robot. In this paper an attempt is made to imitate idiotypic dynamics by creating controllers that use reinforcement with a number of different probabilistic schemes to select robot behaviour. The aims are to show that the idiotypic system is not merely performing some kind of periodic random behaviour selection, and to try to gain further insight into the processes that govern the idiotypic mechanism. Trials are carried out using simulated Pioneer robots that undertake navigation exercises. Results show that a scheme that boosts the probability of selecting highly-ranked alternative behaviours to 50% during stall conditions comes closest to achieving the properties of the idiotypic system, but remains unable to match it in terms of all round performance.


INTRODUCTION
An artificial immune system (AIS) is a computational algorithm that attempts to mimic properties of the vertebrate immune system in order to solve complex problems. There are a number of different types including clonal selection-based algorithms [4], negative selection-based algorithms [5] and idiotypic networks [6]. The latter group is inspired by Jerne's network theory of the immune system [1], which asserts that suppression and stimulation between antibodies plays an important role in the immune response.
Within the domain of mobile robotics, the idiotypic network has remained a popular choice of AIS, with most researchers opting to implement Farmer et al.'s computational model [2] of Jerne's theory. This is largely because the network of stimulation and suppression between antibodies (analogous to behaviours in these systems) is thought to provide a means of achieving a decentralized behaviour-selection mechanism for the robot. Initial results, for example [7][8][9][10][11][12][13] are certainly very encouraging, but lack any sort of comparison with other systems to assert the idiotypic advantage. Furthermore, the complex system dynamics are poorly understood. However, in [3] the performance of a reinforcement learning (RL)-based idiotypic system is compared with a system that uses RL only, and provides statistical evidence that the idiotypic system is able to complete a task faster and with fewer stalls than the RL scheme. The paper also attempts to analyze the performance of both control systems in order to explain the idiotypic advantage. It suggests that the RL-only system provides a strategy that is too greedy, always selecting the behaviour (antibody) that best matches the current environmental situation (antigen). In contrast, the idiotypic system is much more flexible, allowing behaviours that are not necessarily the best to flourish. In particular, the paper proposes that when the robot is stalled the idiotypic mechanism is able to increase the rate of antibody change autonomously so that alternative behaviours are used instead of those already tried.
Given these suggestions, it is possible to create robot controllers that attempt to mimic these properties using probabilistic behaviour selectors coupled with RL. Hence, the main aim of this paper is the construction and testing of such systems in order to facilitate further scrutiny of idiotypic dynamics. In a number of experiments, nine different probabilistic schemes compete with the idiotypic system in order to establish that the idiotypic mechanism is not merely performing the equivalent of random behaviour arbitration, but acting in a more intelligent way. Furthermore, the gradual use of more complex systems that apply probabilistic selection more intelligently should help to provide additional insight into the processes that govern idiotypic dynamics. This rest of this paper is arranged as follows. Section 2 provides some background information about Jerne's idiotypic network theory, Farmer's computation model of it, and the variation of the model used in [3]. Section 3 describes the architectures of the idiotypic and probabilistic systems that are used in this research, and illustrates the environments and problems used for testing them. The experimental procedures adopted are reported in Section 4, Sections 5 and 6 present and discuss the results obtained and Section 7 concludes the paper.

BACKGROUND
In the vertebrate immune system antibodies plays a central role in eliminating antigens (for example bacteria and viruses) from the body. In order to achieve this, an antibody's combining site (paratope) must be able to bind to a region of the antigen called the epitope. According to the clonal selection theory of Burnet [14], antibodies with paratopes that possess a good degree of match to a given antigen epitope pattern proliferate within the system, i.e., are cloned (increase in concentration) and are kept in circulation.
However, Jerne's idiotypic network theory [1] proposes that antibodies also possess a set of epitopes called idiotopes, and that these are the mechanism by which antibodies recognize each other. He suggests that antibody concentration levels are also influenced by inter-antibody activity, i.e. antibodies that are recognized by others are suppressed and reduce in concentration and those that do the recognizing are stimulated and increase in concentration.
Farmer et al. [2] go on to suggest that the dynamics of an idiotypic AIS system with L antigens [y 1 , y 2 ,…, y L ] and N antibodies [x 1 , x 2 ,…, x N ] can be modelled with the following equation: where C represents concentration and b, k 1 and k 2 are constants.
In Eq. (1) the first sum in the square bracket models antibody stimulation due to antigens, the second sum models interantibody suppression and the last sum models inter-antibody stimulation. The match specificities for these three kinds of interaction are given by some functions U, V and W respectively, and the term outside the brackets embodies the natural antibody death rate. Here, the equation models both background antibody communication (i.e. that between all antibodies) and also active antibody communication. The latter is the stimulation and suppression that takes place between the antigenic antibody α (that with the best match to the presenting antigen) and any other antibody that matches the presenting antigen (i.e. any competing antibody).
In [3] background communication is ignored for simplification and antigen concentrations are not needed since each environmental scenario (antigen) is ranked in order of importance and weighted accordingly, i.e. multiple antigens are allowed to present themselves but one is deemed dominant and given greater weighting. Furthermore, N × L matrices P and I, which represent the antibody paratope and idiotope respectively, are used.
The elements of P are the current RL scores, which reflect the degree of match between each antibody and antigen, and the elements of I are fixed disallowed antibody-antigen combinations. Antibody communication is hence simulated by comparing the paratope of α with the idiotope of the competing antibodies (i.e. those that have nonzero match to the set of presenting antigens) and vice versa. The model thus reduces to Eq. (2) below: with matching functions U, V' and W' given by: In Eq. However, Eq. (2) cannot be evaluated as a whole since α must be found first, so it is broken down into constituent parts: First, antibody α is computed from Eq. (6), then the suppression and stimulation factors are calculated from Eqs. (7) and (8) respectively. Finally, the global match strength S g is determined from: The concentration of each antibody is hence given by substitution of Eq. (9) into Eq. (2) giving: which is transformed to: upon discretization. The antibody β selected for execution is that with the highest normalized concentration, given by: In [3] experiments are performed that vary the values of the suppression-stimulation balancing constant k 1 and the rate constant b using a simulated robot navigation exercise as a test bed. Results show that when k 2 is fixed at 0.05, the robot tends to perform best with b set approximately between 40 and 160 and k 1 set between 0.575 and 0.650. In this region α ≠ β (there is an idiotypic difference) approximately 20% of the time. The parameter k 2 governs how quickly the antibodies reach zero concentration. In other systems this might lead to their removal and replacement with alternatives, but this particular architecture uses a fixed number of antibodies that are never replaced, so k 2 is deliberately kept low.
It is worth noting that in [17] and [18] a slightly different idiotypic design is used that allows for only one presenting antigen per iteration, and more importantly employs a variable idiotope matrix with probabilistic components. This means that the idiotypic difference rate is much harder to predict for given values in the {b, k 1 , k 2 } space, and suggests that the findings in [3] may be altogether dependent on the choice of the fixed disallowed antibody-antigen combinations in the idiotope matrix. For this reason the same combinations used in the idiotope matrix in [3] are used here, see Section 3.

TEST ENVIRONMENT AND SYSTEM ARCHITECTURE
Throughout this research simulated Pioneer 3 robots are used with Player's Stage 2.0.3 simulator [15]. The virtual robots possess eight sonar sensors at the rear and a laser sensor at the front that spans 180º. For convenience this 180º sector is subdivided into six 30º subsectors 1 to 6, with 1 and 2 representing the left, 3 and 4 corresponding to the centre, and 5 and 6 corresponding to the right of the robot.
A frontal camera that can detect different coloured objects is also placed centrally, so that the robot is able to recognize cyan squares placed in the doorways. Its task is to use these as markers in order to navigate through the rooms in two different maze environments, M 1 and M 2 which are shown in Figures 1  and 2 respectively. Note that when a robot has passed a cyan marker its path back to the previous room is blocked off manually. As stated earlier, environmental information is modelled with antigens and robot behaviours are modelled as antibodies that possess a fixed action component, and an idiotope and paratope element value for each antigen. For this purpose, eight antigens and sixteen antibodies are created as detailed in Tables 1 and 2 respectively and as in [3] where justification for choosing them is also given.  (2) Zmin < 0.55 m and Rmin = 1 or 2 (-90º to -30º) 1 -Object centre (2) Zmin < 0.55 m and Rmin = 3 or 4 (-29º to 29º) 2 -Object right (2) Zmin< 0.55 m and Rmin = 5 or 6 (30º to 90º) Zav < 0.45 m 5 -Stalled (4) Distance travelled = 0 6 -Blocked behind (5) Distance travelled = 0 and Eav < 0.35 m 7 -Door marker seen (1) A cyan marker has been detected by the camera Table 1 shows the priority ranking of the antigens, with 0 the lowest (least urgent) and 5 the highest (most urgent). Detection of the various antigens is governed by several sensor reading metrics which include the minimum and average laser readings Z min and Z av , the average rear sonar reading E av , and the position of the minimum laser reading R min . The maximum laser reading Z max is also used by antibody 11, see Table 2.
Ten control systems are created of which nine are probabilistic.
The other system I D uses the idiotypic architecture described in Section 2, with b set at 80, k 1 set at 0.65, and k 2 set at 0.05, as these parameter values fall within the region of {b, k 1 , k 2 } space where performance is optimal in [3].
Next the sensors are read and the dominant antigen and antigen array (G) element values are determined so that S 1 (the degree of match to antigen) can be calculated for each antibody and α can be determined using Eq. (6). Following this, Eqs. (7) and (8) are used to calculate suppression and stimulation respectively and thus deduce the global strength of match S g using Eq. (9). The concentration of every antibody in the system is then calculated using (11) and normalized using: x C x C x C (13) so that the total number of antibody clones is kept constant to mimic the biology more closely and help prevent scaling problems. Note that the term concentration is used to mean the When an antibody is selected for execution it carries out its designated action and the result of that action (half a second later) is scored either positively or negatively using RL. This means that paratope element value P βd is adjusted upon every iteration using: where τ is the positive or negative RL score awarded and d is the index of the dominant antigen. Further details on the particular RL scheme used here are provided in [3], and a general explanation of RL can be found in [16].   The nine probabilistic systems, R 1 -R 9 are summarized in Table 4. They use the same essential architecture as the idiotypic system (described above), except that they compute antibody α only, not β, i.e. they omit the suppression and stimulation calculations in Eqs. (7) and (8). Having calculated α from the RL-scores of paratope matrix P using Eq. (6), systems R 1 -R 9 either use it or simply select an alternative antibody µ. The rate of µ selection and which alternative antibody is used both depend on pre-determined probability values.
The idiotypic system is therefore mimicked by using a number of systems with probability values that simulate an approximate overall µ rate of 20%. Note that it is P αd that is scored using RL for the probabilistic systems, or P µd when α is rejected in favour of alternative antibody µ. Also, for systems R 1 -R 9 concentrations play no role in selecting the antibody that will execute its action.
In the case of R 1 there is a 20% chance of choosing any other antibody apart from α, and these are selected with equal probability. System R 2 is similar with a 20% chance of not selecting α, but the alternative antibody is chosen based on probabilities derived from the paratope matrix, i.e. the RLscores representing the match between each antibody and the dominant antigen are used. The probability of selection ν of antibody x i is given by: (15) where N is the number of antibodies and d is the index of the dominant antigen. If α is chosen again the process repeats until µ is different to α. Systems R 3 , R 4 and R 5 also have a 20% µ rate, but when α is rejected R 3 always uses the antibody that is second-best and R 4 uses either the second or third best-matched antibody with equal probability. System R 5 uses either the second, third, or fourth best-matched antibody, but is twice as likely to use the secondbest. System R 6 considers whether the previously-used antibody was deemed successful by the RL.
If it was regarded as successful then there is only a 14% chance of selecting the second, third or fourth best-matched antibody. If it was marked as unsuccessful by the RL, then there is probably a greater need for a different antibody, so the probability of not selecting α increases to 28%. In either case, bias is toward choosing the second best-matched antibody, rather than the third or fourth. Systems R 7 , R 8 and R 9 are similar but take into account whether the robot is currently stalled or was stalled on the previous iteration. This methodology is adopted as previous analysis of antibody selection in system I D has shown that the idiotypic difference rate tends to increase to around 30% during stall conditions. With R 7 , if there are no stall conditions then there is a 15% chance of not choosing α. Again, bias is toward the second best-matched antibody, with the third and fourth bestmatched being only half as likely to be selected.
However, if the robot is currently stalled or was stalled on the previous iteration there is probably a much stronger requirement for an alternative antibody, so the chance of not selecting α increases to 33%, (bias is still toward the second-best antibody). Systems R 8 and R 9 work in the same way as R 7 but use 50% and 75% µ rates respectively when the robot is stalled and 13% and 2% µ rates otherwise. In systems R 6 to R 9 the probabilities are selected based on pre-trials to generate an approximate overall 20% observed µ rate.

EXPERIMENTAL PROCEDURES
Each of the ten control systems is run twelve times in Maze World M 1 , six times starting with paratope D 1 and the other six times starting with paratope D 2 . For each run, the time taken to complete the course T is recorded along with the number of robot stalls σ. A stall represents a collision with an obstacle or the walls and is determined either by detecting that the robot has come to a complete stand-still for more than one time-loop interval (antigen 5) or by recording stand-still coupled with a rear-sonar reading of less than 0.35 m (antigen 6).
A fast robot that continually crashes or a careful robot that takes too long to complete the task is undesirable, so a fitness measure F, which combines T and σ is computed for each run. This is given by: (16) where φ 1 is the ratio of the mean task time to mean number of stalls over all the 120 experiments in M 1 : Maze World M 2 represents a more difficult task for the robot as there are more rooms with more obstacles and there is generally less space for the robot to move around in. The idiotypic system and the best-performing probabilistic controller from the experiments with M 1 (i.e. that with the best fitness) are both used for robot navigation in M 2 , six times starting with D 1 and six times using D 3 . Again, T and σ are recorded for each run and F is calculated, this time using φ 2 , the ratio of the mean task time to mean number of stalls over all the 24 experiments in M 2 .
In both worlds, mean T, σ and F values are computed for each control system and are compared using a 1-tailed t-test, with differences accepted as significant at the 99% level only. As another measure of task performance, runs with an above average fitness for each world are counted as good and those with fitness in the bottom 10% of all runs in each world are counted as bad. In addition, the µ rate is noted for each run and the mean is calculated for each control system. Table 5 shows the mean T, σ, F and µ values for each control system in each world, and also the percentage of good and bad runs. Table 6 displays the significant difference levels when each of the systems is compared to the idiotypic controller. The results show that none of the probabilistic controllers performs as well as I D in Maze World M 1 . The idiotypic system has the fastest completion time, the least number of stalls and the best fitness. All of these performances are significantly better than the probabilistic systems, except in the case of R 8 and when comparing σ values for system R 7 and T values for R 9 . Furthermore, I D has the highest percentage of good runs (92%) and has no bad runs. Probabilistic system R 2 also has no bad runs, but only 50% of its runs are considered good. Since system R 8 is second best in terms of fitness, it is used in Maze World M 2 for comparison with I D . However, in this world both its σ and F values have significantly higher means than those observed with I D , and T is almost significant. In addition, 33% of runs are deemed bad and only 25% are deemed good for system R 8 . In contrast, 83% of the idiotypic system's runs are judged as good and none are judged as bad. All of the probabilistic systems show an overall mean µ rate of approximately 20%, which validates the probability choices.

DISCUSSION
The results achieved provide strong empirical evidence that the idiotypic system possesses a highly intelligent form of behaviour selection that cannot easily be mimicked using simple probabilistic systems. In fact, I D performs better even when a probabilistic system uses some form of inherent intelligence, for example basing the likelihood of antibody selection upon the current RL scores (as in R 2 ) or boosting the probability of selecting an alternative to α under certain conditions (R 6 -R 9 ).
The probabilistic controller that performs best in world M 1 (and therefore comes closest to mimicking idiotypic dynamics) is system R 8 , which increases the theoretical µ rate (probability of selecting either the second, third, or fourth best-matched antibody) to 50% under stall conditions. However, the mean number of stalls is still significantly higher than for the idiotypic system in world M 2 , which suggests that R 8 is less able to deal with more complex environments.
System R 7 has a theoretical µ rate of 33% during stall conditions, which is very close to the mean idiotypic difference rate recorded for I D under these circumstances (31%). However, its performance is inferior to I D and also to R 8 , which increases the µ rate to 50% under stall conditions. This suggests that the idiotypic dynamics are doing more than merely raising the rate of antibody change when the robot is in difficulty.
Indeed, [3] proposes that it is the increased RL success rate of the antibodies chosen during stall conditions that contributes to an idiotypic robot's superior performance, and that the idiotypic process works by selecting antibodies of similar type to α. In other words, as well as raising the µ rate during stall conditions, the probabilistic systems also need a better mechanism for determining which alternative antibody should be selected.
Presently, only the current second, third, and fourth bestmatched antibodies are considered, with the second-best being twice as likely to be selected as the third or fourth. This is a fundamental weakness in the probabilistic schemes, as an alternative antibody with a highly-ranked RL score for a particular antigen does not necessarily represent an antibody with similar properties to α.
Further research is therefore needed, in particular, a detailed examination of the alternative antibodies that are chosen under stall conditions in the idiotypic system, and how they rank in terms of matching to the presenting antigens. If a general pattern of selection could be identified and formalized into a probabilistic algorithm that approximates it, it might be possible to mimic the idiotypic dynamics much more closely.
However, it is still questionable whether such a system would be able to equal or better the performance of I D . This is because the idiotypic mechanism is a dynamic process of continuous change, where the behaviour selected at a given time affects future selections, i.e. it represents a self-regulating system with feedback. In contrast, the probabilistic systems are only flexible in that they permit other antibodies to be chosen; in all other aspects they are inherently rigid.
Furthermore, feedback in the idiotypic system is driven by the use of concentrations in the choice of alternative antibody as well as global strength of match to antigen, which means that it provides a kind of memory feature for past selection as well as considering current environmental information.
In fact, it may be the balance between these two aspects that gives idiotypic robots their advantage. In Eq. (10) parameter b governs the weighting given to the global strength of match S g when calculating new concentration values, and experiments that vary this parameter have shown that idiotypic robots show significantly better performance when b is within a certain region [3].
A probabilistic scheme that aims to imitate the dynamics of an idiotypic system accurately would therefore need to: 1. Incorporate some form of memory feature analogous to antibody concentrations that enables the system to record past antibody use. 2. Utilize a mechanism that gives weighted consideration to both the memory and the strength-ofmatch to antigen when selecting alternative antibodies. This would introduce feedback into the system and provide a more dynamic selection process. 3. Mimic the ideal idiotypic difference rates, both during stall conditions and when the robot is free. 4. Imitate the patterns of alternative antibody selection inherent in the idiotypic dynamics during stall conditions, ideally by using a method that favours antibodies with similar properties to α.
This research has currently addressed item 3) only, which might explain why R 8 came closest to reproducing the performance of I D . System R 8 's theoretical µ rate under stall conditions is greater (50%) than the corresponding idiotypic difference rate of I D (30%). The greater chance of switching to an alternative antibody may have provided some form of compensation for lack of the other features, and may account for R 8 's superior performance to R 7 .
However, it should be noted that system R 9 , which boosted the µ rate to 75% under stall conditions, was inferior in performance to both R 7 and R 8 . This suggests that there may be an optimal µ rate under stall conditions for probabilistic systems that lack design specifications 1), 2) and 4). Future research will investigate this further by determining the optimal value, incorporating the missing design features, and examining any changes in the optimal value once these are in place.

CONCLUSIONS
This research has compared the performances of an idiotypic AIS robot control system with nine other control systems that select robot behaviour using probability functions. It has provided substantial empirical evidence that the idiotypic selection mechanism is superior to any of these systems, which suggests that the idiotypic dynamics are facilitating more intelligent behaviour selection.
The probabilistic system that comes closest to approximating these dynamics is one that boosts the likelihood of non-α selection (i.e. increases the µ rate) during stall conditions, although its performance is still inferior to the idiotypic system. This supports the notion that idiotypic behaviour arbitration incorporates an innate ability to recognize and respond effectively to situations in which the robot is trapped.
Further research will aim to study the patterns of alternative antibody selection within the idiotypic system during stall conditions, in particular the strength-of-match rankings of antibodies chosen instead of α. Study of these patterns might show how idiotypic systems are able to nominate more successful antibodies, and how the selection-mechanism is able to determine which ones have similar properties to α. This might enable a more accurate probabilistic model of the idiotypic system to be created.
Furthermore, a means of recording past antibody use is absent in the probabilistic systems constructed here, and may contribute to their inferior performance. Thus, an important aspect of future research will be the construction of a probabilistic algorithm that imitates this additional feedback feature. A detailed examination of the relationship between antibody concentrations, past use and time in the idiotypic system would greatly assist in this process. Knowledge gained from such a study could be beneficial in terms of improved I D performance. It is likely that this would greatly assist when transferring the control algorithm between robotic platforms as detailed in [19].