The DCA:SOMe Comparison A comparative study between two biologically-inspired algorithms

The Dendritic Cell Algorithm (DCA) is an immune-inspired algorithm, developed for the purpose of anomaly detection. The algorithm performs multi-sensor data fusion and correlation which results in a 'context aware' detection system. Previous applications of the DCA have included the detection of potentially malicious port scanning activity, where it has produced high rates of true positives and low rates of false positives. In this work we aim to compare the performance of the DCA and of a Self-Organizing Map (SOM) when applied to the detection of SYN port scans, through experimental analysis. A SOM is an ideal candidate for comparison as it shares similarities with the DCA in terms of the data fusion method employed. It is shown that the results of the two systems are comparable, and both produce false positives for the same processes. This shows that the DCA can produce anomaly detection results to the same standard as an established technique.

process called apoptosis. Dendritic cells (DCs) are sensitive to both signal types and have the ability 3 to stimulate or suppress the adaptive immune system. DCs are the intrusion detection agents of the 4 human immune system, policing the tissue for potential sources of damage in the form of signals and 5 for potential culprits responsible for the damage in the form of 'antigen'. Antigens are proteins which 6 can be 'presented' to the adaptive immune system by DCs, and can belong to pathogens or to the host 7 itself. 8 The DCA incorporates danger-based DC biology to form an algorithm that is both truly bio-inspired 9 and is capable of performing anomaly detection. It is a population based algorithm, where multiple 10 DCs are programmed to process signals and antigen. 'Signals' are mapped to context information, such 11 as the behaviour of a monitored, e.g. CPU usage or network traffic statistics. 'Antigens' are mapped 12 as potential causes for the changes in behaviour, e.g. the system calls of a running program. The DCA 13 correlates the antigen and signal information across the population of cells to produce a consensus 14 coefficient value which is assessed to determine anomalous antigen. 15 The DCA has been successfully applied to a subset of intrusion detection problems, focussing on port 16 scan detection. Port scans are used to establish network layout and to uncover vulnerable computers. 17 The detection of the scanning phase of an attack can be highly beneficial, as upon its detection the 18 level of security can be increased across a network in response. The DCA has been applied to both 19 ping scans and SYN scans in realtime and offline [27] [25]. The algorithm produced high rates of true 20 positives and low rates of false positives. 21 While the performance of the DCA on these problems appears to be good, thus far no direct 22 comparison has been performed with another system on the same port scan data. The need for a 23 rigorous comparison is necessary to truly demonstrate the capabilities of this algorithm. The signal 24 processing component housed within the DCA bears some resemblance to the function of a neural 25 network [13]. Given these superficial similarities, the obvious next-step for the development of the DCA 26 is to compare its performance to that of a neural network based system, such as a Self-Organizing Map 27 (SOM) [47]. 28 SOM is a clustering method of unsupervised learning where high dimensional data is mapped to 29 a lower dimensional space to create a feature map. This map is constructed from training data and 30 consists of a series of interconnected nodes. Upon the analysis of the test data, the incoming data items 31 are matched against nodes in the map with similar characteristics. SOM uses a similar process to a 32 single-layer neural network to generate the map, and a simple distance metric is used to match the 33 incoming test data to the most appropriate node. This technique can be used for anomaly detection 34 as the training data can consist of normal data items, with unclustered data representing a potential 35 anomaly. SOM is an excellent choice for comparison as it has a history of application within computer 36 security and can be manipulated to use similar input data as used with the DCA. 37 The aim of this paper is to compare the DCA with a SOM. To achieve this aim the two algorithms 38 are applied to the detection of an outbound SYN-based port scan using data captured from previous 39 real-time experiments performed with the DCA [25]. The results of this comparison indicate that the 40 DCA and SOM are both equally as effective at detecting SYN based port scans, and appear to make 41 similar false positive errors. As a baseline a k-means clustering algorithm is applied to the signals in 42 isolation. 43 This paper is structured as follows. In Section 2, the relevant background and context information 44 is given regarding problems in computer security and how these problems relate to port scanning in 45 addition to a summary of current port scanning techniques. In Sections 3 and 4, descriptions are given 46 of the DCA and SOM respectively, including details of their implementations. In Section 5, the two 47 approaches are compared experimentally. In Section 6 we perform an analysis and comparison between 48 the two systems based on the obtained results and debate their differences and similarities, further 49 validated by a baseline series. In the final sections we discuss the results of these comparisons and 50 present the implications for the future of the DCA. 51 52 53 2 Related Work 54 55 2.1 Overview

57
As this paper encompasses a variety of techniques and concepts, this section is subdivided into three 58 parts. Firstly, the problems associated with port scans are described followed by a description of current 3 1 2 scan detection techniques. This is followed by the related computer security work in AIS, including the 3 development of the DCA and the motivation for its development. This section continues with a brief 4 overview of the use of various SOM algorithms in computer security. 5 6 7 8 2.2 Port Scanning and Detection 9 10 2.2.1 Introduction to Port Scanning 11 12 Insider attacks are one of the most costly and dangerous attacks performed against computerised 13 systems, with a large amount of known intrusions of intrusions attributed to internal attackers [6].
14 This type of intrusion is defined through the attacker being a legitimate user of a system who behaves 15 in an unauthorised manner. Such insider attacks have the potential to cause major disruption given 16 that a large number of networks do not employ internal firewalling with many security countermeasures 17 focussing on the detection of external intruders. Insiders frequently know and have access to network 18 topology information. As insiders operate from within an organisation, this provides them with scope 19 to abuse a weak link in the security chain, namely the end users. Having knowledge and relationships 20 with other network users brings with it the potential to coerce passwords from legitimate users for 21 the purpose of gaining access across multiple machines on a network. This information can be used to 22 steal sensitive data, to cause damage to the network or to disguise the identity of the true attacker.

23
Such attacks frequently involve multiple stages. The initial stage is the information gathering stage, It is wise for an attacker to understand the network in question, to avoid wasting time trying to 31 exploit machines which are not receptive to an attack. It is pointless attempting to attack a host which 32 is no longer connected to the network! While scanners are not an 'intrusion' in the classical sense, they 33 are often a pre-cursor to an actual attack, and evidence of sufficient scanning across a network can 34 suggest that an attack may soon follow [39]. 35 A port is a specific endpoint on a network, which is a virtual address as part of a virtual circuit. It is 36 important to note that a port in this context is an abstract concept, not to be confused with a physical 37 port such as a serial port. Ports allow for the direct exchange of information between two hosts. It is 38 similar to a telephone number and is more specific than an IP address as it provides a direct connection 39 between two endpoints. Probing a port with a packet leads to information on the state of the port and 40 its host. Ports can be in three states if the scanned host is available, namely open, closed or filtered. 41 Port scanning tools such as nmap [17] can be used to send packets to various ports on remote hosts to 42 gain understanding regarding the status of the scanned host. The type of packet used to perform such 43 probes can be one of a number of types, including Internet Control Message Protocol (ICMP) ping, 44 TCP, and UDP. According to Bailey-Lee et al. [4], TCP SYN scans are the most commonly observed 45 scan, accounting for over half of all scans performed. 46 Additionally the scans themselves can be performed in a number of different ways, varying the 47 number of hosts scanned, the number of ports scanned, the IP address of the sender and the rate 48 at which packets are sent. The combination of the number of hosts (IP addresses) scanned and the 49 number of ports gives rise to three combinations of scan: 50 51 -Horizontal Scans: A wide range of IP addresses are scanned, though only one port per host is   to an incoming scan will be impaired for the detection of horizontal scans, as only a single connection 3 is made to each host. This problem can be overcome through the detection of outbound port scans, as 4 per extrusion detection [6]. 5 To overcome some of these problems a handful of systems are developed as dedicated port scan 6 detectors. For example 'Spice' developed by Staniford et al. [68] incorporates an anomaly probability 7 score to dynamically adjust the duration of Y . This is useful for the detection of stealthier scans which 8 use randomised or slowed rates of scan packets sending. This detection technique is termed reverse 9 sequential hypothesis testing. It is used to further extent in research performed by Jung et al.
where it is combined with a network based approach and is used to identify potentially malicious 11 remote hosts in addition to detecting scanning activity. 12 In our previous research, the DCA is applied to the detection of various port scans. The DCA is 13 implemented as a host based system monitor, detecting the performance of an outbound port scan, in 14 attempt overcome some of the problems with using static time windows. Initially the DCA is applied to 15 the detection of simple ICMP ping scans [29], where the algorithm was used in real-time and produced 16 high rates of true positives and low rates of false positives. 17 In addition, DCA is used in the detection of a standard SYN scans, also in a real-time environ-18 ment [27]. The results of this study show that the DCA shows promise as a successful port scan 19 detector. However the results presented were preliminary and as the experiments were performed 'live' 20 in real-time, certain sensitivity could not be performed. Therefore, it is necessary to take this investi-21 gation further and explore this application with more rigour. Numerous computer security approaches are based on the principles of anomaly detection. This tech-30 nique involves producing a profile of normal host behaviour, with any significant deviation from this 31 profile presumed to be malicious or anomalous. Various AIS have been applied as anomaly detection 32 algorithms within the field of computer security, given the obvious parallel of fighting computer viruses 33 with a computer immune system [18]. The research of AIS in security has extended past the detection 34 of viruses and has focussed on network intrusion detection [45]. 35 The AIS algorithms used in security are generally based on the principles of "self-nonself discrimi-36 nation". This is an immunological concept that the body has the ability to discriminate between self 37 and nonself (i.e. foreign) protein molecules, termed antigen. The natural mechanism by which the body 38 learns this discrimination is termed negative selection. In this process, self-reactive immune cells are 39 deleted during a 'training period' in embryonic development and infancy. This results in a tuned popu-40 lation of cells, poised to react against any threat which is deemed nonself. These principles are used to 41 underpin the supervised negative selection algorithm. Negative selection itself is described eloquently 42 in a number of sources including the work of Hofmeyr and Forrest [32], Ji and Dasgupta [37] and 43 Balthrop et al. [5].

44
Following its initial success in the detection of system call anomalies [32] The criticisms of negative selection have to some degree overwhelmed the positive aspects of its de- 6 1 2 question of how to overcome these problems remains at the forefront of AIS research, focussing on the 3 incorporation of more advanced immunology. 4 An interdisciplinary approach is presented by Aickelin et al. [1], developed in 2003 through the Dan-5 ger Project. Aickelin et al. believe that some of the problems shown with negative selection approaches 6 can be attributed to its biological naivety. It is recognised that the negative selection algorithm is 7 based on a naive model of central tolerance developed in the 1950s [12]. 8 Aickelin et al. propose that through close collaboration with immunologists, computer scientists 9 will be able to develop more biologically realistic AIS which could potentially overcome the problems 10 of false positives and scaling observed with negative selection [1]. The DCA is developed using this 11 interdisciplinary approach [26], drawing inspiration from DCs as it is now widely accepted that these 12 cells are a major control unit in the human immune system. 13 The Danger Project brought innate immunology in to the AIS spotlight, as the innate immune 14 system is shown as responsible for the initial pathogen detection [52]. From this emerged two streams 15 of research which were based on innate principles, the Dendritic Cell Algorithm and the libtissue 16 system and its related algorithms [73]. The DCA will be explained in detail in Section 3, and is based 17 on an abstract model of the behaviour of natural DCs. visualisation, clustering, data processing, reduction and classification. In more specific terms SOM is 37 an unsupervised learning algorithm that is based on the competitive learning mechanism with self-38 organizing properties. Besides its clustering properties SOM can also be classed as a method for 39 multidimensional scaling and projection.

40
SOM algorithms have been first applied to computer security applications almost ten years after 41 the algorithm's inception [19]. The majority of existing research however is limited to anomaly de-42 tection, particularly network based intrusion detection [16]. Some work has been done on host based 43 anomaly detection using Kohonen's algorithm, however such work is rare [35], which is surprising, overflow attacks. However, as with the majority of anomaly detection systems, the algorithm struggles 4 to recognise attacks which resemble normal behaviour in addition to boundary case behaviour, giving 5 rise to false positives. 6 Buffer overflow [20] attack detection was also tackled by Rhodes [60] using a multilayer SOM, 7 monitoring payloads. Bolzoni [9] also looked at payload monitoring using SOM by employing a two-8 tier architecture system. 9 Gonzalez and Dasgupta [21] compared SOM against another AIS algorithm. Their Real-Valued 10 Negative Selection algorithm is based on the original Negative Selection algorithm proposed by Forrest 11 et al. [18] with the difference of using a new representation. The original Negative Selection algorithm 12 has been applied to Intrusion Detection problems in the past and has received some criticism regarding 13 its "scaling problems" [44]. Gonzalez  DCs belong to the innate immune system, and do not have the adaptive capability of the lymphocytes 4 of the adaptive immune system. DCs exist in three states of differentiation, immature, semi-mature 5 and mature, which determines their exact function [52]. Modulations between the different states are 6 dependent upon the receipt of signals while in the initial or immature state. Signals which indicate 7 damage cause a transition from immature to mature. Those signals indicating good health in the 8 monitored tissue cause a transition from immature to semi mature. The signals in question are derived 9 from numerous sources, including pathogens, from healthy dying cells, from damaged cells and from 10 inflammation. Each DC has the capability to combine the relative proportions of input signals to 11 produce its own set of output signals. Input signals processed by DCs are categorised based on their 12 origin: 13 14 PAMPs: Pathogenic associated molecular patterns are proteins expressed exclusively by bacteria,

19
DCs are sensitive to changes in danger signal concentration. The presence of danger signals may 20 or may not indicate an anomalous situation, however the probability of an anomaly is higher than 21 under normal circumstances.

22
Safe signals: Signals produced via the process of normal cell death, namely apoptosis. Cells must

31
Dendritic cells act as natural data fusion agents, producing various output signals in response to 32 the receipt of differing combinations of input signal. The relative concentration of output signal is 33 used to determine the exact state of differentiation, expressed by the production of two molecules, 34 namely the mature and semi-mature output signals. During this phase they are exposed to varying 35 concentrations of the input signals. Exposure to PAMPs, danger signals and safe signals causes the 36 increased production of costimulatory molecules, and a resulting removal from the tissue and migration 37 to a local lymph node. 38 DCs translate the signal information received in the tissue into a context for antigen presentation, 39 i.e. the antigen presented in an overall 'normal' or 'anomalous' context. The antigen collected while in 40 the immature phase is expressed on the surface of the DC. Whilst in the lymph node, DCs seek out T-41 lymphocytes (T-cells) and attempt to bind expressed antigen with the T-cells variable region receptor.

42
T-cells with a high enough affinity for the presented antigen are influenced by the output signals of 43 the DC. DCs exposed to predominantly PAMPs and danger signals are termed 'mature DCs'; they 44 produce mature output signals, which activate the bound T-cells. Conversely, if the DC is exposed to 45 predominantly safe signals the cell is termed semi-mature and antigens are presented in a safe context, 46 as little damage is evident when the antigen is collected. The balance between the signals is translated 47 via the signal processing and correlation ability of these cells. The overall immune system response is 48 based on the systemic maturation state average of the whole DC population on a per antigen basis. 49 An abstract view of this process is presented in Figure 2.    The signal processing used to transform the input to interim output signals is shown in Figure 3, 54 with the implications of each output signal given in Table 1. Costimulatory molecule (CSM) signal is and therefore tighter coupling is given to the signal and antigen data. This effect is explored in more 27 detail in Oates et al. [58] where a theoretical analysis is provided.  , where M C AV x is the MCAV coefficient for antigen type x, Z x is the number of mature context 41 antigen presentations for antigen type x and Y x is the total number of antigen presented for antigen 42 type x.

43
The effectiveness of the MCAV is dependent upon the use of antigen types. This means that the 44 input antigens are not unique in value, but belong to a population in themselves. For example, the 45 ID value of a running program is used to form antigen, with each antigen generated every time the 46 program sends an instruction to the low level system. Therefore a population of antigen is used, linked 47 to the activity of the program and all bearing the same ID number.

48
To process signals, antigens and cells the DCA uses two virtual compartments: tissue and lymph 49 nodes. The tissue is used as storage for antigens and signals and the lymph node is used for MCAV      implementation has a client/server architecture which separates data collection using clients, from data 49 processing on a server, as shown in Figure 5.

50
Input data is processed using libtissue clients, which transform raw data into antigen and signals.

51
Algorithms can be implemented within the libtissue server, as it provides all the required components 52 such as the ability to define different cell types, specifying receptors, compartments and internal signals.

53
Antigen and signal sources can be added to libtissue servers, facilitating the testing of the same 54 algorithm with a number of different data sources. Input data from the client are passed to and 55 represented in a compartment contained on a server known as the tissue compartment. This is a space 56 in which cells, signals and antigen interact. Each tissue compartment has a fixed-size antigen store 57 where collected antigens are placed. The tissue compartment also stores levels of signals, set by the 58 input clients. 12  type. An overview of this is given in Figure 6.

49
The tissue update is a continuous process, whereby the values of the tissue data structures are 50 refreshed. In this implementation, signals are updated at regular intervals -in our case, this is at a 51 rate of once per second. The update of antigen occurs on an event-driven basis, with antigen items 52 updated in the tissue each time new raw data appears in the system. The updated signals provide the 53 input signals for the population of DCs.

54
The cell cycle is a discrete process occurring at a user defined rate of once per second in this This procedure forms the required post-processing for use with this algorithm. 4 5 6 7 3.5 Parameters and Structures 8 9 The algorithm is described formally using the following terms.
Each DC m transforms each value of s(m) to o p (m). In Equation 2, a specific example is given for 51 use with four input signals, with one signal per category and consists of some additional components.

52
Additionally, the j = 3 component implies that signal category index is not summed with the other 53 three signal categories i.e. inflammation is not treated in the same manner as the other signals, as 54 shown in this equation. The interrelationships between the weights, determined through practical 55 immunology, are shown in Table 2.

56
The tissue has containers for signal and antigen values, namely S and A. In this example version   DCs sample R antigens from the tissue antigen vector A.

26
After the internal values of a DC are updated, o 0 is assessed against t m the cell's migration threshold.

27
If o 0 is greater than t m , the DC is 'removed' from the tissue. Here, 'remove' means that the DC is 28 de-allocated the receptors needed to sample the signal matrix and to collect antigen. On the next 29 update cycle, the remaining output signals are checked and the analysis procedure is initiated.

30
In this implementation, each DC is assigned a random value for t m , within a specified range. The

36
Pseudocode for this specific instantiation of the DCA is given in Algorithm 1. This pseudocode 37 shows both the update of the tissue and the individual DCs. The stages of the algorithm are shown, 38 namely initialisation, update and analysis. While this provides the detail of the DC update mechanisms, 39 this pseudocode does not encapsulate the asynchronous nature of the update stages. As libtissue 40 is a multithreaded framework, the three updates are controlled by three different processes, therefore, 41 the three updates occur asynchronously. This architecture is particularly suited for real time data 42 processing as updates occur as and when they are required. Various properties of the brain were used as an inspiration for a large set of algorithms and compu-7 tational theories known as neural networks. Such algorithms have shown to be successful, however a 8 vital aspect of biological neural networks was omitted in the algorithm's development. This was the 9 notion of self-organization and spatial organization of information within the brain. In 1981 Kohonen 10 proposed a method which takes into account these two biological properties and presented them in his 11 SOM algorithm [46]. 12 The SOM algorithm generates, usually, two dimensional maps representing a scaled version of 13 n-dimensional data used as the input to the algorithm. These maps can be thought of as "neural 14 networks" in the same sense as SOM's traditional rivals, artificial neural networks (ANNs). This is 15 due to the algorithm's inspiration from the way that mammalian brains are structured and operate 16 in a data reducing and self-organised fashion. Traditional ANNs originated from the functionality and 17 interoperability of neurons within the brain. The SOM algorithm on the other hand was inspired by 18 the existence of many kinds of "maps" within the brain that represent spatially organised responses.

19
An example from the biological domain is the somatotopic map within the human brain, containing a 20 representation of the body and its adjacent and topographically almost identical motor map responsible 21 for the mediation of muscle activity [47].

22
This spatial arrangement is vital for the correct functioning of the central nervous system [40]. This 23 is because similar types of information (usually sensory information) are held in close spatial proximity 24 to each other in order for successful information fusion to take place as well as to minimise the distance 25 when neurons with similar tasks communicate. For example sensory information of the leg lies next to 26 sensory information of the sole. 27 The fact that similarities in the input signals are converted into spatial relationships among the 28 responding neurons provides the brain with an abstraction ability that suppresses trivial detail and 29 only maps most important properties and features along the dimensions of the brain's map [61]. 30 31 32 4.2 SOM Algorithm Overview 33 34 As the algorithm represents the above described functionality, it contains numerous methods that 35 achieve properties similar to the biological system. The SOM algorithm comprises of competitive 36 learning, self-organization, multidimensional scaling, global and local ordering of the generated map 37 and its adaptation. 38 There are two high-level stages of the algorithm that ensure a successful creation of a map. The 39 first stage is the global ordering stage in which we start with a map of predefined size with neurons 40 of random nature and using competitive learning and a method of self-organization, the algorithm 41 produces a rough estimation of the topography of the map based on the input data. Once a desired 42 number of input data is used for such estimation, the algorithm proceeds to the fine-tuning stage, 43 where the effect of the input data on the topography of the map is monotonically decreasing with 44 time, while individual neurons and their close topological neighbours are sensitised and thus fine tuned 45 to the present input.

46
The original algorithm developed by Kohonen  Adaptation is the step where the winning node is adjusted to be slightly more similar to the input x.

38
This is achieved by using a kernel function, such as the Gaussian function (h ci ) as seen in Equation 4 39 as part of a learning process.

43
In the above function, α(t) denotes a "learning-rate factor" and σ(t) denotes the width of the 44 neighbourhood affected by the Gaussian function. Both of these parameters decrease monotonically  Stimulus selection, Response and Adaptation are repeated a desired number of times or until a map of 5 sufficient quality is generated. For our experiment this was set to a value suggested by Kohonen [48]. 6 He states that the number of steps should be at least 500 times the number of map units. For this 7 reason 100,000 epochs were used in our experiments. Another possible mechanism for the termination 8 of the algorithm is the calculation of the quantization error, which is the mean of ||x − w c || over the 9 training data. Once the overall quantization error falls below a certain threshold, the execution of 10 the algorithm can stop as an acceptable lower dimensional representation of the input data has been 11 generated. 12 13 14 5 Experimental Comparison 15 16 5.1 Scenarios

18
For the experiments in this paper, two data sets are compiled, collected using a system of signal 19 collection scripts with raw input signal data collected from the Linux /proc filesystem. One data set 20 is termed passive normal (PN) and contains a SYN scan performed without normal processes invoked 21 by a user i.e. scan and shell processes. The second data set is termed active normal (AN). This data 22 set contains an identical SYN scan but is combined with simultaneous instances of normal programs 23 which are used actively by a user throughout the duration of the session i.e. scan, shell processes and 25 a firefox web browser.

34
The AN data set is 7,000 seconds in duration, with 'normal' antigen generated by running a web

42
The PN data set is also 7,000 seconds in duration and comprises of a SYN scan and its pseudo- to a binary signal i.e. inflammation present (1) or not (0). All PAMPs, danger and safe signals are 4 normalised within a range of zero to 100. A sketch of the input signals throughout the duration of the 5 two sessions are shown in Figures 7, 8 and 9 for the AN data set and in Figure 10, 11 and 12 for the 6 PN data set. 7 To devise a set of appropriate signals a number of preliminary experiments must be performed, 8 in addition to the acquisition of knowledge regarding the effects of scanning and normal networking 9 usage within a host. Initially a plethora of system variables are monitored under a variety of situations. 10 The signals used in this experiment are network based attributes. This kind of system data appears 11 to be the most variable under scanning conditions. Once the candidate signals are selected, they are 12 then categorised using the general principles of signal selection i.e. PAMPs are signature-based, danger 13 signals are normal at low values and anomalous at high values etc. Following the categorisation, the 14 raw values of signals are transformed into normalised signals.

15
PAMP-1 is the number of ICMP 'destination unreachable' (DU) error messages received per second. 16 Scanning IP addresses attached to hosts which are firewalled against ICMP packets generate these error 17 messages in response to probing. This signal is shown to be useful in detecting ping scans and may also 18 be important for the detection of SYN scans, as an initial ping scan is performed to find running hosts.

19
In this experiment, the number of ICMP messages generated is significantly less than observed with a 20 ping scan. To account for this, normalisation of this signal includes multiplying the raw signal value 21 by five, capped at a value of 100 (equivalent to 20 DU errors per second). This process is represented 22 in Equation 6 where raw is the unmodified system data and signal represents the normalised output 23 signal. These terms apply to all equations described within this section. 24 25 signal = min{100, raw * 5}   SS-1 is the rate of change of network packet sending per second. Safe signals are implemented to 11 counteract the effects of the other signals, hopefully reducing the number of false positive antigen types.

12
High values of this signal are achieved if the rate is low and vice versa. This implies that a large volume 13 of packets can be legitimate, as long as the rate at which the packets are sent remains constant. The 14 value for the rate of change can be calculated from the raw DS-1 signal value, though conveniently, the 15 proc file system also generates a moving-average representation of the rate of change of packets per 16 second (over 2 seconds). This raw value can be normalised between values of 10 and 100 and inverted 17 so that the safe signal value decreases as the raw signal value increases. This normalisation process is 18 described in Equation 11.  Table 3. Preliminary 27 experiments have also shown that a moving average is needed to increase the sensitivity of this signal.

28
This average is created over a 60 second period.

29
The inflammatory signal is binary and based on the presence of remote root logins. If a root log-in      Table 4 Summary statistics of the frequency of system calls for the nmap and firefox processes.

30
Using multiple system calls with identical PIDs allows for the aggregate antigen sampling method, 31 having multiple antigen per antigen type. This allows for the detection of active processes when changes 32 in signal values are observed. This technique is a form of process anomaly detection, but the actual 33 structure of the PID is not important in terms of its classification, i.e. no pattern matching is performed, 34 on the actual value of the PIDs: it is a label for the purpose of process identification.

35
A graph of the frequency of system calls invoked per second for the AN data set by the nmap 36 process is shown in Figure 13 and for the firefox process in Figure 14. In these two figures, individual 37 points represent the frequency of system calls per second, while the trendline represents a moving 38 average over 100 points. Summary statistics of the system call data are given in Table 4, which are 39 generated across the entire session for both processes. 40 The mean/median frequency of system calls for the nmap process is higher than the firefox process.

41
To assess which process is more variable, the means are divided by the standard deviations, as shown 42 in the summary table. This value is larger in the case of nmap than for firefox. This indicates that 43 relatively, the standard deviation of the firefox process is larger in comparison to the mean than that of 44 the nmap process. The various proportions of input system calls are represented as a chart in Figure 15  45 and shows that the nmap process invokes the majority of system calls in the AN data set.  browsing, chatting, file transfer and other activities performed by a standard user. Once the data is collected, it is combined into one data set, which is subsequently used as the input 48 into the SOM algorithm. Input feature vectors are then selected from this data set at random and 49 presented to the map for computation. The training results in a map which can be seen in Figure 16.

50
In this example, the brighter the colour, the greater the dissimilarity of neighbouring nodes, with 51 the map representing four clusters. This shows one of the maps that was generated throughout our 52 experiments. Ten runs were performed, both for training and detection. The SOM algorithm itself cannot perform anomaly detection without any further processing. A mean-

57
ingful way has to be devised in order to be able to classify and make a decision whether a data item  is anomalous or not. Thus a method for process, rather than signal set anomaly detection calculation 29 had to be developed.

30
For informational purposes a simple anomaly detection can be performed on our data by classifying 31 sets of signals only. This can be done by firstly using the calculation in Equation 3, which will determine 32 the winning node within our trained map. We will call this the Best Matching Unit (BMU). Once the 33 BMU is found, the actual Euclidean distance between the currently observed vector of signals and 34 the BMU is calculated. The most trivial anomaly detection is done by choosing a threshold for this 35 dissimilarity. If the currently observed item is too different from the BMU, then it is deemed anomalous.

36
In order to perform process anomaly detection, antigen information has to be correlated with signals 37 from the testing data. As the SOM is trained on signals only, antigens (PIDs) need to be correctly 38 correlated with the right signals in order to be able to link anomalous sets of signals to their respective 39 initiators (processes). Initial correlation is done by synchronizing antigens with signals using timestamp 40 information. Any antigen with timestamp t is assigned a signal set at time t, and for the purpose of 41 synchronisation t is measured in seconds. Once this synchronisation takes place, the signal set -antigen 42 coupling is assessed for its anomaly level using the BMU technique described previously.

43
As explained later in this section, the output antigen from the DCA are 'segmented' into fixed  Table 5 Default parameter settings for the DCA, chosen following the sensitivity analysis performed previ-3 ously [24]. Values shown indexed from zero.  Higher values of MCAV are expected for the SYN scan process and its parent process the ssh demon, 51 than for the firefox browser. It is expected that smaller values of z will yield an improvement in the 52 precision and accuracy of the detection, though when z = 100, the system may be too sensitive and 53 an element of tolerance to false positives could be lost. The variants of z for the DCA are presented in 54 Table 6.   Table 7 Weights used for signal processing, where j represents the input signal category, i represents an instance of a signal within signal category j and p is the corresponding output signal.    Table 5, derived as a result of previous DCA sensitivity analysis [24]. Weights for the signal 33 processing of these data are shown in Table 7. These weights provide a shorter time-window for the  Table 8 and are chosen based on 37 recommended values as proposed by Kohonen [47].

38
Antigens are generated using system calls, captured through the use of strace and through manip-39 ulation within the antigen tissue client. The normalisation of the input signals is implemented using 40 the tissue client, antigen processed using a separate tissue client, with data processing and the DCA 41 performed using the tissue server process. An initial run of this system is performed to collect the 42 input data and to check for any potential coding errors. Input signals and antigens are collected and 43 recorded in a logfile using the real-time runs. Analysis of the preliminary real-time results of the output 44 antigen and empirical analysis of the input data indicate its suitability for use in these experiments.

45
The libtissue tcreplay client is used to perform the numerous runs of each data set. It is impor-46 tant to stress that the system is designed to work in real-time, though tcreplay is used to provide reproducibility of results, so a rigorous analysis can be performed.

52
The results of the DCA applied to the passive normal data are presented in Figures 17-19 and in 53 Table 9 and 10. The results for an antigen segment size z=100 are shown in Figure 17    While the pts process produces a high MCAV initially, between antigen segments 100 and 900 no 49 antigens are presented for the pts process, as it is inactive at this point. As the trendline is required 50 to clarify the results this indicates that a higher value of z would be preferable to clearly assess the 51 presence of an anomalous process.

52
In Figure 18, the result of the PN data set are presented where z=1,000. As with the results 53 presented in Figure 17, an initial spike of a high MCAV is shown, implying that the scan is in its initial 54 stages. While the individual points on this graph are not as dense as in Figure 17, the additional spikes 55 representing the latter stages of the scan are less in magnitude, though little difference in the initial

56
MCAV for the nmap processes is shown.

57
The results for z=10,000 are plotted in Figure 19 showing lower values for the pts process and a less

19
The size of the antigen segments (z) is 10,000. This data represents an average MCAV derived from across the 20 ten runs performed. The trendlines represent moving averages per process of interest across ten data points.

33
The AN results produced by the DCA show similar features. Figure 20 shows the results using the and anomalous process run simultaneously through the monitored ssh demon.

44
The results for antigen segments 0 to 500 are shown in Figure 21 for the sake of clarity. During 45 this period the majority of antigen presented belong to the firefox process and some modulation of 46 the behaviour of the monitored system occurs, as seen in the initial 500 seconds of Figure 20. Despite 47 these activities, the MCAVs presented in Figure 21 are all relatively low. This suggests that the DCA 48 using these particular signals responds appropriately to normal processes in the absence of scanning 49 activity.

50
In Figure 22, the results are presented for the AN data set where z=1,000. In comparison to 51 Figure 20, the trendlines of the graph are observably similar. Figure 23 shows that an antigen segment 52 size of z=10,000 produces observably different results to z=1,000. This is evident as the major spike 53 peak evident in Figure 20 and 22 is missing in Figure 23. Additionally, the only process technically 54 classed as 'anomalous' (MCAV above 0.5, chosen to reflect the proportion of nmap antigen in the input 55 data) is the nmap scan, though only briefly. This implies that the larger size of z increases the rate of 56 false negatives as shown through the lower values in Figure 23 and also shown in Tables 9 and 10 Figure 24. These trends are also evident in Figures 25 and 26.

54
Trendlines are added to each graph to represent a moving average per process. At the lowest level 55 of granularity of z=1,800 (equivalent in the number of segments to DCA z=100), it is unclear as to 56 exactly what the individual data points imply. Therefore a larger size of segment may be required, as 57 also found with the DCA. Again, sensitivity is lost as the size of z is large, as shown by the results

58
presented in Table 11 where z is 1,800,000.  In a similar manner, the AN results for the SOM produce initially high coefficients for the nmap 46 process. The results for z=1,800 are presented in Figure 27. This shows a major spike at the point 47 of the scan commencement (segments 400-700). Unlike the DCA upon application of a trendline, it 48 appears that the response to the scan is not sustained as three peaks are evident, as opposed to 49 the single peak shown with the DCA. Also, the SOM produces high coefficient values for the firefox 50 process, suggesting that discrimination between active anomalous and active normal processes can not 51 be completely achieved by either algorithm. 52 53 The graphs produced for z=18,000 and 180,000 are shown in Figures 28 and 29 respectively. As 54 with the PN results, the response to the nmap decreases as the value of z increases. This is evident 55 from both graphs and in Table 11. Unlike the DCA, which produced MCAVs for nmap which are 56 consistently higher than with firefox ( Figure 23), with the SOM results both the nmap and firefox 57 coefficients decrease at a similar rate, as exemplified in Figure 29. Statistical analysis is presented to 58 verify these observations in the next section.

38
This data represents an average MBMU derived from across the ten runs performed. The trendlines represent 39 moving averages per process of interest across 20 data points.  Figure 20 with the other antigen segment sizes. The results of

19
This data represents an average MBMU derived from across the ten runs performed. The trendlines represent 20 moving averages per process of interest across ten data points.

38
This data represents an average MBMU derived from across the ten runs performed. The trendlines represent 39 moving averages per process of interest across 50 data points.

41 42
three out of the four tested cases, the data series are significantly different. This indicates that z does 43 have an influence on the results of the DCA. Some further work with this concept may prove fruitful, 44 especially if dynamic antigen segment sizes are used, linked to process activity. The demonstration of 45 statistical significance implies that null hypothesis H1 can be rejected.

46
As the data is not normally distributed for either algorithm, Mann-Whitney tests are performed  Table 13. These results show that the modification of 49 z produces a statistically significant effect on the resultant anomaly values. Therefore, null hypothesis 50 H1 is also rejected for the SOM in addition to its rejection produced by the DCA.

19
This data represents an average MBMU derived from across the ten runs performed. The trendlines represent 20 moving averages per process of interest across 20 data points.

38
This data represents an average MBMU derived from across the ten runs performed. The trendlines represent 39 moving averages per process of interest across ten data points. Table 12 The results of the Mann-Whitney test comparing the results of z=100 to the results of z=1,000, 42 z=10,000 and z=100,000. A confidence interval of 95 % is used and data which are statistically significantly  sided unpaired Mann-Whitney test is used to perform this comparison. As the sample size is in excess 51 of 300 datapoints, a 99% confidence interval is deemed appropriate for this assessment.

52
The results of this comparison for the firefox process yields a p-value of 0.02, which at the given 53 confidence interval implies that the two sets of results are not statistically significant. This implies 54 that the algorithms produce similar results for active normal processes. For the firefox process, null 55 hypothesis H2 cannot be rejected under these particular circumstances with these given data sets.

56
Upon performance of the same statistical test, the nmap process produces a p-value of 0.002, 57 which shows that the two algorithms produce statistically significant differences in the detection of the 58 scan process. To assess which system produces the better performance, an additional two-sided Mann-24 35 1 2 Whitney is performed. The results of this test show that the DCA has the improved performance, 12 producing a p-value of 0.0001. Therefore null hypothesis H2 can be rejected for the nmap process and 13 that the DCA shows the better performance on this occasion. 14 15 16 6.3 Baseline 17 18 To validate both sets of results and to ensure that both performances are improved over a baseline, 19 a k-means classifier is applied to the signal data. The classifier used belongs to the WEKA suite [65]. 20 In this test 52% of the signals were classed as belonging to one class while 48% to another class. This 21 implies that the necessary discrimination cannot be achieved through classification on the basis of 22 signals alone. This also shows that this data is non-trivial to classify and adds value to the results 23 produced for both the SOM and DCA.

29
We have validated the use of the DCA as a serious competitor for anomaly detection applications.

30
Until this comparison we were still uncertain as to the quality of results produced by the DCA. This 31 comparison with the traditional SOM has shown that the DCA shows great promise as a successful AIS 32 algorithm. The performance produced by the DCA shows that the algorithm is capable of performing 33 at a level comparable to a standard technique.