Theoretical formulation and analysis of the deterministic dendritic cell algorithm

As one of the emerging algorithms in the field of Artificial Immune Systems (AIS), the Dendritic Cell Algorithm (DCA) has been successfully applied to a number of challenging real-world problems. However, one criticism is the lack of a formal definition, which could result in ambiguity for understanding the algorithm. Moreover, previous investigations have mainly focused on its empirical aspects. Therefore, it is necessary to provide a formal definition of the algorithm, as well as to perform runtime analyses to revealits theoretical aspects. In this paper, we define the deterministic version of the DCA, named the dDCA, using set theory and mathematical functions. Runtime analyses of the standard algorithm and the one with additional segmentation are performed. Our analysis suggests that the standard dDCA has a runtime complexity of O(n2) for the worst-case scenario, where n is the number of input data instances. The introduction of segmentation changes the algorithm's worst case runtime complexity to O(max(nN; nz)), for DC population size N with size of each segment z. Finally, two runtime variables of the algorithm are formulated based on the input data, to understand its runtime behaviour as guidelines for further development.


Introduction
Artificial Immune Systems (AIS) [7,18] are computer systems inspired by both theoretical immunology and observed immune functions, principles and models, which are applied to real-world problems.The human immune system from which AIS draw inspiration, is evolved to protect the host from a wealth of invading micro-organisms.AIS are developed to provide similar defensive properties within a computing context.Initially, AIS were based on simple models of the human immune system.As noted by Stibor et al. [29], 'first generation' immune algorithms, such as negative selection and clonal selection, do not produce the same high-quality performance as the human immune system.These algorithms, negative selection in particular, are prone to problems with scalability and the generation of excessive false alarms, when used to solve problems such as network-based intrusion detection.Recent AIS use more rigorous and up-to-date immunology and are developed in collaboration with modellers and immunologists.The resulting algorithms are believed to encapsulate the desirable properties of immune systems, including robustness, error tolerance, and self-organisation [7].
One such 'second generation' immune algorithms is the Dendritic Cell Algorithm (DCA) [10].The algorithm is inspired by functions of the dendritic cells (DCs) of the innate immune system, while incorporating principles of a key novel theory in immunology, named the danger theory [21].An abstract model of natural DC behaviour is used as the foundation of the developed algorithm.The DCA has been successfully applied to numerous securityrelated problems, including port scan detection [10], botnet detection [1] and as a classifier for robot security [25].These applications refer to the area of anomaly detection, which is essentially one particular type of binary classification with an 'anomalous' class and a 'normal' class.According to results of these applications, the DCA has shown not only good performance in terms of detection rate, but also the ability to reduce the rate of false alarms in comparison to other systems, such as Self Organising Maps (SOM) [13].
However, there are also issues concerning the DCA.One criticism is the lack of a formal definition, which could result in ambiguity for understanding the algorithm and thus lead to incorrect applications and implementations.It is pointed out in [28] that the DCA shares similarities to linear classifiers since it employs a linear discriminant function for signal transformation.However, the DCA is not simply a collection of linear classifiers, as it performs classification based on the temporal correlation of a multi-agent DC population, rather than linear signal transformation.Signal transformation is used to identify if any anomalies occurred in the past.Whether the identified anomalies can be correctly correlated with potential causes is determined by the effectiveness of the temporal correlation performed at the population level.As a first step, a formal definition should be provided for presenting the algorithm in a clear and accessible manner.
Previous investigations have mainly focused on its empirical aspects, evidenced by experimental results on a range of problem domains.Except for the geometry analysis of Stibor et al. [28] that was later extended in Oates's thesis [24], theoretical analysis of the DCA has barely been performed, and most theoretical aspects of the algorithm have not yet been revealed.Other immune inspired algorithms, such as negative and clonal selection algorithms, were theoretically presented in [30].Elberfeld and Textor [9] theoretically analysed string-based negative selection algorithms, to show the possibility of reducing the worst-case runtime complexity from exponential to polynomial, through compressing detectors.More recently, the work of Zarges [31,32] theoretically analysed one of the vital components of the clonal selection based algorithms, namely inversely proportional mutation rates.Jansen and Zarges [19] performed a theoretical analysis of immune inspired somatic contiguous hypermutations for function optimisation.As a result, it is important to conduct a similar theoretical analysis of the DCA, to determine its runtime complexity and numerous other algorithmic properties, in line with other AIS.
In this paper, we extend the work presented in [15], which involved formal specifications of a single-cell model at the behavioural level using interval temporal logic [23].Note the algorithm demonstrated in this work is the deterministic DCA (dDCA) [12], created by removing stochastic components for the ease of analysis.Any statements of the DCA made subsequently are referred to the dDCA.The aim is to provide a clear and accessible definition of the DCA, as well as an initial theoretical analysis on the algorithm's runtime complexity and other algorithmic properties.As potential readers may not have a deep understanding of complicated formal methods such as the B-method [20], we use set theory and mathematical functions to specify the algorithm.From the formal definitions, theoretical analyses on the runtime complexity are performed, for the standard algorithm and an extended system with segmentation.Moreover, the formulations of two important runtime variables are included to present the algorithm's runtime behaviour, and to provide guidelines for future development.The paper is organised as follows, an overview of the DCA is given in Section 2, the formal definition is presented in Section 3, runtime analyses are shown in Section 4, formulation of two runtime variables is described in Section 5, and finally conclusions and future work are presented in Section 6.

Biological Background
The DCA is inspired by functions of the dendritic cells (DCs) of the innate immune system, which forms part of the body's first line of defence against invaders.DCs exhibit the ability to combine a multitude of molecular information and to interpret this information for the T-cells of the adaptive immune system.This could lead to the induction of various immune responses against perceived pathogenic threats.Therefore, DCs are often seen as detectors responsible for policing different tissues, as well as inductive mediators for a variety of immune responses.
In general, two types of molecular information are processed by DCs, namely 'signal' and 'antigen'.Signals are collected by DCs from their local environment and consist of indicators of the health status of the monitored tissue.Throughout its lifespan, an individual DC will exist in one of three states, namely 'immature', 'semi-mature' and fully 'mature', as shown in Figure 1.In the initial immature state, DCs are exposed to a combination of signals, and perform phagocytosis to ingest substances from their surroundings.Based on the concentration of presented signals, DCs differentiate into either a 'fully mature' form to activate the adaptive immune system, or a 'semi-mature' form to suppress it.If a DC is exposed to a combination of signals generated from a healthy or steady state tissue environment, such as no occurrence of tissue damage, it more likely becomes a semi-mature DC.Conversely, if a DC is presented with a combination of signals generated from a damaged tissue environment, such as the presence of unregulated cell death, it more likely differentiates into a fully mature DC.Natural DCs bind to and process many cytokine signals.In an abstract model of DC behaviour developed by Greensmith [10], the following categories are defined.
• PAMP: Pathogenetic Associated Molecular Patterns, molecular signatures of pathogens which are recognised by Toll-Like Receptors (TLRs) on the surface of DCs, and they are highly influential to the transition from immature state to fully mature state; • Danger: released by damaged tissue cells subject to necrosis (unregulated cell death), they have a lower effect than PAMPs on the maturation towards fully mature state; • Safe signals are derived from the cells that encounter apoptosis (programmed cells death), TNF-α (Tumour Necrosis Factor) is one candidate of safe signals, they contribute to the maturation from immature state to semi-mature state; During the immature state, DCs also collect debris in the tissues which are subsequently combined with the environmental signals.Some of the 'suspicious' debris collected are known as antigens, and they are proteins originating from potential invading entities.DCs combine the 'suspect' antigens with evidence in the form of signals to correctly instruct the adaptive immune system to respond, or become tolerant to the presented antigens.For more detailed information regarding the underlying biological mechanisms, please refer to [10,21].

Algorithmic Details
The DCA was designed and developed based on an abstract DC model created by Greensmith [10].It incorporates the functionality of DCs including data fusion, state differentiation and causal correlation.As per the natural system, there are two types of input data, namely 'antigen' and 'signal'.It is generally assumed that certain causal relationship exists between the two data streams.Antigens are categorical values that can be various states of a problem domain or the entities of interest associated with a monitored system.Signals are represented as vectors of real-valued numbers, and they are measures of a monitored system's status within certain time periods.In real-world applications, antigens represent what is to be classified within a given problem domain.For instance, they can be process IDs in computer security problems [1,11], a small range of positions and orientations of robots [25], the proximity sensors of online robotic systems [22], or the time stamps of records collected in biometric data [17].Signals represent system context of a host or a measure of network traffic [1,11], the readings of various sensors in robotic systems [25,22], or the biometric data captured from a monitored automobile driver [17].Signals are normally pre-categorised as 'PAMP','Danger' or 'Safe'.The semantics of these signal categories is listed as follows: • PAMP: increases in value as the observation of anomalous behaviour, it is a confidence indicator of anomaly, which usually is presented as signatures of the events that can definitely cause damage to the system; • Danger: reflects to potential anomalies, as the values increases, the confidence of the abnormal status of the monitored system increases accordingly; • Safe: increases in value in conjunction with observed normal behaviour, this is a confidence indicator of normal, predictable or steady-state system behaviour.
Increases in the value of safe signal suppress the effect of the PAMP and Danger signals within the algorithm, as per what is observed in the natural system.This immunological property is incorporated within the DCA in the form of predefined weights for each signal category, for the transformation from input signals to output signals, which are 'CSM ' and 'K' signals.
The CSM signal reflects the amount of information a DC has processed, i.e. when to make decisions, while the K signal is a measure indicating the polarisation towards anomaly or normality, i.e. how to make decisions.The output signals are used to evaluate the status of the system monitored by the analysis component of the algorithm.Such a signal transformation process is displayed in Figure 2.  In order to achieve its detection ability, the DCA initialises a population of artificial DCs operating in parallel as detectors.Each DC is given a distinct limit of its lifespan, which creates a dynamic time window effect in the population [26].This leads to the same signal and antigen data streams being processed by every DC, during different time periods across the analysed time series.A temporal correlation between signals and antigens is also performed by each DC internally, to capture the causal relationship within the data.As suggested in [12], to perform correct correlation, the signals are supposed to appear after the antigens, and the delay should be shorter than the time window created by each DC.
During detection, each individual DC updates its antigen profile by storing the sampled antigens internally.In the meantime, the output signals produced by the signal transformation are accumulated, to update the DC's lifespan and signal profile.The DC's lifespan is subtracted by the cumulative CSM, which gives the difference between the amount of information initially allowed for a DC and that has been processed by the DC so far.Such difference reflects to if the DC has processed sufficient information and is ready to make decisions.On the other hand, its signal profile is added by the cumulative K, to aggregate the polarisation towards anomaly or nor- mality indicated by its tendency toward −∞ or +∞.As soon as the DC's lifespan reaches zero, it stops performing signal transformation and temporal correlation.The association between the cumulative K and sampled antigens within the DC, termed 'processed information', is then accumulated by the analysis phase to produce the final detection results.Once a matured DC has presented the processed information, it is reset to its default form.Here, the population size is generally kept constant, but can be user specified.The entire process of different steps of the DCA is illustrated in Figure 3.

Formalisation of the DCA
In this section, we formally define data structures and procedural operations of the DCA at the population level.Unlike specifications of a single DC at the behavioural level in [15], here we focus on specifying the entire DC population using quantitative measures at the functional level.Instead of using more advanced and possibly more complex interval temporal logic [23], set theory and mathematical functions e.g.addition, multiplication and recursion are used for clarity.This aims to present the algorithm in a comprehensive way, which can be easily accessed by readers who may not be familiar with formal logic.

Data Structures
Define Signal ⊆ R m and Antigen ⊆ N as the two types of input data.Within a discrete time space Time = {1, 2, . . ., t, . ..}, the input data can be defined as a function S : Time → Signal ∪ Antigen, and S(t) is a data instance at a time point t ∈ Time.Elements from Signal are input signal instances of the algorithm, and are represented as m-dimensional real-valued vectors.These are usually normalised into a non-negative range, e.g.[0, 1], as the input to the DCA.In many applications, m = 3 is the standard case, corresponding to the three input signal categories of the DCA as described in Section 2. Elements from Antigen are categorical identifiers of certain objects to be classified, and are often represented as natural numbers starting from one, where the order is ignored.
Define the weight matrix of signal transformation as where each entry w ij ∈ R. The weight matrix W is used to transform the m-dimensional input signals to two categories of output signals, namely 'CSM ' and 'K'.It is usually predefined by users and kept constant during runtime.Entries in the weight matrix are based on empirical results from the underlying immunology of natural DCs [10].
Let Population be an index set of DCs and N = |Population| be the population size (N = 100 is a popular choice).The index of a DC is i ∈ Population.The function of assigning the initial lifespan to a DC is defined as I : Population → R, where I(i) = I(j) (i = j ∈ Population).The function of initialising the antigen profile of the DC is defined as M : Population → (a i1 , a i2 , ..., a ik , ...), where (a i1 , a i2 , ..., a ik , ...) is a sequence storing the antigen instances sampled by a DC and a ik ∈ Antigen.The initial signal profile of a DC is usually set to zero.
The output of each DC is stored as a pair (a ik , r i ) ∈ Antigen × R in a list, where r i is the signal profile of a DC when it reaches a termination condition.We also define π 1 and π 2 as projection functions to obtain the first and second elements of a pair respectively.

Procedural Operations
To access the data structures of the DCA, a series of one-step procedural operations are executed.Formally defining these operations is essential for the algorithm's runtime analysis.At the beginning (t = 1), the algorithm initialises all the DCs indexed by Population, through assigning the initial values of lifespans and signal profiles, named 'DC initialisation'.The value of I(i) depends on the distribution function used to generate the initial lifespans of DCs.Both uniform distribution and Gaussian distribution can be applied to generate I(i).The antigen profile of each DC is set as Null or empty, while the signal profile is set as zero.
Definition 1 (signal transformation).The signal transformation function This operation is executed whenever S(t) ∈ Signal holds, and it performs the multiplication between a transposed 2 × m matrix and an m-dimensional vector to produce a two dimensional vector of output signals, 'CSM ' and 'K'.These are related to when and how to make decisions respectively.In the case that S(t) ∈ Antigen, the function returns a zero vector.
Definition 2 (lifespan update).The lifespan update function F : Time × Population → R is defined as When t = 1, the initial value of F is I(i), which is the initial lifespan of the DC with an index i.It is repeatedly subtracted by CSM signal until the termination condition, F (t − 1, i) ≤ 0, is reached.The function is then reset to 'I(i) − π 1 (O(t))' (not I(i)), due to the function O(t) being executed at a regular basis, e.g. at every single time point t.

Definition 3 (signal profile update). The signal profile update function
When t = 1, the value of G is zero, which is the initial signal profile of the DC with an index i.It is repeatedly increased by the K signal until the termination condition is reached.The function is then reset to 'π 2 (O(t))' (not 0), due to the function O(t) being executed at a regular basis, e.g. at every single time point t ∈ Time.
Definition 4 (antigen profile update).The antigen profile update function H : Time × Population → (a i1 , a i2 , . . ., a ik , . ..) is defined as where H is initially empty.As a new antigen instance arrives, it is sampled by the DC with an index i and its antigen profile is updated until the termination condition is reached.This function merely appends a list to another, which can be done in constant time regardless of the length of the lists, and thus considered as one-step operation as well.It is performed individually by each DC and the index of the DC selected to sample an incoming S(t) ∈ Antigen is defined as i ≡ θ mod N (i is congruent with θ modulo N ), where θ is the number of antigen instances up to time t.This is termed the 'sequential sampling' rule.
Definition 5 (output record).Let r i = G(t, i) s.t.F (t − 1, i) ≤ 0 be the signal profile of a DC, and L : N → Antigen × R denote the function that maps an index j ∈ N to an element of the output list.The output record function is defined as where L(j) is the jth element of the list.The antigen profile often consists of multiple values while the signal profile only contains one single value in the DC with an index i.This function essentially enumerates all the possible pairs and appends them to the output list, where each of them is assigned an index j.The list is then used to produce the final detection results in the analysis phase of the DCA.
Definition 6 (antigen counter).The antigen counter function C : N × Antigen → {0, 1} is defined as Definition 7 (signal profile abstraction).The signal profile abstraction function R : N × Antigen → R is defined as In the two functions above, α ∈ Antigen is an antigen type.The function C is used to count the number of instances of antigen type α, and the function R is used to calculate the sum of all K values associated with antigen type α.These two operations are performed for every antigen type and involve scanning the sequence of L(j) in its entirety.
Definition 8 (anomaly metric calculation).Given the number of input instances is equal to n, the anomaly metric calculation function is defined as.
As Antigen = ∅ and α ∈ Antigen, the minimum number of antigen instances is equal to one, so is the minimum number of antigen types.Therefore, we have β ≥ 1.A threshold ε can be applied for further classification.The value of the threshold depends on the underlying characteristics of the dataset used.An antigen type α is classified as anomalous if K(α) > ε, and normal otherwise.

Analysis of Runtime Complexity
4.1.The Standard DCA By combining the procedural operations of the DCA with for, while loops or if statements the algorithm can be presented as in Algorithm 1.Previous applications of the DCA have shown that the runtime of the algorithm is relatively short and the consumption of computation power is also low [12].However, theoretical analysis of the runtime complexity of the DCA, given a set of input data, has not yet been performed.Runtime analysis involves calculating the number of primitive operations or steps executed by an algorithm [5].The analysis is often based on asymptotic theory, and its aim is to theoretically show the runtime complexity of an algorithm as a function of increasing input size n.
As mentioned previously, applications of the DCA are referred to the area of anomaly detection.In AIS, one popular anomaly detection algorithm is known as the negative selection algorithm, which was shown to have an exponential runtime complexity [30].An attempt of reducing the worstcase runtime complexity from exponential to polynomial was reported in [9], however this reduction is only applicable when the input feature space is bit strings instead of real numbers.Other popular anomaly detection algorithms are more or less derived from techniques in machine learning [4], e.g.K-Nearest Neighbour (KNN) with a runtime complexity of O(nd) [8], decision trees algorithms with an exponential runtime complexity [8], and support vector machines (SVM) with a runtime complexity of O(n 2 d) [3], where n is the number of input instances and d is the dimensionality.As a result, the subsequent runtime analysis of the DCA reveals if the algorithm is competitive against other state-of-the-art anomaly detection algorithms.
Let a be the number of antigen instances within the input data, b = |Antigen| be the number of antigen types and N be the size of the DC population.According to previous applications [1,25,11,22,17], N is usually user defined and independent of the increase of data size n.However, we often assume that 1 ≤ N ≤ n.In order to make the following analyses more general, the population size N is considered a parameter of the algorithm.As the type of input data instances is either Antigen or Signal, if the number of antigen instances is equal to a, the number of signal instance is n − a.For the ease of analysis, the algorithm is divided into three phases as follows: 1. Initialisation phase -Line 1 to Line 3; 2. Detection phase -Line 4 to Line 19; 3. Analysis phase -Line 20 to Line 26.
The calculation of runtime is performed phase by phase.Let T 1 (n), T 2 (n) and T 3 (n) be the runtime of each phase respectively, and is the overall runtime of the algorithm.Details of all the primitive operations of the algorithm are listed in Table 1, including the line number and the description of each operation as well as the number of times an operation is executed, corresponding to Algorithm 1.
The initialisation phase is only executed once for the entire DC population at the commencement of the algorithm.Its runtime is independent of the number of input instances n, but is determined by the population size N .Therefore, the runtime of the initialisation phase is calculated as follows.
The runtime of the detection phase depends on the data size n, the number of antigen instances a, the number of signal instances n − a and the size of the DC population N .Thus the runtime of the detection phase is calculated as follows.
The runtime of the analysis phase is dependent on the size of the output list that is equal to the number of antigen instances a and the number of antigen types b.The value of b is determined by the number of states or entities to classify within a problem domain.Here we merely focus on the worst-case scenario, which occurs if b = a, and the number of antigen types is equal to the number of antigen instances.Therefore, we have 1 ≤ b ≤ a ≤ n.The runtime of the analysis phase is thus calculated as follows.
Theorem 1.The runtime complexity of the standard DCA is bounded by O(n 2 ), with respect to the data size n. Proof.
Bounds provided by O-notation are asymptotically tight.
As suggested by Theorem 1, the DCA has a worst case runtime complexity of O(n 2 ), which is quadratic.As a result, the DCA is indeed competitive in terms of processing large-sized datasets while keeping the runtime complexity under control, when compared to state-of-the-art anomaly detection algorithms.According to previous applications [1,11,25,22,17,14], we often have N n.Such a premise makes the runtime complexity of algorithm's initialisation and detection phases overall linear, while the analysis phase stays quadratic.This leads to the following work of modifying the analysis phase of the algorithm via an introduction of segmentation.

The DCA with Segmentation
Segmentation is introduced to adapt the algorithm to online analysis [16].Instead of analysing the processed information in a single operation at the termination of the detection phase, the output list is partitioned into smaller segments and the analysis is performed within each segment.We postulate that segmentation could potentially generate finer grained results, as well as performing analysis in parallel with the detection process.Here, we focus on the antigen based segmentation approach, as it is more favourable in actual applications [16].One may think that the system with segmentation produces the final detection results much faster, as the analysis is performed during detection on a much smaller chunk of processed information.Based on the analysis of the standard DCA, it is possible to theoretically analyse the effect of segmentation on the algorithm's runtime complexity.Let z be a predefined segment size and 1 ≤ z ≤ n.A segment is generated once the size of the output list reaches z, and an analysis on the current batch of processed information in the output list is performed.
As a post-processing mechanism, segmentation only affects the analysis phase of the algorithm, but not the initialisation phase or detection phase.The search space of the analysis of a segment is determined by the value of z.The number of segments created is equal to n/z , and they are indexed by {1, 2, . . ., k, . . ., n/z }.Let a k ≤ z and b k ≤ z denote the number antigen instances and the number of antigen types in the kth segment respectively.As a result, the runtime complexity of each segment at the analysis phase is Theorem 2. The runtime complexity of the DCA with segmentation is bounded by O(max(nN, nz)), with respect to the data size n, the DC population size N , and the segment size z. Proof.
As shown in Theorem 2, the introduction of segmentation changes the overall runtime complexity of the algorithm to O(max(nN, nz)).Depending on the values of N and z, the runtime complexity can be either quadratic . This is very attractive for online detection tasks, as it provides a means of online analysis that continuously and periodically produces results during detection.Additionally, the DCA with segmentation produces significantly different and better results than the standard version [16].Therefore, segmentation is an important and necessary addition to the DCA from a practical point of view.Thus far only static segmentation with a fixed segment size has been applied to the DCA.The effect of variable segment sizes on the detection performance still requires further investigation.

Formulation of Runtime Properties
Two runtime variables of the DCA are assessed, as they can be used as quantitative indicators of the changes to the algorithm's runtime behaviour.They are the number of matured DCs (those which reach the termination condition and are reset) and the number of processed antigens respectively.The number of matured DCs indicates that the amount of processed information is related to signal instances.Conversely, the number of processed antigens implies that the amount of processed information is related to antigen instances.In this section, the formulation of the above properties is given, to build up the mathematical foundation of the algorithm.This is obtained with respect to a time interval [t b , t e ] := {t b , t b + 1, t b + 2, . . ., t e } ⊆ Time.

Number of Matured DCs
The number of matured DCs within a time interval is related to the reset frequency of the DC population, which indicates the work-load of the DC population.This can be used to determine whether the current setup of the current system should be altered.If the frequency of DC resetting is too high, most of the DCs become matured and get reset before they acquire a sufficient amount of information.As a result, the range of lifespans of the DC population should be extended, allowing more information to be obtained.In conduction with extending the range of lifespans of the DC population, it is necessary to also increase the size of the DC population, so that the lifespans do not become sparse.
This becomes crucial if the system is deployed online, as an online system is often required to perform continuous detection and adapt to the changes of real-time situations.The number of matured DCs in the DC population depends on the distribution function used for the generation of DC lifespans, in addition to the input data within the time interval of interest.To make the analysis manageable, two types of distributions for generating the initial DC lifespans are considered, namely uniform distribution [2] and Gaussian distribution [2].The calculations will be done through using the mean value of lifespans of the DC population and the mean value of CSM signals corresponding to all the input signal instances.They focus on the average number of matured DCs within a given time interval rather than the particular number per iteration.However, as the time interval is reduced, e.g. to the duration of one iteration, the two numbers could become approximate to each other.
Proposition 1 (uniform distribution).If the lifespans of the DC population are generated from an arithmetic series x i = x 1 + (i − 1)d, where x i is the nth element, x 1 is the first element and d is the interval between two successive elements, the number of matured DCs in the DC population δ can be calculated as follows.
By default, the ascending order of lifespans of the DC population corresponds to the order of its indices.As a result, if the size of the DC population is equal to N , the lifespan of the last DC with an index i = N is given as As demonstrated in Section 3, the termination condition where a DC matures as soon as its lifespan reaches zero through subtracting the CSM signals. Proof.
Where ϕ is the mean value of the CSM signals within the interval [t b , t e ] and µ 1 is the mean lifespan of the DC population.
Uniform distribution is used in the dDCA [12] to generate the initial lifespans of the DC population.This produces a set of values that are uniformly distributed within a certain range.According to Proposition 1, if the parameters (first element x 1 and the interval d) of the arithmetic series are given, the number of matured DCs within the time interval [t b , t e ] can be calculated accordingly.
Proposition 2 (Gaussian distribution).If the lifespans of the DC population are generated from a Gaussian distribution x ∼ N (µ, σ 2 ), then the following formula holds.
Pr(•) is the probability operator.If the sample size is N , the sample mean µ 2 is bounded by a Gaussian distribution x ∼ N (µ, σ 2 N ) [2].The lower and upper bounds of the sample mean can be used to induce the bounds of the number of matured DCs.
In practice, Gaussian distribution has not been used for generating the lifespans of the DC population, but it has been of great interest [27] and would be a priority of future investigation.According to Proposition 2, if we know the mean (µ) and variance (σ 2 ) of the Gaussian distribution from which the lifespans of the DC population are generated, the size of DC population N , and the input data instances within the time interval [t b , t e ], we can show that there is a 0.95 chance the number of matured DCs is bounded by the lower and upper bounds.This could provide sufficient information for adjusting the system according to real-time scenarios.

Number of Processed Antigens
As demonstrated in [16], segmentation is effective for maintaining or even improving detection accuracy on large-sized datasets.This may be due to the fact that the number of processed antigens could determine whether an analysis of the current batch of processed information is required.Different from input antigen instances, processed antigens are those, presented by matured DCs.Investigation of the relationship between the number of processed antigens and the input data becomes essential for understanding the DCA, as well as for the development of integrating segmentation with the algorithm.Additionally, a priori knowledge of the number of processed antigens, based on the input data, may facilitate choosing an appropriate segment size.Here, Moreover, two runtime variables are formulated, the number of matured DCs and the number of processed antigens.This shows how the algorithm behaves within a given time interval based on the input data without actually running the algorithm.As a result, the formulas of two runtime variables can be used as the indicators of adjusting the setup of the system according to real-time situations during detection.This an important step for understanding some of the potentially beneficial properties of the algorithm from a theoretical perspective, which could facilitate further investigations on the usefulness of these properties with respect to anomaly detection problems.
This work gives application independent insights to the algorithm, which can be used as guidelines for future development.One of the goals of future development of the DCA is to turn it into an automated and adaptive online detection system, and such a system has certain requirements to fulfil.Firstly, the system has to be computationally efficient.The analysis of the runtime complexity of the DCA shows even in worst case scenarios its runtime complexity is competitive against other popular anomaly detection algorithms.Secondly, the system should be able to adapt to real-time scenarios encountered during detection.This requires the insights of how the algorithm behaves during runtime, which can be assessed from the two runtime variables.As a result, new components can be developed and integrated within the algorithm to adjust the system based on the assessment of these two runtime variables.
In terms of future work, the specifications can be further simplified and the algorithm can be presented using functional programming approach [6], to reveal more algorithmic details.In addition, synthetic datasets generated from various probability density functions can be used to test the formulas defined in this paper.We can also investigate other properties of the algorithm, for example, the moving window effect created by each DC and the relationship between the size of DC population and the detection performance.Different methods of generating the initial lifespans of the DC population should also be investigated, in addition to the relationship between the weight matrix and the detection performance.

Figure 1 :
Figure 1: A state-chart describing the three states of an individual DC.

Figure 2 :
Figure 2: An illustration of the signal transformation process of the DCA.

Figure 3 :
Figure 3: An illustration of different steps of the DCA, where the initialisation and analysis steps are performed at the population level and the rest of the steps (bounded within the two vertical lines) are performed at the individual DC level.

Table 1 :
Details of primitive operations of Algorithm 1, where N is the size of DC population, n is the data size, a is the number of antigen instances, and b is the number of antigen types.