PAS3-HSID: a Dynamic Bio-Inspired Approach for Real-Time Hot Spot Identification in Data Streams

Hot spot identification is a very relevant problem in a wide variety of areas such as health care, energy or transportation. A hot spot is defined as a region of high likelihood of occurrence of a particular event. To identify hot spots, location data for those events is required, which is typically collected by telematics devices. These sensors are constantly gathering information, generating very large volumes of data. Current state-of-the-art solutions are capable of identifying hot spots from big static batches of data by means of variations of clustering or instance selection techniques that pre-process the original input data, providing the most relevant locations. However, these approaches neglect to address changes in hot spots over time. This paper presents a dynamic bio-inspired approach to detect hot spots in big data streams. This computational intelligence method is designed and applied to the transportation sector as a case study to identify incidents in the roads caused by heavy goods vehicles. We adapt an immune-based algorithm to account for the temporary aspect of hot spots inspired by the idea of pheromones, which is then subsequently implemented using Apache Spark Streaming. Experimental results on real datasets with up to 4.5 million data points—provided by a telematics company—show that the algorithm is capable of quickly processing large streaming batches of data, as well as successfully adapting over time to detect hot spots. The outcome of this method is twofold, both reducing data storage requirements and demonstrating resilience to sudden changes in the input data (concept drift).

technique that is normally used to reduce the size of a dataset prior to it being used 134 for data mining. This is achieved by removing data points that are redundant or 135 noisy, leaving behind a smaller subset that is still representative of the original data, 136 resulting in lower storage requirements and more efficient mining without compro-137 mising the accuracy of the results [20]. In the HSID context, the points remaining 138 after instance selection are the hot spots. 139 An immune-inspired instance selection method, SeleSup [13,14,26], was success-140 fully used in Figueredo et al. [15] to reveal hot spots. This method has an ad-141 vantage over traditional clustering methods in that the number of 'cluster' centres 142 is self-adaptive, and therefore no predefinition of the number of hot spots is re-143 quired. However, the implementation of the algorithm shows reduced performance 144 on datasets with millions of instances. The work done in Triguero et al. [31] aims 145 to improve the performance of this algorithm by adapting it for implementation in 146 Apache Spark. This implementation indicates the same hot spots for the datasets as 147 the previous implementation, and also demonstrates an increase in performance for 148 larger datasets, due to the distributed nature of the computation. 149 While the SeleSup method and its subsequent implementation in Spark performs 150 well for large batch datasets, it is not suitable for HSID in a dynamic streaming en-151 vironment. Our novel approach appropriately tackles the challenges of data streams, 152 using instance selection as a technique. The next section discusses some of the ex-153 isting instance selection methods for data streams in the literature. 154 2.2 Instance selection for data streams 155 Additional challenges become apparent when considering the application of instance 156 selection to data streams, due to the dynamic nature of streams. The instances stream and be able to update quickly as the distribution of the data changes over 159 time (concept drift) [16]. As recently surveyed in [28], existing instance selection techniques do not cope well with the non-stationary characteristics of data streams.
Here, we discuss some current approaches and consider whether they could be applied 162 to the hot spot problem. 163 Klinkenberg [22] compares multiple methods for handling concept drift by se- 164 lecting the number of instances to be used. These include an adaptive time window, 165 batch selection, and weighting instances with respect to their age. The experiments 166 showed that batch selection, where batches of data that seem to include a large 167 number of outliers are eliminated, performed best, closely followed by the adaptive 168 time window. Weighting instances gave the lowest performance, although was better 169 than methods that did not adapt for concept drift. All of these methods use the as-  The instance-based learning on data streams (IBL-DS) algorithm proposed in 174 [3] was developed to tackle the problem of concept drift for classification on data 175 streams. This approach takes into account both the time that instances arrive, and 176 the distance between instances to determine redundant or noisy points to remove. 177 Older instances are also removed when the size of the case base will exceed a given

183
A different approach to instance selection for classification is to store only those 184 instances that define the boundaries between classes, reducing the memory require-185 ments of the model. One such example is presented in [35], where a data stream 186 classification algorithm based on an artificial endocrine system is proposed. As the 187 stream progresses, the maintained instances change, representing the evolving class 188 boundaries. Although this mechanism works well for classification, it would not be 189 suitable for hot spot identification, where there are no such boundaries to find.

190
In summary, existing instance selection techniques for data streams are not suit-191 able for application to the hot spot identification problem. We require a method that, 192 while adapting with respect to the most recently arrived instances, can also take into 193 account previously established hot spots and incorporate them in the current set of 194 hot spots in some way. It is also essential that the method does not rely on removing 195 long-standing hot spots after a fixed time period, as these can be significant areas for 196 HGV incidents. Instead, hot spots should be deleted based on an alternative measure 197 of their importance.  This merges the values assigned  to a key together, usually returning a single value per key. There are some cases for   207   which Hadoop is not the most suitable choice, such as for iterative algorithms where   208   data needs to be reused across computations, a task which it does not efficiently   209 accomplish.

210
Other data processing frameworks exist that overcome these drawbacks. Apache 211 Spark is one such example, introducing a distributed memory abstraction known as 212 Resilient Distributed Datasets (RDDs) [33]. A Spark cluster consists of a driver node 213 alongside multiple worker nodes, and RDDs allow data to be cached, or persisted, in 214 main memory of these nodes, resulting in more efficient data reuse. The Spark pro-215 gramming interface provides several MapReduce-like operations that can be applied 216 to RDDs, such as map, reduce and f ilter. There are also methods for moving data 217 between nodes. These include collect, which fetches all elements of an RDD back to 218 the driver node, and broadcast, which sends a read-only variable to all nodes.

219
Spark Streaming is an extension to Spark that treats data streams as a se-220 quence of microbatches on which to perform computations [34]. It provides dis- Here we present our immune-inspired, pheromone-based adaptive SeleSup algorithm 231 (PAS3-HSID) for hot spot identification in data streams. This algorithm is based 232 on the existing SeleSup HSID method [15], with the additional consideration of how 233 to establish a set of hot spots that can change over time in response to incidents 234 arriving. We assume that the stream is split into time intervals, and that incidents 235 arriving within one interval are allocated to one batch that is processed at the end 236 of that interval.

237
The algorithm is designed with three main requirements in mind:

238
-Identification of hot spots from streamed incident data, taking into account the 239 temporal nature of this data.

240
-Reduction of the volume of data that needs to be stored at each interval of the 241 stream. Instead of storing all incidents that arrive per interval, the hot spots 242 identified must represent a reduction in this data, resulting in lower storage 243 requirements.

244
-Suitability for parallelisation, to enable an implementation that can efficiently 245 compute hot spots for large batches. This is required because there is the po-246 tential for data to be arriving in very large batches due to the quantity being 247 generated through HGV telematics, which would result in poor performance from 248 a sequential implementation. 249 We first explain the algorithm from a general perspective in Subsection 3.1, before 250 providing specific details of our Spark-based implementation, designed to process 251 large batches of incident data in parallel, in Subsection 3.2.

PAS3-HSID details 253
The PAS3-HSID algorithm works by maintaining a state of current hot spots between 254 time intervals of a data stream. At each interval, the algorithm receives as input a 255 batch of new incidents I to be reduced. Using these incidents, as well as the hot 256 spots from the previous interval, an updated set of hot spots is produced. Figure 2   257 shows how the state is repeatedly updated and fed into PAS3-HSID to determine 258 future hot spots.  The fitness values F V 1 , F V 2 , ..., F V #HS are initialised to the number of incidents 266 included within the respective hot spot when it is first discovered, similar to how 267 fitness values are decided in [15]. The state is updated at each interval through a 268 pheromone-based mechanism that alters the fitness values accordingly. Any hot spots 269 with a fitness value below a given threshold are discarded, ensuring that the set of hot The higher the pheromone value of an edge, the greater the probability of it being 281 selected by ants at future iterations. Ants that generate good solutions will deposit 282 larger amounts of pheromones than those that find worse solutions. In addition, an 283 evaporation rate is also set, so that the pheromone values will decrease over time. 284 We can apply the pheromone idea to the fitness values of hot spots. Fitness values 285 must be increased at each interval in relation to the number of incidents added to 286 each hot spot, similar to depositing pheromones of the edges of the graph in ACO.

287
Just as the edges that contribute to shorter paths receive more pheromone, hot spots 288 that gain more incidents in a given interval will see their fitness value increase by 289 a larger amount. We also require the fitness values to decrease over time, so that 290 eventually hot spots will be removed after not gaining new incidents for some time.

291
This ensures that the current set of hot spots is truly representative of the present 292 state of the roads, and is equivalent to the evaporation of pheromones.    The algorithm consists of three main stages, as shown in Algorithm 1 and Figure   294 3, that take place at each interval of the stream. Figure 4 illustrates the process of 295 determining current hot spots from a set of incidents and pre-existing hot spots.  This value is initialised to zero at the start of every interval, and is incremented 309 each time h reduces an incident in the current batch. It is then used later in Stage 310 3 when recalculating the fitness value of h. Note that it is not necessary to ensure 311 that an incident is reduced by the closest hot spot, as we are not aiming to find a 312 precise location for the hot spot centre; rather, we want to find the general areas 313 of the road where there are a high frequency of incidents. Therefore, an incident 314 is reduced by the first hot spot found that it is close enough to, with respect to 315 the distance measure. This has the additional advantage of being generally faster 316 than finding the closest hot spot, which is important in the context of processing 317 big data streams. to reduce the remainder of the incidents. This process is similar to that used at 325 the start of the original method proposed in [15] pheromone update formula in [10]: resulting state will feed into the next stream interval to be used in the process 362 of deciding the next set of hot spots.

363
Further filtering on the hot spot state can then be performed, to produce a 364 subset containing those hot spots with a fitness value greater than a given hot spot   spots is stored as an RDD and so is distributed across nodes. This is also true of the 384 RDD containing the incidents for the present interval, which is created by reading 385 from a streaming source. Here, we simply load newly arrived incidents from a text 386 file, but any Spark input source could be used.

387
In order for all hot spots to be available at each node, they must first be collected  incidents. This information is used in Stage 3 to update the state. 413 We present two different implementations of Stage 2, a decision also taken in 414 [31]. The first is a sequential version, that makes the assumption that the majority 415 of incidents are reduced in Stage 1. This is tested later in the experimental study to 416 establish if it is a valid assumption to make. Therefore, the set left over to be reduced 417 is sufficiently small to collect back to the driver and operate on sequentially. Each  hot spot is therefore simply the number of incidents that it covers.

446
The final step for updating the state is to remove those hot spots with a fitness 447 value less than a given deletion threshold, achieved using a f ilter operation. cluster with larger batch sizes.

477
When characterising the behaviour of the algorithm, we discuss both the run-  We also aim to show the advantages of our pheromone-based algorithm in com-487 parison to other HSID approaches. Due to the lack of methods available in the 488 literature for HSID on big data streams, we are limited in the comparisons we can 489 make. We therefore focus on the differences between PAS3-HSID, with its pheromone 490 mechanism for determining hot spots and their relevance, and the original SeleSup 491 HSID algorithm, without such a mechanism. We use two alternative ways of applying 492 SeleSup HSID to the data for the comparison, namely: -Applying SeleSup HSID to each dataset as a whole, allowing us to compare 494 against a HSID method that does not account for hot spots changing over time. 495 We refer to this approach as SeleSup-HSID-D.

496
-Applying SeleSup HSID to each streaming interval individually. This enables 497 comparison with a method that should identify changes over time, but without 498 a way of considering previous hot spots when establishing the current hot spot. 499 We refer to this approach as SeleSup-HSID-I.

500
The parameters chosen for these experiments are displayed in Table 3. Mileage The effect of setting different decay rates (0.1, 0.3 and 0.5) when the algorithm 537 is applied to daily batches is shown in Figure 6, alongside the distributions of the 538 original incidents. We can observe that for datasets with a more regular pattern 539 of incidents, the number of hot spots that are identified increases throughout each 540 week before decreasing over the weekends when there are naturally fewer incidents.

541
The method is also able to adapt quickly to the sudden changes in the irregular 542 distribution of the speeding dataset. A decay rate of 0.1 seems to be suitable for the smaller datasets; however, for the contextual speeding data it results in a general 544 increase in hot spots over time, suggesting that old hot spots are not forgotten quickly 545 enough. A rate of 0.3 is able to handle a short period of time with very few incidents, 546 such as the few days in early May (Figure 6d) where there was a sudden decrease 547 in batch size for contextual speed; 0.3 resulted in a larger proportion of previous 548 hot spots being retained over these days than 0.5, which lost the majority of all hot 549 spots that were stored.

550
The algorithm relies on two thresholds relating to the fitness of hot spots. The   The delete threshold directly impacts the hot spots that are maintained in the 571 state between streaming batches. Figure 8 shows the effect of two delete thresholds 572 (0.9 and 1.9) on both the number of hot spots in the state, and the number with 573 a fitness greater than a hot spot threshold of 5. It can be seen that increasing 574 the value of delT h to 1.9 considerably reduces those in the state, while having a 575 relatively small impact on the number with F V > hsT h; this behaviour is consistent 576 across all datasets. The main difference between these threshold values is that 0.9 577 will keep isolated incidents that could not be allocated to a hot spot within the 578 interval in which they arrive, thus giving them a chance to become a hot spot later.

579
Alternatively, using 1.9 ensures that any isolated incidents are removed within the 580 same interval that they arrive. From this, we can conclude that the majority of 581 these isolated incidents do not subsequently become hot spots. We suggest a delete 582 threshold of 1.9 so that such incidents are removed immediately, resulting in a smaller 583 state being maintained between batches.  processed in a short time. In some cases, contextual speeding batches are processed 587 quicker than harsh braking, despite having more than three times the number of 588 incidents per batch. This is due to different mileages used to define hot spots for 589 these incident types, a behaviour also observed in [31].

590
From the results presented here, we can conclude:

591
-When run on a single node, our Spark-based implementation can efficiently pro-592 cess batches containing tens of thousands of incidents.    The average number of hot spots found per interval is shown in Table 6. Despite we do observe the beginning of a plateau, suggesting that using a greater number of 642 partitions would not give much performance gain, at least for a dataset of this size. 643 We can conclude that the Spark-based implementation of our proposed algorithm is 644 capable of efficiently handling batches containing hundreds of thousands of incidents, 645 and we advise employing the fully parallel implementation in such scenarios.

647
In this work we have presented an approach for vehicle hot spot identification in data 648 streams, adapting an existing instance selection method, SeleSup, with a pheromone-based mechanism that ensures the hot spots found are reflective of the recent incident