Fast and Scalable Approaches to Accelerate the Fuzzy k-Nearest Neighbors Classifier for Big Data

One of the best-known and most effective methods in supervised classification is the k-nearest neighbors algorithm (kNN). Several approaches have been proposed to improve its accuracy, where fuzzy approaches prove to be among the most successful, highlighting the classical fuzzy k-nearest neighbors (FkNN). However, these traditional algorithms fail to tackle the large amounts of data that are available today. There are multiple alternatives to enable kNN classification in big datasets, spotlighting the approximate version of kNN called hybrid spill tree. Nevertheless, the existing proposals of FkNN for big data problems are not fully scalable, because a high computational load is required to obtain the same behavior as the original FkNN algorithm. This article proposes global approximate hybrid spill tree FkNN and local hybrid spill tree FkNN, two approximate approaches that speed up runtime without losing quality in the classification process. The experimentation compares various FkNN approaches for big data with datasets of up to 11 million instances. The results show an improvement in runtime and accuracy over literature algorithms.

Fast and Scalable Approaches to Accelerate the Fuzzy k-Nearest Neighbors Classifier for Big Data

I. INTRODUCTION
T HE FUZZY k-nearest neighbor algorithm (FkNN) [1] is developed with the aim of improving and alleviating the main weakness of the k-nearest neighbor algorithm (kNN) [2]. This weakness resides in considering all neighbors as equally important in the classification, making the kNN algorithm more vulnerable to noise at the class boundaries, leading to a downgrading of the classification.
In the experimental analysis by Derrac et al. [3], the classic algorithm FkNN stands out as one of the most effective approaches. The FkNN is composed of two stages: class membership degree and classification. The first stage changes the label of the class by a vector of membership degree belonging to each class, according to the closest training instances. To calculate the nearest instances, it uses a similarity function, usually with a distance function (Euclidean or Manhattan). The second stage calculates the kNN with the information of the membership degree. Thus, it is possible to detect borders with greater precision, being less affected by noise and improving the kNN in most classification problems used in many applications, such as medicine [4], spacecraft [5], and many other fields. Nowadays, the FkNN and kNN are used in many areas of data mining. They are used as data preprocessing techniques [6] to deal with imperfect data [7] and uncertainty in the classification process by means of aggregation operators [8]. Studies to improve the FkNN algorithm and its applications continue to develop in many areas, such as convergence [9] and runtime improvement [10]. There are some recent proposals that enhance quality of the classic FkNN classifier, two proposals based on evolutionary algorithms [11] and [12] and one proposal based on parameter independent fuzzy weighted kNN [13]. Nevertheless, these solutions used to increase the computational complexity, making the algorithm less scalable for the application in big data problems. For this reason, we will focus on the classical FkNN algorithm.
In the big data environment [14], the kNN and FkNN algorithms have been key to solving different machine learning problems, such as fuzzy-rough-based NN classification [15], time-series forecasting [16] or data preprocessing to obtain quality data [17]. In this article, we are focused on standard classification. When handling large datasets the kNN and FkNN classifiers have problems regarding runtime and memory consumption. There is an exact proposal of the kNN algorithm to address big data problems and it is called the k-nearest neighbor-iterative Spark (kNN-IS) [18]. In addition to this exact version, there are also approximate variations that drastically reduce execution times: metric-tree [19] and spill-tree. Liu et al. [20] studied the metric-tree and spill-tree models and proposed the hybrid spill-tree model [21] (HS). HS is the hybridization of the two models with the aim of improving the runtime in big data.
Regarding the fuzzy approach, in [22], we investigated the feasibility of an exact approach to apply FkNN in big data called global exact fuzzy k-nearest neighbors (GE-FkNN) [22]. Even though it is capable of scaling up to large datasets, the runtime of the first stage are considerably high, causing a bottleneck. Subsequently, Maillo et al. [23] presented a preliminary study on the use of approximate kNN search to accelerate the execution time and alleviate the bottleneck.
The objective of this article is to design and develop a FkNN model capable of handling large datasets accurately and quickly. To do this, we use the Spark framework and use HS as the base algorithm due to its balance between scalability and accuracy that improves previous kNN proposals in the literature. The proposed algorithm is composed of the same two stages of classical FkNN: membership degree and classification. The main difference of the proposed algorithm can be noted in the first stage, focusing on handling the bottleneck with two different approaches: 1) Local hybrid spill tree FkNN (LHS-FkNN): The local approach divides the dataset into different parts and calculates the class membership degree internally in each partition, without considering other partitions. 2) Global approximate hybrid spill tree (GAHS-FkNN): The global approach is based on the HS model. It generates a tree with the instances of the training set and distributes it among all the computation nodes, considering all the instances for the calculation of the class membership degree. The second stage classifies the unseen samples from the test set using the class membership degree knowledge calculated in the first stage. The classification stage is the same for both models, following a HS-based approach and with a workflow similar to the first stage of GAHS-FkNN. The novelty of the proposal is the use of approximate kNN searches, presenting local and global approaches, achieving quality accuracy and scalability that allows execution with large datasets through the use of the MapReduce [24] paradigm and the Spark framework [25].
In order to study the performance of this model, experiments have been carried out on eight datasets with up to 11 million instances and 631 features. The experimental study analyzes the accuracy and runtime making a comparison with existing algorithms of the literature.
In addition, we have developed a software package with FkNN algorithms for big data, making use of in-memory native operations and distributed computing from Apache Spark. The developed algorithms can be found in the repository. 1 The rest of this article is structured as follows. Section II introduces the state of the art in the FkNN and hybrid spill-tree algorithms. Next, Section III details the proposals of the FkNN algorithm. Section IV describes the experimental study, and Section V includes multiple analyses of the results. Finally, Section VI concludes this article.

II. PRELIMINARIES
This section provides background knowledge of the FkNN algorithm (Section II-A), the hybrid spill-tree (Section II-B), and the big data technologies used (Section II-C).

A. Fuzzy k-Nearest Neighbors and Its Computational Complexity
FkNN needs a precomputation stage in the training set, which calculates the class membership degree. Afterward, FkNN calculates the nearest neighbors for each unseen instance and 1 [Online]. Available: https://spark-packages.org/package/JMailloH/HS_ FkNN decides on the predicted class with the highest membership degree. A formal notation for the FkNN algorithm is as follows.
Let T R be a training set and T S a test set, composed of n and t instances, respectively. Each instance x i is a vector (x i1 , x i2 , x i3 , . . . , x ij ), where x ij is the value of the ith instance and jth feature. For each instance of TR its class ω is known. However, for TS instances the class is unknown.
The FkNN has two stages: class membership degree computation and classification. The first stage calculates the kNN for each instance of TR, keeping a scheme leave-one-out selecting the k instances with a shorter distance. Finally, it calculates the class membership degree according to (1). The result of the first stage is the TR modifying the class label ω, for a membership vector to each class (ω 1 , ω 2 , . . . , ω l ) where l is the number of classes. This new set will be called fuzzy training set, FTR as For each instance of the TS, the classification stage calculates its kNN in the FTR. Thus, it gets the membership vector of each neighbor and aggregates this vector by applying the following equation. Finally, the class with a higher membership will be predicted as . (2) The first stage of the FkNN, which is an extra stage compared to kNN, causes increased computational complexity and generates two issues to deal with the big data problems.
1) Runtime: The complexity of computing kNN for an instance is O(n · c), where n is the number of instances of TR and c is the number of features. For more than one neighbor, it increases to O(n · log(N )). In addition, FkNN has an extra stage of computation for calculating the class membership degree. 2) Memory consumption: To speed up the calculation, the TR and TS sets stored in the main memory are required. However, when both sets are large, the available main memory is easily exceeded. To alleviate these difficulties, we worked on the design of the two approximate models based on hybrid spill-tree developed under the big data technologies of MapReduce and Apache Spark.

B. Hybrid Spill-Tree: Approximate kNN Search
In the search for the nearest neighbor two approaches can be followed: exact and approximate. The exact approach aims to ensure that the instance identified as closest is actually the closest. To do this, it needs to calculate the distance to all the samples in TR and select the one with the lowest distance. In the big data environment, reducing runtime and increasing scalability is a very important factor, so the approximate approach is more relevant. The approximate approach can be tackled from different perspectives. Due to its high number of features, the dimensionality reduction [26] is a way to speed up the calculation of the distance. The locality-sensitive hashing algorithm [27] is a well-recognized algorithm for reducing dimensionality through hash functions, generating collisions between similar instances. This requires a previous stage of computation for the calculation of hash functions, reducing the scalability of the algorithm. When dealing with not so many features, but with a large number of instances, tree-based proposals get the best performance. Liu et al. in [20] studied tree-based approaches, and proposed the hybrid spill-tree algorithm (HS) [21] as the most promising algorithm to accelerate the search for the kNN.
The HS algorithm is formed of metric-tree (MT) with its precise search and spill-tree (SP) with its fast search. The MT data structure organizes the dataset in a spatial hierarchy, performing a search that ensures the exact nearest instance is found. MT is a binary tree whose root includes the entirety of the samples, and where each child represents a subset of the elements. Fig. 1(a) illustrates how to divide the elements between the two children, selecting each child as the furthest possible instance (represented by ). The mean distance between the children will be the separation of these nodes. The tree will have a depth of O(log(N )). In order to search for the nearest instance, it keeps the candidate with the shortest distance C and its own distance d. If the distance to a branch is more than d, prune it and continue the search. Once there is no branch in the tree with a distance less than d, the search is finished and C and d are returned. Note that a backtracking operation is made in the structure to ensure that C is the nearest, returning exactly the nearest instance.
The SP data structure is a variation of MT, performing an approximate search to speed up its execution. The main difference compared to MT consists in sharing instances between child nodes. Fig. 1(b) shows how the data are divided with the same procedure as MT, allowing a set of duplicate instances in the child nodes. The overlapping area is dependent on the τ parameter. When τ is 0, it would be an MT with no instances shared. If τ is too high, the depth of the tree rises to infinity because the overlap is high. An SP does not backtrack to ensure that the nearest instance has been found, reducing execution times. Moreover, due to the overlapping area, it obtains representative instances of the problem. A common characteristic of MT and SP is that they perform a depth-first search, computationally dependent on the number of features. Thus, when the number of features increases, the runtime is higher.
An HS is proposed with the objective of achieving a balance between accuracy and runtime. Thus, it merges the MT and SP models. To build an HS, it starts by building an SP, and if the number of instances in the overlapping area is less than the balance threshold (BT), it will continue to be an SP. If repeated instances exceed BT, it is reconstructed as an MT. Fig. 2 shows an example of HS, differentiating the MT nodes from the SP nodes. It is important to highlight the starting point for the development of this contribution, which is available in the library developed by the spark-packages community. 2

C. Apache Spark and MapReduce Paradigm
The programming paradigm MapReduce [24] will be used in the development of the algorithm proposed in this article. The MapReduce aims to process large datasets through the distribution of data storage and execution through a cluster of computers.
The MapReduce implementation selected is Apache Spark [25], [28]. Spark parallelizes the calculation transparently through a distributed data structure called resilient distributed datasets (RDD). RDDs allow data structures stored in the main memory to persist and be reused. Additionally, Spark was developed to cooperate with the distributed file system of Hadoop [29], [30] (Hadoop Distributed File System-HDFS). With this configuration, you gain the benefits provided by Spark: fault tolerance, data splitting, and job communication.
MLlib [31] is the official library of machine learning in Spark. It incorporates a large number of statistic techniques and algorithms in areas, such as regression, classification, or clustering.

III. FAST AND SCALABLE FKNN CLASSIFIERS FOR BIG DATA
This section presents two approximate and distributed proposals for the FkNN algorithm based on the HS method to address big data problems implemented in the Spark. Two different approaches are proposed in the class membership degree stage: local and global. The local approach applies a divide-and-conquer approach, where each partition does not know the instances of the other partitions. The global approach has knowledge of all the instances of the TR and develops the use of the HS algorithm. Section III-A describes the local approach, performing the computation on each partition independently, without knowing information about the other partitions in the dataset. Section III-B presents the global approach based on an HS, considering the totality of the data for the calculation of the class membership degree. Section III-C defines the classification stage, which is the same for both models and is based on the HS algorithm. N eigh y ← computekNNLocal (models, k, y) 9: membership y ← computeMembership (N eigh y ) 10: F T RS ← join(y, membership y ) 11: end for 12: return F T RS 13: END computeMembership

A. LHS-FkNN: Local Hybrid Spill Tree FkNN
The proposed local stage together with the classification stage is called LHS-FkNN. Fig. 3 shows the class membership stage workflow. To alleviate the bottleneck, data are partitioned and distributed among the computation nodes. Subsequently, the membership to each partition is calculated independently. Finally, the results of each partition are joined, obtaining FTR as an output.
Algorithm 1 shows the steps and operations in the Spark for calculating the class membership degree. It begins by reading the TR from HDFS and divides it into #Maps parts. Subsequently, a Spark mapPartition operation is used to calculate the class membership degree for each training set split (TRS i ) partition in a distributed manner. The membership calculation is represented in lines 6-12. For each y instance of each TRS i partition, kNN is calculated and finally, the class membership degree is obtained by applying (1). Once the membership for each partition is obtained, the results are joined and form the FTR (Line 3), which will be the input of the classification stage.

B. GAHS-FkNN: Global Approximate Hybrid Spill Tree FkNN
The global stage together with the classification stage is called the GAHS-FkNN. Fig. 4 specifies the workflow of the result y ← join(y, membership y ) 10: end for 11: return prediction y membership stage, which follows an approximate scheme based on the HS. This approach aims to alleviate the bottleneck computation, with consideration of the data globally to obtain quality in the membership degree. Thus, this approach prioritizes quality over scalability. As in the local approach, the output from this stage is the FTR.
Algorithm 2 shows Spark's instructions for the membership degree stage with the global approach. Lines 1-5 correspond to the model creation stage based on the HS, and the remaining lines correspond to the kNN and membership computation.
The model fit phase begins by reading the TR from an HDFS. First, it takes a random subsample to construct an MT, as described in Section II-B (the authors recommend 0.2%). This MT receives the name of the top tree (TT) and is used to estimate the value of the τ parameter and partition the entire TR. The estimate of τ is the average distance between all the instances. To speed up this calculation, it is done with the TT instances.
The next step is to split the TR. To do this, the instances are distributed in the space taking as reference the TT. The value of τ defines the overlapping area. It starts building an SP, and checks if the number of instances in the overlapping area is less than 70%. Otherwise, an MT is reconstructed. When performing the search, the SP branches perform a faster search by not backtracking in the tree. However, those built as MT perform backtracking to ensure the nearest is found. The construction stage of the model ends up distributing the TT and the tree associated with the TR.
The membership phase is shown in lines 6-10. For each TR instance, kNN is calculated following the model generated. Algorithm 3 describes how to perform the kNN with native Spark operations. Using a flatMap operation, the indexes of the nearest instances of TR are computed and obtained. Thus, the distance to the right and the left nodes is calculated, and it continues the search of the nearest instance through the node with a shorter distance. When it reaches a leaf node, it returns the index of the selected instance.
With the neighbors, the class membership degree vector is calculated by (1) (Line 8). The result of this phase is the FTR, and becomes the input of the classification stage.

C. Classification Stage
The proposed classification stage receives as input the FTR calculated in the previous stage, the TS and the value of k. The TS is usually significantly smaller than the FTR, for this reason, the classification stage has a lower computational cost than that of the membership stage. In order to obtain better results in the classification, a global approach is followed, which considers all the instances for the decision making. However, it is approximate in nature in order to speed up the runtime and obtain a higher result x ← join(x, prediction x ) 10: end for 11: return prediction y scalability. Fig. 5 shows the HS-based classification stage workflow. It has two distinct phases: model fit and classification. In the first, the tree is built and the instances are divided between the computation nodes. In the second, the kNN of the FTR is searched for and the predicted class is returned as the output according to the membership degree vector.
Algorithm 4 shows the native instructions from Spark for the classification stage. Lines 1-5 correspond to the model fit phase, and the remaining lines correspond to the classification stage. Due to the similarity in the data flow with the calculation stage of class membership degree based on the HS, only the differences are detailed.
The first difference is in the input datasets. In this case, FTR and TS will be used. The model fit phase is not affected, since the input variables are not modified, and the distances of the instances are maintained. Thus, the model is built with the same methodology, modifying only the class label of the TR, by the membership degree vector of the FTR.
The FkNN calculation is the same as that applied in the HS-based membership calculation stage, which was described in Algorithm 3. In contrast to the membership calculation, the kNN calculation returns the membership degree vector instead of the class label (Line 7). In Line 8, it calculates the predicted class applying (2), obtaining the predicted class for each TR instance, as the final result.

IV. EXPERIMENTAL SETUP
This section presents the issues involved in the experimental framework. It presents the performance measures used (Section IV-A), the details of the datasets (Section IV-B), and the algorithms used with their respective parameters (Section IV-C). Finally, the hardware and software used for the experimentation phase are specified (Section IV-D).

A. Performance Measures
In this article, the efficiency and scalability of the models will be evaluated using the following metrics.
1) Accuracy: The most widely used metric in the literature [32], [33] will be applied to evaluate the quality of the classifiers. This metric counts the number of correct classifications in relation to the total number of instances. Experimentation will be performed on the classification problems with an appropriate class balance, where accuracy is a representative measure. 2) Runtime: Time consumed in computation, also considering the readings and network communications by Spark. In addition, the runtimes will be taken for each one of the two stages that compose the fuzzy algorithms studied in order to analyze the time each of them require. To validate the results of the experiments, we have used the pairwise nonparametric statistical tests based on the Dirichlet Process called Bayesian Sign test [34]. The Bayesian Sign test calculates a distribution with the differences of the results obtained from the confrontation of the two algorithms. Thus, a triangle is constructed that will determine depending on the position of the majority of the distribution, if there is a draw (rope position), victory of the first algorithm (right position) or victory of the second algorithm (left position). The statistical test and the graph shown in the experiments have been generated by the package in R called rNPBST [35].

B. Dataset
For the experimental study, we have selected eight datasets in a large number of instances. The ECBDL14 dataset is extracted from the competition [36]. Although it has an imbalance ratio greater than 45, to study the effect of a large number of features, we selected this dataset. However, in this article we do not address the problem of imbalance classification, so it has been subsampled by obtaining an imbalance ratio of the two. The Epsilon dataset has been taken from the LIBSVM repository [37] and it was artificially created for the Pascal Large Scale Learning  [38]. This dataset was selected to analyze how a high number of features affects the proposed algorithms. The other six datasets have been extracted from the UCI repository [39]. Table I presents the number of instances, characteristics, and classes (#ω). The cross-validation scheme will be followed in five partitions, composed of 80% training instances and the remaining 20% test instances.
In the MapReduce schema, the number of instances processed in each worker depends on the number of instances of the dataset and the number of map tasks used in the execution. Table II lists the number of instances for TR and TS according to the number of map tasks.

C. Algorithms and Parameters
The experimentation carried out has been compared with the other proposals of FkNN and its crisp analogs. The algorithms used and their acronyms are presented in the following: 1) Global exact FkNN (GE-FkNN) [22]: Exact model of the FkNN algorithm to tackle big data problems, obtaining the same results as the original FkNN. Its two stages are global and exact. 2) Local FkNN (L-FkNN): Developed proposal of the FkNN algorithm for this contribution. The first stage, which is responsible for calculating the class membership degree, is described in Section III-A. The second stage is global and exact, identical to the used by that GE-FkNN algorithm. 3) k-nearest neighbor-iterative Spark (kNN-IS) [18]: Crisp kNN's exact proposal to tackle big data problems, getting the same results as the original kNN. 4) Hybrid spill-tree kNN (HS-kNN) [21]: Approximate proposal of the crisp kNN for big data. Although approximate, consider all instances in the search. The best-known FkNN parameter is the number of neighbors (k) considered in the classification. k may be different in the  Models based on the HS algorithm need two parameters to build the model and speed up the search for the nearest instances. The first is the percentage of instances that will be taken into account to form the TT, which is then used to divide and distribute the data. The second is the BT, the admission percentage of the repetition of instances between nodes of the tree to decide if an ST or an MT is constructed. The study has taken the optimal values recommended by the authors of HS: TT equal to 0.2% and BT equal to 70%.

D. Hardware and Software Used
All experiments have been performed on a cluster composed of 15 nodes: a master node and 14 computation nodes. All the nodes have the same configuration: 1) Processor: Intel Xeon CPU E5-2620 (2 GHz) ×2.

V. ANALYSIS OF RESULTS
In this section, we study the results compiled from different experimental studies. Specifically, we analyze the following points.

A. Accuracy Study
The accuracy study starts by listing the results from Table III, which compares the accuracy between the algorithms in relation to the number of neighbors (k) and the number of map operations equal to 128 for all datasets. The best result for each dataset and the best average result are highlighted in bold. Those values that could not be executed due to scalability problems are represented with the symbol "-," for the calculation of the mean, they are considered as a zero in accuracy. Fig. 6 presents the probability distribution of the differences between the GAHS-FkNN and LHS-FkNN algorithms obtained with the Bayesian Sign Test.  Analyzing the table and figures presented, we can observe the following.
1) The GE-FkNN algorithm finds its scalability limit in its first stage, not being able to run for the ECBDL14-S and Higgs datasets.

B. Scalability Study
The scalability study starts by presenting the results from Fig. 8, which compares the runtime of the membership stage and the runtime of the classification stage in seconds, for each datasets, with values of k equal to 3, 5, and 7 and number of maps equal to 64, 128, and 256.
According to the figure shown, the following have been listed.
1) The GAHS-FkNN algorithm is affected when dealing with a large number of features. This is due to the HS structure generating trees with a high depth in their branches as it has a high number of features, resulting in higher runtimes. This can be observed in the Epsilon and ECBDL14-S datasets, where the runtime obtained by the LHS-FkNN are much faster than the runtimes for the GAHS-FkNN algorithm.
2) The LHS-FkNN scale depending on the number of maps thanks to its first local stage, obtaining a performance associated with the hardware used.
3) The GAHS-FkNN gets good runtimes without significantly affecting the hardware used, showing interesting behavior, but limiting the scalability of the model.

C. Study for Higher Values of k
According to the results shown, the GE-FkNN and L-FkNN algorithms show a stagnation in accuracy with the values of k set to 3, 5, and 7. In addition, the runtime is not drastically affected by the k. However, the GAHS-FkNN and LHS-FkNN algorithms keep improving the results. For this reason, this section extends the values of k up to 51, studying the accuracy obtained by both algorithms and the eight datasets, setting the number of maps to 128. Fig. 9 presents the accuracy obtained for each dataset with the GAHS-FkNN and LHS-FkNN algorithms, with values of k between 3 and 51. In order to facilitate the visualization of the results, two figures are shown due to the differences in accuracy between the datasets.
According to the figures presented we can observe the following.
1) The results obtained (accuracy) according to the value of k, follow a similar behavior pattern for both the algorithms. Focusing on the datasets, high values of k improve the accuracy in the Higgs, Poker, Epsilon, and Susy datasets. However, low values of k improve accuracy for Covtype, Watch-acc, Watch-gyr, and ECBDL14-S datasets. This is a natural behavior for the FkNN algorithm, which occurs with the classical datasets from the literature. Therefore, the proposed algorithms show the same behavior in the large datasets. 2) Comparing GAHS-FkNN and LHS-FkNN in terms of accuracy, we see that the differences are very low, and when this difference is accentuated somewhat more in the Covtype and Epsilon datasets, LHS-FkNN is the clear winner.

D. Comparison With Crisp kNN Algorithms
As the influence of the number of maps has already been analyzed and not considered significant, this experiment has been set to 128 maps in order to focus on the comparative study of crisp versus fuzzy. Fig. 10 shows the total runtime for the algorithms kNN-IS, HS-kNN, GAHS-FkNN, and LHS-kNN. To facilitate the study of the runtime, it is presented only with the value of k = 5. The results are shown for two figures due to the differences in the scales of the total runtime of each dataset.  Table IV gives a comparison between the result obtained by the two proposed algorithms and the two crisp-alternatives, exploring the values of k 3, 5, and 7.
According to the table and figure presented, it can be seen that the best results are obtained by the proposed algorithms. Although HS-kNN improves with respect to the kNN-IS algorithm, it is always less accurate than the FkNN models, without the runtime being excessively increased due to the optimization carried out in the classification stage of the GAHS-FkNN algorithm. The kNN-IS wins in the Epsilon, Watch-acc, and Watch-gyr datasets, possibly because it is a dataset with clearly differentiated boundaries and low noise, where the classification problem is simpler than in the other datasets. Despite this, on average the proposed fuzzy models are clearly better.

VI. CONCLUSION
In this article, two MapReduce approaches were proposed to speed up the FkNN algorithm in Big Data problems. Because of the design and use of big data technologies, it is possible to execute with very large datasets. In order to study any possible improvements, the proposed model was compared with the fuzzy and crisp versions of the literature. The GAHS-FkNN and LHS-FkNN algorithms achieve statistically equal results in terms of accuracy. On one hand, the LHS-FkNN algorithm demonstrated very high scalability depending on the hardware facilities available, as well as high accuracy results. On the other hand, the GAHS-FkNN algorithm was less dependent on the hardware resources, but was more affected by a high number of features.
Thus, the use of LHS-FkNN is recommended when we have powerful hardware according to the problem we want to address, and if the number of features is high. The use of GAHS-FkNN is recommended when the number of features is not too high and we have hardware limitations.
A library has been generated with the algorithms used in this article and is available in the spark-packages platform. 3 As future work, we aim to tackle the class imbalanced problem through evolutionary undersampling techniques [40], capable of handling the large datasets. He is currently an Assistant Professor of data science with the University of Nottingham, Nottingham, U.K., since June 2016. He has authored/coauthored more than 65 international publications in the fields of big data, machine learning and optimization (H-index=22 and more than 1900 citations on Google Scholar). His research interest mostly focuses on the research of novel methodologies for big data analytics.
Dr. Triguero is a Section Editor-in-Chief for the Machine Learning, and Knowledge Extraction Journal, and an Associate Editor for the Big Data and Cognitive Computing Journal, and the IEEE ACCESS. He was a Program Co-chair of the IEEE Conference on Smart Data in 2016, the IEEE Conference on Big Data Science and Engineering in 2017, and the IEEE International Congress on Big Data in 2018. He is currently leading a Knowledge Transfer Partnership project funded by Innovative U.K., and the energy provider E.ON that investigates smart metering data.