kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data

Maillo, Jesus; Ramirez, Sergio; Triguero, Isaac; Herrera, Francisco

doi:10.1016/j.knosys.2016.06.012

kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data

Maillo, Jesus; Ramirez, Sergio; Triguero, Isaac; Herrera, Francisco

Authors

Jesus Maillo

Sergio Ramirez

Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR

Francisco Herrera

Abstract

The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies.

In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available.

Citation

Maillo, J., Ramirez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3-15. https://doi.org/10.1016/j.knosys.2016.06.012

Journal Article Type	Article
Acceptance Date	Jun 12, 2016
Online Publication Date	Jun 14, 2016
Publication Date	Feb 1, 2017
Deposit Date	Jun 15, 2016
Publicly Available Date	Jun 15, 2016
Journal	Knowledge-Based Systems
Print ISSN	0950-7051
Electronic ISSN	1872-7409
Publisher	Elsevier
Peer Reviewed	Peer Reviewed
Volume	117
Pages	3-15
DOI	https://doi.org/10.1016/j.knosys.2016.06.012
Keywords	K-nearest neighbors; Big data; Apache Hadoop; Apache Spark; MapReduce
Public URL	https://nottingham-repository.worktribe.com/output/795290
Publisher URL	http://www.sciencedirect.com/science/article/pii/S0950705116301757
Contract Date	Jun 15, 2016