Jesus Maillo
kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
Maillo, Jesus; Ramirez, Sergio; Triguero, Isaac; Herrera, Francisco
Authors
Sergio Ramirez
Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR
Francisco Herrera
Abstract
The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies.
In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available.
Citation
Maillo, J., Ramirez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3-15. https://doi.org/10.1016/j.knosys.2016.06.012
Journal Article Type | Article |
---|---|
Acceptance Date | Jun 12, 2016 |
Online Publication Date | Jun 14, 2016 |
Publication Date | Feb 1, 2017 |
Deposit Date | Jun 15, 2016 |
Publicly Available Date | Jun 15, 2016 |
Journal | Knowledge-Based Systems |
Print ISSN | 0950-7051 |
Electronic ISSN | 1872-7409 |
Publisher | Elsevier |
Peer Reviewed | Peer Reviewed |
Volume | 117 |
Pages | 3-15 |
DOI | https://doi.org/10.1016/j.knosys.2016.06.012 |
Keywords | K-nearest neighbors; Big data; Apache Hadoop; Apache Spark; MapReduce |
Public URL | https://nottingham-repository.worktribe.com/output/795290 |
Publisher URL | http://www.sciencedirect.com/science/article/pii/S0950705116301757 |
Contract Date | Jun 15, 2016 |
Files
acceptedVersion.pdf
(1.5 Mb)
PDF
Copyright Statement
Copyright information regarding this work can be found at the following address: http://creativecommons.org/licenses/by-nc-nd/4.0
You might also like
Machine Learning Pipeline for Energy and Environmental Prediction in Cold Storage Facilities
(2024)
Journal Article
Local-global methods for generalised solar irradiance forecasting
(2024)
Journal Article
Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data
(2023)
Presentation / Conference Contribution
Explaining time series classifiers through meaningful perturbation and optimisation
(2023)
Journal Article
Downloadable Citations
About Repository@Nottingham
Administrator e-mail: discovery-access-systems@nottingham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search