Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

Triguero, Isaac; Garcia-Gil, Diego; Maillo, Jesus; Luengo, Julian; Garcia, Salvador; Herrera, Francisco

doi:10.1002/widm.1289

Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

Triguero, Isaac; Garcia-Gil, Diego; Maillo, Jesus; Luengo, Julian; Garcia, Salvador; Herrera, Francisco

Authors

Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR

Diego Garcia-Gil

Jesus Maillo

Julian Luengo

Salvador Garcia

Francisco Herrera

Abstract

The k-nearest neighbours algorithm is characterised as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data -likely to contain noise and imperfections - are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbours rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data - which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context will be investigated. This will include a brief overview of Smart Data, current and future trends for the k-nearest neighbour algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbour algorithm to obtain Smart/Quality Data for a high quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analysed.

Citation

Triguero, I., Garcia-Gil, D., Maillo, J., Luengo, J., Garcia, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), Article e1289. https://doi.org/10.1002/widm.1289

Journal Article Type	Article
Acceptance Date	Sep 26, 2018
Online Publication Date	Nov 28, 2018
Publication Date	2019-03
Deposit Date	Oct 19, 2018
Publicly Available Date	Nov 29, 2019
Journal	Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Electronic ISSN	1942-4795
Publisher	Wiley
Peer Reviewed	Peer Reviewed
Volume	9
Issue	2
Article Number	e1289
DOI	https://doi.org/10.1002/widm.1289
Public URL	https://nottingham-repository.worktribe.com/output/1176205
Publisher URL	https://onlinelibrary.wiley.com/doi/full/10.1002/widm.1289
Contract Date	Oct 19, 2018