Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR
Evolutionary undersampling for extremely imbalanced big data classification under apache spark
Triguero, Isaac; Galar, M.; Merino, D.; Maillo, Jesus; Bustince, H.; Herrera, Francisco
Authors
M. Galar
D. Merino
Jesus Maillo
H. Bustince
Francisco Herrera
Abstract
The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification.
Citation
Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., & Herrera, F. Evolutionary undersampling for extremely imbalanced big data classification under apache spark. Presented at 2016 IEEE Congress on Evolutionary Computation (CEC)
Conference Name | 2016 IEEE Congress on Evolutionary Computation (CEC) |
---|---|
End Date | Jul 29, 2016 |
Acceptance Date | Mar 15, 2016 |
Publication Date | Jul 24, 2016 |
Deposit Date | Nov 22, 2016 |
Publicly Available Date | Nov 22, 2016 |
Peer Reviewed | Peer Reviewed |
Keywords | Big data, Sparks, Data mining, Data models, Biological cells, Proposals, Standards |
Public URL | https://nottingham-repository.worktribe.com/output/799743 |
Publisher URL | http://ieeexplore.ieee.org/document/7743853/ |
Additional Information | © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. |
Contract Date | Nov 22, 2016 |
Files
EUSspark.pdf
(1.4 Mb)
PDF
You might also like
Machine Learning Pipeline for Energy and Environmental Prediction in Cold Storage Facilities
(2024)
Journal Article
Local-global methods for generalised solar irradiance forecasting
(2024)
Journal Article
Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data
(2023)
Presentation / Conference Contribution
Explaining time series classifiers through meaningful perturbation and optimisation
(2023)
Journal Article
Downloadable Citations
About Repository@Nottingham
Administrator e-mail: discovery-access-systems@nottingham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search