Jesus Maillo
Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data
Maillo, Jesus; Triguero, Isaac; Herrera, Francisco
Authors
Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR
Francisco Herrera
Abstract
It is recognized the importance of knowing the descriptive properties of a dataset when tackling a data science problem. Having information about the redundancy, complexity and density of a problem allows us to make decisions as to which data preprocessing and machine learning techniques are most suitable. In classification problems, there are multiple metrics to describe the overlapping of the features between classes, class imbalances or separability, among others. However, these metrics may not scale up well when dealing with big datasets, or may not simply be sufficiently informative in this context. In this paper, we provide a package of metrics for big data classification problems. In particular, we propose two new big data metrics: Neighborhood Density and Decision Tree Progression, which study density and accuracy progression by discarding half of the samples. In addition, we enable a number of basic metrics to handle big data. The experimental study carried out in standard big data classification problems shows that our metrics can quickly characterize big datasets. We identified a clear redundancy of information in most datasets, so that, discarding randomly 75% of the samples does not drastically affect the accuracy of the classifiers used. Thus, the proposed big data metrics, which are available as a Spark-Package, provide a fast assessment of the shape of a classification dataset prior to applying big data preprocessing, toward smart data.
Citation
Maillo, J., Triguero, I., & Herrera, F. (2020). Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. IEEE Access, 8, 87918-87928. https://doi.org/10.1109/ACCESS.2020.2991800
Journal Article Type | Article |
---|---|
Acceptance Date | Apr 27, 2020 |
Online Publication Date | May 1, 2020 |
Publication Date | May 1, 2020 |
Deposit Date | May 13, 2020 |
Publicly Available Date | May 13, 2020 |
Journal | IEEE Access |
Electronic ISSN | 2169-3536 |
Publisher | Institute of Electrical and Electronics Engineers |
Peer Reviewed | Peer Reviewed |
Volume | 8 |
Pages | 87918-87928 |
DOI | https://doi.org/10.1109/ACCESS.2020.2991800 |
Keywords | General Engineering; General Materials Science; General Computer Science |
Public URL | https://nottingham-repository.worktribe.com/output/4429552 |
Publisher URL | https://ieeexplore.ieee.org/document/9083972 |
Files
Big Data Classification Towards Smart Data
(3.6 Mb)
PDF
Publisher Licence URL
https://creativecommons.org/licenses/by/4.0/
You might also like
Machine Learning Pipeline for Energy and Environmental Prediction in Cold Storage Facilities
(2024)
Journal Article
Local-global methods for generalised solar irradiance forecasting
(2024)
Journal Article
Explaining time series classifiers through meaningful perturbation and optimisation
(2023)
Journal Article
Identifying bird species by their calls in Soundscapes
(2023)
Journal Article