Ryan Dave
Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data
Dave, Ryan; Angarita-Zapata, Juan S.; Triguero, Isaac
Authors
Juan S. Angarita-Zapata
Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR
Abstract
The emergence of Machine Learning (ML) has altered how researchers and business professionals value data. Applicable to almost every industry, considerable amounts of time are wasted creating bespoke applications and repetitively hand-tuning models to reach optimal performance. For some, the outcome may be desired; however, the complexity and lack of knowledge in the field of ML become a hindrance. This, in turn, has seen an increasing demand for the automation of the complete ML workflow (from data preprocessing to model selection), known as Automated Machine Learning (AutoML). Although AutoML solutions have been developed, Big Data is now seen as an impediment for large organisations with massive data outputs. Current methods cannot extract value from large volumes of data due to tight coupling with centralised ML libraries, leading to limited scaling potential. This paper introduces Hyper-Stacked, a novel AutoML component built natively on Apache Spark. Hyper-Stacked combines multi-fidelity hyperparameter optimisation with the Super Learner stacking technique to produce a strong and diverse ensemble. Integration with Spark allows for a parallelised and distributed approach, capable of handling the volume and complexity associated with Big Data. Scalability is demonstrated through an in-depth analysis of speedup, sizeup and scaleup.
Citation
Dave, R., Angarita-Zapata, J. S., & Triguero, I. (2023, August). Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data. Presented at 7th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2023, Benevento, Italy
Presentation Conference Type | Edited Proceedings |
---|---|
Conference Name | 7th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2023 |
Start Date | Aug 29, 2023 |
End Date | Sep 1, 2023 |
Acceptance Date | Aug 22, 2023 |
Online Publication Date | Aug 22, 2023 |
Publication Date | Aug 22, 2023 |
Deposit Date | Aug 29, 2023 |
Publicly Available Date | Jan 12, 2024 |
Publisher | Springer |
Volume | 14065 LNCS |
Pages | 82-102 |
Series Title | Lecture Notes in Computer Science |
Series Number | 14065 |
Series ISSN | 1611-3349 |
Book Title | Machine Learning and Knowledge Extraction |
ISBN | 9783031408366 |
DOI | https://doi.org/10.1007/978-3-031-40837-3_6 |
Keywords | AutoML; Big Data; Apache Spark; Supervised learning |
Public URL | https://nottingham-repository.worktribe.com/output/24585930 |
Publisher URL | https://link.springer.com/chapter/10.1007/978-3-031-40837-3_6 |
Files
Hyper Stacked
(2.8 Mb)
PDF
You might also like
Machine Learning Pipeline for Energy and Environmental Prediction in Cold Storage Facilities
(2024)
Journal Article
Local-global methods for generalised solar irradiance forecasting
(2024)
Journal Article
Explaining time series classifiers through meaningful perturbation and optimisation
(2023)
Journal Article
Identifying bird species by their calls in Soundscapes
(2023)
Journal Article
Downloadable Citations
About Repository@Nottingham
Administrator e-mail: discovery-access-systems@nottingham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search