Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

Dave, Ryan; Angarita-Zapata, Juan S.; Triguero, Isaac

doi:10.1007/978-3-031-40837-3_6

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

Dave, Ryan; Angarita-Zapata, Juan S.; Triguero, Isaac

Authors

Ryan Dave

Juan S. Angarita-Zapata

Dr ISAAC TRIGUERO VELAZQUEZ I.TrigueroVelazquez@nottingham.ac.uk
ASSOCIATE PROFESSOR

Abstract

The emergence of Machine Learning (ML) has altered how researchers and business professionals value data. Applicable to almost every industry, considerable amounts of time are wasted creating bespoke applications and repetitively hand-tuning models to reach optimal performance. For some, the outcome may be desired; however, the complexity and lack of knowledge in the field of ML become a hindrance. This, in turn, has seen an increasing demand for the automation of the complete ML workflow (from data preprocessing to model selection), known as Automated Machine Learning (AutoML). Although AutoML solutions have been developed, Big Data is now seen as an impediment for large organisations with massive data outputs. Current methods cannot extract value from large volumes of data due to tight coupling with centralised ML libraries, leading to limited scaling potential. This paper introduces Hyper-Stacked, a novel AutoML component built natively on Apache Spark. Hyper-Stacked combines multi-fidelity hyperparameter optimisation with the Super Learner stacking technique to produce a strong and diverse ensemble. Integration with Spark allows for a parallelised and distributed approach, capable of handling the volume and complexity associated with Big Data. Scalability is demonstrated through an in-depth analysis of speedup, sizeup and scaleup.

Citation

Dave, R., Angarita-Zapata, J. S., & Triguero, I. (2023, August). Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data. Presented at 7th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2023, Benevento, Italy

Presentation Conference Type	Edited Proceedings
Conference Name	7th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2023
Start Date	Aug 29, 2023
End Date	Sep 1, 2023
Acceptance Date	Aug 22, 2023
Online Publication Date	Aug 22, 2023
Publication Date	Aug 22, 2023
Deposit Date	Aug 29, 2023
Publicly Available Date	Jan 12, 2024
Publisher	Springer
Volume	14065 LNCS
Pages	82-102
Series Title	Lecture Notes in Computer Science
Series Number	14065
Series ISSN	1611-3349
Book Title	Machine Learning and Knowledge Extraction
ISBN	9783031408366
DOI	https://doi.org/10.1007/978-3-031-40837-3_6
Keywords	AutoML; Big Data; Apache Spark; Supervised learning
Public URL	https://nottingham-repository.worktribe.com/output/24585930
Publisher URL	https://link.springer.com/chapter/10.1007/978-3-031-40837-3_6