Skip to main content

Research Repository

Advanced Search

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

Dave, Ryan; Angarita-Zapata, Juan S.; Triguero, Isaac

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data Thumbnail


Authors

Ryan Dave

Juan S. Angarita-Zapata



Abstract

The emergence of Machine Learning (ML) has altered how researchers and business professionals value data. Applicable to almost every industry, considerable amounts of time are wasted creating bespoke applications and repetitively hand-tuning models to reach optimal performance. For some, the outcome may be desired; however, the complexity and lack of knowledge in the field of ML become a hindrance. This, in turn, has seen an increasing demand for the automation of the complete ML workflow (from data preprocessing to model selection), known as Automated Machine Learning (AutoML). Although AutoML solutions have been developed, Big Data is now seen as an impediment for large organisations with massive data outputs. Current methods cannot extract value from large volumes of data due to tight coupling with centralised ML libraries, leading to limited scaling potential. This paper introduces Hyper-Stacked, a novel AutoML component built natively on Apache Spark. Hyper-Stacked combines multi-fidelity hyperparameter optimisation with the Super Learner stacking technique to produce a strong and diverse ensemble. Integration with Spark allows for a parallelised and distributed approach, capable of handling the volume and complexity associated with Big Data. Scalability is demonstrated through an in-depth analysis of speedup, sizeup and scaleup.

Conference Name 7th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2023
Conference Location Benevento, Italy
Start Date Aug 29, 2023
End Date Sep 1, 2023
Acceptance Date Aug 22, 2023
Online Publication Date Aug 22, 2023
Publication Date Aug 22, 2023
Deposit Date Aug 29, 2023
Publicly Available Date Jan 12, 2024
Publisher Springer
Volume 14065 LNCS
Pages 82-102
Series Title Lecture Notes in Computer Science
Series Number 14065
Series ISSN 1611-3349
Book Title Machine Learning and Knowledge Extraction
ISBN 9783031408366
DOI https://doi.org/10.1007/978-3-031-40837-3_6
Keywords AutoML; Big Data; Apache Spark; Supervised learning
Public URL https://nottingham-repository.worktribe.com/output/24585930
Publisher URL https://link.springer.com/chapter/10.1007/978-3-031-40837-3_6

Files





You might also like



Downloadable Citations