An Intelligent Toolkit for Benchmarking Data-Driven Aerospace Prognostics

Machine Learning (ML) has been largely employed to sensor data for predicting the Remaining Useful Life (RUL) of aircraft components with promising results. A review of the literature, however, has revealed a lack of consensus regarding evaluation metrics adopted, the state-of-the-art methods employed for performance comparison, the approaches to address data overfitting, and statistical tests to assess results’ significance. These weaknesses in methodological approaches to experimental design, results evaluation, comparison and reporting of findings can result in misleading outcomes and ultimately produce less effective predictors. Arbitrary choices of approaches for novel method’s evaluation, the potential bias that can be introduced, and the lack of systematic replication and comparison of outcomes might affect the findings reported and misguide future research. For further advances in this area, there is therefore an urgent need for appropriate benchmarking methodologies to assist evaluating novel methods and to produce fair performance rankings. In this paper we introduce an open-source, extensible benchmarking library to address this gap in aerospace prognosis. The library will assist researchers to conduct a proper and fair evaluation of their novel ML RUL predictive models. In addition, it will assist stimulating better practices and a more rigorous experimental design approach across the field. Our library contains 13 state-of-the-art ML methods, 12 metrics for algorithm performance evaluation and tests for statistical significance. To demonstrate the library’s functionalities, we apply it to gas turbine engine prognostic datasets.


I. INTRODUCTION
Prognostics and Health Management (PHM) systems have become increasingly important in aviation. Aircraft are now fully equipped with sensors that constantly gather information regarding their status, and possible faults. The ability to utilise these sensor data to accurately predict problems in aircraft parts, facilitates their intelligent health and maintenance management. In addition, the widespread adoption of data collection in aircraft has allowed for the transition from Time-Based Maintenance (TBM) activities, where maintenance is scheduled under fixed intervals, to Condition-Based Maintenance (CBM), where decisions are based on information collected via sensor monitoring [1], [2]. CBM has enabled the rapid development of data-driven methods for aerospace maintenance and stimulated research of predicting when aircraft components will break.
There are several studies employing Machine Learning (ML) techniques to perform Remaining Useful Life (RUL) prediction for aircraft components using publicly available datasets [3]- [8]. Among most research reviewed, there is a lack of methodological agreement with regards to: (1) the evaluation metrics employed to assess their results, (2) the choice of the state-of-the-art methods for performance comparison, (3) the strategies chosen to address data overfitting, and (4) the adequate statistical methods for assessing results' significance. These existing weaknesses in methodological approaches can result in misleading outcomes and ultimately misguide future studies. Additionally, random choices of evaluation methods for novel approaches can introduce reporting bias in performance evaluation.
In order to address these methodological gaps, we conduct an in depth review of the evaluation methodologies used in data-driven aerospace prognosis and similar machine learning tasks, such as regression, and select state-of-the-art methods for evaluating and comparing performance. Subsequently, we develop a library in Python using open-source Keras and Scikit-learn libraries, which allows researchers to evaluate their novel methods using a systematic, robust, fair, and reproducible methodology. This library aims to achieve two main objectives: (1) to introduce an extensible, open-source data-driven toolkit for researchers, to encourage more systematic replication of data-driven prognosis models; and (2) to provide a robust methodology for evaluation and comparison of novel methods. In order to achieve the first objective, we implement 13 existing state-of-the-art datadriven prognosis algorithms and optimise their hyperparameters using random search and cross validation. For the second objective, we employ 12 evaluation metrics and statistical assessment of the outcomes. This paper is organised as follows. Section II provides a review of the methodologies for evaluating and benchmarking ML prognosis algorithms and introduces the datasets used for validating our library. Section III provides an overview of the library, the methods implemented, and the results after applying the library on the datasets. Section IV concludes the paper and introduces opportunities for future research.

II. LITERATURE REVIEW
In order to better understand the rationale behind our toolkit and methodology we introduce an overview of the current efforts towards benchmarking RUL predictions and the existing gaps in the literature. Subsequently, we identify the state-of-the-art ML prognosis methods and commonly used evaluation metrics. We also contrast the different performance evaluation approaches by different groups of authors. The objectives are (1) to draw attention to the lack of consensus regarding methods adopted; and (2) to establish a common set of well-known methods to be used by researchers for a more rigorous approach across the field in the future.

A. Remaining Useful Life Prognostics Benchmarking
Prognosis benchmarking has been an under-explored area for aerospace RUL prediction models. To the best of our knowledge, the main investigation towards advances in the area is introduced by Ramasso and Saxena [8]. The authors review and analyse an extensive list of studies employing intelligent prognosis methods to a well-known set of benchmark sensor data, i.e. Commercial Modular Aero-Propulsion System Simulation Datasets (CMAPSS). The authors also list the existing methods employed for the models' performance evaluation. The main objective of their study is to provide a clear guideline for using the dataset to ensure consistent comparison between different techniques. Their findings reveal inconsistencies in selecting performance evaluation metrics for results comparison among different authors. There is little literature, however, regarding methodologies for benchmarking novel prognosis models.

B. Benchmarking Datasets: Commercial Modular Aero-Propulsion System Simulation
The most widely used, publicly available data set for evaluating aerospace prognostic algorithms is CMAPSS [9]. It consists of four sub-datasets (Table I), established from a high fidelity simulation of a complex non-linear system that closely models a real aerospace engine. Each sub-dataset contains one training set and one test set with different operating conditions and fault patterns. The training set is the complete engine life cycle data, i.e. run to failure, but testing set cycles do not reach failure. The datasets consist of the engine unit number, the operating cycle number of each unit, the operating settings and the raw sensor measurements.

C. Current Methodologies in Aerospace Prognostic Algorithms Evaluation
Evaluation of prognostic algorithm involves: (1) selecting the different algorithms for performance comparison, and (2) selecting the metrics for result evaluation. Across three surveys of data-driven approaches for prognostics [8] [10] and [11], it can be seen that there is inconsistency in the selection of models for comparison and metrics for evaluation. In order to validate these findings, we review a broad list of studies employing data-driven prognostic methods and their preferred algorithms for model comparison and evaluation (Table II and III).
Furthermore, Table III presents the metrics used in the literature for model evaluation. The most commonly used metrics are Absolute Error (AE) [12], Relative Error (RE) [20], Root Mean Squared Error (RMSE) [21], Mean Error (ME) [13], Mean Squared Error (MSE) [21], timeliness [20], False Positives (FP) [22], Median Absolute Error (MdAE) [23], Mean Absolute Deviation (MAD) [23], symmetric Mean Absolute Percentage Error (sMAPE) [23], False Negatives (FN) [22], training time [23], and test time [23]. These metrics evaluate different aspects of performance and together they enable an in depth understanding of the model. The metrics are classified into three major categories: (1) algorithmic performance evaluation metrics, (2) computational performance metrics, and (3) cost-benefit performance metrics [23]. Algorithmic performance metrics evaluate the accuracy of the model in predicting RULs. Computational performance metrics evaluate the amount of time needed for the model to run, which is imperative for real-time monitoring and safety critical prognosis. Cost-benefit metrics are employed to evaluate the economic value of the model. We also observe disagreement in selecting these metrics. For instance, Yuan et al. [6], and Hinchi et al. [17] use LSTM for prognostics and employ RE and timeliness as evaluation metrics while Wang et al. [16] use Bidirectional LSTM but employ RMSE and timeliness as their choice of performance metrics. In addition, Gao et al. [12] use SVR with AE as the only evaluation metrics while Baptista et al. [13] also use SVR with ME, RMSE, MdAE and training time as their choice of evaluation metrics.
Current evaluation methodologies clearly show a lack of consensus regarding the evaluation metrics adopted and the state-of-the-art methods employed for performance comparison and evaluation (see Table II and Table III). In  Yuan et al. [6] Wang et al. [16] Hu et al. [24] Gao et al. [12] Baptista et al. [13] Baptista et al. [14] Zaidan et al. [19] Zhao et al. [18] Hinchi et al. [17] addition, we observe the absence of strategies to address data overfitting in the literature and statistical tests on the RUL predictions for evaluating significant improvement of results. We therefore introduce an intelligent toolkit that allows users to compare different machine learning algorithms using a multitude of evaluation metrics and a significance test to reduce reporting bias.

III. THE INTELLIGENT TOOLKIT
In this section, we provide an overview of the toolkit's components and the performance results obtained after applying the toolkit on the CMAPSS datasets.

A. Overview Of Toolkit's Evaluation Methodology
Our toolkit is implemented in Python programming language using open-source Keras with Tensorflow backend and Scikit-learn libraries, which consists of high-level efficient ML and neural network functions for fast implementation of ML models. The toolkit has the GNU General Public License v3.0 in Github 1 , which enables researchers to freely contribute and use the toolkit. Figure 1 illustrates a flowchart of how the toolkit can be used to evaluate existing and novel prognosis models. The toolkit first automatically checks if a new model and dataset is defined by user. If they are defined, the toolkit compares the new model with the existing builtin algorithms (defined in Section III-B) on the new dataset and evaluates its performance (using the metrics define in Section III-C). Subsequently, Mann-Whitney-Wilcoxon nonparametric test evaluates the statistical significance of the

B. Machine Learning Algorithms
The library consists of 13 machine learning algorithms, along with their respective hyperparameters for all CMAPSS datasets. The values of optimized hyperparameters can be found on our Github 2 . These data-driven algorithms consist of linear (SGD), kernel (SVR), tree (ET, RF, Boosting, GBR, Adaboost) and deep neural network (DNN, CNN, LSTM, GRU) models. We choose these algorithms as they are among the most widely used machine learning methods in the intelligent prognostic community [7], [25], [8]. To reduce overfitting of these algorithms, we optimise their hyperparameters using random search and 10-folds cross validation. Researchers using our toolkit are required to optimise the hyperparameters of their models to reduce overfitting and evaluate their algorithms with the optimised models in the toolkit.
Furthermore, to illustrate the effectiveness of random search optimisation, we examine the validation loss of CNN with and without optimisation (i.e. Figure 2). With no optimisation (purple line), we can observe an upward 2 Hyperparameter is available at https://github.com/divishrengasamy/intelligenttoolkit-prognostic trend in validation loss for CNN after approximately 20 epochs that continues throughout the training process which indicates overfitting. Subsequently, after using optimisation (cyan line), there is a gradual decrease in validation loss across the epochs showing no sign of overfitting. Fig. 2: Reduction of overfitting in CNN after optimisation shown using validation loss. The purple line is the validation loss before optimisation while the cyan line is the validation loss after optimisation is applied. The red and green lines represent the training loss before and after optimisation respectively.

C. Performance Evaluation Metrics
For performance evaluation, we implement coefficient of determination (R 2 ), Absolute Error (AE), Relative Error (RE), Root Mean Squared Error (RMSE), Mean Error (ME), Mean Squared Error (MSE), Timeliness, Median Absolute Error (MdAE), Mean Absolute Deviation (MAD), symmetric Mean Absolute Percentage Error (sMAPE), Training Time, and Test Time. These metrics are among the most commonly used algorithmic and computational performance evaluation metrics for regression problems. They are therefore useful for the prediction of RUL [7], [25], [8].

D. Results And Outputs From Intelligent Toolkit
We apply the 13 optimised ML algorithms from Section III-B to all CMAPSS datasets. Table IV shows the evaluation of models' performance on CMAPSS FD001 dataset with the emboldened values representing the best performing algorithm for each metric (the performance evaluation results of the toolkit on CMAPSS FD002, FD003 and FD004 datasets are found in our Github 3 due to page limitation). We observe some disagreement as to which algorithms perform the best. For instance, CNN1D performs the best on timeliness, MAE and R 2 metrics, while GRU in RE, MAD, AE, MdAE and RMSE. In addition, CNN2D achieves the best result with sMAPE metric while DNN leads in ME. This disagreement is due to the fact that the metrics evaluate different facets of performance such as accuracy, precision, robustness and timeliness [23], and therefore, are all required for a better understanding of model performance. Generally, evaluation metrics do not tell us if improvement in results is significant or not relative to other methods due to uncertainties from sensor readings and the combination of different sensors [26]. Therefore, using the Mann-Whitney-Wilcoxon non-parametric test [27] at a 5% significance level we can determine the statistical significance of the result of one algorithm compared to the others. We choose Mann- Whitney-Wilcoxon non-parametric test because it does not assume normality of results' distributions. In Figure 3, we present a heat map of p-values (%) for pairwise comparison of the MAE results of the 13 algorithms applied on CMAPSS FD001 dataset. The heatmap clearly shows that there is no statistical improvement of performance among Extra Trees, Adaboost Regressor, Bagging Regressor, RF, SVR, GBR and KNN. Similarly, the performance of LSTM is not statistically different from CNN using the MAE metrics. In addition, Figure 4 displays the box-plots of the algorithms' MAE results for 10-fold cross validation to show the variability of model performance. We observe high degree of variability for all models except for CNN and GRU, and outliers are present in both SGD and SVR. This high variability in validation score for SGD, SVR, ET, RF, Boosting, GBR, Adaboost and LSTM indicates that the models poorly capture the degradation process from the sensor measurements.
This toolkit enables researchers to benchmark their novel method to other optimised machine learning algorithms by producing a table with results using a wide variety of evaluation metrics and a heatmap illustrating statistical significance of results. These outputs (i.e. table and heatmap) provide a better understanding of the performance of researchers' novel methods in comparison to existing state-of-the-art datadriven methods to further research in the area. However, there are some limitations to this toolkit. The toolkit does not support automatic preprocessing and hyperparameters optimisation i.e., the toolkit uses default hyperparameters for new models.

IV. CONCLUSION AND FUTURE WORK
In this paper we have developed an intelligent toolkit that allows researchers to evaluate their novel methods using a systematic, robust, fair, reproducible methodology. The toolkit is aimed at achieving two main objectives: (1) introducing an extensible, open-source data-driven toolkit for researchers, to encourage more systematic replication of data-driven prognostic models; and (2) provide a robust methodology for evaluation and comparison of novel methods. We implemented 13 existing state-of-the-art machine learning models for prognosis, 12 evaluation metrics and statistical assessment for a more robust evaluation and comparison of the models. Subsequently, we validated our toolkit by applying it to the four CMAPSS datasets. The results show an advantage in utilising diverse evaluation metrics as there is variability in the performance of algorithms across different metrics. Thus, the wide variety of evaluation metrics and data-driven prognostic algorithms in our toolkit provide a deeper understanding of the performance of novel models in predicting the remaining useful life of a component. For future work, we consider extending the library to include fault and anomaly detection benchmarking to provide a better overview machine health monitoring. Finally, we intend to apply our toolkit on additional prognostic datasets from other domains to test its robustness.
ACKNOWLEDGEMENT This work is funded by the INNOVATIVE doctoral programme. The INNOVATIVE programme is partially funded by the Marie Curie Initial Training Networks (ITN) action (project number 665468) and partially by the Institute for Aerospace Technology (IAT) at the University of Nottingham.