The need for operational reasoning in data‐driven rating curve prediction of suspended sediment

The use of data‐driven modelling techniques to deliver improved suspended sediment rating curves has received considerable interest in recent years. Studies indicate an increased level of performance over traditional approaches when such techniques are adopted. However, closer scrutiny reveals that, unlike their traditional counterparts, data‐driven solutions commonly include lagged sediment data as model inputs, and this seriously limits their operational application. In this paper, we argue the need for a greater degree of operational reasoning underpinning data‐driven rating curve solutions and demonstrate how incorrect conclusions about the performance of a data‐driven modelling technique can be reached when the model solution is based upon operationally invalid input combinations. We exemplify the problem through the re‐analysis and augmentation of a recent and typical published study, which uses gene expression programming to model the rating curve. We compare and contrast the previously published solutions, whose inputs negate their operational application, with a range of newly developed and directly comparable traditional and data‐driven solutions, which do have operational value. Results clearly demonstrate that the performance benefits of the published gene expression programming solutions are dependent on the inclusion of operationally limiting, lagged data inputs. Indeed, when operationally inapplicable input combinations are discounted from the models and the analysis is repeated, gene expression programming fails to perform as well as many simpler, more standard multiple linear regression, piecewise linear regression and neural network counterparts. The potential for overstatement of the benefits of the data‐driven paradigm in rating curve studies is thus highlighted. Copyright © 2011 John Wiley & Sons, Ltd.


INTRODUCTION
In this paper, we highlight arguments surrounding the operational applicability of data-driven discharge/ suspended sediment rating curve solutions that use lagged data as inputs. This is an issue that has received no comprehensive attention despite more than 10 years of data-driven, suspended sediment modelling research activities. Moreover, it is a highly significant issue as the development of any model that cannot actually be used inevitably raises questions about the value of the modelling that is being carried out and the approaches that are being used to do it. We define the operational applicability of a data-driven solution simply as the extent to which its input and parameter requirements limit its application with regard to common operational tasksa definition which is distinct from a solution's functional performance (i.e. a poorly performing solution may be operationally applicable, whereas a better performing solution is not). Indeed, this definition conforms to others, which emphasise the need for solutions rather than the search for hydrological knowledge per se (Wilby and Davies, 1997;p. 195). In this study, we repeat and augment the analysis of a typical, recently published datadriven sediment rating curve problem and, in so doing, exemplify how solutions that have no operational application, may nonetheless be evaluated as offering better functional performance than their operationally applicable counterparts. This can result in an overstatement of the data-driven paradigm's ability to provide better solutions to end-users. To this end, we replicate the basic modelling methodology of Aytek and Kisi (2008), who developed a set of data-driven sediment rating curve solutions for the Tongue River, USA, and augment this with a range of new, operationally applicable solutions for comparison. By repeating previously published methods, we include certain methodological weaknesses that we identify in the original work and which are highlighted at appropriate points in the text. However, these inclusions are justified by the comparative nature of the study, which requires the exact re-application of the original methodological approach if the new, additional solutions are to be adequately compared with the originals. The aim of this study is, therefore, not to develop a comprehensive critique or re-analysis of the original modelling methodology but to re-apply it so that an illustrative case study emerges, which demonstrates the importance of including both functional performance assessment and a higher-level rationale for the operational application of data-driven rating curves solutions. This paper is the third in a series by the authors that critically evaluates different aspects of data-driven methodologies used in suspended sediment modelling. In Mount and Abrahart (2011a), we examine how data series should be delivered to the modelling process (i.e. as concentration or load/logged or unlogged) and the impact that decision has on the form and consistency of outputs. In , we examine how contextual hydrological and dataset knowledge can be incorporated into the process of developing and selecting data-driven suspended sediment models and call for an end to blind model justification on the sole basis of goodness-of-fit metrics.
In recent years, practical opportunities for the implementation of data-driven modelling mechanisms have caught the attention of the hydrological modelling community, and this has included efforts to develop superior discharge/sediment rating curve solutions. The use of a data-driven model (DDM) in this way has been described as advanced curve fitting, in which the form of the response function is unrestricted by the a priori constraints imposed in conventional, empirical models (Mount and Abrahart, 2011a). Satisfactory combinations of simplicity and predictive power in both single-input single-output and multi-input single-output versions have been reported (cf Kisi, 2005;Zhu et al., 2007). The value of such models in suspended sediment forecasting and hindcasting has long been recognised, as most catchments lack the comprehensive suspended sediment monitoring equipment from which to quantify suspended sediment directly. Consequently, discharge-driven rating curves are often used as operational tools for estimating short and long-term suspended sediment yield in catchments and for interpolating missing values in suspended sediment records (e.g. Stott and Mount, 2007). However, the risks associated with their use, especially for estimating high-magnitude/low-frequency data, should not be underestimated (Walling, 1977;Walling and Webb, 1988).
In the data-driven paradigm, one or more input data sets that are presumed useful predictors of suspended sediment, are interrogated using machine learning and artificial intelligence algorithms with the optimal form of the response function being learned directly from patterns and structures that are present in the data sets. A comprehensive review of data-driven approaches in hydrology can be found in Solomatine et al. (2008). The standard conceptual and practical modelling processes used by data-driven modellers are shown in Figure 1, with an ordered set of more abstract reasoning elements informing the practical workflow of the modeller. Conceptual reasoning is undertaken to formulate the overall concept and goal of the modelling activity (e.g. process knowledge discovery versus curve fitting) and to determine the appropriateness of a data-driven versus knowledge-driven approach (Dubois et al., 2000). It also determines the extent to which evidence of causal hydrological relationships should be captured in the suspended sediment model's response function and predictor set selections and, therefore, the conceptual set of predictive drivers for which data will be required.
Procedural reasoning informs the decisions governing data set acquisition and any processing procedures applied to optimise the model's predictive power and Figure 1. The reasoning and practical modelling processes most commonly utilised in data-driven suspended sediment studies 3983 robustness. In a practical sense, it guides how each predictor is objectified as a concrete data set that can be delivered to the model as an input. This includes any raw data manipulations and transformations applied to improve the model's predictive power or robustness, such as temporal lagging (Ciğizoğlu, 2002;Ciğizoğlu and Kisi, 2006;Aytek and Kisi, 2008;Partal and Ciğizoğlu, 2008;Cobaner et al., 2009;), log-transformation (Alp and Ciğizoğlu, 2007;Mount and Abrahart, 2011a) and unit conversion to remove spurious correlation (McBean and Al-Nassri, 1988;Annandale, 1990;McBean and Al-Nassri, 1990a, b;Milhous, 1990;Nordin, 1990;Wahl, 1990). Functional reasoning directs the modeller's identification of an optimal data-driven modelling algorithm or technique, and/or the identification and selection of the optimum combination of input data sets delivered to it. In practice, the approaches tend to be rather ad hoc and partial in scope. The array of possible input combinations are seldom modelled exhaustively, and preferred algorithms and/or input combinations are usually identified through reference to one or more goodness-of-fit metrics.
Parallels can be drawn between the reasoning elements of data-driven suspended sediment models detailed in Figure 1, and many of those underpinning physically based hydrological models. However, at present, the reasoning used to inform the development of DDMs is too heavily biased towards functional elements. Indeed, papers assessing the relative performance of different data-driven algorithms, techniques and model configurations under different modelling scenarios are far more numerous than those which examine the influence that the modeller's reasoning processes may be having on the usefulness or correctness of their assessments, or the physical interpretation of their resultsissues that have received far more attention from physically based modellers (cf Beven, 1989). As a consequence, a number of key criticisms of the data-driven paradigm that apply to suspended sediment modelling remain largely unaddressed including a general (1) lack of justification for the model configuration and response function structures used (Minns and Hall, 1996;Babovic, 2005;Solomatine et al., 2008); (2) lack of established procedures to ensure robust model configurations and outputs (Abrahart et al., 2008a;Mount and Abrahart, 2011a); (3) lack of justification for using more complex, datadriven response functions over simpler empirical counterparts (Mount and Abrahart, 2011b) (4) over-reliance on simplistic, goodness-of-fit metrics in the identification of preferred models (Legates and McCabe, 1999;. (5) lack of operational applications for the model, i.e. can the resultant model actually be applied for an operational purpose? (Abrahart et al., 2008b;. The last of these is particularly important as, without a justification of its operational applicability, the practical value of a model becomes unclear and the conceptual reasoning underpinning the modelling goal becomes open to question. The problem is well exemplified by the publication of numerous suspended sediment studies in which the 'best' model input configuration incorporates current discharge together with some mix of lagged discharge and lagged sediment records, e.g. Ciğizoğlu (2002); Ciğizoğlu and Kisi (2006); Aytek and Kisi (2008); Partal and Ciğizoğlu (2008); Cobaner et al. (2009);.
The widespread use of lagged sediment as a predictor has, to date, been supported by functional reasoning, which asserts that DDMs produce better goodness-of-fit metrics if such inputs are used (e.g. Kisi, 2005;Aytek and Kisi, 2008). However, it is important to recognise that the use of past sediment records as an input to the modelling process not only reduces the problem to something approaching a near-linear modelling operation (in which case the need for using a DDM becomes questionable anyway), but also makes little operational sense. This is because a full set of observed measurements of suspended sediment is needed as an input to the model for the period being modelled minus the lag. Consequently, there would be no need to model the suspended sediment record as it would already, by definition, be known at least up until the period of the lag. Similarly, extrapolation of suspended sediment outputs beyond the period for which suspended sediment values are known is not supported. Indeed, problems even remain if the prediction of sediment response is made only within the period of the lag. This is because the standard inclusion of a current discharge input means that the related sediment response would have happened before it could be modellednegating the value of the prediction. Finally, model solutions cannot be transferred to similar rivers or to different periods, if no observed suspended sediment records are available for use as inputs to the sediment prediction model. The supposed advantages of both implicit and explicit DDMs are thus negated on several counts because the resultant models and/or equations are restricted to the original stations for which developmental and operational sediment data must be collected, and transferring uniquely-optimised solutions to different reaches or catchments makes neither conceptual nor operational sense.
Given the importance of such issues, the explicit inclusion of an operational reasoning element to inform the modeller about how a particular data-driven model configuration may restrict its operational value would appear essential. To this end, Figure 1 can be amended so that operational reasoning is explicitly included (Figure 2), and model input configurations, which result in models of little operational value, can be identified and rejected prior to any functional efforts to identify 'best' or 'preferred' models.
In this paper, we demonstrate how incorrect conclusions about the performance of a data-driven modelling technique can be reached when the model solution is based upon operationally invalid input combinations. This is achieved through a re-appraisal of two gene expression programming (GEP: Ferreira, 2001;2006) models developed for the Tongue River, USA, originally published by Aytek and Kisi (2008). Each model, developed on operationally invalid input combinations, is compared with a range of new GEP solutions, developed in an identical manner to the originals but with only operationally valid inputs included. In addition, we develop a range of new, operationally valid linear and non-linear model counterparts against which the overall performance of GEP solutions can be assessed. Consequently, using the context of the Tongue River, this paper examines (1) the impact of removing operationally invalid input combinations on the explicit outputs and goodness-offit metrics obtained by the GEP technique and (2) the extent to which GEP maintains its functional advantages over other increasingly sophisticated modelling approaches when operationally invalid model configurations are excluded from the analysis.
The paper progresses with a review of the Tongue River data sets, a re-examination of the input combinations used by Aytek and Kisi (2008) and identification of those combinations that lack operational application. We then repeat Aytek and Kisi's original analysis using only the input combinations deemed to be of operational value. The results from the original work and those obtained from the re-analysis are then compared, and the impact of rejecting the operationally invalid input combinations is highlighted. Finally, we compare the results obtained from the re-analysed GEP approach and those obtained from traditional log-log sediment rating curve, linear regression, piecewise linear regression and non-linear neural network (NN) counterparts.

Overview
The Tongue River has been a focus of several recent papers examining data-driven modelling approaches in hydrology (Kisi, 2004;Aytek and Kisi, 2008;Guven and Kisi, 2010). Of particular importance to this study is the paper by Aytek and Kisi (2008) in which an explicit formulation of the sediment-discharge relationship in the Tongue River is developed using GEP. The Tongue River watershed encompasses some 13 983 km 2 of Wyoming and Montana (Figure 3). This river is a tributary of the Yellowstone River; it rises in the Big Horn Mountains of Wyoming, flows through northern Wyoming and southeastern Montana and then empties into the Yellowstone River at Miles City, Montana. The river is 396 km long, and the watershed is for the most part rural. From elevations of 2400-3000 m, it drops to low, rugged mountains and badlands. Below the mountains, the stream runs through a long, narrow valley confined by high bluffs and terraces. HydroSolutions Inc. (2008) report that streamflow is driven by precipitation, although the relationship is complex. Variations in the pattern and timing of precipitation over the basin, and lag time between snowfall and snowmelt are some of the complicating factors. The river is fed by winter snow pack from the higher elevations of the Big Horn Mountains, by early snow runoff in the lower elevations of the drainage basin, and by ground water springs. The river rises in March and April due to snowmelt in the lower elevations, and again in June, as summer weather melts the higher elevation snow pack. In the plains region, with elevations from 900 to 1800 m above mean sea level, annual average precipitation ranges from 250 to 350 mm, and rainfall is the more dominant form. Average monthly precipitation is greatest from April through September, and maximum temperatures occur in July, whereas minimum values occur in January. About 75% of the annual precipitation falls as rain during April-September. May and June are usually the wettest months of the year.
In accordance with standard rating curve approaches, two datasets were used in Aytek and Kisi (2008). They comprise daily time series of streamflow (Q in m 3 s À1 ) and suspended sediment (S in ton day À1 ) records for an upstream station (UPST: station no: 6307830, situated below Brandenberg Bridge near Ashland) and a downstream station (DNST: station no: 6308500 at Miles City) on the Tongue River in Montana, USA. The reported upstream drainage area for UPST is 10 521 km 2 ; for DNST, it is 13 932 km 2 .

Data subsets for model training and testing
In the subsequent analyses, we replicate the data acquisition, processing and sub-setting procedures outlined in Aytek and Kisi (2008). Every available record for the two test stations was downloaded from the US Geological Survey web server (http://webserver.cr.usgs. gov/sediment) on 1 April 2011. The full set of downloaded material comprised daily records for 1 October 1975-30 September 1981 at UPST and for 31 August 1977-4 December 1985 at DNST. Modelling was performed on a common overlapping subset of the full available record: a 3-water-year period that spanned 1 October 1977-30 September 1980 (Set A) that was used  Inc., 2008) for model development/training purposes, and a 1-wateryear model testing period that spanned 1 October 1980-30 September 1981 (Set B) that was used for model testing. The record for 31 August 1977 was also used in the modelling processes to fill a missing record at tÀ1 for 1 September 1977. Time series plots and scatterplots for the UPST and DNST split sample datasets are provided in Figure 4.
The statistical parameters of daily stream flow and daily sediment for both stations are given in Table I. The discharge and suspended sediment load period for both stations, particularly for the training period (Set B), is observed to be highly variable, highly peaked and highly skewed. Large magnitudinal differences in sediment, but not in discharge, are observed to exist between the two gauging stations. Such differences reflect their relative positions in the catchment with regard to contributing sources. Large magnitudinal differences also exist between Set A (training data) and Set B (testing data) at each station such that final model assessment is performed on what equates to a minor, limited and somewhat unrepresentative fraction of a larger set of processes appertaining to that region. The largest events are part of the development period, thus sidestepping the need to perform difficult extrapolation tasks for peak sediment loads during testing, but also increasing the probability of overestimating low(er) sediment values. Similarly, the smallest magnitude events are unevenly distributed amongst the development and testing datasets. The minimum values for Set A at DNST are higher than those of its corresponding testing dataset, which could cause downward extrapolation difficulties in estimating lower sediment values. Figure 4 identifies a number of important outliers in the dataset, arising from some common set of non-linear processes that a DDM might be expected to encapsulate, and which should not simply be interpreted as measurement or sampling oddities. These extreme values have their origins in snowmelt processes that are unlikely to be captured by DDMs trained on relatively short periods and, in particular, on records which do not include local climatic and/or meteorological predictors. Indeed, Figure 4 shows low flow conditions, interrupted in March and April by lower elevation snowmelt, and again in June by higher elevation snowmelt. This produces high discharges, sediment flushing and associated large clockwise hysteresis loops. The result is that at DNST, two observed suspended sediment load values exceed 80 000 ton day À1 , whereas the remaining sediment values are below 50 000 ton day À1 . Similarly, at UPST, one observed suspended sediment load value exceeds 27 000 ton day À1 , whereas the other values are below 20 000 ton day À1 . Table II, which provides a cross-correlation matrix for Set A and Set B, also reveals potential problems. Higher correlation coefficients for all variable pairs are reported in Set B, in comparison to Set A, and this likely leads to artificially high testing scores. Moreover, the highest correlation against S is S tÀ1 and the correlation between S and S tÀ1 is much higher for Set B compared with Set A.
Consequently, there is a danger that autocorrelation in the predicted dataset is confused with causation, resulting in poorly constructed arguments for the inclusion of past sediment inputs as drivers.

MODEL INPUT CONFIGURATION SELECTIONS
In Aytek and Kisi's original paper, models are developed on seven input combinations comprising Q t , Q tÀ1 , S tÀ1 and S tÀ2 (Table III), although the remaining eight input combinations that are theoretically possible are not included in the analysis, and a justification for their lack of inclusion is not provided. Thus, conclusions drawn in the study should not be generalised but considered as observations, specific only to, and valid in, the particular cases which are listed (Aksoy et al., 2007). When operational reasoning is applied and the use of lagged suspended sediment inputs is thus discounted, the candidate input set is reduced to Q t and Q tÀ1 , resulting in only three possible input combinations: The first combination equates to concurrent input-output modelling in a similar manner to that of a traditional sediment rating curve approach. The second is, in effect, a lagged version of the traditional rating curve, and therefore, one would not expect its performance to exceed that of Q t . Consequently, we have discounted its use in this study, again on grounds of operational reasoning (the operational value of an inferior model configuration is difficult to justify). The third combination was not considered in the original paper despite its clear potential for improved modelling where complex hysteresis may exist in the record. Consequently, we do include it in our subsequent analyses.

TONGUE RIVER RE-ANALYSIS: MODELLING APPROACHES
In their original paper, Aytek and Kisi (2008) conclude that 'the results obtained with GEP models are better than those obtained using the conventional rating curve and multiple linear regression techniques'. They also assert that 'the results suggest that GEP may provide a superior alternative to the sediment rating curve and multiple linear regression techniques' (p. 297). Similar conclusions remain in more recent studies (cf Guven and Kisi, 2010). However, these conclusions are based on comparisons between rating curve and regression solutions and the best performing GEP solution, which is developed on operationally inapplicable input combinations (i.e. [Q t , Q tÀ1 , S tÀ1 ]). Consequently, there is a need to assess the relative performance of GEP solutions developed on operationally applicable input combinations with i) Aytek and Kisi's best performing GEP solution and ii) other operationally applicable linear and non-linear solutions.
To that end, Aytek and Kisi's preferred GEP solutions (Figures 3 and 5 and Equations 7 and 14 from the original paper), which were developed using inputs [Q t , Q tÀ1 , S tÀ1 ] were: (1) re-modelled with two new GEP solutions developed on inputs [Q t ] and [Q t , Q tÀ1 ], using the same basic methodology and settings used by Aytek and Kisi (2008) (2) re-modelled with three additional statistical and data-driven approaches that provide a range of linear and non-linear counterparts against which the relative performance of GEP solutions can be assessed.
The modelling approaches are detailed in the following paragraphs. Models developed on one input [Q t ] are labelled 1 models developed on two inputs [Q t , Q tÀ1 ] are labelled 2.    Aytek and Kisi (2008) were replicated and, unless their method indicated otherwise, default settings were adopted. However, some apparent inconsistencies in the methodological descriptions in their paper meant that some minor assumptions had to be made. Specifically, the original paper (p.290) reported that the function set used contained four basic arithmetic operators (+, À, *, /) and six basic mathematical functions (√, ln(x), log(x), e x , 10 x , power)yet, their final models, inter alia, contained abs(), x 2 and 3 √ functions. Similarly, the original paper (p.290) reported that the fitness function used was 'Absolute Error with Selection Range' yet, mean absolute error is subsequently reported as the fitness function on p. 291. Because neither issue could be resolved from the original paper, our method follows what is documented in the methodology on p.290 of Aytek and Kisi's paper. It should be noted that Aytek and Kisi's choice of the fitness function settings (SR = 100, p = 0.1) is of importance because major events that deliver large errors on specific models could fall outside the selection range (SR) and would therefore be treated as outliers and excluded from the fitness landscape that is used to evolve the solutions. Consequently, snow melt instances are not likely to be properly accommodated during the modelling process. Finally, no 'stopping condition' was reported in the original paper, so we opted to stop at 10 000 generations. Our settings are listed in Table IV.

Traditional modelling counterparts
Aytek and Kisi (2008) provided a bias-corrected (Ferguson, 1986) single-input single-output log-log sediment rating curve solution (e.g. Asselman, 2000) as a benchmark comparison for their transparent GEP  solutions. We include this counterpart in our analysis, accepting that it represents an important baseline comparator against which the potential benefits of more

Linear regression counterparts
Linear models (Raghuwanshi et al., 2006) should be used as benchmarks against which non-linear data-driven modelling methodologies are tested to establish the extent to which the numerical relationship that is presented for modelling is linear or near-linear. Without such comparators, it is impossible to properly justify the need for the application of more complex modelling methodologies, irrespective of theoretical arguments about the established nature of the underlying scientific process that is of interest See, 2007a, 2007b). Therefore, linear modelling solutions should always be used to (i) test the numerical strength of empirical relationships in each dataset and (ii) identify the nature and extent of residual non-linearities that might necessitate the adoption of a different sort of additional model to represent such factors (Curry and Morgan, 2003). For each station, as a baseline linear comparator, ordinary least squares (OLS) linear regression models were developed. Two different models were produced, providing S t output on inputs [Q t ] and [Q t , Q tÀ1 ], named OLS1 and OLS2, respectively. Equation parameters for the OLS regression counterparts are provided in Table V. In the present study, the 'Waikato Environment for Knowledge Analysis' Data Mining Toolkit v3.6 (WEKA: Witten and Frank, 2005) was also used to derive M5 Model Tree (M5MT: Quinlan, 1992) piecewise linear regression predictions of S t on inputs [Q t ] and [Q t , Q tÀ1 ], termed M5MT1 and M5MT2. The M5MT algorithm splits the input data into non-intersecting regions and thereafter fits a linear regression model to each of the data subsets. The size of the regions progressively narrows, producing increasingly complex piecewise models. Exhaustive search is used to examine all possible splits to select the one that delivers maximum reduction in statistical variance for the resulting models, and the splitting process terminates when no significant progress is achieved via further splitting. M5MT has been used to predict harbour basin sedimentation in the Port of Rotterdam (Bhattacharya and Solomatine, 2006), to model bed-load and total-load sediment transport in alluvial channels (Bhattacharya et al., 2007) and for modelling suspended sediment load in rivers (Janga Reddy and Ghimire, 2009). Upstream M5MT solution parameters are presented in Table VI, with downstream solution parameters given in Table VII.

Non-linear NN counterparts
Numerous different types of NN exist with feedforward solutions that are trained using the back propagation of error algorithm, providing one of the most widely used data-driven approaches in hydrological modelling. These include applications for sediment yield estimation (Abrahart and White, 2001). They therefore represent an established non-linear modelling benchmark against which more novel approaches can be compared.
NN models producing S t output on inputs [Q t ] and [Q t , Q tÀ1 ], named NN1 and NN2, were developed for each station using an in-house programme, written in Pascal, that has delivered sound performance on a number of previous occasions using similar settings, e.g. Dawson et al. (2002Dawson et al. ( , 2006. Ten different architectures were used to model each station: each setup comprising 1:{1,5}:1 and 2:{1,5}:1 configurations, in which the number of hidden units {1,5} used in a particular model is indicated by means of brackets, e.g. NN1(1). In all cases, prior to modelling, each dataset was standardised in a linear manner to a common range {0.1-0.9}. Each network configuration was thereafter trained using the traditional 'back propagation of error with momentum' algorithm. The algorithm parameters (objective function = sum squared error; learning rate = 0.1; momentum factor = 0.9; number of epochs = 20 000) are commensurate with the relatively simple NN architecture being used in this paper. Each model was assessed on root mean squared error for the training and testing datasets at 1000 epoch intervals such that a preferred solution could be selected. These configurations mirror those adopted in other hydrological modelling tasks of similar scope and scale.

EVALUATION METRICS
It is standard practice to assess the relative performance of a given model via reference to one or more quantitative evaluation metrics. However, the selection and use of these statistics is often problematic and it is sometimes difficult for readers or users to interpret how well a particular model reproduces the observed dataset or how well a model compares with other models (American Society of Civil Engineers, 1993; Legates and McCabe, 1999). To provide a transparent comparison with past modelling efforts, this paper includes identical metrics to Aytek and Kisi (2008) namely root mean square error (RMSE) and R-squared (RSqr). These two metrics are strongly influenced by a model's replication of lower frequency, medium and high magnitude events in the data. Although these comprise a relatively small proportion of the observations in the data set, they are arguably some of the most important events in an operational context as they may be responsible for delivering much of the sediment yield of a catchment. All evaluation statistics were generated using HydroTest (http:// www.hydrotest.org.uk): a standardised, open-access web site that performs the required numerical calculations (Dawson et al., 2007(Dawson et al., , 2010. This service supports a broad spectrum of quantitative tests and provides descriptive statistics for the comparison of observed and predicted datasets. It also documents the underlying equations upon which the calculations are based.

COMPARISON OF GEP SOLUTIONS
One of the key benefits of GEP is the production of an explicit set of parse tree diagrams, which provide an explicit documentation of the model solution. These can then be used to re-apply the solution to other data, or can be used as the basis for comparison to other GEP solutions. GEP expression tree diagrams for Aytek and Kisi's (2008) UPST solution, together with those for UPST-GEP1 and UPST-GEP2 are presented in Figure 5. Observed versus predicted plots for training and testing periods for each sub-expression, together with the combined response function, are presented in Figures 6  and 7. By plotting individual sub-expressions, it becomes possible to identify the importance of the contribution that each makes to the combined response function. Moreover, as not all inputs are necessarily utilised in each subexpression (e.g. see Aytek and Kisi Sub-ET 1 where lagged sediment is not included), it becomes possible to determine whether all inputs are implicated as main drivers of the combined response function.
The patterns presented in Figures 6 and 7 demonstrate that for all GEP solutions, one dominant sub-expression comprises the majority of the combined response function, with the other two sub-functions providing small, local adjustments. In the case of Aytek and Kisi's solution, Sub-ET 3 is the dominant sub-function and models the global trend in the data, particularly at the lower data ranges. The non-dominant expression tree Sub-ET 2 can be seen to have little influence on lowerrange data but particular influence on the upper-range data, where it acts to increase their values. Sub-ET 1 displays a similar pattern to Sub-ET 2 but in a much subdued manner. Importantly, both Sub-ET 3 and Sub-ET 2 include S tÀ1 as an input, and in so doing, lagged sediment is implicated as an important driver of their solution, particularly for modelling upper-range data. In the case of GEP1 and GEP2, Sub-ET 1 acts as the dominant sub-function with the other two providing negligible inputsuggesting that three sub-expression trees in solutions based solely on Q t and Q tÀ1 is unnecessarily complex. The inclusion of Q tÀ1 in Sub-ET 1 of GEP2 indicates that, when made available, lagged discharge is selected as an important driver of the model.
The solutions for the DNST gauge (Figures 8-10) follow a similar basic pattern to those for UPST, with a single sub-expression capturing much of the global trend in the data and, in so doing, comprising the majority of Figure 6. A + K (Aytek and Kisi, 2008), GEP1 and GEP2: analysis of GEP sub-expression and GEP combined function Set A (training period) outputs for UPST the combined response function in all solutions. Again, lagged sediment is included as an input to the dominant sub-expression of Aytek and Kisi's solution and to Sub-ET 2, both of which act to increase the two highest data points. Sub-ET 1 contributes little to Aytek and Kisi's combined response function, again indicating that the use of three sub-expressions is more complex than the modelling problem demands. GEP1 and GEP2 are seen to perform comparatively poorly, especially in their ability to model the two high magnitude data points. As in the UPST models, a single sub-expression tree is responsible for almost all of the response function. Evaluation metrics for the models are presented in Table VIII. The explicit solutions published by Aytek and Kisi (2008) make it possible to compute the full range of evaluation metrics, for both Set A and Set B data, despite the fact that only Set B results are reported in the original paper. In all cases, Aytek and Kisi's solution outperforms GEP1 and GEP2. The inclusion of Q tÀ1 in GEP2 is seen to enhance its performance compared with GEP1 but without the inclusion of a lagged sediment input; the evaluation metrics for GEP1 or GEP2 fail to equate to those of Aytek and Kisi's solution by a substantial margin.
The expression tree plots highlight the critical importance of lagged sediment in adequately modelling high magnitude instantaneous suspended sediment values. This is reflected strongly in the evaluation metrics, which are particularly influenced by the effectiveness with which upper-range data are modelled. Both RMSE and RSqr values indicate Aytek and Kisi's lagged suspended sediment model as performing consistently, and considerably better, especially in upper-range data. However, when operational reasoning is incorporated, and lagged suspended sediment is rejected as an input, the ability of GEP to model suspended sediment is seen to be significantly limited, with RMSE and RSqr statistics reduced. This therefore raises the important question of whether GEP maintains its performance advantage under such a scenario, a question which is addressed through the subsequent comparison of GEP1 and GEP2 solutions against a range of linear and non-linear counterparts.

COMPARISON OF GEP AND COUNTERPART SOLUTIONS
Comparison of GEP1 and GEP2 solutions to (i) traditional counterparts, (ii) linear counterparts and (iii) Figure 9. A + K (Aytek and Kisi, 2008), GEP1 and GEP2: analysis of GEP sub-expression and GEP combined function Set A (training period) outputs for DNST non-linear counterparts are presented in Tables IX, X and XI, XII, respectively. Comparisons between GEP1, GEP2 and BCLM indicate mixed results (Table IX). Metrics for the UPST gauge indicate that GEP2 offers a generally better solution than either GEP1 or BCLM, with it resulting in the best metric scores for all but Set B RMSE. Scores from the more complex Set A data highlight significant improvement in RMSE when GEP solutions are applied, irrespective of whether Q tÀ1 is included as an input, whilst RSqr scores are roughly comparable across the models. This improvement in RMSE is, however, not observed in the Set B data and there is relatively little difference between GEP1, GEP2 and BCLM scores. This is probably a result of set B lacking the outliers evident in set  (Table X), the evaluation metrics provide a clear picture. GEP performs similarly to its OLS counterparts with metrics for GEP2 and OLS2 (both including Q tÀ1 as an input), indicating that simple multiple linear regression provides a better model than the more complex GEP solution. This finding is in clear contradiction to Aytek and Kisi's (2008) paper, which identified the multiple linear regression solution as having inferior metric scores compared with its GEP counterpart. Whilst the advantages of a piecewise linear solution based solely on [Q t ] (M5MT1) is minimal, the piecewise solution based on [Q t , Q tÀ1 ] (M5MT2) has the best metric scores for both UPST and DNST gauges and for both Set A and Set B data. These results therefore indicate that either a multiple linear regression or a piecewise linear solution incorporating lagged discharge is preferable to its counterpart GEP solution.
The message emerging from our NN non-linear findings (Tables XI and XII) is not as well defined as that delivered by the linear modelling solutions (Tables IX  and X). For models based solely on [Q t ] (Table XI), a general pattern of marginal improvement in metric scores is evident in the NN solutions with those of NN1(3), suggesting it to be the best overall model for Set A data at the UPST and DNST gauge. For Set B data, the picture is less clear with GEP1 providing the best RSqr scores and NN1(2) providing the best RMSE scores at both UPST and DNST gauges. For solutions based on Q t and Q tÀ1 (Table XII), NN solutions result in better metric scores than their GEP counterparts, although it is clear that different NN configurations perform differently at UPST and DNST gauges, and for Set A and Set B data. Those NN solutions with high numbers of hidden units (NN2(4) and NN2 (5)) have the best metric scores on Set A data, and this reflects the relative complexity of that data set, which is best modelled by a more complex NN configuration. The more simple Set B data are well modelled by NNs with lower numbers of hidden units, particularly at the DNST gauge where the relationship between discharge and suspended sediment is least complex. It is important to note that the higher the number of hidden units in an NN, the greater the chance of overfitting. Normally, this is detected by ensuring a comparable degree of fit for the solution to training and testing data. However, this relies on the training and testing data sets containing mutually representative statistical and temporal patterns. The data split used by Aytek and Kisi (2008), and replicated here, has resulted in data sets that are not particularly representative of one another (Table I and Figure 4), and it is therefore difficult to discount overfitting in those NN solutions with higher numbers of hidden units. However, the fact that even the NN solutions with low numbers of hidden units (i.e. those less likely to be overfitted) result in better metric scores than their GEP counterparts is strong evidence of the preferential solutions delivered by NN approaches. In the context of the operationally applicable models for the     (Table XIII) and, in so doing, offer an operationally useful alternative with only minimal reduction in evaluation statistics. Of particular note in this regard are NN and piecewise linear regression solutions based on [Q t , Q tÀ1 ]. Whilst NN solutions lack the explicit outputs of their GEP counterparts and may be prone to overfitting, M5MTs have the advantage of being both explicit and robust.

CONCLUSIONS
This study highlights the fact that input configurations, which include lagged sediment, and which are commonly used in data-driven suspended sediment rating curve models, lack operational justification and have little operational value. Their inclusion in future studies should, therefore, either be explicitly justified with respect to the conceptual purpose of the modelling exercise, or rejected. It also highlights how sensitive the performance of GEP modelling approaches are to their input configurations in the case of both example data sets. These two factors support the authors' call for the inclusion of a proper appraisal of the operational applicablity of different input configurations to a data-driven modeller's workflow before any functional evaluations based on evaluation metrics are madean element that is currently missing.
The original input configurations used by Aytek and Kisi (2008) in their analysis of the Tongue River lacked operational application because of their inclusion of lagged sediment as an input. Their GEP models did indeed produce improved evaluation metric scores. Therefore, on the basis of the purely functional reasoning employed in their study, their claims that GEP 'may provide a superior alternative to the sediment rating curve and multiple linear regression techniques' are supported. However, the re-analysis presented here clearly demonstrates that, once operational reasoning is applied and operationally inapplicable input combinations are discounted, GEP fails to perform as well as simpler, more standard multiple linear regression, piecewise linear regression and NN counterparts. Indeed, the reported superiority of GEP in the Tongue River is shown to be dependent on the inclusion of an operationally invalid lagged sediment input.
The conclusion of the study is therefore clear. Operational reasoning should be an essential element in all data-driven suspended sediment modelling workflows to ensure that: (1) the resultant models can actually be used for operational purposes and (2) the performance of complex data-driven solutions are not overstated in comparison to their simpler model counterparts.