Fuzzy Integral Driven Ensemble Classification using A Priori Fuzzy Measures

Aggregation operators are mathematical functions that enable the fusion of information from multiple sources. Fuzzy Integrals (FIs) are widely used aggregation operators, which combine information in respect to a Fuzzy Measure (FM) which captures the worth of both the individual sources and all their possible combinations. However, FIs suffer from the potential drawback of not fusing information according to the intuitively interpretable FM, leading to non-intuitive results. The latter is particularly relevant when a FM has been defined using external information (e.g. experts). In order to address this and provide an alternative to the FI, the Recursive Average (RAV) aggregation operator was recently proposed which enables intuitive data fusion in respect to a given FM. With an alternative fusion operator in place, in this paper, we define the concept of ‘A Priori’ FMs which are generated based on external information (e.g. classification accuracy) and thus provide an alternative to the traditional approaches of learning or manually specifying FMs. We proceed to develop one specific instance of such an a priori FM to support the decision level fusion step in ensemble classification. We evaluate the resulting approach by contrasting the performance of the ensemble classifiers for different FMs, including the recently introduced Uriz and the Sugeno λ-measure; as well as by employing both the Choquet FI and the RAV as possible fusion operators. Results are presented for 20 datasets from machine learning repositories and contextualised to the wider literature by comparing them to state-of-the-art ensemble classifiers such as Adaboost, Bagging, Random Forest and Majority Voting.


I. INTRODUCTION
Aggregation operators are powerful mathematical methods for weighted multi-source information fusion. The weights of the sources (e.g. individual classifier outputs in the context of ensemble classification) are defined by a Fuzzy Measure (FM) [1], [2], which captures the worth of the individual sources and all their possible combinations. Many aggregation operators have been proposed in the literature, Fuzzy Integrals (FIs) [3] being the most commonly used (especially the Choquet Fuzzy Integral (CFI)).
Although FIs have been used widely, they suffer from potential drawbacks of producing non-intuitive results in respect to a commonly intuitively interpretable FM. For example, as shown in [4], the re-ordering inherent to the FI can lead to only partial exploitation of a FM and the aggregation of symmetrically mirrored inputs does not necessarily result in symmetrically mirrored outputs. Wagner et al. recently discussed these aspects in detail in [4] and introduced a family of aggregation functions called the Recursive Average (RAV), as an alternative to FIs for fusion applications designed to leverage (interpretable) FMs. With the RAV operator addressing some of the potential shortcomings of how FMs are used by FIs, there is renewed scope to revisit how FMs are generated in practical applications.
Currently, to construct a FM, three key approaches are common: 1) Expert-driven, 2) Algorithm-driven and 3) Optimisation (see Section II-A). In this paper, we put forward a fourth category: FMs specified using externally available information, so-called a priori FMs. The key idea underpinning these FMs is that they are specified independently of the actual aggregation operator using external information, thus preserving their interpretability. Here, 'external information' is information available to inform the weighting of individual sources and their combinations which is independent from the actual data fusion step. In this sense, one could argue that expert-driven FMs are also a type of 'a priori' FM; nevertheless, as expert-driven FMs are a specific type with a long and well understood tradition/rationale, we argue that maintaining them as a separate category is both useful and serves clarity.
In the context of ensemble classification, the aggregation operators are an extension of the weighted ensemble algorithms. In the past decade, aggregation operators based ensemble classifiers have become popular due to their ability to express interactions between the classifiers [5], [6]. In this paper, we develop one specific instance of an a priori FM: one which captures the quality of individual classifiers (and their combinations) in order to then enable a fusionbased ensemble classifier. While working on this paper, the authors became aware of recent work by Uriz et al. [7] which introduces a FM based on the same principle (i.e. a FM based on sub-classifier performance) in the context of imbalanced classification problems and traditional FIs.
In order to evaluate the potential of the proposed instance of an 'a priori' FM, also in respect to the recently introduced RAV aggregation operators, we conduct in-depth experiments contrasting the proposed FM with both the Sugeno and Uriz FMs, using the CFI and RAV as aggregation operators. Note that because of space limitation, in respect to the family of RAV operators, we focus only on the arithmetic RAV (i.e. p = 1) in this paper. Experiments are conducted using 20 datasets from machine learning repositories and contextualised to the wider literature by comparing them to state-of-theart ensemble classifiers such as Adaboost, Bagging, Random Forest and Majority Voting.
In Section 2 we review the background on FMs, Aggregation operators and ensemble classification methods employed in this paper. In Section 3 the a priori FM is introduced. This section also presents the a priori FM based ensemble classifier. Section 4 presents and discusses the results of all the experiments, followed by conclusion in Section 5.

A. Fuzzy Measures
Fuzzy Measures (FMs) capture the worth of each information source (also called densities) and all their possible combinations i.e. every subset in a power set [1], [4]. Figure 1 shows the FM weight structure (also referred as a lattice) for three sources.
Let X = {x 1 , ..., x n } be a discrete and finite set of information sources and g : 2 X → [0, 1] be a FM having the following properties: P1: Boundary condition, i.e., g(∅) = 0, g(X) = 1, and P2: Monotonic and non-decreasing, i.e., g( In cases of infinite FM X, there is another property which guarantees continuity. However, X is finite and discrete in this paper and therefore this property is not relevant. In the context of ensemble classification, g(A) represents the weight or importance of subset A. To construct FMs there are three major approaches, as follows: 1) Expert-driven: While FMs can be defined by experts [4], with increasing lattice size (number of parameters), defining each parameter of the FM becomes practically not feasible, restricting this approach to a subset of applications with a limited number of sources. 2) Algorithm-driven: These methods compute the FM from the values of individual sources by leveraging the FM's mathematical constraints. Some examples are the Sugeno λ-measure, the S-decomposable measure and the K-additive measure [8]- [11]. 3) Optimisation-driven: These methods use algorithms such as gradient descent, evolutionary computation and quadratic programming to optimise the FMs in respect to a pre-defined aggregation operator and training data [1], [5], [12]. Note that this approach, while offering the potential of powerful and concise fusion operators, is also most at risk of generating non-generic FMs, i.e. FMs which are not independently interpretable, but are tuned to drive the specific aggregation operator (e.g. CFI) which they were trained for [4]. One of the extensively used FM in a number of research works is the Sugeno λ-measure, described in the next subsection.  [13], the Sugeno λ-measure has been the most commonly used algorithmic FM in literature.
The Sugeno λ-measure centres on the following property: where λ > −1. The unique value of λ can be obtained by solving the following polynomial equation: where g i = g(x i ).
2) Uriz FM: Uriz et al. [7] proposed a method which learns the FM from the classification accuracies of individual classifiers. This method shares key aspects of the motivation in respect to using the performance of individual classifiers within an ensemble classification framework with the approach put forward in this paper, nevertheless follows a different approach in the generation of the actual FM as detailed in Section III.
For all A ⊆ X and a number of classifiers (or features) N : {1, .., n}, the Uriz's FM is composed in two steps. First, the uniform FM g u is given by: The second step makes use of the results of all the individual classifiers and the results of all classifier combinations, given by Acc A , ∀A ⊆ X, as follows: where, M eanAcc |A| is the average of results for the classifiers with the same cardinality.

B. Aggregation Operators
Aggregation operators are mathematical functions that combine the information from multiple sources [1], [4]. There are many aggregation operators in the literature [14], but in this work we focus on commonly used FIs and the recently introduced RAV, explained in the next subsections.
1) Fuzzy Integrals: Fuzzy Integrals (FIs) are non-linear aggregation functions often used for information (evidence) fusion using the worth of each subset of sources (provided by a FM 'g') [1], [4]. The two most commonly used FIs in the literature include Choquet Fuzzy Integral (CFI) [1], [2], [5], [6] and Sugeno Fuzzy Integral (SFI) [3]. In this work the CFI is used which is defined as follows: Choquet Fuzzy Integral: Let h : X → [0, ∞) be a real valued function that represents the evidence or support of a hypothesis. The discrete Choquet Fuzzy Integral (CFI) [1]- [4] can be defined as: and g(A 0 ) = 0. More detail on the property of FIs and the CFI can be found in [14].
2) Recursive Average (RAV): The Recursive Average (RAV), an instance of Recursive Weighted Power Mean − introduced by Wagner et al. [4], is an aggregation operator that fuses information over a set of sources X defined with respect to a FM g, mathematically defined as follows: where |p| > 0, B j = X\x j , x j ∈ X. ∀p the RAV for a set of sources is recursively defined as the weighted average of its sub sources, where the weight at each node is captured by a FM [4]. For some particular p values, RAV adopts specific averaging behaviour in a recursive manner, i.e. the recursive arithmetic average for p = 1, the recursive harmonic average for p = −1, the recursive quadratic average for p = 2, and so on. For simplicity and space, we focus on solely on the RAV for p = 1 in this paper.

C. Ensemble Classification
Ensemble Classification determines the class to which a new object belongs by integrating the results of multiple classifiers. In the previous studies [1], [6], [15], the researchers concluded that aggregation operator based ensemble classifiers worked well in a number of applications such as multi-criteria decision making (MCDM) [16], forensic science [17], software defect prediction [18], brain computer interface (BCI) [19], computer vision [20], [21] and explosive hazard detection [22]. Here, we use the application of ensemble classification to compare the a priori measure with the Sugeno λ-measure and the Uriz measure.
Algorithm 1: a priori FM Algorithm inputs : Accuracy Values (Acc), normalisation factor (N f ) output: Fuzzy Measure (g ap ) Importantly, the FM based ensemble classifiers are also compared with extensively used, state-of-the-art, machine learning algorithms methods such as the Random Forests, Bagging, Boosting and Majority Voting (briefly explained below). Finally, We also compare the a priori measure based classifier with DeFIMKL [5] (Decision-level Fuzzy Integral Multiple Kernel Learning), which is a state-of-the-art FI based ensemble classification algorithm.
Adaboost: Adaboost is a very popular ensemble classification algorithm which combines multiple weak classifiers to construct a strong classifier [23]. The algorithm trains by giving higher weights to mis-classified data in subsequent iteration (of the classifier) while the weights of the correctly classified instances are decreased. Weighted combinations of all the classifiers in the ensemble are used to predict the final result.
Bagging: Bagging (or Bootstrap Aggregation) [24] ensemble algorithms are most commonly used in problems with high variance. The data for each classifier in the ensemble is selected by sampling with replacement. The final classification outcome is based on a majority-vote.
Majority Voting with SVM (MJSVM): Let x be an instance and S i (i = 1, 2, ..., k) Support Vector Machine (SVM) classifiers that output class labels m i (x, c j ). For each class label c j (where j = 1, ..., n) [15], the output of the final classifier y(x) for instance x is given by: Random Forest: Random Forests are one of the most commonly used ensemble classifiers in the literature. One of the important parameter in this algorithm is the size of the trees. Small trees suffer from high bias while the tree with more levels suffer from high variance [25]. DeFIMKL algorithm: The Decision-level Fuzzy Integral Multiple Kernel Learning (DeFIMKL) is a state-of-the-art FI-FM ensemble classification algorithm, which aggregates the kernels through the use of the CFI with respect to a FM learned by a regularised quadratic programming approach [1]. Agrawal et al. [15] showed that DeFIMKL is the best FI-FM ensemble classification algorithm, therefore it will be informative to compare DeFIMKL with the a priori FM based ensemble classifier.

III. A Priori FM FOR ENSEMBLE CLASSIFICATION
This section presents an instance of an A priori FM which uses the classification accuracy of all the individual classifiers and their combinations as external information. The steps of the algorithm using the a priori FM for ensemble classification is also presented in this section.

A. Generating an A priori FM for Ensemble Classification
For all the classifier combinations A ⊆ X and the classifiers in the ensemble N : {1, .., n}, the a priori FM is given by 'g ap ', presented in Algorithm 1. Algorithm 1 takes as input the classification accuracies Acc i (i.e. the training set accuracies of the input dataset), where i is all the combinations, and the normalisation factor N f (the range on which the accuracies are normalised); and generates the a priori FM g ap by running the following steps: S1: The input accuracies are normalised using the factor N f = 50 i.e. the input accuracies are normalised between 50% and the maximum observed accuracy (M axAcc).
The normalisation factor is chosen to be 50 (instead of 0) as the classifiers should ideally perform better than random guessing i.e. their accuracies should be more than 50%. These normalised accuracies are then subtracted from one (to give high worth to best accuracies and viceversa) and then passed on to second step.
S2: For all the individual classifiers (eg g{x 1 }, g{x 2 } and g{x 3 } from Fig. 1), the values obtained from the previous step is the final a priori FM value. For each combination of sources (eg all the combinations in Fig 1), if the normalised measure value obtained from the previous step is greater than or equal to the values of all its sub-sources, the normalised measure value of that combination is the final a priori FM (eg from Fig 1  if Otherwise, if this normalised measure value is less than the measure value of any of its sub-sources, the maximum measure value among its sub-sources is the final a priori FM (eg from Fig 1 if nm(

B. Ensemble Classification
Each test data x can be classified using the following steps: 1) Compute the decision value h(x ) for all the classifiers in the ensemble, 2) The aggregation value agg for the test data is computed using the a priori FM g ap in respect to the RAV (2) or the CFI (1).

IV. RESULTS AND DISCUSSION
In this section we present the experimental framework for the comparison of the FM based ensemble classifiers followed by results and discussion.

A. Experimental Framework
The a priori FM is compared with Uriz and Sugeno FMs, for both the RAV and the CFI aggregation operators, for 20 benchmark datasets from the UCI machine learning repository [26], described in Table I. To be consistent, the comparison with the DeFIMKL, MJSVM, Adaboost (with trees), Bagging and Random Forest ensemble classifiers is also presented for the same 20 datasets. All these datasets are normalised using zero mean and unit standard-deviation normalisation [27]. A binary ensemble classifier is used for comparison. As many of the datasets have more than two classes, classes are merged in the datasets having more than two classes [28]. Table II and Table III present the results of the a priori, Uriz and Sugeno FM based ensemble classifiers for both the CFI and the RAV aggregation operators on the 20 UCI datasets. The first two columns report the performance of the Sugeno λmeasure based ensemble classification algorithm for both the CFI and the RAV aggregation operators respectively. Similarly, the third and fourth column capture the results for the Uriz FM based ensemble classification algorithms and the fifth and the sixth columns for the a priori FM based ensemble classifiers.

B. Results
The ensembles in Table II use the same base classifiers i.e. five SVMs, whereas the ensembles reported in Table III use   TABLE II: Summary performance statistics of the A priori FM, the Uriz FM and the Sugeno λ-measure based ensemble classifiers for the same base classifiers. For clarity, the best classifiers are bold, the classifiers performances which are statistically significantly indifferent than the best are in italics and the ones which are statistically significantly worse than the best are underlined.   Table IV. All the tables report the average and the standard-deviation accuracies for 100 runs of these runs. During each run 80% of the data is randomly selected for training i.e. constructing the FM, and the remaining 20% for testing. For every dataset, the accuracy of each algorithm was compared with the accuracy of the best algorithm (shown in bold) using a two sample ttest (at p<.05). Results of classifiers which are statistically not different to the best algorithm at captures in italics and the classifiers which are statistically significantly worse than the best are underlined. IV: Summary performance statistics of the A priori FM based ensemble classifier comparison with state-of-the-art ensemble classifiers. For clarity, the best classifiers are bold, the classifiers performances which are statistically significantly indifferent than the best are in italics and the ones which are statistically significantly worse than the best are underlined.

C. Discussion
From Table II it can be observed that all the ensemble algorithms with the same base classifiers performed very well. Indeed, as shown in the table, nearly all classifiers result in performance which is not statistically different from the best classifier (hence nearly all results are shown in italics). In the below, we refer to best performance of a classifier when said classifier produces results which are either the best or statistically not different from the best.
The CFI based ensemble classifiers performed very similarly for all three measures, being best in 16 out of 20 datasets. The RAV operator based ensemble algorithm also performed very similarly, being best among 15 datasets for the Sugeno and the A priori measure, and 16 datasets for the Uriz measure.
Overall, these algorithms outperformed all ensemble algorithms based on different base classifiers shown in Table III. Here, CFI based ensemble classifiers were best in 2, 14 and 2 datasets for Sugeno, Uriz and A priori measure respectively. Conversely, RAV based ensemble algorithms were best in 16, 15 and 17 datasets for Sugeno, Uriz and A priori measures respectively.
From Table IV it can be observed that the DeFIMKL and MJSVM were the overall best classifiers, showing good performance in 16 and 15 datasets respectively. The other classifiers: CFI with a priori FM, RAV with a priori FM, Adaboost with trees, Bagging and the Random Forest had the best accuracies in 7, 7, 6, 9 and 7 datasets respectively. DeFIMKL is the overall best FI based ensemble classifier, showing the performance of optimising the FM in respect to a specific operator (the CFI in this case). The a priori FM based classifiers (Which do not employ optimisation towards a specific aggregation operator) overall achieve lower performance.

V. CONCLUSIONS
FIs are powerful aggregation operators, yet they suffer from potential drawbacks [4], resulting in non-intuitive outcomes when employed in respect to interpretable FMs, i.e. FMs which can be interpreted to provide meaningful insight into the value of information sources and their combinations.
The RAV aggregation operator was introduced as an alternative to FIs, specifically for cases where a FM is available which is independent from the aggregation operator, for example a FM which is specified using external information (cf. experts). In this paper we put forward a new category of a priori FMs to capture FMs which are based on external information rather than for example being optimised in respect to training data and one specific aggregation operator such as the CFI.
To illustrate and explore the concept of the a priori FM, we use the application of ensemble classifiers to compare a specific instance of an a priori FM with the well established Sugeno λ-measure and the recently introduced Uriz FM; for both the CFI and the RAV aggregation operators. The ensemble classification algorithms are constructed for two sets of base classifiers: a set of five SVMs classifiers and mixed set of SVM, Decision Tree, Adaboost, Bagging and Neural Network classifiers. The three FMs -integrated with the different aggregation operators -were compared for 20 datasets from the UCI repository. Further, the a priori FM ensemble algorithm with same base classifier was also compared with DeFIMKL, MJSVM, Adaboost with trees, Bagging, and Random Forest for the same 20 UCI datasets.
The results show that the specific instance of an a priori FM put forward in this paper is a robust way to construct FMs which perform well (with different aggregation operators) in the context of ensemble classification. Results highlight the strong value of leveraging available information, rather than relying on FM generation approaches which focus solely on the densities (such as the Sugeno λ-measure) and the potential of alternative FM generation approaches which do not rely on training the FM in respect to one specific aggregation operator.
In the future, we will focus on exploring the robustness of a priori FMs when employed with a wider set of aggregation operators, targeting the Sugeno FI as well as the CFI, but also the RAV operator for different values of p. Further, as part of an expanded journal paper on a priori FMs, we have developed a priori FMs which leverage external information from the scientific literature (and not the performance of individual classifiers as done in this paper) to support improved fusion, showing a pathway for delivering strong information aggregation using interpretable and validatable FMs, with rich scope for further research.