Handling Uncertainty in Citizen Science Data: Towards an Improved Amateur-based Large-scale Classiﬁcation

Citizen Science, traditionally known as the engagement of amateur participants in research, is showing great potential for large-scale processing of data. In areas such as astronomy, biology, or geo-sciences, where emerging technologies generate huge volumes of data, Citizen Science projects enable image classiﬁcation at a rate not possible to accomplish by experts alone. However, this approach entails the spread of biases and uncertainty in the results, since participants involved are typically non-experts in the problem and hold variable skills. Consequently, the research community tends not to trust Citizen Science outcomes, claiming a generalised lack of accuracy and validation. We introduce a novel multi-stage approach to handle uncertainty within data labelled by amateurs in Citizen Science projects. Firstly, our method proposes a set of transformations that leverage the uncertainty in amateur classiﬁcations. Then, a hybridisation strategy provides the best aggregation of the transformed data for improving the quality and conﬁdence in the results. As a case study, we consider the Galaxy Zoo, a project pursuing the labelling of galaxy images. A limited set of expert classiﬁcations allow us to validate the experiments, con-ﬁrming that our approach is able to greatly boost accuracy and classify more

entails the spread of biases and uncertainty in the results, since participants involved are typically non-experts in the problem and hold variable skills. Consequently, the research community tends not to trust Citizen Science outcomes, claiming a generalised lack of accuracy and validation.
We introduce a novel multi-stage approach to handle uncertainty within data labelled by amateurs in Citizen Science projects. Firstly, our method proposes a set of transformations that leverage the uncertainty in amateur classifications.
Then, a hybridisation strategy provides the best aggregation of the transformed data for improving the quality and confidence in the results. As a case study, we consider the Galaxy Zoo, a project pursuing the labelling of galaxy images.
A limited set of expert classifications allow us to validate the experiments, confirming that our approach is able to greatly boost accuracy and classify more

Introduction
Connectivity is promoting the emergence of a great potential amongst virtual communities of people that share a common goal. In some cases, this goal may consist of making a significant contribution towards the solution of a complex scientific problem. Whereas, in the past, the analysis of these problems used to be restricted to a group of experts in the subject, today this is difficult when the processing of large amounts of data is required. In this context, Citizen Science refers to the development of scientific research assisted by amateur volunteers from the general public [15]. As a form of crowdsourcing [13], this practice is re-emerging, engaging the crowd in helping researchers complete high timeconsuming tasks for which no reliable automatic procedures are available yet, for example, labelling of images [45], detection of patterns in graphic data [49], or transcription of handwritten texts [22].
Here we deal with classification problems in Citizen Science, which generally aim at the classification of huge collections of images according to a number of classes. These classes capture the interest of a specific research field, and identifying them correctly is the target of the participants. This sort of project involves, for instance, the recognition of structures in cell images [35], animal species in images taken in the savannah [2], or types of storms in actual data taken from meteorological satellites [23]. Amongst others, the nascent discipline of astroinformatics [3] has greatly benefited from the analysis of astronomical data in multiple projects, providing data analysis at a scale never reached in the past [31,37,6]. Nevertheless, many challenges are raised when the maximal profit of this large-scale analysis is desirable, regarding aspects such as the best use of expert classifications [44] or participants' engagement in this type of voluntary scientific contribution [42].
Citizen Science has also attracted the attention of data scientists. Research focused on the mining of data using an off-line approach, that is, the study of results once the project has finished. They have tested the capabilities of Data Mining (DM) and Machine Learning (ML) techniques, aiming the replication of amateurs' performance [5,18] or the training of ML algorithms for a certain problem [38,7]. Moreover, ML implementations are also being used for optimising amateurs' endeavours through the course of the project, following an on-line approach. This other framework encompasses the progressive training of new participants as they acquire experience in the problem, or the interaction between a ML classifier and new labelled data as it is generated by project participants [49].
Despite this, Citizen Science still arouses scepticism within the research community [39]. Even though it offers possibilities for research not possible to accomplish by experts alone, it is not universally accepted as a valid research method [10]. The reasons lie in the quality of results, which are often questioned because of several drawbacks involving the prevalence of biases and lack of accuracy and validation [28]. Amateurs participating in Citizen Science projects exhibit a wide range of skills, and it is not guaranteed they hold any background in sciences or research. Moreover, there is always some degree of uncertainty in classification problems, which usually tend to bring additional vagueness in the definition of the classes (type of birds, shape of galaxies, patterns in a graph, etc.). Consequently, depending on the problem and participants' expertise with the classification task, the confidence through amateur-labelled data varies and Citizen Science results thus hold an intrinsic uncertainty.
The study of classification problems with uncertain labels has been developed adopting several approaches [12,30]. Nonetheless, when the uncertainty comes from a set of independent judgements on the object being classified, fuzzy logic provides a very suitable framework for a thorough study of this uncertainty [27]. Areas such as multi-criteria decision making and multi-expert decision making address the problem of providing a final decision when a set of indepen-dent judgements are available [21,46]. Several aggregation methods have been widely studied through the specialised literature, aiming to use a set of experts' individual preferences in such a way that reflects the properties contained in all individual contributions [14,47]. However, this kind of approach has not been extended when there is available a great number of non-experts opinions with a widespread uncertainty in their final decisions. Moreover, classification problems covered by Citizen Science projects tend to produce additional vagueness related to the definition of the classes to be identified, a disparity either in the total number of votes received or in the confidence of amateur classifications, etc.
To the best of our knowledge, the enhancement of Citizen Science results by using this kind of methods has not been fully addressed yet. In our preliminary study [25], we started investigating the potential and the issues derived from the employment of these results with two simple data transformations. In this work, we propose a novel approach that, based on our previous findings, uses the data produced in Citizen Science projects that deal with classification problems. We present a method to aggregate information about the prevalent uncertainty in this sort of data. We first identify three sources of uncertainty in Citizen Science data that we address separately by a set of transformations that aim to enrich the original data. Then, we employ a hybridisation strategy that explores the most suitable combination of these individual transformations, providing more confident and accurate classifications. We eventually pursue a refinement of results, so that, they became more trustworthy and maximise the utility and outreach of Citizen Science projects.
We consider as a case study the first edition of the Galaxy Zoo (GZ1) project [32], one of the very first successful implementations of Citizen Science using the Internet. GZ1 finished classifying nearly one million galaxy images with the help of more than 200,000 volunteers. However, these results did not consider at that time a substantial part of the information stored in the original data about the uncertainty in amateur classifications. Making an integrated use of the same original GZ1 dataset, our approach is able to provide more accurate classifications for a greater number of galaxies, improving the state-of-art of the problem.
This document is organised as follows. In Section 2, we extend the background on Citizen Science and the management of the uncertainty with fuzzy logic. In Section 3, we introduce our approach for the improvement of Citizen Science data. Section 4 presents the set of experiments that test our method along with a discussion of results. Finally, in Section 5 we draw some conclusions and outline possible directions for future work.

Background
In this section, we further introduce the main materials covered in this work for a better comprehension. In Section 2.1, we first explain in more detail different aspects around the running of Citizen Science projects and review current trends in the specialised literature. After this, in Section 2.2 we take a deeper look at the field of fuzzy logic as a promising resource for the improvement of Citizen Science results.

Citizen Science: A brief overview
Citizen Science has been a common practice for many years. This form of citizen support to science developed by volunteers goes back in time to the eighteenth century. In those days, a few amateurs started making small but significant contributions by reporting observations about meteorology and ornithology [34]. Nowadays, the great advances in the Internet and Information Technologies have broadened the ways volunteers can develop these researchrelated activities, to the point that Citizen Science is being re-discovered by the scientific community as a valuable resource [40]. An increasing number of projects engage day by day significant numbers of individuals through the Internet in collecting and/or analysing data, with the support of many institutions from research and academia. The Zooniverse 1 initiative is one of the main plat-forms for Citizen Science project development and management [41]. Currently, Zooniverse hosts more than 80 projects in topics such as space sciences, ecology, medicine and humanities, directing the joint effort of more than a million participants [20]. This has led to the publication of more than 250 scholarly articles 2 , validating the utility of Citizen Science for today's research.
There is a solid body of works devoted to the study of Citizen Science as a social phenomenon, emphasising different aspects such as motivation of volunteers, challenges towards acquiring real research status, or its future prospects [15,9,40,17,10]. A shared claim within these works is the latent potential in the crowd as a valuable resource that should not be neglected by scientific community. Nonetheless, two main concerns are raised by scientists: a generalised lack of accuracy and a proliferation of biases within the data coming from Citizen Science projects [28]. To overcome this, it has become crucial the development of proper tools for improving data accuracy, control and minimise the impact of biases, and validate final results. These issues have been addressed from several approaches [1,48,11,8]. However, this body of works focuses on Citizen Science projects in the context of ecological sciences, which aim for data collection from natural environments at a massive scale. They ignore the difficulties covered in this work that arise when the target is the processing of data by large amounts of people. This is the case, for example, in projects coping with classification of images.
Citizen Science projects usually involve one particular task around the processing of some sort of raw data. Once the project is released, participants interested in taking part are invited to complete the task, developing genuine data analysis. For a great number of projects, this has consisted of the classification of large collections of images. After some training is provided, amateurs are asked to classify the images displayed in the project website by choosing amongst a set of categories. These categories often hold a set of main classes, which get the major part of the votes and comprise the target of the classifica-2 A complete list of references can be found at http://www.zooniverse.org/publications. tion problem. In addition, it is commonly offered a Don't Know (DK) category, useful in case no class is clearly distinguishable and that ensures any image gets a vote every time it is shown. When the project is closed down, all clicks conveniently recorded in a database are made available to a team of experts in the problem for their follow-up study. This data normally includes the count of votes for each of the classes offered to participants, and not a final label for each of the objects in the original dataset, as it is shown in Table 1. Therefore, a suitable analysis of this data is key at this point to extract good results from the project. However, a thorough study of this problem from the data science perspective remains unexplored.
Image ID Votes Class 1 Class 2 · · · Don't Know    of features extracted from the image [5,29], or the whole image within a deep learning approach (that performs its own feature extraction) [18]. It has been shown that ML classifiers can achieve similar results to those obtained by a group of amateurs, when these algorithms are trained using Citizen Science data [7]. However, these approaches do not address the intrinsic uncertainty, and tend also to replicate the biases present in the data. In addition to off-line approaches, on-line settings have recently been developed for optimising the interaction between humans and machines through the course of the projects [26,16]. These approaches involve ML systems that deal with the training of participants as they acquire expertise in the problem, the management of expert classifications, and the synergy between amateurs, experts and automated classifiers [49]. The operation of both kind of approaches is outlined in Figure   1, where we highlight the interrelation amongst both potential roles of ML in reinforcing Citizen Science outcomes.
In the present work, we opt for an off-line approach that targets the inherent uncertainty in projects that tackle classification problems. Our aim is to help experts increase their accuracy and confidence in amateur-labelled data in order to improve the outcomes of this kind of projects. To do so, our approach ensures an aggregation of information concerning this uncertainty that, as a form of data pre-processing, ensures a better use of this data for either research or the training of ML algorithms. Multi-criteria and Multi-expert decision making are well-studied categories of problems concerned with finding the best choice when a set of alternatives is available [43,19]. Eventually, an aggregation method is needed to combine individual criteria into a final decision, which is expected to contemplate all individual contributions. In data coming from Citizen Science projects we often encounter this scenario, where there is available a set of opinions. However, while fuzzy models for decision making normally use information from a reduced number of experts on the problem, in the case of Citizen Science data the uncertainty is more extreme for two reasons: firstly, amateurs (in contradistinction to experts) hold a wide range of backgrounds and varying expertise on the task, meaning more vagueness in their opinions; second, the number of judgements that need aggregation is much larger than in other typical group decision making problem. For instance, standard medical decision making has modelled the aggregation of ∼50 experts [21], whereas a typical Citizen Science project engages up to hundreds of thousands of participants, each one providing several tens of opinions about a set of objects.
These approaches represent a valuable initial framework for studying better use of Citizen Science data. A wide range of uncertainties is pervasive either through the problem definition, amateurs' set of judgements, and in the process of aggregating these judgements to reach a final classification. Depending on the nature of the problem addressed, results provided by amateur participants can be aggregated using expert knowledge in the subject to take advantage of all resources available. Pursuing this target, here we propose a way for aggregating additional information about the uncertainty in the voting process that, despite its simplicity, is able to improve current results.

A method for handling uncertainty in Citizen Science classification
In this section, we present our approach for handling the uncertainty spread within Citizen Science data. We consider the whole dataset obtained after the project has finished collecting votes from participants, taking an off-line approach. Firstly, in Section 3.1 we introduce basic notation and motivate the adequacy of the method by distinguish three types of uncertainty present in this sort of data. Then, in Section 3.2 we present a set of mathematical transformations that aims to leverage each of these uncertainty types. After this, in Section 3.3 we explain a hybridisation strategy that explores the best way to concatenate the three transformation stages in order to get the most convenient aggregation procedure.

Motivation: Three sources of uncertainty within Citizen Science data
In this section, we introduce the basic problem related to the employment of Citizen science results as well as some notation about Citizen Science data taken in an off-line approach. Then, we provide brief explanations of the different ways the uncertainty is encountered within the data. This makes easier the later comprehension of the method.
This work focuses on Citizen Science projects that signify a valuable aid for some scientific research in solving a certain classification problem. The classification task, which is the core goal of the project, tends to involve the identification of a few classes across huge collections of images. However, as explained above, the task is developed with the help of a myriad of amateur participants. Hence, the output is not a final label for each of the images released during the project running but a variable set of independent amateur votes. Using the data generated by this process, here we propose a better use of these results adopting an off-line approach and exploring how to leverage information about the uncertainty in amateur votes that is able to improve the quality of final classifications.
In order to facilitate the subsequent data analysis, amateur votes are usually converted into scores, which are numbers in the unit interval calculated dividing the number of votes in each category by the total number of votes received by the example. Thus, let V = (v 1 , v 2 , ..., v C ) be the vote vector for an instance in the dataset, containing the votes for each of the categories defined in the problem, with C the number of categories and N = C i=1 v i the total number of votes received by that object. We get the score vector X = (x 1 , x 2 , ..., x C ) by computing x i = vi N , for i ∈ {1, 2, ..., C}. The score vector is typically used to obtain a final classification for the object by simply applying a threshold: the category which score is greater or equal than the threshold is assigned to that example. This procedure allows the expert to adjust the confidence in the classification: the higher is the threshold applied, the larger is the consensus amongst amateurs who labelled that object, and objects holding a greater consensus are expected to be assigned more accurate classifications. However, the selection of the threshold is arbitrary, and even more importantly, it does not take into account the total number of votes, N . On the one hand, all objects which scores do not reach the threshold are left unlabelled (uncertain), mak-ing the process ineffective as we require higher confidence in the classifications.
On the other hand, examples with similar scores may hold a totally divergent number of votes N . So we are neglecting a hidden disparity in confidence.
The main issue derived from the employment of Citizen Science data is the prevalent uncertainty when a group of people provides a set of judgements about the same object. Amateur participants are not expected to agree in their classifications, and final labels depend on how this disagreement is handled.
Additionally, we often encounter variability in the total number of votes received by the example, N . Our target here is to refine this amateur-labelled data in order to obtain better classifications and improve both the number of objects classified by applying a threshold as well as the quality of these classifications.
We distinguish three different sources of uncertainty within the data: • We refer to Inherent Uncertainty (IU) as the uncertainty due to the variation across amateurs' votes. Given an example displayed in the website, each participant is asked to classify it by clicking in the most appropriate category according to their opinion at the time. Therefore, the final outcome is not a classification but a record of votes for each of the categories, which spread tells us about the IU in that object. In the case all participants have voted for the same category, this class holds a 1.0 score and then the example presents zero IU. Conversely, if the votes are equally split across the categories, with scores equal to 1.0 C , the IU reaches its highest value accounting for the greatest uncertainty.
• We denote as Measured Uncertainty (MU) the uncertainty directly quantified by the DK category. This option is normally offered as a form of ensuring every example gets a vote every time it is shown to a participant. This count of votes represents a measure of the uncertainty in the classification: as one object holds a greater number of DK votes, v DK , it is expected to entail more ambiguity in its labelling. Hence, an example with v DK = 0 ideally holds zero MU, getting bigger as v DK takes on larger values.
• Lastly, we refer to Level of Confidence (LC) as the uncertainty caused by the variability in the total number of votes, N , received by each of the examples in the dataset. This quantity often follows an uneven distribution, being able to provide an estimation of the confidence in the classifications with respect to the whole set of examples: given an example, the higher is N in comparison with the rest of the set of objects (taking a metric, for instance, the mean number of votes, µ N ), the greater is our confidence in the set of scores for that example. Consequently, for scores similarly spread through the categories of the problem, the LC informs about the more or less confidence we can expect in regards to each particular example.
The three sources of uncertainty are inevitably intertwined. The MU is part of the IU, which accounts for the spread of the votes through the whole set of categories, including the DK votes. The LC, in turn, is codified in the IU as well, since we can trust a finer variability in the scores given an example as more votes are available, that is, as N reaches greater values. Here we do not aim to study these concepts in depth. We only set a concise conceptual framework for the explanation of the method.

Three transformations for data refinement
In the following subsections, we explain the basis of the proposed method, consisting of three independent mathematical transformations to be applied on the original scores. These transformations are intended to aggregate information about the uncertainties summarised above and not present per se in the set of scores obtained from amateur votes. For the sake of clarity, we label each one with a number tag (not related to any order or importance) and explain their application over the example data presented in Table 1. The method takes as input the whole set of vote and score vectors, V and X, respectively, for each example in the dataset, and provides a modified score vector. Using the new scores, we can apply a threshold to assign a final class to the example, as it was explained above.

Normalisation: Reinforcement of main classes
The first transformation ({1}) consists of the normalisation of a subset of the scores. In Citizen Science projects dealing with classification problems, we commonly find that some classes within the options available for voting covers the major part of the examples. These so-called main classes hold a greater importance with respect to the rest and represent the target of the problem, that is, to classify the sample according to these few main classes.
For example, participants may be asked to recognise either shapes of celestial objects, patterns in a graph, or types of animals in a picture of the savanna, all of these previously defined as canonical types. In addition, other secondary classes are offered, corresponding to minority (less common or rare) classes in the problem or a Don't Know response for the extreme cases in which the amateur is not able to decide. These secondary classes may be of interest for other problems. In this work we are focusing on the improvement in the classification of the main classes. Once the scores are computed, the minority classes tend to obtain negligible scores and therefore do not reach the threshold for the vast majority of the examples. However, these secondary scores contribute to lower the main classes scores, complicating the classification with a threshold.
Hence, the normalisation of the main scores is intended to remove the "noise" due to votes received by secondary classes. We also obtain a representation of the IU restricted to the target classes of the problem and independent of the total number of votes received by the example, N : all instances with equal proportion of votes in the main categories are assigned identical scores after the normalisation.
Let X = (x 1 , x 2 , ..., x C ) be the whole score vector, we select the main scores, getting a reduced score vector X = (x 1 , x 2 , ..., x M ), with M the number of main categories (M < C). Once we have X, the normalised score vector The normalisation of the main scores ensures that M i=1 z i = 1 for every example. This develops as well a cleaning of the main scores for a later aggregation of information about the MU and LC by the two other transformations.
Taking as example the data presented in Table 1, the normalised scores for this data are shown in Table 2, assuming this is a problem with two main classes: C1 and C2.

DK votes shift: evaluation of Measured Uncertainty
The second transformation ({2}) modifies the main scores using the information held in DK votes. It aims to leverage the MU of the example by introducing a shift that favours one particular class and penalises the rest. In projects dealing with classification, we usually find an asymmetry in the main classes: one class is harder to identify than the rest. This occurs, for example, when the overall quality of the images is deficient because multiple factors (images of natural environments affected by weather conditions, space images that depend on the distance, etc.), or biases emerge in amateurs' skills (for instance, due to a     Considering again the example data in Table 1, we demonstrate the application of this transformation {2} in Table 3.
The parameter γ works as a factor that adjusts the influence of the boost depending on the particularities of the problem. It is optimised using the origi- x max values are selected amongst the whole set of modified scores.
The application of this transformation {3} over the example data in Table   1 is illustrated in  in the example data presented in Table 1

Hybridisation strategy
In this section, we introduce the final procedure leading to the target of the proposed method: exploring the best aggregation of information about the uncertainty in amateur classifications contained in the data. To this aim, we introduce a hybridisation strategy that operates with the three mathematical transformations explained above.
Each transformation tackles one particular expression of the intrinsic uncertainty present in the amateur-labelled data compiled after the project closure.
As we discuss in Section 3.1, this uncertainty can be split into three distinguish-  shift, and {3} Votes boost), a combinatorial calculation yields we can build a total of 3 1 + 3 2 · 2! + 3 3 · 3! = 15 different sequences, which we will denote explicitly from now on by the numerical sequence enclosed by keys 3 .
The whole process is developed as depicted in Figure 3. Firstly, taking the amateur-labelled data as input, the method tests all hybrid transformations, where the modified scores work as input of the next transformation of the sequence. A subset of expert classifications allows for the parameters optimisation and for assessing the sequences and ranking them in terms of their quality, using an adequate metric. At the end of the process, the ranking provides a set of improved scores (Refined Data) for their later use to obtain final classifications for the objects classified by the crowd of amateurs.

Case study: Improving galaxy morphology classification with Citizen Science data
In this section, we illustrate the proposed method with a case study. We look upon the first edition of the Galaxy Zoo (GZ1) project [32], taking the data produced during the run of this project. First, in Section 4.1 we present the particular features of GZ1, concerning the running of the project and the  available data. After this, Section 4.2 introduces the two expert catalogues that allow for an assessment of the proposed approach. Then, we describe the experiments implemented for the testing of the method in Section 4.3, and finally we summarise and discuss the results in Section 4.4.

Galaxy Zoo
The GZ1 project has constituted the very first successful implementation of a Citizen Science project using the Internet. For over a decade it has been bringing together myriads of little efforts from a huge community of amateurs committed to making a contribution to a classical astrophysical problem: the morphological classification of galaxies [24]. This long way has resulted in a list of publications that have supposed a great advances in the astrophysical research [20], via the relaunching of the project in multiple editions as well. Since the first edition of the project, an application was made available on-line 4 , by which any interested individual was able to sign up and start classifying galaxy images from the Sloan Digital Sky Survey 5 (SDSS), one of the main databases of astronomical images compiled to date. GZ1 focused on disentangling the observed bimodality in galaxy morphologies that roughly divides the population between elliptical and spiral galaxies. The first launch caused a great impact, and after six months more than 100,000 volunteers had completed over 40 million classifications for a sample of nearly 900,000 galaxy images [31]. A sample of these images is shown in Figure 4. In GZ1 project, participants were asked to classify galaxy images choosing between one of six categories: Elliptical, Clockwise Spiral, Anti-clockwise Spiral, Edge-on Spiral, Star / Don't Know, and Merger. Images shown held a common scaling of 423×423 pixels in order to provide a similar basis for all classifications [32]. In this edition, the classification was focused on the distinction between elliptical and spiral morphologies as main classes. However, there are multiple factors that complicate this classification problem. Whereas elliptical galaxies present spherical symmetry, spirals hold plane symmetry. One selection of such images is presented in Figure 4. Consequently, the orientation of the galaxy plays a fundamental role in the identification of its morphology. In addition, the quality of the image strongly depends on several factors such as the distance to the galaxy, and its physical size and brightness. This brings a huge multiplicity of grades of difficulty that is reflected in the uncertainty in amateur classifications.
At the time the project was closed, each image had received an average number of ∼38 independent amateur classifications with a standard deviation of ∼14 votes, producing the amateur-labelled data of the problem. Then, the GZ1 team started analysing this data to evaluate the influence of biases in the classification task. This resulted in a thorough study by Bamford et al. [4] by which a (manual expert) transformation of the scores obtained from amateurs' votes was developed. Referred as debiasing of the scores, it was intended to counter the tendency of classifying blurred images of spiral galaxies as elliptical.
As a result, the overall effect was to favour spiral classifications at the expense of elliptical ones. For this amendment, the three spiral sub-categories were joint, giving a combined spiral score (the addition of the Clockwise, Anti-clockwise, and Edge-on scores), which we will refer to as Spiral score henceforth.
The GZ1 data was collected in a set of csv files and published 6 . These files include the ID of the galaxy in the SDSS database, the location in the sky, total number of votes received by the galaxy, the set of original scores for all categories, and the debiased scores for the main categories: Elliptical and Spiral. In addition, the GZ1 team provides final classifications, known as GZ1 flags. These are generated via a process that involves the application of a 0.8 threshold over the debiased scores 7 . However, the debiasing of scores required an additional parameter 8 that was not available for the whole GZ1 dataset at the time. Therefore, the debiasing and thus the GZ1 flags were only computed for a portion of the GZ1 dataset. In the following, we will refer to this sample as GZ1 subset, consisting of 667,944 galaxies.

Expert validation
To validate amateurs' performance through the GZ1, the developers team originally used two expert catalogues [32]. These two expert catalogues will operate as the ground truth needed for the comparison of results. On the one hand, the MOSES catalogue [36] includes 16,516 galaxies present in the GZ1 subset, all of them classified by a team of professional astronomers as elliptical.
On the other hand, the Longo catalogue [33] includes 25,190 galaxies all labelled as spiral by another set of experts and part of the GZ1 subset as well. When both catalogues are compared, we found an overlap of 141 examples, which were removed for the consistency of results. After this adjustment is made, we take the joint expert catalogue, now composed of 41,424 galaxies from the GZ1 subset, which we will refer to as the validation subset. This part of the GZ1 data have both expert and amateur classifications. Therefore, it is used to validate the GZ1 flags. Also, as the available expert knowledge on the problem at hand, this subset plays a fundamental role in order to assess the performance of our approach through the following experimental trials. From now on, we will take the validation subset as the ground truth of the problem.  after removing the 141 overlapped galaxies.

Experimental setting
Here we present and explain the set of experiments executed for the testing of our approach. In the first place, we illustrate the performance of the three transformations taken independently (Section 4.3.1). After this, we test the hybridisation of the transformations over the GZ1 validation set (Section 4.3.2).
In GZ1 there are two sets of main scores: first, we have the original scores directly obtained from the final count of amateur votes, which we will refer to as raw scores. Also, we have the debiased scores obtained after the debiasing process explained above. These debiased scores serve of a comparison method proposed by the experts in [4], as a manual transformation of the raw scores.
Here we consider independently both sets of scores for the evaluation of the experiments results.
Similarly to the procedure followed by the GZ1 team, we apply a threshold Likewise, the application of this series over the scores enables us to check the trade-off between Acc and RR as the IU varies across the sample. That is to say: the higher is the threshold, the larger is the amount of uncertain galaxies but more accurate the classifications provided.
By using this set of thresholds we compare the quality of the modified scores obtained after applying either a single transformation, or any hybrid combination of them. This is made according to the expert validation explained above.
To do this, we represent in a Acc-RR chart the (Acc, RR) points obtained for each of the thresholds in the [0.5-1.0] interval. In addition, for the sake of making the comparison easy and quantitative, along with Acc and RR we consider a third metric: the Hypervolume [50] (HV) subtended by the set of (Acc, RR) points. Since we pursue a two-objective optimisation (we aim to maximise Acc and diminish RR), the HV enables a numerical comparison and ranking of dif-ferent scores. For its calculation, we take as reference the optimum point (Acc

Hybridisation of transformations
After the testing of the single transformations presented through the previous section, in the following we explain the hybridisation of transformations.
In order to extract and combine all information present in GZ1 data, here we propose one hybridisation strategy in two steps: (1)  {321}. In addition, since in GZ1 there are two primary scores, raw and debiased, we take both score types available and compute the whole set of transformation sequences over them. Hence, this hybridisation provides a total of 30 different sets of (Acc, RR) points to compare, after validating each of the final scores obtained with expert classifications (Section 4.2).
As a preliminary trial, we compute and rank the transformation sequences taking the same parameters values used in the previous section. We take α = 0.05 votes, β = 1.0 votes, and γ = 0.4, and restrict the application range to the interval (0.6, 0.9). Figure 10 shows this ranking of transformation sequences.
Following the parameters optimisation and the 70/30 validation, we complete a second trial computing the same set of hybrid transformations. Figure 11 shows the ranking obtained with this validation for parameters optimisation.

Discussion of results
We have completed two sets of experiments for the testing of the method.
Although the final goal is to obtain the best global transformation to be chosen amongst the set of hybrid sequences for the problem studied, the testing of the transformations alone illustrate how the method works. In broad terms, both experimental trials bring better trade-offs between Acc and RR with respect to the GZ1 benchmark (Table 6). In addition, classifications provided by application of the proposed set of thresholds generally outperform the marks obtained by considering the original scores without modification. In the following, we highlight the most meaningful results in accordance with the experiments presented above: • In GZ1 we have two sets of scores available, raw and debiased. As it is shown in Figures 5, 7 and 9, debiased scores reach better results compared with raw scores. This trend is maintained in the ranking of transformation sequences ( Figures 10 and 11  depending on the problem, which also enables here a comparison between our proposal and a comparable method developed by experts in the field.
• Taken independently, transformation {1} is the only one able to provide a simultaneous improvement of both raw and debiased scores. Transformation {2} worsens the Acc-RR marks for raw scores, and transformation {3} does not provide any improvement to original scores, either raw or debiased. However, through the hybridisation process, it can be seen that These results confirm the potentiality behind this approach, as able to find an adequate adjustment for the aggregation of information about the uncertainty present in the data, taking the form of either MU or LC, and hidden in the DK votes and distribution of votes through the main categories, respectively.
This depends on the choice of metrics for the evaluation of results, and different metrics could lead to different optimal solutions. However, the results presented here ensure a wide margin of improvement using the proposed method, considering the state-of-art of the problem that is represented by the debiased scores computed by experts.

Conclusions and further work
In this paper, we proposed a novel approach for a better employment of the data generated in the course of Citizen Science projects that deal with classification problems. The main achievement of this approach is to be able to aggregate information about different types of uncertainty present in this sort of data: inherent uncertainty, due to the lack of consensus amongst participants that annotate a same example; uncertainty quantified by participants themselves and included as part of the data; and the uncertainty codified in the distribution of votes through the whole dataset for the main classes of the problem. Using this information, our method has proposed three mathematical transformations that modify the original scores and a hybridisation of them that provides the best combined application in accordance with available expert classifications for the problem. To test our approach, we have analysed as case study one of the most representative Citizen Science projects to date, the Galaxy Zoo project.
We have presented two sets of experiments: the first one addresses the transformations alone, showing their performance in classifications generated using a threshold over the modified scores; the second implements the hybridisation of the three transformations, demonstrating the advantage of this procedure in order to explore the most adequate blending of them depending on the problem at hand. As a result, the method has proven to enhance classifications accuracy and diminish the amount of unclassified images, comparing with an existing method and using expert classifications as ground truth.
For future work, we plan to extend this approach to more complex settings such as projects involving classification problems with large number of classes, or the aggregation of further information regarding, for instance, participants' and/or experts' expertise in the classification task. These frameworks will entail new analyses on the aggregation of this sort of data. Eventually, we aim to study the merging of all information available about the problem, pursuing the best results and utility of Citizen Science outcomes for science and research.