Galaxy Zoo Builder: Four Component Photometric decomposition of Spiral Galaxies Guided by Citizen Science

Multi-component modelling of galaxies is a valuable tool in the effort to quantitatively understand galaxy evolution, yet the use of the technique is plagued by issues of convergence, model selection and parameter degeneracies. These issues limit its application over large samples to the simplest models, with complex models being applied only to very small samples. We attempt to resolve this dilemma of"quantity or quality"by developing a novel framework, built inside the Zooniverse citizen science platform, to enable the crowdsourcing of model creation for Sloan Digitial Sky Survey galaxies. We have applied the method, including a final algorithmic optimisation step, on a test sample of 198 galaxies, and examine the robustness of this new method. We also compare it to automated fitting pipelines, demonstrating that it is possible to consistently recover accurate models that either show good agreement with, or improve on, prior work. We conclude that citizen science is a promising technique for modelling images of complex galaxies, and release our catalogue of models.


INTRODUCTION
Disc galaxies are complex objects, containing many different components, including a disc, disc phenomena (i.e. spiral arms, bars and rings) and central structures (bulges, bars). Decomposing disc galaxies into their component structures has become an important tool for extragalactic astronomers seeking to understand the formation and evolution of the galaxy population, ranging from analysing bulge and bar structure (Elmegreen & Elmegreen 1985, de Jong 1996, Gadotti 2011, Mendez-Abreu et al. 2016, Gao & Ho 2017, Kruk et al. 2018 to the secular evolution of disc galaxies (Lilly et al. 1998, Barden et al. 2005) and general galaxy assembly and evolution (Simard et al. 2002, Bamford et al. 2011, Lackner & Gunn 2012, Rampazzo et al. 2019.
These fully quantitative methods allow researchers to obtain structural parameters of galaxy sub-components, which has been used in a variety of astrophysical and cosmological research. For example, the stellar mass found in discs and bulges places strong constraints on the galaxy merger tree from ΛCDM N-body simulations (Parry et al. 2009, Hopkins web-browser environment2. We describe our method in Section 2, including details of the images and ancillary data from SDSS as well as the strategy used to obtain scientifically useful models from volunteer classifications. We provide consistency checks within our infrastructure and to other methods in Section 3 and discuss the efficacy and potential impact of our new method relative to existing methodologies in Section 4.

METHOD
This Section describes the Galaxy Builder project, and the sample of galaxies and syntetic images used in this paper to examine the project's output. The entire process is summarised in flowchart form in Figure 1, with appropriate sections referenced therein.

The Galaxy Builder Zooniverse project
Galaxy Builder is a citizen-science project built on the Zooniverse web platform. It asks volunteers to perform detailed photometric modelling of spiral galaxies (potentially including bulge, disc, bar and spiral arm components). A project of this kind, allowing volunteers to interact with and model data, had never been attempted inside the current Zooniverse web platform before, so this project involved designing and implementing a model rendering3 suite inside the existing Zooniverse front-end code-base. As with all citizen science solutions, we had to not only consider the accuracy of the resulting model, but also user experience and engagement in our design decisions.
The closest relative to this project within the Zooniverse ecosystem was the Galaxy Zoo: Mergers project (Holincheck et al. 2016). This project asked volunteers to help match the morphological properties of an image of merging galaxies to a plethora of restricted three-body simulations. Galaxy Zoo: Mergers required volunteers to download a Java applet to take part in model selection, while Galaxy Builder operates purely inside a web page.

Project Timeline and Development
The Galaxy Builder project was built inside the Zooniverse's (Simpson et al. 2014) Panoptes-Front-End4 codebase, using the React.js5 framework, as well as WebGL6 to enable low-latency photometric galaxy model rendering. Galaxy Builder entered a Zooniverse beta in late November 2017 and after some user experience improvements and code refactoring, the project was launched as an official Zooniverse project on the 24th of April 2018.
A major challenge during the development of the project was finding the right balance between keeping a simple and intuitive interface and workflow while also allowing the freedom and versatility to properly model galaxies. It was also a significant challenge to develop a compelling and simple tutorial for what is one of the most complex projects attempted on the Zooniverse platform. Feedback from expert users was essential to this process as part of the typical beta trial process for Zooniverse projects7.

The project interface
The Galaxy Builder project prompts volunteers to work through the step-by-step creation of a photometric model of a galaxy (described in detail in Section 2.4). A screenshot of the interface can be seen in Figure 2, where a residual image is shown. The interface presents a volunteer with three views, which they can switch between at any time: a r-band cutout image of a spiral galaxy (see Section 2.2), the galaxy model they have created so far, and the residual between their model and image (shown in blue and yellow).
The workflow is designed so that volunteers slowly subtract increasing amounts of light from the galaxy, as is illustrated in Figure 3. A tutorial is available that contains a step-by-step guide to completing a classification. At each step, volunteers are asked to first draw a simple isophote, and then make use of a series of sliders to adjust the parameters of the model component (see Section 2.4 for more information).  Figure 1. Flowchart detailing the entire Galaxy Builder process, from image creation, through classification collection using the Zooniverse, to model aggregation and fitting. Processes, manual input, data inputs and exports, and document exports are displayed distinctly. Colours distinguish between component-specific processes (disc in blue, bulge in orange, bar in green and spiral in red). Black nodes relate to the galaxy as a whole. Figure 2. The Galaxy Builder interface. The residual image is being shown, and the volunteer is on the "Disc" task. The drawn disc component (yellow) is offset from the galaxy image (blue) to demonstrate the positive and negative residuals. Where the image equals the model the residual is black. The dots below the residual image allow the user to switch images. The icons to the right allow panning and zooming of the image (rotation was not functional for this project). The icons to the bottom right of the image allow colour inversion of the galaxy cutout, flagging of the image as inappropriate, inspection of galaxy metadata (i.e. sky position, link to SDSS SkyServer), ability to save the image as a favourite and to add to a Zooniverse "collection". The Score shown in the bottom left of the image is calculated using Equation 1 and is a rough goodness-of-fit measure.
Volunteers are also guided by a "score", which is tied to the residuals and chosen to increase from zero to some arbitrary value depending on the galaxy; a less noisy and more easily modelled galaxy will have a higher maximum score. To map a residual image to a final score shown to volunteers we used where N is the total number of pixels, y is the cutout of the galaxy, normalized to a maximum value of 1 (y = cutout/max(cutout)), M is the model calculated by volunteers and A = 300 is an arbitrary choice of scaling chosen based on a handful of test galaxies. This score has the advantage of being easy (and fast) to generate from the residual image shown to volunteers (which was Arcsinh-scaled in a manner described by Lupton et al. 2004), however, it is overly sensitive to small deviations of the model from the galaxy.

Sample Selection: Images and Ancillary Data
As a proposed solution to the problem of fitting multi-component and complex galaxies, Galaxy Builder finds a niche with a sample of disc galaxies with spiral features. One such sample is the stellar mass-complete sample in Hart et al. (2017), which is a sample of relatively face-on spiral galaxies (b/a > 0.4) with and without bars and selected to be complete across stellar masses 9.45 < log(M /M ) < 11.05. The test sample we use for the Galaxy Builder project was therefore selected from the Hart et al. (2017) sample of relatively face-on spiral galaxies.
The morphological information required to select spiral galaxies came from the public data release of Galaxy Zoo 2 (Willett et al. 2013, hereafter GZ2). Each response to a GZ2 morphology question is allocated a p value ranging from 0 to 1, where 0 indicates no volunteers responded positively to that question and 1 indicates all volunteers who classified that galaxy responded positively (i.e. p bar = 0.5 would indicate 50% of volunteers said a bar was present in a galaxy). Photometric measurements used for selection came from the NASA-Sloan Atlas (Blanton et al. 2011, hereafter NSA). The stellar mass complete sample is constructed using the set of criteria detailed in Table 1.
The stellar mass-complete sample was split into smaller sub-samples, each containing 100 galaxies. In an iterative process, each sub-sample was chosen to contain the 60 lowest redshift unclassified galaxies, and 40 random unclassified galaxies. This was done to ensure we would have an early sample to work with given the a priori unknown rate at which volunteers would provide classifications. Due to time constraints, classifications were only collected for two unique sub-samples. The mass-redshift relation of galaxies in the stellar mass-complete sample from Hart et al. (2017) can be seen in Figure 4, with galaxies present in this work highlighted in red. Stellar Masses were calculated by Mendel et al. (2014).
In the first two sets of 100 galaxies, 1% of galaxies (i.e. 2 images) failed to run through the image preparation process, due to an error when attempting to montage multiple frames. The root cause of this error is unknown, but it  Hart et al. (2017) to create the stellar mass-complete sample of 6222 spiral galaxies.

Description Value
Face-on spiral morphological selection. GZ2 p features × p not edge on × p spiral ≥ 0.5 Redshift limits. 0.02 < z < 0.055 Relatively face-on galaxy selection using g-band axial ratio.
(b/a) g > 0.4 Mass limits for rough volume limited sample.
9.45 < log(M * /M ) ≤ 11.05 Mass limits for complete sample a  . Redshift against total galaxy stellar mass for all galaxies in the stellar mass-complete sample, with the 198 galaxies considered in this paper highlighted in red. The distribution of stellar masses is shown in the right panel for the total sample and for the galaxies considered here. It is evident that the galaxies for which we collected classifications are not complete in stellar mass, but it is possible to select a further subset that would be.
leaves a sample of 198 galaxies with images (the test sample, 98 of which are repeated in a validation subset) that are considered in this paper, in order to explain the method used and test the reliability of the models obtained.

Image and modelling metadata extraction
The galaxy data shown to volunteers in the Galaxy Builder project came in two forms: A grey-scale image cutout of the galaxy and a JSON file containing rendering information for the web-interface.
Both forms of data were obtained using a similar process: 2. This montage was cropped to four times the Petrosian radius of the galaxy.
3. The SExtractor software (Bertin & Arnouts 1996) was used to identify regions containing secondary sources (foreground stats, other galaxies) and generate a mask.

4.
A PSF was obtained from the relevant Sloan r-band psField file, extracted at the central position of the galaxy (Stoughton et al. 2002).
5. The JSON file was written containing the cut-out data and the 2D boolean mask obtained from the source extraction process. This file also contained other metadata needed for the rendering process (PSF, the size of the PSF array, and the width and height of the image).
6. Another JSON file containing simply the information used to render the volunteer's model (image size and PSF) was created.
7. An arcsinh-stretch was applied to the masked cutout (as described by Lupton et al. 2004). It was then saved as a grey-scale image.
The decision to use r-band images in our subject set was due to its higher signal-to-noise than other bands. Once a sub-sample had been created, the Zooniverse's panoptes-python-client8 was used to upload them as a subject-set to the Zooniverse.
The reprojection performed by Montage has a smoothing effect on the data, and thus does not conserve errors. We, therefore, create a separate stacked image, sigma image and corresponding pixel mask, using the same r-band corrected frames present in the montage. These images were not shown to volunteers but were used for model fitting and comparison.

Choice of Retirement limit
The number of independent answers needed to create reliable and reproducible aggregate classifications was not known at the start of this project. An initial experiment with collecting 10 classifications per galaxy demonstrated that this was insufficient; further experimentation with a diverse range of galaxy types (most with prominent spiral features including grand-design and flocculent arms) revealed 30 classifications per galaxy was sufficient.
The entire test sample of 198 galaxies was then presented to users, with 30 classifications collected per galaxy. In addition, one of the subsets was presented a second time, thus providing a validation subset to measure consistency between sets of 30 classifications on the same galaxies.
We also created 9 synthetic images of galaxies, containing various combinations of components available to volunteers and a spread of possible parameters. These synthetic galaxies were based off of a set of target galaxies from Galaxy Builder and designed to be as realistic as possible, including the addition of realistic noise and pixel masks. This set of synthetic images is shown in Figure 5 and was used to calibrate our aggregation and fitting methodology and thus is referred to as the calibration subset.

The Galaxy Model
Our chosen galaxy model was largely based on components described in Peng et al. (2002). The modelling code ignores masked regions identified as secondary sources by SExtractor. It over-samples the bulge, disc and bar components by a factor of five and performs PSF convolution using a PSF obtained from the relevant Sloan r-band psField file, extracted at the central position of the galaxy (Stoughton et al. 2002). The model created by a volunteer could be chosen from 1. One exponential, ellipsoidal disc.
2. One ellipsoidal Sérsic bulge, with n chosen by volunteers.
3. One Sérsic bar with a "boxiness" modifier (as described in Peng et al. 2002), with n and c chosen by volunteers.
4. Any number of freehand poly-line9 spiral arms, as described below.

Spiral Arm Model
Each spiral arm is modelled using a poly-line drawn by the volunteer. The brightness of a spiral arm at any point is given by the value of a Gaussian centred at the nearest point on any drawn poly-line, with volunteers able to choose the Gaussian width and peak brightness using sliders. Radial falloff was added by multiplying by the value of the previously added exponential disc, though volunteers could change the half-light radius of this falloff disc. Figure 5. Arcsinh-stretched images of the synthetic galaxies present in the calibration subset. These galaxies were designed to look as realistic as possible, while being described perfectly by the model available to volunteers.

Classification Aggregation Methodology
In this Section, we will use the galaxy UGC 4721, a two-armed barred spiral galaxy at z = 0.02086 classified by de Vaucouleurs et al. (1991) as SBcd, to illustrate the data reduction and aggregation methodology. For UGC 4721 we received 32 classifications, containing 28 discs, 24 bulges, 17 bars and 47 drawn spiral arm poly-lines (four classifications did not contain spirals, seven contained one spiral arm, fourteen contained two arms, six contained three arms and one contained four arms). These annotations can be seen in Figure 6, overlaid on the greyscale r-band image of the galaxy.

Aggregation of Volunteer Models
Aggregate model calculation was done on a component-by-component basis, rather than per classification, i.e. clustering of discs was performed independently to that of bulges, bars and spirals. We did not take into account any slider values, only the shape drawn by the volunteers. Disk classifications were doubled in effective radius to correct for a systematic error in disk size observed in the classifications received for the calibration subset. Model parameters were restricted to be within the limits shown in Table 2 (deemed to be the physically acceptable bounds). All components  were transformed from the coordinate space of the Montage-created images to the more accurate stacked images created for fitting. Clustering was performed using the Jaccard distance measure (also known as the intersect-over-union distance, or IOU distance), which is a simple metric determining the relative shared area of two sets: The algorithm chosen to perform clustering was the Density-Based Spatial Clustering of Applications with Noise (DBSCAN, Boonchoo et al. 2018) algorithm, due to its robustness and speed. We made use of Scikit-learn (Pedregosa et al. 2011) to implement the algorithm. In DBSCAN the core of a cluster is defined as a group of at least N min items that are all within a distance of each other. Additionally, any points within a distance of a cluster's core are also associated with the cluster. Note that some parameters were allowed to overflow when fitting, for instance an axis ratio greater than 1 (signifying a swap of major and minor axis) was allowed, and corrected for once fitting reached completion. This helped avoid the optimizer encountering parameter bounds and failing to converge. Component position angle (ψ) and spiral pitch angle (φ) were similarly unconstrained.

Disc, Bulge and Bar Clustering
Component Parameter Tuning Minimum Bound Tuning Maximum Bound We select the disc clustering hyperparameters such that a disc is clustered for all galaxies, and the bulge hyperparameters to most successfully recover the morphology of galaxies in the calibration subset. The value of used to cluster bars was tuned such that the aggregate model best agreed with GZ2 p bar (p bar < 0.2 implying no bar and p bar > 0.5 implying a definite bar). The values chosen for were 0.3, 0.4, 0.478 for the disc, bulge and bar; N min was set to 4 for all three of these components.
We define the aggregate component to be the shape that minimises the sum of Jaccard distances to each of the members of the cluster. For our example galaxy, UGC 4721, clustered and aggregate components can be seen in Figure  7.

Spiral Arm Clustering
To cluster drawn spiral arms, we define a custom separation measure to represent how far away one poly-line is from another. This measure was chosen to be the mean of the squared distances from each vertex in a poly-line to the nearest point (vertex or edge) of another poly-line, added to the mean of the squared distances from the second poly-line to the first. We make use of this separation measure inside the DBSCAN algorithm to cluster these drawn lines, after removing any self-intersecting drawn arms (as this was deemed an easy method to filter out "bad" classifications). Values of 0.001 and 4 were used for the and min_samples hyper-parameters respectively.
Once spiral classifications on a galaxy have been clustered into the physical arms they represent, the points are deprojected using the axis ratio and position angle of the aggregated disc. The deprojection method assumes a thin disc and stretches the ellipsoidal minor axis to match the major axis.
Deprojected points within each drawn poly-line are converted to polar coordinates and unwound to allow model fitting. These unwound points are then cleaned using the Local-outlier-factor algorithm (LOF, Breunig et al. 2000). For each drawn poly-line in the cluster, the LOF algorithm was trained on all points not in that arm, and then used to predict whether each point in the arm should be considered an outlier. In this way, we clean our data while respecting its grouped nature. The points removed as outliers for the example galaxy are shown in the bottom right panel of Figure 7.
For each arm cluster in each galaxy, a logarithmic spiral model was fitted using Bayesian Ridge Regression, performed using the Scikit-learn python package. A logarithmic spiral was chosen due to its simple form with a constant pitch angle. Hyperpriors on the noise parameter were chosen by fitting a truncated gamma distribution (Zaninetti 2014) to the spiral width slider values returned by volunteers (ignoring sliders left at the default or moved to the extremes of allowed values). Any logarithmic spirals within a distance of 0.0005 (given by the clustering metric) were deemed to be from the same arm and thus their classifications were merged and a log-spiral recalculated.
We do not assume that every arm in a galaxy has the same pitch angle. To obtain a single value for the pitch angle of a galaxy, we take the length-weighted average pitch angle of all arms detected in the galaxy (as used by Davis & Hayes 2014).
The galaxy model for UGC 4721 obtained through aggregation can be seen in the bottom left panel of Figure 8.

Error Estimation of Aggregate models
As all components in a cluster can be viewed as volunteers' attempts at modelling the true underlying component, the sample variance of the parameters of these shapes can be used as a measure of confidence in the parameters present in the aggregate result. These are highly sensitive to clustering hyperparameters, and are only valid for a component's position, size and shape. Figure 7 illustrates the variance in clustered shapes for our example galaxy (UCG 4721); we see a large variation in the clustered discs, and much closer agreement on the bulge and bar size and shape.

Model Fitting
The final step in creating Galaxy Builder models is a numerical fit to fine-tune parameters. This fitting was performed using the L-BFGS-b algorithm (Byrd et al. 1995), implemented in Scipy (Jones et al. 2001). We minimize a custom likelihood function that assumes Gaussian error on pixel values and incorporates the priors on parameters we obtain from clustering. The full fitting model and likelihood is detailed in Appendix A. We use the same model as used by volunteers in the online interface (with altered limits), with spiral arms restricted to being logarithmic spirals relative to the disc, and without the ability to change the relative falloff of spiral arms.
The model rendering and fitting code was written up using Google's JAX package (Bradbury et al. 2018), which allows GPU-optimization and automatic gradient calculation, enabling quick and accurate calculation of the jacobian matrix needed for the L-BFGS-b minimization algorithm.
We initially fit only for the brightnesses of components, and then simultaneously for all free parameters of all components. The result of the fit, including the final photometric model for UGC 4721, can be seen in 8. The secondary components have been accounted for well, and the model has a sensible reduced chi-squared value of 1.176, where we have approximated degrees of freedom as the number of unmasked pixels present in the galaxy image (similar to Galfit).
We use the errors described in Section 2.6 as parameter uncertainties, as we feel an approach based on the local curvature of the likelihood-space (as used by Galfit) would likely fall foul of the issues described in the introduction and thus be an under-estimate. This decision means we do not have uncertainties for some parameters.
We remove two models for which a fit did not converge.

RESULTS
In this Section we present Galaxy Builder models for 198 galaxies, from the aggregation of user classifications (aggregate models), and with parameters fine-tuned by a numerical fit (fitted models). We explore the consistency with which volunteers modelled galaxies, the accuracy of the aggregate models, and compare the aggregate and fitted models to comparable results in the literature.

The Calibration Set
The calibration subset was a set of nine synthetic galaxy images created from Galaxy Builder models, which were then re-run through the Galaxy Builder process. These galaxies were used to fine-tune clustering and fitting hyperparameters (See Section 2.5.1), as the ground truth was known. Our ability to recover morphology accurately is essential validation for our ability to recover good photometric models of galaxies.
The scatter between true and measured parameters is shown in Figure 9; these results highlight the importance of good priors to obtain accurate fits of complex photometric models. In more detail, the models recovered for the nine synthetic galaxy images demonstrate that: 1. Model parameters were generally recovered to a high degree of accuracy 2. We successfully recover all spiral arms present, and do not receive any false positives The spiral pitch angles obtained through aggregation vary by < 9°from the true values, with fitting improving this error slightly.
3. Volunteers systematically use elongated bulges to model bar components. This resulted in two false positives for bulge presence in the aggregate models. This feature (switching light between model components) is a common issue in all photometric fitting methods (Kruk et al. 2018).
4. The Jaccard metric is unstable to small changes in rotation for highly elliptical components (i.e. bars). This resulted in one false negative of bar presence in the aggregate model.
The fitting step for this subset of images highlighted the benefit of obtaining a rough starting point through clustering of user classifications; the method struggled to recover structural parameters for which we did not obtain such a starting point (Sérsic index and bar boxiness). These parameters are difficult to identify using gradient descent (Lackner & Gunn 2012), suggesting future work should attempt to obtain priors on these parameters from volunteers and make use of a more robust fitting algorithm.

Examination of Volunteer consistency
We aggregate two independent models for a set of 98 galaxies based on "original" or repeat ("validation") classifications, obtained with the same retirement limit (see Section 2.2 for more on this selection).
One of the simplest choices the volunteers have is whether to include a model component or not. Figure 10 illustrates the consistency with which volunteers made use of a component in their model for a galaxy. We see that volunteer classification is very consistent, with scatter as predicted by the Binomial uncertainty on the mean. Volunteers almost always make us of a disc and bulge (as seen in the calibration subset), and bulge, bar and spiral arm usage is consistent  within Binomial error. One common challenge is that some volunteers used a very ellipsoidal bulge and the ends of spiral arms to model light that other users modelled with a bar. This caused some scatter in aggregate models.
In the end, the aggregated validation model is identical to the original aggregated model in around 40% of galaxies. The most common changes are a missing bar component or a missing single spiral arm. This may suggest that more than 30 classifications should be collected per galaxy, or could be an artefact of the lack of consensus among volunteers for galaxies with difficult-to-determine components.
After selecting a component, the volunteer sets its shape and size. The variation in axial ratios and effective radii for the aggregate discs, bulges and bars are shown in Figure 11. The aggregate discs and bulges are consistent within errors, however, bars show more scatter. Bars are one of the most challenging components to aggregate consistently. This is partly because even a strongly barred galaxy with 30 classifications overall might receive only 15 or so drawn bars, and lower numbers of classifications result in more scatter. In addition, the aggregation method is more sensitive to rotation of highly elongated shapes. Both factors probably contribute to lower consistency in bar components.

Comparison to results in the literature
After having aggregated and fitted models for our galaxies, we examine how our models compare to other results in the literature. Part of the motivation for exploring the Galaxy Builder method was that there exists no published large sample of galaxies with four-component photometric fits. This means we can only make comparisons for individual or subsets of model components (e.g. just disc and bulge) and by design Galaxy Builder models will differ as we have  Figure 11. Comparison of component shape in aggregate models between the original and validation sets. Errors are obtained through the sample variance of clustered components, as detailed in Section 2.6. We see close agreement between aggregate components from the two sets, suggesting that the clustering method is robust to the scatter in classifications.
attempted to fit bulge-disc-bar-spiral models to all our galaxies. The reader is therefore cautioned against treating literature models as any kind of "ground truth" since deviation from these simple models is part of the goal of this project. We provide these comparisons not to check how well our models work, but to provide data on how they compare with other well known, but much simpler photometric models.

Comparison to Galaxy Zoo morphology
The simplest comparison we can make to external results is to examine whether our models respect the existing morphological classifications present in the literature. We make use of Galaxy Zoo 2 (Willett et al. 2013, hereafter GZ2) results, including the redshift debiasing described in Hart et al. (2016) and spiral properties calculated in Hart et al. (2016).
When comparing the probability of a volunteer's classification containing a bar component against a galaxy being classed as strongly-barred or as having no bar (as defined in Masters et al. 2010), we see reasonable agreement. Classifications of GZ2 strongly-barred galaxies (p bar > 0.5) are more likely to contain a bar than GZ2 unbarred galaxies (0.47 ± 0.15 vs. 0.29 ± 0.11). While there is some overlap in these probabilities, the Pearson correlation between GZ2's p bar and the bar likelihood in Galaxy Builder is 0.56, implying a significant correlation. We also note that GZ2 bar classifications exclude most weak bars (Kruk et al. 2017).
We also compare the number of spiral arms aggregated by Galaxy Builder with the responses to the GZ2 "number of arms" question (of which the possible responses were one, two, three, four, more than four or "Can't tell"). We attempt to account for the spread in volunteer answers to this question by binning responses, rather than using the mean or modal response. The results of this comparison can be seen in Figure 12. The area of each circle can be seen as the level of agreement between Galaxy Builder aggregate models and GZ2 classifiers, it is defined as Can't tell 1 2 3 4 4+ Galaxy Zoo 2 number of spiral arms where n k is the number of aggregate arms for galaxy k (out of N g galaxies), C k,m is the m-th answer for galaxy k (out of M k answers).
The circle with the largest area for each possible GZ2 response is highlighted, and agrees with the number of spiral arms aggregated here for m = 1, 2, 3, 4. No aggregate model contained more than four spiral arms, and when galaxies have an uncertain number of spiral arms (the "Can't tell" GZ2 response) we mostly do not aggregate any spiral arms.
It is not uncommon in Galaxy Builder for one spiral arm to have been broken into two smaller segments. We also occasionally identify two distinct clusters that represent the same physical arm. These two reasons account for a majority of cases where GZ2 classifications suggest a galaxy has two spiral arms and we have clustered a larger number. Improved project user experience would be crucial in correcting these errors.

Comparison to One-component fit -axis ratio
We compare the axis ratios of the discs of Galaxy Builder aggregate models (without fitting) to the axis ratio of a 2D Sérsic fit to the r-band SDSS image of each galaxy (as provided in the NSA catalog, Blanton et al. 2011). The resulting scatter is shown in Figure 13; for these untuned models there is an error of ∼ 0.1, consistent with our expected errors (derived in Section 2.6).
We observe a clustering of outlying values around b/a = 0.5. This is almost certainly due to the drawing tool ellipse having a default axis ratio of 0.5. Where this default is a "good enough" fit we hypothesise that volunteers are less likely to modify it, while if it needs to move a long way they find a more refined value. Overall we see that 36% of all Galaxy Builder disc axis ratio Figure 13. Difference between the axis ratios of the aggregated disc component (before fitting) to the results of an r-band Sérsic profile fit. Points between one-and two-sigma are highlighted as orange squares, points outside 2σ are shown as red stars. While the overall relationship is good, the increase prevalence of points outside 2σ is a clear indication of bias caused by the Galaxy Builder online user interface. disc components drawn by volunteers were left at the default axis ratio. We recommend that future projects should carefully consider their interface design to minimize this bias (e.g. forcing volunteers to draw both the major and minor axis), however, the fitting process we implement on the aggregate models successfully removes the bias, and the overall scatter does not change significantly. As we account for light in spiral arms and bars, we expect that disc axis ratios fit by Galaxy Builder should be more physical than those from models that do not account for how these non-axisymmetries can bias measurements of ellipticity.

Comparison to Disc-Bulge models
A strong motivation for performing multi-component modelling is the desire to measure the fraction of a galaxy's light being emitted by its central components (such as bulge fraction, defined as the ratio of bulge luminosity to total luminosity). Gao & Ho (2017) demonstrate that modelling secondary central components is essential for recovering an accurate measure of bulge fraction. The difficulty of measuring bulge fraction is further compounded by the complex Exponential + Sérsic (Simard, 2011) 0.00 0.25 0.50 0.75 1.00 Bulge fraction Exponential + Exponential (Lackner, 2012) 0.00 0.25 0.50 0.75 1.00 Bulge fraction Exponential + De Vaucoulers (Lackner, 2012) Figure 14. Scatter plots comparing the ratio of flux from central components (bulge and bar) to the total flux between fitted models from Galaxy Builder and two-component models in the literature. Our models are broadly consistent with their results, but should be more accurate for complex galaxies, as we account for galaxy bars.
degeneracies present in even two-component fits, meaning that many gradient-descent based solvers often fail to find the globally optimum solution (Robotham et al. 2016), especially when bulge Sérsic index is a free parameter. One of the largest catalogues of 2D multi-component fits is Simard et al. (2011), which performed simultaneous, twobandpass decompositions of 1,123,718 galaxies in the Legacy area of the SDSS DR7 using Gim2D. Three variations of models were fitted: a pure Sérsic model, an Exponential disc and de-Vaucouleurs bulge model (hereafter exp+deV), and an Exponential disc and a Sérsic bulge model (exp+S). Fitting was performed using the Metropolis algorithm, which is resilient to local minima and therefore suitable for the complex likelihood space of galaxy photometric modelling. Lackner & Gunn (2012) similarly fitted two models to SDSS main-sample galaxies: an exponential disc and exponential bulge (exp+exp), and an exponential disc and de Vaucouleurs bulge. They used a Levenberg-Marquardt gradient descent algorithm, with initial parameters taken from previous SDSS analysis.
We compare our central component fraction (the flux of the bulge and bar relative to the total model flux) to bulge fraction from Simard et al. (2011) where their analysis indicated genuine bulge+disc systems (P pS ≤ 0.32). We compare to Lackner & Gunn (2012) bulge fractions only when their model selection criteria determined that model was the best-fit model. We see a strong correlation with significant scatter (Figure 14). The relationship to exp+deV models appears to be less than 1:1, while the relationship to exp+exp models is greater than 1:1, highlighting the dependence of bulge fraction on Sérsic index. Taking Galaxy Builder results as ground truth implies that exp+deV puts too much light into the bulge, while exp+exp puts too little.
The amount of scatter (and lack of consistent 1:1 relationships) between bulge fractions between any two of the published two-component models is comparable to the scatter we see between any one of them and our more complex model. Bulge fractions for complex multi-component galaxies fit with any method should be used with caution.
Another comprehensive catalogue of 2D two-component fits is that of Meert et al. (2015), who fit identical models to Simard et al. (2011) on ∼ 7 × 10 5 galaxies imaged by SDSS, using Galfit and PyMorph (Vikram et al. 2010). They made use of a set of logical filters to distinguish between model fits, allowing them to identify cases where the model did not converge to a physically meaningful result. There is an overlap of 86 galaxy builder galaxy models with their "intermediate catalogue", and we see some scatter between measured parameters (see Figure 15). The modelling of spiral arms does not appear to impact measured disk parameters, with disk size and ellipticity showing strong agreement between the catalogues. We see significant scatter in bulge Sérsic index, especially when a bar is present. Total luminosity is not strongly affected by the addition of detail to the model. Kruk et al. (2018) performed multi-component (up to three), multi-band decompositions of a selection of SDSS galaxies, 23 of which were also classified in Galaxy Builder (with 16 in the repeated validation subset). Figure 16 compares the axis ratios and effective radii of bulges, discs and bars in Kruk et al. (2018) to those present in the fitted models. We see good consistency in effective radii of all components in the majority of galaxies. There is more scatter in the fit axis ratios of components. In particular, we observe many of the Galaxy Builder bulges reaching the imposed lower boundary. Comparing the central component fraction between Galaxy Builder models and those in Kruk et al. (2018), we see next to no scatter.  Meert et al. (2015, x-axis) and Galaxy Builder (y-axis). We note that adding spirals to a model does not strongly impact disc parameters, but the presence of a bar has a significant impact on bulge Sérsic index measurement.  Kruk et al. (2018). Discs, Bulges and Bars are shown as blue circles, orange stars and green squares respectively. The left panel compares components' effective radii, the right panel compares the component axis ratio. The components match well, with bulges showing the most scatter. Bulges in Galaxy Builder fit models often get stuck at the lower allowed value, despite the physically motivated initial conditions.

Comparison to Disc-Bulge-Bar-Spiral models
To the best of our knowledge, no photometric models exist for the Galaxy Builder sample that contain spiral arm structure. The closest comparable result is that produced by Gao & Ho (2017), however, the galaxies they used are not in the Sloan footprint.
In order to provide a comparison for our novel method of spiral parameter (pitch angle and amplitude) extraction, we compare the result of our galaxy length-weighted pitch angles to the relationship obtained by Hart et al. (2016) between GZ2 classification and galaxy pitch angle. Their fit was obtained by using the Zooniverse project Spiral Spotter to filter good vs bad spiral arm segments identified using an automated spiral arm detection and fitting tool, SpArcFiRe (Davis & Hayes 2014), whereas Galaxy Builder asks volunteers to provide their own opinion on spiral arm number, location and tightness. Galaxy Builder pitch angles are within the (large) uncertainties on the Hart et al. (2016) fit.
Many researchers (Davis & Hayes 2014, Díaz-García et al. 2019 to name a few) have noted that many galaxies show large inter-arm variations in pitch angle, suggesting that obtaining a single value of a galaxy's pitch angle is highly dependent on which arms have been identified. We plan to further explore this issue in future work.

SUMMARY AND CONCLUSIONS
In this paper, we present a novel method for modelling of galaxy images, Galaxy Builder, which was conceived with the goal of solving the "quality or quantity" dilemma facing galaxy image modelling, which, despite advances in computation, still typically requires significant human interaction to achieve quality fits. In future work, we use this sample to investigate spiral arm formation mechanisms.
Galaxy Builder leverages the power of crowdsourcing for the hardest to automate parts of image fitting, namely determining the appropriate number of model components to include, and finding regions of parameter space close to the global optima.
The use of a small sample of synthetic images to calibrate and test our model clustering and fitting code has demonstrated our ability to recover galaxy morphology in the majority of cases. For example, our spiral arm fitting recovered spiral pitch angles to within 9 deg. This set of 9 synthetic images revealed a systematic tendency for volunteers to incorporate more bulges and fewer bars than necessary for photometric models of strongly barred spirals. Future work might implement an improved clustering algorithm and an improved user interface to address the failures of bar model clustering we observed in a small fraction of galaxies.
Some parameters are not recovered well (bulge and bar Sérsic n, bar boxiness), we hypothesise that this is because a wide range of values fit the light profile well. As a result, we are unable to obtain reliable physical results with our optimization algorithm (gradient descent-based methods are subject to being trapped in local minima, or not converging for parameters with flat likelihoods). A solution to this would be performing a full Bayesian optimization with priors obtained from volunteer input, or using a more robust algorithm (such as Basin-Hopping; Wales & Doye 1998). This work is beyond the scope of the current study.
We have demonstrated our ability to obtain physically motivated models with comparable reduced chi-squared values (between 1 and 5) to results in the literature. We obtain errors on parameters where possible through the sample standard deviation of component clusters, which is less likely to be an under-estimate than approximations using the local curvature of the Likelihood-space.
We compare these new models to existing results in the literature. We find good agreement where the models or parameters are comparable, and suggest that where differences are found, Galaxy Builder should generally provide superior models because of the more realistic modelling of the galaxy morphologies.
Upcoming survey missions such as LSST (Ivezić et al. 2019) and Euclid (Laureijs et al. 2011, Amiaux et al. 2012) present a rich source of astrophysical data. However, the approach detailed in this paper will not be sufficient to deal with the volume of galaxies these surveys will image (twenty billion and two billion respectively, though a large proportion of these will not benefit from detailed photometric modelling). Tools such as Galaxy Builder may serve an important role in the generation of training catalogues for scalable machine learning techniques, in an analogous manner to that currently employed for visual morphological classification in Galaxy Zoo: Enhanced (Walmsley et al. 2020).
We were able to obtain aggregate models for 296 images with an average rate of one galaxy per day, and fit photometric models for 294 images. At the time of writing and to the best of our knowledge, the number of photometric models obtained here is still significantly larger than the largest sample obtained through purely computational pho-tometric fitting of a disc, bulge, bar and spiral arms in galaxies (10 galaxies, Gao & Ho 2017, who also included rings, disc-breaks and further components).
The software used to generate image cutouts; perform clustering and aggregation of volunteer models, and fit photometric models is available under a GNU general public licence on GitHub10. We hope that publishing this code with the paper promotes transparency and accountability in astrophysical software development. All models created as part of the Galaxy Builder project will be available on the Galaxy Zoo website11.
Any citizen science project is only as good as the volunteers who generously donate their time to it. We were incredibly fortunate to be able to make use of the wonderful pool of volunteers built by the Zooniverse, who in some cases contributed hundreds of detailed galaxy classifications to this project. We are optimistic about the potential of projects like Galaxy Builder to dramatically increase the ability of researchers to perform complex, labour-intensive modelling of galaxy photometry, leveraging the power of the crowd to perform the complex tasks best suited to humans, and computer algorithms for the final optimization.

ACKNOWLEDGEMENTS
We would like to thank our referee for their input; suggesting numerous ways to improve the clarity and content of this paper, and providing avenues of discussion that we had not previously covered. We would also like to thank Ross Hart for providing the target catalogue for the stellar mass-complete sample.
This publication made use of SDSS-I/II data. Funding for the SDSS and SDSS-II was provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.
This publication uses data generated via the Zooniverse.org platform, development of which is funded by generous support, including a Global Impact Award from Google, and by a grant from the Alfred P. Sloan Foundation. We would also like to thank the 2,340 volunteers who have submitted classifications to the Galaxy Builder project, especially user EliabethB, whose presence on the Galaxy Builder forum on top of a large number of galaxies modelled has been a huge help.
Montage is funded by the National Science Foundation under Grant Number ACI-1440620, and was previously funded by the National Aeronautics and Space Administration's Earth Science Technology Office, Computation Technologies Project, under Cooperative Agreement Number NCC5-626 between NASA and the California Institute of Technology.
This project was partially funded by a Google Faculty Research Award to Karen Masters (https://ai.google/research/ outreach/faculty-research-awards/), and Timothy Lingard acknowledges studentship funding from the Science and Technology Facilities Council (ST/N504245/1).

APPENDIX
A. MODEL FITTING Assume Normal priors on component parameters determined from clustering (µ x , µ y , q, Re), with the spread given by the spread in the clustered values. We, therefore, have that our final log-likelihood (to be maximised) is the sum of the Gaussian log-likelihood of the residuals given the pixel uncertainty and the Gaussian log-likelihood of the variation in parameters, given their uncertainty.
The model being rendered is the PSF-convolved sum of the separate components and outputs an (N x , N y ) image. The disc, bulge and bar are variations on the boxy Sérsic profile: where r ( ì P) = 1 q 0 0 1 cos ψ − sin ψ sin ψ cos ψ ì µ − ì P c .

(A2)
The disc is resticted to n = 1; c = 2, bulge to n ∈ (0.5, 6); c = 2 and bar to n ∈ (0.5, 6); c ∈ (0.5, 6). The Sérsic components are actually rendered at 5x the image resolution, and downsampled using the mean pixel brightness. This is a widely used method of approximating the true pixel value, which is an integration over the area of sky inside the pixel: for a pixel of size (δ x , δ y ), Spiral arms were restricted to be logarithmic with respect to the inclined, rotated disc. They were rendered in a similar manner to the online interface; using the nearest distance from a pixel to a calculated logarithmic spiral.
An inclined, rotated log spiral requires parameters brightness I s , spread s, minimum and maximum θ (θ min and θ max ), an amplitude A, pitch angle φ, position ì µ, position angle ψ and axis ratio q, where ì µ, ψ and q are inherited from the disc component.
The distance from a pixel to a logarithmic spiral is given by In practice the spiral distance was approximated using the distance to a poly-line with 200 vertices, as solving the above minimization for each pixel at each fitting step is computationally intractable. We also adjust A, θ min and θ max to account for the rotation of the disc component from its starting value, in order to prevent spirals inadvertently moving far from starting locations for face-on discs (which have poorly constrained position angles). These adjustments are A = Ae ∆ψ tan φ , θ min = θ min − ∆ψ, θ max = θ max − ∆ψ.
For the fit, we parametrize disc I e as the Sérsic total luminosity, given by L tot = I e R 2 e 2πn e b n (b n ) 2n Γ(2n).
Bulge (bar) I e is reparametrized as "bulge (bar) fraction", which we define as and is limited to be between 0 and 1. Disc luminosity is allowed to take any value greater than or equal to zero. Similarly, bulge and bar effective radius are reparametrized as their scale relative to the disc (R e = R e /R e, disc ). Bulge and bar are also restricted to have the same position.