Deep structure, long‐distance migration and admixture in the colour polymorphic land snail Cepaea nemoralis

Abstract Although snails of the genus Cepaea have historically been important in studying colour polymorphism, an ongoing issue is that there is a lack of knowledge of the underlying genetics of the polymorphism, as well as an absence of genomic data to put findings in context. We, therefore, used phylogenomic methods to begin to investigate the post‐glacial history of Cepaea nemoralis, with a long‐term aim to understand the roles that selection and drift have in determining both European‐wide and local patterns of colour polymorphism. By combining prior and new mitochondrial DNA data from over 1500 individuals with ddRAD genomic data from representative individuals across Europe, we show that patterns of differentiation are primarily due to multiple deeply diverged populations of snails. Minimally, there is a widespread Central European population and additional diverged groups in Northern Spain, the Pyrenees, as well as likely Italy and South Eastern Europe. The genomic analysis showed that the present‐day snails in Ireland and possibly some other locations are likely descendants of admixture between snails from the Pyrenees and the Central European group, an observation that is consistent with prior inferences from mitochondrial DNA alone. The interpretation is that C. nemoralis may have arrived in Ireland via long‐distance migration from the Pyrenean region, subsequently admixing with arrivals from elsewhere. This work, therefore, provides a baseline expectation for future studies on the genetics of the colour polymorphism, as well as providing a comparator for similar species.

genes that determine variation in colour morphs within species and their evolutionary history (Wellenreuther et al., 2014). These same methods have also enabled the subdiscipline of phylogeography to move away from the use of one or a few markers to studies that involve thousands of genetic markers (Dussex et al., 2020;Kotlik et al., 2018;Lucena-Perez et al., 2020). Such studies have revealed how connectivity between populations, or lack of, shapes, the genetic structure of species, and critically, how regions of the genome respond differently depending upon the nature of selection and the genetic architecture of a particular colour trait (e.g. Pardo-Diaz et al., 2012;Poelstra et al., 2014).
Historically, some of the most important animals in studying colour polymorphism have been the two snail species, Cepaea nemoralis and C. hortensis. One benefit of working with these species is that variation in colour and banding ( Figure 1) shows straightforward inheritance (Cook, 2017;Jones et al., 1977;Richards et al., 2013), being controlled by a series of at least nine loci, of which five make a single 'supergene' containing tightly linked colour, banding and other loci. In these two species, we now have some understanding of the pigments and shell proteome (Affenzeller et al., 2020;Mann & Jackson, 2014;Williams, 2017) and some knowledge underlying the development of the banding pattern (Jackson et al., 2021) and have begun to use genomic methods to try to map and identify the genes involved (Kerkvliet et al., 2017;Richards et al., 2013).
However, although ongoing studies on snails in general continue to provide evidence for the relative role of various forms of natural selection and random drift in promoting and maintaining variation, snails (and molluscs in general) are poorly represented in phylogeographic or population genomic studies. For example, most studies on terrestrial snails have been limited to using just a few genes, even those published recently (de Weerd et al., 2016;Ebbs et al., 2018;Neiber et al., 2017). In our own work on C. nemoralis, we sampled 880 individuals from 111 locations, but used only one genetic marker, a fragment of the mitochondrial gene cytochrome c oxidase F I G U R E 1 Top: European map summarizing the results of the mtDNA study on C. nemoralis, as well as the locations and abbreviation code for the samples used for the ddRAD study. Pies are coloured according to mtDNA lineage (see Figure S1); larger pies indicate n > 5 per site. Limits of the present-day distribution of C. nemoralis shown by the red dotted line. Data from Layton et al. (2019) indicated by asterisk. Bottom: Some of the C. nemoralis colour and banding morphs, from left: (1) yellow unbanded, white lip; (2) brown midbanded, white lip; (3) pink unbanded, normal lip; (4) yellow five-banded, white lip; (5) brown unbanded, normal lip; (6) pink five-banded, normal lip; (7) yellow three-banded (10 305), normal lip. Image credit: Angus Davison subunit 1 (Grindon & Davison, 2013). Otherwise, there are just a few notable exceptions (Haponski et al., 2019;Hirano et al., 2019;Miura et al., 2020;Neiber & Hausdorf, 2015b;Westram et al., 2018), with a few studies in Europe (Koch et al., 2017;Razkin et al., 2016;Xu & Hausdorf, 2021), and a single genome re-sequencing study (Chueca et al., 2021a). This is a problem because understanding colour polymorphism requires a contextual understanding of the history of populations in general, and more precisely, a gene-by-gene account of the impact that drift and selection have had upon the genome.
One issue for Cepaea snails is that Jones et al. (1977) questioned whether understanding the polymorphism is 'a problem with too many solutions?', which led to difficulties arguing the case for these snails as a study organism, or molluscan 'model' (Davison & Neiman, 2021). Actually, the misunderstood intention of Jones et al. (1977) was to emphasize the perfect case study provided by Cepaea. Simple explanations for phenotypic variation, including colour, are likely an exception, so they were making the point that it is important to study organisms for which polymorphism may be explained by a variety of processes, precisely because they are more representative. Given that present-day genomic technologies should allow us to uncover the relative contributions of each of these processes in making contemporary diversity, this aim should now be realizable. Unfortunately, a second general issue is that extracting DNA from snails is sometimes problematic (Adema, 2021), and snails in general have large and repetitive genomes, which are difficult to assemble. Advances in the understanding of the colour polymorphisms of snails have, therefore, tended to lag behind other species, especially invertebrates with relatively small genomes (Davison & Neiman, 2021).
In this study, we take advantage of a first draft genome sequence of C. nemoralis (∼3.5 Gb, 28 537 contigs with N50 of 333 kb; Saenko et al., 2021) and restriction site-associated DNA sequencing methods (ddRAD) to understand the phylogeographic and population genomic structure of C. nemoralis. By combining the genomic data with mtDNA sequences generated from three prior studies (Davison, 2000;Grindon & Davison, 2013;Layton et al., 2019) and new sequences from Italy, the proximate aim is to begin to understand the post-glacial history of the species. The ultimate aim is to provide context for subsequent studies on the evolutionary origins of the supergene and the relative roles that natural selection and drift may play in the establishment and loss of colour polymorphism in local populations.

| Sample collection
In previous work, we and others collected C. nemoralis from across Europe, euthanizing snails by freezing at −80°C upon arrival at the University of Nottingham (Davison, 2000;Grindon & Davison, 2013). In this study, we strategically sampled from this prior collection, including individuals from Germany, Belgium, France, Hungary, Denmark, Spain, UK and Ireland. We also used snails from a 2017/2018 collection (Ramos Gonzalez & Davison, 2021) from three valleys in the Pyrenees, Valle de Vielha, Valle de Jueu and Valle Noguera Ribagorzana (subsequently "Vie", "Jue" and "Rib").
For the mitochondrial DNA analysis, we combined the data from our own prior studies (Davison, 2000, Grindon & Davison, 2013 with data from another recently published work (Layton et al., 2019), the latter of particular benefit because the study had good coverage of sites in central France. In addition, we also added new data from samples from Italy, useful because the peninsula was a potential refugium during previous glaciations but poorly sampled in the original studies.
For the genomic analyses, individuals were selected to ensure a wide geographic spread and to include representatives of different colour morphs, as well as the major mitochondrial lineages reported in the prior studies ( Figure 1, Table 1).

| DNA methods
High-molecular-weight genomic DNA was extracted from frozen snail foot tissue, using methods similar to those described previously (Gonzalez et al., 2019;Richards et al., 2013). In brief, tissue slices were incubated at 65°C in lysis buffer (10 mM Tris, 0.1 M EDTA, 0.5% SDS), with proteinase K (0.2 mg/ml), with occasional inversion.
RNase A (80 μg/ml) was added, and then, each tube was incubated for one more hour. The mixture was cooled to room temperature and then extracted using the standard phenol/chloroform/iso-amyl ethanol (25:24:1) method, with ethanol precipitation. DNA quality was visually verified by agarose gel electrophoresis and then quantified and further checked using a NanoDrop 2000c spectrophotometer and a Qubit 2.0 Fluorometer. ddRAD libraries were generated by SNPsaurus (https://www.snpsa urus.com/), using the method of Russello et al. (2015) and 75 ng of input DNA. Using a service provided by SNPsaurus at the University of Oregon (https://www. snpsa urus.com/), one lane of a HiSeq4000 was then used to generate 150 base pair sequence reads, which were then demultiplexed and stored as compressed fastq files. In addition, a ∼600 base pair (b.p.) fragment of the mitochondrial gene COI was amplified by the polymerase chain reaction (PCR) for some individuals, using either the Folmer et al. (1994) primers, those of Gittenberger et al. (2004) or a combination of the two, as described previously (Grindon & Davison, 2013). The PCR product was then sequenced with forward primers (and reverse, if necessary) using BigDye™ Terminator version 3.0 Cycle Sequencing Kit.

| Analyses of mitochondrial DNA variation
The randomized axelerated maximum likelihood method, as implemented in raxml 8.2.12 (Stamatakis, 2014), was used with a partitioned data set (one for the 1st/2nd codon positions and another for the 3rd codon position) and produced a mitochondrial phylogeny, TA B L E 1 Summary showing the locations and phenotypes of the individual samples used for the analyses of genomic variation, including mitochondrial haplotype (Grindon & Davison, 2013); "group" reflects the clustering returned using Gaussian finite mixture modelling with rapid bootstrap resampling (100 replicates). The phylogeny was then used to place individual haplotypes into lineages, as previously described (Grindon & Davison, 2013).

| Quality checking, variant calling and filtering
fastqc v0.11.9 software was used to check the quality of the reads, and then, adapters were removed using trimmomatic 0.39 (Bolger et al., 2014). The fastq files were then aligned to the draft genome (Saenko et al., 2021) using the Burrows-Wheeler Alignment method (Li et al., 2009)

| Analyses of genomic variation
We used principal component analysis (Patterson et al., 2006), whole-genome phylogeny and admixture tests to analyse the relationships between the individuals.
First, as others have done (Bryc et al., 2010;Decker et al., 2014;Marques et al., 2022), principal components analysis (PCA) and Gaussian finite mixture modelling methods were used as an independent means of exploring the broad phylogenomic relationship between individuals, independent of models of sequence evolution.
For the PCA, but not other analyses, the filtered variant calling file was first pruned using plink v1.9 (Purcell et al., 2007), then linkagepruned sites were used to generate eigenvalues and vectors. The were estimated using vcftools (Danecek et al., 2011). To achieve this, genome-wide F st was used directly by plotting against geographic distance or transformed into a dissimilarity matrix, using the vegan 2.5-6 package in r (Jari Oksanen et al., 2018) and again plotted against geographic distance. Significance was tested using a Mantel test based on Spearman's rank correlation and 9999 permutations, again using vegan 2.5-6.
In the future-when whole-genome analyses are possible and the colour and banding loci have been finely mapped-we aim to understand how gene flow impacts upon different regions of the genome, for example, using a window-based approach. As a starting point, we estimated F st by contig and also estimated Mantel's r statistic for the F st for each contig and all pairwise comparisons. To achieve this, the same analyses as above were carried out on each of the individual contigs, again testing significance using a Mantel test.
To further understand the phylogenomic relationship between individuals, the concatenated biallelic SNP ddRAD data were used to generate a phylogeny, using raxml v8.2.9 with the GTRGAMMA model and 100 bootstrap replicates (Stamatakis, 2014).
The program admixture v1.3.0 was used to determine the best number of clusters (K), by computing the fivefold cross-validation error (Alexander & Lange, 2011), with the aim of then using treemix 1.13 (Pickrell & Pritchard, 2012) to investigate possible migration events.
To support these analyses, D statistics were used to explore evidence of introgression between Pyrenean and central European snails, using admixtools version 6.0 (Patterson et al., 2012) and admixr version 0.7.1 (Petr et al., 2019). Leave-one-out cross-validation was used to determine the best-supported value, with significance evaluated using block jackknife and the Z-score.
For the D statistics, the underlying principle (Ravinet & Meier, 2020) is

| Mitochondrial DNA
Here, we combined data from two prior studies (Davison, 2000;Grindon & Davison, 2013) with a new study (Layton et al., 2019). We also added 17 individuals from a further nine locations in Italy, making a combined total of 1521 individuals from 180 locations across Europe (Table S1). The revised mtDNA phylogeny ( Figure S1; rooted using the sister taxon C. hortensis) shows the same pattern as before ( Figure 1), with the new finding that Italy has several lineages that are not found anywhere else, except in Croatia and Hungary.
The majority of Europe (Figure 1) Table S3). There was a mean of 3.74 SNPs per contig and, as expected, a correlation between contig length and number of SNPs (r = 0.11, p < 0.001; Figure S2). There was no evidence for PCR duplication problems.

| Different genomic evolutionary history of the colour and geographic variation
A principal components analysis of the whole-genome data showed that individuals of C. nemoralis broadly cluster according to geography ( Figure 2) with PC1, PC2 and PC3 together explaining 37% of the variance (Table S4) are coloured according to these groupings. Figure 2 shows the clusters in relation to the principal component analysis; Figure 3 shows the geographic distribution of three clusters.    (Table S5). Specifically, we set out to test whether the Irish samples (Lisdoonvarna, Dublin) and some others (e.g. La Roche, Mull) show evidence of introgression from other populations. In general, there was no evidence for an excess of allele sharing (Table S5, Z-score between −3 and 3) and therefore, no evidence for introgression since their common ancestry. The main exception was a significant enrichment of shared sites between Segre in the Eastern Pyrenees and Lisdoonvarna in Ireland, irrespective of whichever central European site used (Z-score around −5 to −6). This is consistent with these West Irish samples being partly derived from snails in the Eastern Pyrenees (Table S5). There was also limited evidence that the Dublin population may be partly derived from Pyrenean snails and the same for samples from La Roche (Zscore < −3 in two comparisons each). These results are, therefore, consistent with Irish and possibly some French populations being derived from an introgression event between Pyrenean and central European populations of snails. In the future, it should be possible to better test for evidence of introgression, using analyses such as treeMix (Pickrell & Pritchard, 2012), but unfortunately, a relatively high degree of missing data precluded this analysis here.

| DISCUSS ION
Investigating the phylogeography of Cepaea nemoralis-and the relative degree of gene flow between populations-should be an important step towards understanding the relative roles that natural selection and drift have to play in the evolutionary history of the colour and banding polymorphism. In this study, we used a newly F I G U R E 4 Mid-point rooted RAxML phylogeny, based on an analysis of 8689 biallelic loci. Bootstrap support >70% is shown. Colours are in accordance with the clusters defined by Gaussian finite mixture modelling available draft genome (Saenko et al., 2021) and the ddRAD method, supplemented with a wide geographic survey of mtDNA data, with a primary aim to begin to understand the post-glacial history of the species, but also to provide context for future studies on the evolutionary origins of the supergene, and the roles that selection and drift have in determining the local colour polymorphism. Of course, these methods are now standard in many animal groups, but the study is relatively novel for snails, partly because molluscs in general lag behind (Davison & Neiman, 2021), with, for example, only a few other land snail species having a genome assembly (Chueca et al., 2021b;Guo et al., 2019;Liu et al., 2021). The reality is that although phylogeographic and genomic studies are relatively common in other invertebrates, there are remarkably few similar studies in snails or even the wider group of molluscs.
Broadly, analysis of the whole-genome data expands and adds nuance to the mtDNA survey. The analysis shows that Europeanwide patterns of differentiation are primarily due to multiple deeply divergent groups of snail. Geographically one group is dominant, being found over much of Central Europe (Figures 1 and 2 The results corroborate previous studies in that populations showed a high degree of genetic structuring, sometimes over short geographic distances (Davison & Clarke, 2000). For example, the maximal genomic distance was between snails in Hungary and the rest of Europe (F st > 0.3; Table 2; Figure 5). In comparison, samples from the UK, Denmark, Germany, France and Belgium were relatively homogenous, with an F st around 0.05 to 0.15. Previously, Ochman et al. (1983) showed using allozymes that the Pyrenees contain three deeply differentiated sets of populations, or 'molecular area effects', for which the geographic patterns did not correlate with the shell ground colour and banding frequencies. The genomic evidence also supports these findings, showing deep divergence between geographically close populations on the Atlantic coast of Northern Spain (e.g. Santander), and the Central Pyrenees (e.g. Vielha) with possible hybrid populations in both the Western and Eastern Pyrenees (e.g. Roncesvalles in West and Segre in East).

| Origins of the Irish Cepaea nemoralis
One of the key characteristics of most population genetic studies of land snails is that they show high population structure, and deeply diverged mitochondrial lineages, the latter frequently considerably <5% within species (Chiba, 1999;Thomaz et al., 1996), for which a variety of explanations have been put forward (Davison, 2006;Parmakelis et al., 2013). In Cepaea, one of the most striking findings from previous work was that some Irish C. nemoralis are most likely derived from a post-glacial colonization event from the Pyrenees (Grindon & Davison, 2013), in part because of identical mitochondrial haplotypes in the two locations.
Here, the genomic data provide further evidence for the longdistance colonization, but also show that the present-day snails in Ireland are likely descendants of admixture between Pyrenean snails and individuals from the population that colonized most of Central Europe. The snails from the West coast of Ireland show the greatest proportion of Pyrenean ancestry (Figure 7; Table S5), consistent with the observation that they also have some of the same shell characters, including large size and shells with a pale or white lip colour (Cameron, 1969;Cook & Peake, 1960, 1962. The analyses may also indicate, albeit a lesser signal, of this same admixture in snails from Dublin, Ireland, and the North-West of France (La Roche; Figure 7; Table S5). As there is also another mitochondrial DNA lineage present in Ireland (D), which may derive from Northern Spanish or West Pyrenean snails (directly or indirectly), then Ireland may have been colonized on at least two and perhaps three or more occasions.
Similar patterns of mixed ancestry have been reported for other species (especially those involving the 'Celtic fringe ';Brace et al., 2016;Kotlik et al., 2018;McDevitt et al., 2011;Searle et al., 2009). Altogether, these new results contribute to the ongoing debate regarding the uncertain origins of the Irish biota (e.g.

McDevitt et al., 2020).
Although not proven, we believe that the most likely scenario is that C. nemoralis snails arrived directly in Ireland from the Pyrenean region, intentionally or accidentally assisted by humans.
For the moment, the precise geographic origin remains unknownidentical mtDNA haplotypes are shared between multiple sites in the Pyrenees and Ireland, with mtDNA F st values not significantly different from zero (Grindon & Davison, 2013). In the genomic study here, a single site (Segre) in the Eastern Pyrenees falls most closely with the Irish snails. However, this population in itself could be a hybrid population, with the actual Irish ancestor elsewhere.
It is reasonable to suppose that snails of the same lineage could also have been distributed along the western fringes of France. Of use here, an inspection of the data in Layton et al. (2019) shows that snails of mtDNA lineage C are found in central and North Western France (see their Figure 1; Deux-Sevres, Paimpont), whereas previously lineage C was reported from the Pyrenean region and Ireland only (Grindon & Davison, 2013). This finding is, therefore, consistent with the genomic evidence (especially the admixture analyses) that indicates a possible hybrid origin of snails in North-West France (La Roche; Table S5). However, the specific mitochondrial C type haplotype found in France has not yet been found in Ireland. It is,  Kerney et al., 1980;Preece et al., 1986;Speller, 2006 Populations have since hybridized and introgressed. In perhaps a similar manner, founding populations of mice were likely transported by humans (Jones et al., 2013). Moreover, C. nemoralis are arguably 'synanthropic', liking disturbed habitats but also being consumed as food source, especially in Iberia, since at least 11 000 years be-

| Colour and genomic geographic variation
One of the characteristics of the genus Cepaea is that all, or nearly all, populations show some degree of colour or banding polymorphism (Cook, 2017;Jones et al., 1977). Nonetheless, large-scale and local surveys of the geographical distribution of shell ground colour have shown clear geographical patterns Cook, 2014;Davison et al., 2019;Ożgo & Schilthuizen, 2012;Ramos Gonzalez & Davison, 2021;Silvertown et al., 2011;Worthington et al., 2012). Future studies that aim to test trends in shell variation may, therefore, wish to take account of this knowledge in analyses.
For the moment, the whole-genome data enable a more nuanced understanding, for example, showing that large scale differentiation due to isolation by distance is mainly driven by about a third of the genome (Figure 3c).
To what extent is the lack of geographic differentiation for most loci due to neutral processes such as drift ( As our previous work showed that colour is 'indiscrete' or continuous, and not three separate colours in Cepaea , further sequence analysis of colour alleles will inform how selection has impacted upon the genome and more specifically, the nature of this selection.  on the basis of the evidence provided here, and the distribution of samples, it seems plausible that the central European population is derived from an as yet unsampled refugium in the Italian or alpine region, or perhaps adjacent Balkans-which would make the pattern most similar to that of the Chorthippus parallelus paradigm suggested by Hewitt (2004).

| The evolutionary history and future of Cepaea
In the future, we believe that a higher level assembly of the C. nemoralis genome should enable subsequent studies on the evolutionary origins of the supergene, and the relative roles that natural selection, recombination and drift may play in the establishment and loss of colour polymorphism in species as a whole and, more specifically, local populations. The same phylogeographic and population genomic questions should also be addressed in the sister species C. hortensis, which has a more northerly distribution; many records for C. hortensis in the Iberian Peninsula are arguably erroneous, instead of being pale-lipped C. nemoralis (Puente et al., 1998, Ramos Gonzalez & Davison, 2021; see also records in e.g. iNaturalist). This species also colonized North America, likely more than 6000 years ago (Pearce et al., 2010), and is long present in Iceland, and perhaps once present in Greenland (Johnson, 1906). It is not known how this species colonized these locations, but population genomics would reveal whether it was from one or several sources and whether a 'stepping-stone' was involved (e.g. Europe -Iceland -North America). Harinee Selvadurai, who did not seem to mind AD collecting snails whilst on honeymoon in Lisdoonvarna, Ireland.

CO N FLI C T O F I NTE R E S T
The authors declare no competing interests.