Whole genome duplication potentiates inter-specific hybridisation and niche shifts in Australian burrowing frogs Neobatrachus

Polyploidy has played an important role in evolution across the tree of life but it is still unclear how polyploid lineages may persist after their initial formation. While both common and well-studied in plants, polyploidy is rare in animals and generally less well-understood. The Australian burrowing frog genus Neobatrachus is comprised of six diploid and three polyploid species and offers a powerful animal polyploid model system. We generated exome-capture sequence data from 87 individuals representing all nine species of Neobatrachus to investigate species-level relationships, the origin and inheritance mode of polyploid species, and the population genomic effects of polyploidy on genus-wide demography. We resolve the phylogenetic relationships among Neobatrachus species and provide further support that the three polyploid species have independent autotetraploid origins. We document higher genetic diversity in tetraploids, resulting from widespread gene flow specifically between the tetraploids, asymmetric inter-ploidy gene flow directed from sympatric diploids to tetraploids, and current isolation of diploid species from each other. We also constructed models of ecologically suitable areas for each species to investigate the impact of climate variation on frogs with differing ploidy levels. These models suggest substantial change in suitable areas compared to past climate, which in turn corresponds to population genomic estimates of demographic histories. We propose that Neobatrachus diploids may be suffering the early genomic impacts of climate-induced habitat loss, while tetraploids appear to be avoiding this fate, possibly due to widespread gene flow into tetraploid lineages specifically. Finally, we demonstrate that Neobatrachus is an attractive model to study the effects of ploidy on the evolution of adaptation in animals.

Polyploidy has played an important role in evolution across the tree of life but it is 30 still unclear how polyploid lineages may persist after their initial formation. While both 31 common and well-studied in plants, polyploidy is rare in animals and generally less well-32 understood. The Australian burrowing frog genus Neobatrachus is comprised of six 33 diploid and three polyploid species and offers a powerful animal polyploid model system. 34 We generated exome-capture sequence data from 87 individuals representing all nine 35 species of Neobatrachus to investigate species-level relationships, the origin and 36 inheritance mode of polyploid species, and the population genomic effects of polyploidy 37 on genus-wide demography. We resolve the phylogenetic relationships among 38 Neobatrachus species and provide further support that the three polyploid species have 39 independent autotetraploid origins. We document higher genetic diversity in tetraploids, 40 resulting from widespread gene flow specifically between the tetraploids, asymmetric 41 inter-ploidy gene flow directed from sympatric diploids to tetraploids, and current 42 isolation of diploid species from each other. We also constructed models of ecologically 43 suitable areas for each species to investigate the impact of climate variation on frogs 44 Introduction 53 Polyploidy or whole genome duplications (WGDs) play important roles in ecology and 54 evolution (1, 2). Although polyploidization predominantly occurs in plants, polyploidy has also 55 played an important role in animal evolution. For instance, two ancient WGDs occurred early in 56 the vertebrate lineage (3), while more recent WGDs occurred in several animal groups, 57 including insects, molluscs, crustaceans, fishes, amphibians and reptiles (4-6). While the 58 majority of polyploid animals switch to different atypical modes of bisexual reproduction after 59 polyploid formation (4,7,8), amphibians, and more specifically anurans, are unique among 60 polyploid animals in exhibiting multiple independent occurrences of diploid and sexually 61 reproducing polyploid sister species (9). 62 While polyploidization has thus occurred frequently across the tree of life, the 63 evolutionary benefits of WGDs remain elusive. Although polyploids of hybrid origin 64 (allopolyploids) may benefit from heterosis due to increased genetic variation, instantaneous 65 shifts into intermediate or new ecological niches, and the redundancy of independently 66 segregating gene copies (2, 10), newly formed allopolyploids are simultaneously subject to 67 several disadvantages perhaps most prominently of which is their low abundance compared to 68 the established non-polyploids (10). Conversely, while autopolyploids retain many of the same 69 disadvantages as allopolyploids, the advantages of autopolyploidy are much less clear (11). 70 Here, we focus on a group of widely distributed, endemic Australian burrowing frogs, 71 Neobatrachus. This genus comprises six diploid (N. albipes, N. fulvus, N. pelobatoides, N. 72 pictus, N. sutor, N. wilsmorei; 2n=24) and three tetraploid (N. aquilonius, N. kunapalari, N. 73 sudellae; 4n=48) species (12,13), all characterised by bisexual reproduction (14). Tetraploid 74 species of Neobatrachus were suggested to have independent origins based on mitochondrial 75 DNA (mtDNA) (15) and to have originated through autotetraploidy rather than allotetraploidy, as 76 they exhibit tetrasomic inheritance and show high rates of tetravalent formation during meiosis 77 (14,16 It has also been observed that, while the Neobatrachus tetraploid species are distributed  88  sympatrically with some of the diploid species, they are also able to occupy more arid areas  89 across Australia (2,15). Generally, polyploidy has been associated with greater tolerance to 90 harsher conditions, but it is not clear whether whole-genome duplication broadly provides a 91 fitness advantage or simply is a consequence of elevated rates of unreduced gamete formation 92 (2,10,24,25), which might be more prone to occur in extreme environments. At the same time, 93 changing environments may contribute to amphibian extinction rates, which continue to increase 94 and are driven by many interdependent factors such as habitat loss, emergence and spread of 95 diseases, invasive species and pollution (26)(27)(28)(29). 96 In the current study, we use an anchored hybrid enrichment approach (AHE) (30-33) to 97 resolve the phylogenetic relationships among Neobatrachus species, and to assess fine-scale 98 intra-specific genetic population structure. We also quantify the extent of hybridization between 99 the nine Neobatrachus species with a particular focus on taxa with contrasting ploidies. Finally, 100 we combine population dynamics assessments with changes in ecologically suitable areas for 101 each species to describe population responses to climate changes. 102

103
Phylogenetic relationships, genetic population structure and gene flow in Neobatrachus 104 We first generated sequence data and alignments for 439 targeted orthologous nuclear loci of 105 87 Neobatrachus individuals spanning the entire genus as well as nine Heleioporus individuals 106 as outgroups (see Methods). We then built a species tree and gene trees from the sequenced 107 loci with ASTRAL-II (34) using RaxML (35) (Fig. 1A). These traditional bifurcating species-tree 108 methods reveal extensive conflict among genealogies, as well as between nuclear DNA (nDNA) 109 and mtDNA topologies, but they do all demonstrate that the three tetraploid species do not form 110 a monophyletic group (Fig.1A, Fig. S1, Fig. S2). Multidimensional scaling (MDS) of gene tree 111 topologies suggested that nuclear loci constitute either two or four topology clusters (Fig. S3 Diploid Neobatrachus species (top 6: N. pelobatoides,N. albipes,N. 145 wilsmorei,N. sutor,N. pictus,N. fulvus) are each assigned to separate clusters, while all three tetraploid 146 species (bottom 3: N. kunapalari, N. sudellae, N. aquilonius) show inter-species admixture.

148
To further assess the complex demographic history of Neobatrachus, we performed 149 TreeMix (38) modeling where species relationships are represented through a graph of 150 ancestral populations (Fig. 3A). The structure of the graph is inferred from allele-frequency data 151 and Gaussian approximation of genetic drift such that the branch lengths in the graph are 152 proportional to the amount of drift since population split. We sequentially added up to 15 153 migration events, showing saturation of the model likelihood at five additional migration edges 154 on average for 30 runs of Treemix, each with a different seed for random number generation 155 (Fig. 3E). We show an example of the inferred introgression events and the bifurcating graph for 156 the model with five migration events for the run that resulted in the highest maximum likelihood 157 ( Fig. 3A-D). Inferred migration events (Fig. 3B) indicate widespread directional introgression and 158 interploidy gene flow between the polyploid species, however, only two introgression events had 159 p-value lower than 0.05 in this particular run: from N. sudellae (4n) to N. kunapalari (4n) and 160 from N. sutor (2n) to N. kunapalari (4n). Since there was some variability in inferred migration 161 edges from run to run, to estimate the most frequently inferred migration events we summed the 162 significant inferred migration edges among 30 TreeMix runs with five events allowed (Fig. 3F). 163 Migration events were found most frequently from N. sudellae (4n) to N. kunapalari (4n) (19 of 164 30 runs) and from N. sutor (2n) to N. kunapalari (4n) (12 of 30 runs). Interploidy introgression 165 events were mostly asymmetric and from diploids to tetraploids, which corresponds with our 166 ADMIXTURE cluster assignment results (Fig. 1, Fig. 2). Inferred introgression events are 167 broadly congruent with clusters of conflicting gene-tree topologies (Fig. S3). Each tetraploid 168 Neobatrachus species (N. aquilonius, N. kunapalari, N. sudellae) is sister to a diploid species in 169 the TreeMix graphs ( Fig. 3A-B, tips highlighted in bold) as well as in the species trees ( Fig. 1), 170 which is consistent with previously suggested independent origins for the tetraploid species 171 (15). 172  Arabidopsis lyrata and A. arenosa could not hybridize, while tetraploidy seems to overcome the 305 endosperm-based hybridization barrier enabling gene flow between the two species (59, 60). 306 Neobatrachus species are widely distributed in Australia with tetraploid species 307 occurring more in the central (drier) area compared to diploids, which is reflected on the 308 principal component analysis of the climatic data for species occurrences (Fig. S8). Areas 309 occupied by different Neobatrachus species differ only slightly in their environmental 310 characteristics (Fig. 8). Worth mentioning is that climatic variables do not entirely describe 311 ecological niches, which could differ in other characteristics such as timing of breeding and 312 foraging, food source preference, etc. Nevertheless, ecological niche modelling based on 313 climate data may provide additional insights into population dynamic trends. Here, we applied 314 the MaxEnt modelling approach to the publicly available climate and occurrence data for all nine 315 Neobatrachus species, comparing the present and past suitable geographical areas. Most of the 316 Neobatrachus species showed substantial changes of the suitable areas comparing current and 317 past presence probabilities (Fig. S10-11). Interestingly, the estimated change in suitable habitat 318 areas and population genetics estimator of demographic trends (Tajima's D), obtained from 319 independent datasets, were correlated (Fig. 4C). Tetraploid species appear to be outliers from 320 the general trend, probably due to their mixed population structure: in this case, emergence of 321 rare alleles in the population due to migration events will affect Tajima's D estimator. Overall, it 322 appears that the species with greater shrinkage of suitable area since the last glacial maximum 323 had less negative median Tajima's D values, which suggests an ongoing shift from population 324 expansion to population contraction. 325

Conclusion and Outlook 326
Neobatrachus frogs represent a group of diploid and tetraploid species with a complex 327 ancestry. Our results, revealing gene flow between tetraploids and asymmetric inter-ploidy gene 328 flow, pose a number of important questions concerning the evolution of sexual polyploid 329 animals. Whole-genome sequencing data for Neobatrachus species would not only help to 330 refine the population structure and introgressive mixing in the genus, but also provide 331 information on potential adaptive effect of the introgressed regions. One could hypothesize that 332 a wide and potentially rapid spread of the tetraploids into new territories was facilitated by 333 introgression from the locally adapted diploids (61), and that more detailed sampling of the 334 tetraploids from parts of their distribution ranges remote from sympatry with diploids may reveal 335 evidence of early versus ongoing gene flow. Moreover, population-level genomic resequencing 336 of multiple diploid and tetraploid sister species could provide insight into the unique biology of 337 autotetraploid sexual animals and effects of the tetraploidization on their evolution. Following sequencing, we quality filtered the reads using the Casava high-chastity filter, 359 then demultiplexed the reads using the 8bp indexes with no mismatches tolerated. To increase 360 read length and correct for sequencing errors, we merged read pairs that overlapped by at least 361 18bp using the method of Rokyta et al. (67). This process also removed sequencing adapters. 362 We then performed a quasi 'de novo' assembly of the reads following Hamilton et al. (68), with 363 Pseudacris nigrita, and Gastrophryne carolinensis as references. In order to reduce the 364 potential effects of low level sample contamination, we retained only the assembly clusters 365 containing more than 61 reads. In order to produce phased haplotypes from the assembly 366 clusters, we applied the Bayesian approach developed by Pyron et al. (69), in which reads 367 overlapping polymorphic sites are used to identify the likely phase of allelic variants within each 368 locus. Because this approach was developed to accommodate any ploidy level, we were able to 369 isolate two or four haplotypes for diploid and tetraploid individuals, respectively. We determined 370 orthology for each locus using a neighbor-joining approach based on pairwise sequence 371 distances, as described in Hamilton et al. (68). We aligned homologous haplotypes using 372 MAFFT v7.023b (70), then auto-trimmed/masked the alignments following the approach of 373 Hamilton et al. (68), but with MINGOODSITES=12, MINPROPSAME=0.3, and 374 MISSINGALLOWED=48. Final alignments were visually inspected in Geneious R9 (Biomatters 375 Ltd., (71)) to ensure that gappy regions were removed and misaligned sequences were masked. 376 Whole mitochondrial genomes 377 Mitochondrial bycatch from exome capture methods can be an appreciable source of 378 mitogenomic data (72). To take advantage of these data, we assembled mitochondrial genomes 379 from raw reads using custom scripts built around existing assembly and alignment software. 380 Raw reads for each sample were assembled against the mitochondrial genome of Lechriodus 381 melanopyga (73), using a baiting and mapping approach implemented in MITObim (74). 382 Mapped reads from each individual genomes were then aligned using MUSCLE (75), and 383 adjusted manually. Entire concatenated mitogenomes were used to investigate sample 384 identities, and to build an initial phylogenetic tree. To estimate divergence dates between 385 species using mitochondrial data, we extracted 13 protein coding regions (CDS), and partitioned 386 them separately. We then trimmed the concatenated mitogenome phylogeny and CDS 387 alignments down to two representatives of each taxon except for Neobatrachus sutor, which 388 was not monophyletic (and therefore was trimmed down to four samples), and included two 389 species of Heleioporus and Lechriodus as outgroups. The resulting tree and alignments were 390 used as inputs for baseml and ultimately mcmctree, which we ran until we reached 20,000 post-391 burnin samples. 392

Phylogenetic analysis 393
To generate a molecular species tree, we started by reconstructing individual 394 genealogies for each of the 439 recovered loci. We analyzed two datasets, of which the first 395 included all samples (except triploids discussed in the Results section), while the second was 396 trimmed down to just two individuals per species. Results from the full sampling can be found in 397 the Supplementary Material (Fig. S2), and the finer sampling in the main text (Fig. 1). We used 398 RaxML (35) to simultaneously search for the best tree and apply 100 rapid bootstraps, 399 implementing the GTRGAMMA model of nucleotide evolution for each locus. In generating 400 species trees, coalescent methods have been shown to be more accurate than concatenation in 401 case of extensive incomplete lineage sorting, and so we used the shortcut coalescent method 402 ASTRAL III. Shortcut coalescent methods like ASTRAL take individual gene trees as input, and 403 are much more computationally efficient than full coalescent analyses. We used our RAxML-404 generated gene trees as input for ASTRAL, allowing us to make use of all our molecular data. 405 To address gene-tree incongruence and investigate possible conflicting signals in our 406 data, we used multidimensional scaling (MDS) to approximate the relative distances between 407 gene tree topologies (76). To prepare the data, we trimmed down gene trees to one sample per 408 species of Neobatrachus, and discarded loci missing any taxa, leaving us with 361 loci. We 409 started by simply visualizing gene-tree incongruence overlaying the topologies of all 361 loci in 410 DensiTree (Fig.1). We then calculated the pairwise distances between all gene trees using the 411 Robinson-Foulds metric, in the R package APE (77). We projected the tree distances into two 412 and three dimensions (representing tree topology space) using MDS, as visualizing and 413 interpreting more dimensions is difficult. To test if gene trees are uniformly distributed 414 throughout tree space, or clustered, we used the partitioning around the medoids algorithm as 415 implemented in the R package CLUSTER(78). We chose the optimum number of clusters (k), 416 using the gap statistic, calculated for each k = 1-10. Clusters of gene trees represent similar 417 topologies, and so we then summarized each cluster using ASTRAL, to identify consistent 418 differences in topology. 419 We estimated the evolutionary timescale of Neobatrachus frogs for the mitochondrial 420 DNAm, using 13 protein coding sequences, and a concatenated RAxML phylogeny of those loci 421 as input for mcmctree. Because no valuable fossil information for Neobatrachus is available, we 422 used four secondary calibrations from the most extensive anuran time-tree to date (36). We 423 applied these secondary calibrations as truncated Cauchy distributions with 2.5% above and 424 below the designated bounds, to the splits between (i) Myobatrachidae and Microhylidae 425 (lower=120, upper=140), Kalophrynus and remaining microhylids (lower=55, upper=66), 426 Cophixalus and Liophryne (lower=17, upper=24), and Microhyla and Kaloula (lower=40, 427 upper=55). This analysis produced an estimate of the divergence between Neobatrachus and 428 Heleioporus with a 95% CI of 25-45 Mya, which was used to inform the mcmctree analysis of 429 mtDNA coding sequences. We ran analyses until we reached 20,000 post-burnin samples. Neobatrachus individuals (excluding the outgroup species) to have missing data at each site. In 434 order to include tetraploid samples in ancestry assignment we randomly chose two alleles for 435 each site. We also applied minor allele frequency threshold of two percent. Ancestral population 436 assignment showed three local minima of cross-validation errors at K equals 3, 7 and 9 ( Fig.  437 S6), with K=7 being the lowest, which we chose for the subsequent analysis as the optimal 438 solution. 439 Misidentifications in the dataset 440 Both phylogenetic and admixture assignments suggested that several individuals had 441 been misidentified in the field, which is expected for morphologically similar species and in 442 particular for some of the diploid-tetraploid species pairs, e.g. N. fulvus and N. aquilonius (17). 443 Field sampling can be accompanied by a certain level of honest mistakes in species 444 identification, especially for sympatric species. However, a high level of incompletely sorted 445 polymorphisms in recently split lineages or recent hybridization events could also result in 446 uncertain positioning of an individual. We carefully curated the dataset and made a decision to 447 rename some of the misidentified samples or completely exclude them from the analyses based 448 on the amount of the missing data in the assembly, ploidy estimations from the sequencing 449 data, identification from mitochondrial sequences and on the clear placement in a different 450 clade. Below we describe our workflow for manual curation of the dataset to exclude or rename 451 uncertain individuals without compromising too much on the potentially real shared variation. 452 The We have also excluded from further analysis sample I5442, initially identified as N. 473 kunapalari, which was estimated to be a triploid and showed high levels of admixture between a 474 diploid N. wilsmorei and potentially N. kunapalari or N. sudellae. In fact, triploid individuals are 475 known in natural populations of Neobatrachus (19,21,22,82,83) , (Appendix 1), and could 476 provide an explanation for the gene flow between species of different ploidy through a "triploid 477 bridge", when a 3n individual formed in a cross between 2n and 4n individuals and could 478 produce a balanced haploid, diploid or triploid gametes and in the two later cases cross to a 479 diploid and produce 4n individuals that can backcross into the sympatric 4n population ( allotetraploids were modelled combining bam files between each diploid species (Fig. S4A). 487 Base frequencies distributions were produced using denoised algorithm from nQuire (79) with diploids or other autopolyploid species, or the process of diploidization in an autopolyploid 496 with tetrasomic inheritance (88). Other analyses in this study suggest extensive gene flow 497 between N kunapalari and diploids as well as tetraploids (Fig. 2, Fig. 3B,F) . In addition, the 498 mitochondrial phylogenetic tree and our hierarchical clustering of nucleotide diversity analysis 499 both support N. kunapalari as the oldest tetraploid lineage in this group (Fig. 1B, Fig. 4D). While 500 we cannot rule out the possibility that N. kunapalari was initially formed from the hybridization of 501 two closely related lineages, we believe extensive gene flow and the older lineage age of N. 502 kunapalari is a sufficient and more likely cause for its elevated intermediate allele frequencies 503 as compared to the other tetraploids. 504

Introgression inference 505
The graphs representing ancestral bifurcations and migration events were produced 506 using Treemix V.1.12 (38). Input data contained 5092 biallelic sites called at 439 loci among 9 507 Neobatrachus species with at least 20% of the data to be present at each species at each site. 508 Position of the root was set to N. pelobatoides as the nuclear species tree suggested (Fig. 1A,  509 black). To account for linkage disequilibrium we grouped SNPs in windows of size 10 using -k 510 flag. We also generated bootstrap replicates using -bootstrap flag and subsequently allowed up 511 to 15 migration events with flag -m. We ran TreeMix software with 30 different random number 512 generated seeds. For graph and residuals visualisation we used R script plotting_funcs.R from 513 the Treemix package. 514 Summary statistics and demographic tendencies 515 We calculated summary statistics (Fig. 4A-C)  background points and 10 replicates. We have trained the model on bioclimatic variables 537 (reduced to 6 in total, described earlier) averaged across conditions 1960-1990; and then 538 projected that model to the same set of environmental variables from the Last Glacial Maximum. 539 The average test AUC (area under the Receiving Operator Curve) for the replicate runs for all 540 the species was more than 0.9 (Supplementary Table 2), indicating a high performance of the 541 models. In order to estimate which bioclimatic variable is the most important in the models we 542 performed a jackknife test, where model performance was estimated without a particular 543 variable and only with this particular variable in turn (Fig. S9). 544 We used cloglog output format of MaxEnt, which gives an estimate between 0 and 1 of 545 probability of presence of the species in the area. In order to determine the relative change of 546 the suitable area we used the point-wise mean values from the 10 replicates for model 547 predictions on current and past climate (Fig. S10-11