Culture-enriched metagenomic sequencing enables in-depth profiling of the cystic fibrosis lung microbiota

Amplicon sequencing (for example, of the 16S rRNA gene) identifies the presence and relative abundance of microbial community members. However, metagenomic sequencing is needed to identify the genetic content and functional potential of a community. Metagenomics is challenging in samples dominated by host DNA, such as those from the skin, tissue and respiratory tract. Here, we combine advances in amplicon and metagenomic sequencing with culture-enriched molecular profiling to study the human microbiota. Using the cystic fibrosis lung as an example, we cultured an average of 82.13% of the operational taxonomic units representing 99.3% of the relative abundance identified in direct sequencing of sputum samples; importantly, culture enrichment identified 63.3% more operational taxonomic units than direct sequencing. We developed the PLate Coverage Algorithm (PLCA) to determine a representative subset of culture plates on which to conduct culture-enriched metagenomics, resulting in the recovery of greater taxonomic diversity—including of low-abundance taxa—with better metagenome-assembled genomes, longer contigs and better functional annotations when compared to culture-independent methods. The PLCA is also applied as a proof of principle to a previously published gut microbiota dataset. Culture-enriched molecular profiling can be used to better understand the role of the human microbiota in health and disease. The PLate Coverage Algorithm (PLCA) determines the culture plates required for culture-enriched metagenomics and enables the recovery of greater taxonomic diversity, better quality metagenome-assembled genomes and improved functional annotations compared to metagenomics alone, indicating its utility for other microbiomes, especially those dominated by host DNA.

T he field of microbiology began with the visualization of microorganisms 1 and continued once we learned to control their growth. The advent of next-generation sequencing revolutionized microbiology by allowing microbial genomics and community analysis without the requirement of culture. These technologies have expanded our understanding of the human microbiome, identifying the effects that environment, diet and host genetics have on these complex communities [2][3][4][5] . Culture-independent studies have made crucial contributions to microbiome research. However, without bacterial culture, we lose the advantages of classical microbiology, resulting in a lack of mechanistic studies and an alternative focus on microbial 'dysbiosis' and microbiome diversity 6,7 .
Although it is commonly stated that the majority of the human microbiota is unculturable, numerous studies conclusively counter this. In 1974, Finegold and others cultured ∼300 species from 40 faecal specimens using both aerobic and anaerobic culture 8 . More recently, Goodman and others cultured almost half of the human gut microbiota, recovering 316 operational taxonomic units (OTUs) of the 631 identified by culture-independent techniques 9 . Lagier and others cultured 340 species of bacteria from three stool samples 10 . Furthermore, recent studies recovered 88% of family-level OTUs 11 and 95% of all OTUs identified in faecal specimens 12 (importantly, both of these studies also identified more OTUs via culture-dependent than culture-independent methods). Other human-associated communities have been profiled with culture, including urine 13 , skin 14 , oral 15 and cystic fibrosis communities 16 .
Marker gene profiling (for example, 16S rRNA gene sequencing) provides a simple and rapid method to assess the taxonomic composition of a community. Although metagenomics can assess functional capacity in addition to taxonomy, the usefulness of these data is dependent on how well short-read sequences can be assembled into contigs. The quality of assembly can be impacted by the complexity of the community, the sequencing technology and/or the proportions of host DNA contamination [17][18][19] . Culture enrichment is uniquely positioned to improve metagenomic assembly by allowing the proliferation of microorganisms (thus abating host DNA contamination) on media that 'biologically bin' samples, thus decreasing their complexity. Coupling culture enrichment with computational approaches allows us to separate promiscuous and fastidious microorganisms to better resolve these communities. Here, we use sputum from cystic fibrosis as an example of a complex microbial community where host DNA can represent a large proportion of the community (>99%) [20][21][22][23] . We present a cultureenriched metagenomic strategy, which overcomes these limitations and improves metagenomic results, providing a more comprehensive profile of the microbial community.
In this study, we merge culture-dependent and -independent techniques to better understand microbial communities. Cultureenriched 16S rRNA gene sequencing established that 82.13% of all OTUs-representing 99.3% of the relative abundance-in the cystic fibrosis lung microbiota are culturable. Furthermore, culture enrichment increased OTU recovery when compared to culture-independent sequencing. We introduce the PLate Coverage Algorithm (PLCA), which uses 16S rRNA gene sequencing to optimize culture-enriched metagenomics, and show that culture-enriched metagenomics improves the recovery of metagenomic-assembled genomes and produces more thorough functional annotations than direct metagenomic approaches. We identify the advantages of culture-enriched metagenomics: increased taxonomic and functional information due to a decrease in contaminating host DNA and the ability to computationally and biologically bin microbial species.

Results
In this study, we devised a strategy for culture-enriched metagenomic profiling (Fig. 1). This strategy is based on previous culture enrichment for amplicon-based sequencing conducted on the cystic fibrosis lung microbiome, which was able to culture 43 of 48 bacterial families identified in sputum 16 . On collection, samples were immediately plated onto 13 different media types (Extended Data Fig. 1) under aerobic and anaerobic conditions. 16S rRNA gene sequencing was performed on the sputum sample (direct sequencing) as well as on the collective organisms grown in each medium/environment pairing (culture-enriched sequencing) for a total of 26 culture-enriched samples per sputum (Fig. 1). The OTU diversity in the direct sequencing and distribution in the cultureenriched sequencing was used in conjunction with the PLCA (details below) to determine a representative subset of cultureenriched plates that adequately represented the sample. Shotgun metagenomic sequencing was performed on the original sample and the culture-enriched subset.
The majority of the cystic fibrosis lung microbiota is culturable. We first identified the culturable fraction of the cystic fibrosis lung microbial community in 20 sputum samples from 10 patients. We defined an OTU to be culturable if it contained ≥10 reads and was recovered from ≥1 culture-enriched plate at a relative abundance of ≥0.01%. Across the dataset, an average of 82.13% (range, 64.6-100%) of OTUs identified by direct sequencing were culturable; this culturable fraction represents an average of 99.3% (range, 97.8-100%) of the relative abundance in the associated direct sequencing results (Fig. 2a). When the genus-level OTU taxonomic assignments were compared to a list of species previously identified via culture enrichment 24 , and to previous culture enrichment of the cystic fibrosis lung 16 and gut 12 , we identified 18 genera cultured in this study that had not previously been identified via large-scale culture enrichment (Supplementary Table 1). Of the OTUs that were never cultured across the dataset, we observed an over-representation of the Spirochaetes (7 of 7 OTUs identified in cultureindependent sequencing) and the SR1 (1 of 1), and many members of the Tenericutes (7 of 22) and Saccharibacteria (2 of 3) phyla (Fig.  2b, blue ring). Together, these results indicate that most OTUs in the cystic fibrosis lung microbiota are culturable and that those OTUs that are not cultured are taxonomically restricted, and historically challenging to culture. Importantly, culture-independent methods do not distinguish DNA from viable versus non-viable organisms, so some bacteria not recovered may not be viable in these samples.
These results still hold if a more stringent definition of culturable is used (Extended Data Fig. 2a,b).
Culture enrichment increases OTU recovery. Culture-enriched 16S rRNA gene sequencing consistently recovered more OTUs then direct sequencing ( Fig. 2b and Extended Data Fig. 2c-e). For example, in the direct sequencing of sample 1, 49 OTUs were recovered, 42 of which were also identified by culture enrichment (Fig. 2a); in addition to these 42 OTUs, an additional 124 OTUs were identified by culture enrichment (Extended Data Fig. 2c). This enrichment in OTU recovery did not correlate with variability in the α-diversity of the original sample (Extended Data Fig. 2e). We hypothesized that the ability to enrich may be due to the recovery of low-abundance taxa. To test this hypothesis, we resequenced a sample to a depth 24 times deeper than the original direct sequencing (972,834 versus 41,199 reads) and rarefied at decreasing depths (range, 500,000-1,000 reads). We observed that the number of OTUs recovered only by culture decreases as the sequencing depth increases (Extended Data Fig. 3). These cultured OTUs were typical members of the cystic fibrosis lung microbiota, including Streptococcus sp., Prevotella sp. and Veillonella sp., indicating that culture allows for the enrichment of taxa present at low abundance in the original sample.
Culture enrichment's increase in OTU recovery is dependent on medium type and oxygen availability. The variety of media and environmental conditions used is important in capturing the diversity of microbial communities. The use of both anaerobic and aerobic conditions encourages the recovery of different taxa, as is evident in the taxonomic distribution and β-diversity relationships of the 16S rRNA gene sequencing results of direct and cultureenriched sequencing (Fig. 3a,b). For example, Veillonella sp. and Prevotella sp. were recovered exclusively under anaerobic conditions; conversely, Rothia sp. and Pseudomonas sp. were obtained at greater abundances in aerobic culture (Fig. 3a). The α-diversity of each culture condition (medium + aerobic/anaerobic environment) varied with each sample (Supplementary Fig. 2) and no single condition consistently best recapitulated the originating sputum sample (Extended Data Fig. 4). Hierarchical clustering of the taxa recovered from each culture condition further indicates the importance of using both selective and non-selective culture conditions ( Fig.  3c and Extended Data Figs. 5 and 6). Although some organisms, such as Streptococcus sp., will grow under many conditions, others, such as Neisseria sp., Rothia sp. and Stenotrophomonas sp. were only recovered from a subset of culture conditions. Furthermore, there were also OTU-dependent differences in growth patterns within some genera (for example, Prevotella OTUs; Fig. 3d and Extended Data Fig. 6). Across the dataset, anaerobic culture recovered almost half of all cultured OTUs; furthermore, we nearly doubled the number of recovered OTUs by expanding our culture conditions beyond the four typically employed in a clinical laboratory (Fig. 3e).  Another advantage of culture enrichment is that it allows for the post hoc recovery of organisms of interest from frozen bacterial stocks (see Methods). As an example, Stenotrophomonas sp. were isolated from two bacterial stocks in which they accounted for relatively low relative abundances (1.3 and 1.5%). To facilitate recovery, the stocks were replated on a medium type specific to Stenotrophomonas sp. (Fig. 3f). Taxonomic assignment of isolated colonies as Stenotrophomonas maltophilia was confirmed via fulllength 16S rRNA gene sequencing (Supplementary Table 2).
The PLCA informs culture-enriched metagenomic sequencing. The conditions used for culture enrichment are necessarily broad; the cystic fibrosis lung microbiota, like other human-associated communities, can be composed of a wide range of organisms, from common pathogens (for example, Pseudomonas, Staphylococcus and Haemophilus 25 ) to anaerobes (for example, Prevotella, Fusobacterium and Veillonella [26][27][28] ) and emerging pathogens (Stenotrophomonas maltophilia and Achromobacter 29 ). Although the lung could harbour any of these organisms, an individual's microbiota is a unique subset of these possibilities. This means that while the variety of culture conditions used is necessary to capture the diversity across individuals, not every plate is needed to enrich the microbiota of a given sample. This is also true of other human-associated communities; the gut, for example, harbours many organisms that are prevalent across the population, whereas some species can be quite specific to the individual 30 . It is not known a priori which subset of plates would best recapitulate a given community; however, we can use 16S rRNA gene sequencing to identify the taxonomic distribution across cultured plates, and to choose the subset of plates that best recapitulates the community on which to conduct culture-enriched  63.3% of OTUs across the dataset were identified only by culture-enriched sequencing (green ring). In contrast, 5.7% of OTUs were not cultured, including many Tenericutes (^) and Saccharibacteria (*) and all Spirochaetes (&) and Sr1 (+). Similar results were obtained with a more stringent relative abundance cutoff (extended Data Fig. 1a,b). metagenomic sequencing. As such, we implemented the PLCA, which determines the minimum number of culture-enriched plates necessary to capture the taxonomic diversity of a sample with culture-enriched metagenomics. The PLCA is not specific to a particular set of culture conditions, or particular microbiome, but instead can be used with any community that contains a culturable majority.  In a, culture-enriched plates, indicated with circles as in Fig. 1, are displayed alphabetically (extended Data Fig. 1). In b, the direct sequencing clusters with some aerobic samples due to the abundance of Pseudomonas sp., and not due to a lack of bacterial growth ( Supplementary  Fig. 1). c, A heatmap showing the maximum observed relative abundance (range 0-1) of each phylum across culturing conditions. Aerobic (Aer) and anaerobic (Ana) culture condition acronyms and recipes are provided in the Methods. Genus-level labelling is available in extended Data Fig. 5. d, Within a genus, different OTUs can also have different culture preferences (extended Data Fig. 5). e, The number of OTUs obtained from culture enrichment is compared to the number obtained if only aerobic culturing was used, or if culture was restricted to that of a standard clinical microbiology laboratory (CBA.Aer, MAC.Aer, MSA.Aer, CHOC.Ana). f, Cultured organisms can be recovered from frozen bacterial stocks. Here, Stenotrophomonas sp. were isolated from stocks with a relative abundance of 1.3% on CNA.Aer and 1.5% on TSy.Aer, respectively. Plates with low abundance of Stenotrophomonas sp. were purposefully chosen to indicate the power of this approach.
There are two versions of the PLCA. The de novo PLCA recapitulates the culture-enriched community independently of direct sequencing, whereas the adjusted PLCA focuses on the OTUs recovered from the direct sequencing results ( Supplementary  Fig. 3). That is to say, the user can decide between recovery of all cultured organisms (de novo PLCA) or preferential recovery of the abundant organisms from the original community (adjusted PLCA). The use of either version is dependent on whether the user is interested in the composition of the original sample-for example, when answering clinically relevant research questions-or in questions concerning sample biodiversity. Using these algorithms at different thresholds (see Methods) across the dataset highlights S9 S10 S15 S20 S14 S19 S11 S16 Relative abundance (%) Relative abundance (%) Cultured OTUs in decreasing abundance OTUs in direct sequencing in decreasing abundance S17 S12 S13 S15 S14 S11 S12 S13 S18 S20 S19 Not identified Identified S16 S17 S18 The number of plates and OTUs necessary to capture the taxonomic diversity above varying thresholds of a representative sample using de novo and adjusted PLCAs. The resultant OTUs are divided into those obtained above the threshold and those captured by consequence of being present on plates within the optimal plate set. A similar output for all other samples is available (extended Data Fig. 6). b, The plate subsets for the de novo and adjusted PLCAs for each sample in the dataset. each culture condition is represented with a grey dot, which is coloured in the samples (S) in which it is part of the PLCA optimal plate set. c, The number of identified OTUs obtained with the de novo (blue) and adjusted PLCA (orange) when applied to sample 1 with the thresholds (dotted lines) displayed in b. Because the aim of the de novo PLCA is to recover the most abundant cultured organisms, the OTUs identified by culture enrichment are displayed; in contrast, the adjusted PLCA aims to recover abundant OTUs from the original sample and thus the OTUs identified in the direct sequencing are shown.
the uniqueness of these communities: no two samples have the same optimal plate set, and every culture condition is necessary for at least one sample, as indicated by each condition in the de novo and adjusted outcomes being necessary for the culture-enriched metagenomic sequencing of at least one sample in the dataset (Fig.  4a,b and Extended Data Fig. 7).
Rothia dentocariosa (b2)    Fig.  8). Following metagenomic co-assembly, binning and taxonomic assignment (see below), we compared the observed versus expected results of each algorithm. The de novo PLCA recovered 10 metagenomic bins with matching taxonomic assignments to the 10 OTUs above the PLCA threshold, in addition to a further 6 bins matching OTUs below the threshold ( Fig. 4c and Fig. 9a). We also applied the PLCA to previously published culture enrichment data from the gut microbiota 12 , establishing that this algorithm is not specific to the lung microbiome or a specific set of culture conditions (Extended Data Fig. 9b).
For each of the de novo and adjusted PLCA plate sets and the direct sequencing, we co-assembled shotgun metagenomic reads into contigs, which were binned and defined as metagenome-assembled genomes (MAGs) or non-MAG bins based on sequence composition (see Methods). Culture-enriched metagenomic sequencing resulted in 7 MAGs and 9 non-MAG bins (de novo PLCA) and 12 MAGs and 12 non-MAG bins (adjusted PLCA); in contrast, 1 MAG and 1 non-MAG bin were recovered from direct metagenomic sequencing (Fig. 5). When the direct sequencing metagenomic reads were mapped onto the culture-enriched metagenomic bins, significant coverage was only observed in 1 bin in both the de novo and adjusted PLCA results (Fig. 5, blue dots). Both direct sequencing metagenomic bins were taxonomically assigned as Pseudomonas sp.; in contrast, the culture-enriched bins were taxonomically diverse, spanning 14 genera (Fig. 5).
The increased taxonomic diversity obtained via culture enrichment directly translates into increased functional information about this microbial community (Fig. 6). Consistently, across clusters of orthologous groups (COGs), functional categories and predictions of virulence genes, phage, antibiotic resistance, clustered regularly interspaced short palindromic repeats (CRISPRs) and secondary metabolites, culture-enriched metagenomic sequencing provides a greater diversity and number of functional identifications. For example, genes contributing to the GacS/GacA two-component system were identified in Pseudomonas sp. bins from both culture enrichment datasets but not in the direct sequencing of the sputum sample. Previous research has shown that strains of P. aeruginosa lacking this system are less able to colonize in mouse models 31 , indicating their importance for P. aeruginosa virulence. The detection of these genes in the culture-enriched data suggests that this system is present in the cystic fibrosis lung microbiota but is not identified in the direct metagenomic sequencing due to poor sequencing depth/ assembly. Among the Pseudomonas sp. results, there were 10, 16 and 17 'perfect' hits against the Comprehensive Antibiotic Resistance Database (CARD) in the direct, de novo PLCA and adjusted PLCA datasets, respectively; among these, two beta-lactamases (OXA-50, PDC-1) and two repressors (nalD, nfxB) were found only in the culture-enriched sequencing. These Pseudomonas-specific results indicate that even when bins overlap in taxonomic assignment between techniques, the culture enriches for functional annotation; in addition to this, the functional characterizations of all other bins (15 and 23 in the de novo and adjusted PLCA, respectively) would not have been possible from direct sequencing alone.
Previous research has established the presence of heterogeneous populations of single species in cystic fibrosis airways (for example, in P. aeruginosa 32 , Burkholderia cepacia complex 33 and Stenotrophomonas maltophilia 34 ) as well as in the human microbiome as a whole (for example, Bifidobacterium longum populations in the infant gut microbiome 35 , various strains following faecal microbiota transplantation 36 ). As such, we calculated the genetic variability within each metagenomic bin by identifying haplotypes within open reading frames 37 (Fig. 6d). In some metagenomic bins, we identify a consistent and small number of open reading frame haplotypes, suggesting that the bin represents a single genomic population (that is, one strain). However, in most bins, we see great haplotype diversity indicative of heterogeneous populations (that is, multiple strains). As expected, the number of haplotypes per gene within these bins is diverse, indicating the known spectrum of evolutionary pressures within bacterial genomes 38 . On average, more gene haplotypes were identified in the Pseudomonas sp. cultureenriched bins than in the direct sequencing; however, the direct sequencing identified a greater number of prevalent haplotype gene outliers. There was no correlation between bin completeness or MAG status and mean haplotype frequency.

Discussion
The decrease in cost and increase in massively parallelized sequencing technology has revolutionized the way that we study the human microbiota, leading to an increased understanding of these communities and how they relate to health and disease. The gut has arguably been the most well-studied human microbiome, with numerous studies linking it to various diseases (for example, inflammatory bowel disease 39 and irritable bowel syndrome 40 ) and conditions (for example, obesity 41 , pregnancy 42 , metabolic 43 and mental health 44 ). This community lends itself well to these investigations; its composition can be approximated via faecal matter-readily available without intervention-and consists of a dense microbial community with little host DNA contamination. However, many other important human-associated communities have high levels of non-microbial DNA, including communities of the skin 45 , tissue biopsies 19 and oral microbiome samples 46 . The respiratory microbiome is a low biomass community, often contaminated with DNA from endothelial cells or from the DNA associated with an acute immune response in diseases such as asthma and cystic fibrosis 47 . Because of the nature of such samples, metagenomic sequencing must be paired with the in silico removal of most sequencing reads due to host contamination, meaning that only the most abundant members of the community can be assembled into MAGs. We demonstrate that culture-enriched metagenomics-in conjunction with traditional, culture-independent sequencing-can improve the resolution of these communities.
Within this Article, we show that the cystic fibrosis lung microbiota is culturable, and that culture enrichment increases OTU diversity (Fig. 2). This follows directly from ref. 16 , in which, using T-RFLP and 454 sequencing, it was possible to identify a culturable majority within the cystic fibrosis lung microbiota 16 . Although culture enrichment consistently recapitulated >97% of the direct sequencing results, a few organisms were not cultured. Many, including members of the Spirochaetes, SR1, Tenericutes and Saccharibacteria, consist of organisms that are difficult to culture [48][49][50] , but were also at low abundance in these samples. In contrast, the organisms recovered only by culture are common members of the cystic fibrosis lung microbiota 25 , including Rothia, Prevotella, Veillonella, Fusobacterium and Streptococcus species. Selective media allow for the proliferation of low-abundance organ-isms, which can be below the level of detection of standard sequencing approaches. It is not uncommon, for example, for the cystic fibrosis lung microbiota to reach a density of at least 10 8 c.f.u. ml −1 (refs. [51][52][53] ). If amplicon sequencing produces 50,000 reads per sample, an organism identified by a single read would equate to ≤0.002% relative abundance or 2 × 10 4 c.f.u. ml −1 . This is not to dismiss culture-independent approaches; culture enrichment benefits from being combined with direct sequencing in order to maintain the relative abundance ratios of the original community and to recover important uncultured organisms.
To most effectively combine culture-independent and -dependent approaches, we designed the PLCA, which determines the cultured metagenomic sequencing necessary to recapitulate the original microbial community, as determined by amplicon sequencing. We show that the PLCA recovers targeted OTUs as well as a substantial number of additional OTUs (Fig.  4). Importantly, PLCA-assisted culture-enriched metagenomics vastly improves the taxonomic and functional outputs of sequencing. The inability of traditional metagenomic sequencing to distinguish microorganism from host can result in the need for incredible sequencing depths of samples with high host contamination. Combining culture with culture-independent sequencing is one way of mitigating host contamination due to culture's ability to enrich for viable microorganisms.
The PLCA is not specific to the cystic fibrosis microbiome and can be used on any microbial community in which most of its membership is culturable. The culture conditions chosen will impact the ability to culture the sample and the performance of the PLCA. Ideally, a combination of selective, non-selective and enrichment media should be used. Furthermore, the PLCA is also not specific to any one 16S rRNA gene processing pipeline. Here, we have used a 16S rRNA gene sequencing pipeline, which has been validated previously against other available approaches 54 ; however, 16S rRNA gene sequencing processing is a moving target, with new (or improved) methodologies constantly being published and compared/validated against already existing methods. Because the PLCA is agnostic to how the data are processed, it can be applied to datasets processed with any method. As the field continues to progress, the PLCA will only benefit from the improvements in taxonomic resolution available from these technological improvements.
The combination of culture enrichment with direct sequencing enhances the observed taxonomic diversity and provides greater insight into human-associated microbial communities. We have shown that culture-enriched metagenomics provides deeper resolution of these communities. With these data, we can better predict mechanisms of antimicrobial resistance, virulence factors and, in general, gain a better understanding of each organism's gene repertoire. Furthermore, having these organisms in culture means that we can carry out in vitro, mechanistic studies to better understand these communities in the context of human health and disease. Culture-enriched metagenomics, as exemplified here for cystic fibrosis sputum, provides an approach for the study of microbiome samples that are composed mostly of human rather than microbial DNA. This method can be applied to any community where the majority of its members are culturable.

Methods
Sputum collection and culture enrichment. On receiving informed consent, sputum samples were collected from 4 December 2013 to 6 October 2014 from willing participants visiting the Calgary Adult Cystic Fibrosis Clinic (ethical approval was granted by the Calgary Health Region Ethics Board, REB-24123). Two samples were collected from each patient (with the noted exceptions): one at the onset of pulmonary exacerbation (as defined by ref. 55 ) and a second during a follow-up appointment (one week to four months) following the resolution of symptoms and antibiotic discontinuation. In one case, a patient was not able to produce a follow-up sputum sample; in another, a patient experienced two exacerbations before a follow-up appointment, so three samples were collected.
After 3-5 days (aerobic) and 5-7 days (anaerobic) of growth, plates were imaged and growth acquired by adding 2 ml of BHI broth to each plate and lifting the colonies. A 1 ml sample of this broth was frozen directly for DNA extraction while the remaining 1 ml was frozen in skimmed milk (final concentration 10%) for any potential growth or re-isolation experiments. For the first few culture enrichment sample sets, plates with no visible growth were processed like any other plate (see below); however, we consistently were unable to obtain visible PCR products on a 2% agarose gel from those plates that did not have visible growth. Thus, any plate that resulted in no visible bacterial colonies was discarded and omitted from all downstream processing.
To demonstrate the reproducibility of the sputum sample collection and culture enrichment methods, we carried out an additional experiment where two sputum samples from each of three patients were collected in the clinic before and after physiotherapy (3 × 2 biological replicates). The consistency of these biological replicates indicated the similarity of sputum communities when collected in quick succession. These samples were then plated on six media in triplicate (6 × 3 technical replicates across six sputum samples, n = 108). The results demonstrate the consistency in replicate sputum samples and in culture enrichment (Extended Data Fig. 10).

DNA isolation and Illumina sequencing.
Genomic DNA was isolated from culture-enriched plates and sputum as previously described 58 , with the exception of using lifted colonies/homogenized sputum as input instead of Copan swabs, as performed in ref. 16 . Dilutions resulting from the same culture conditions were combined into one genomic DNA isolation for a maximum of 26 culture-enriched samples per sputum sample. The variable 3 region of the 16S rRNA gene was amplified using universal primers as adapted from refs. 58,59 . The PCR reaction consisted of 5 pmol of each primer, 1 ng template DNA, 200 µM deoxynucleoside triphosphates, 1.5 mM MgCl 2 and 1 U Taq polymerase. The PCR protocol used was as follows: 95 °C for 5 min, followed by 30 cycles of 95 °C for 30 s, 50 °C for 30 s and 72 °C for 30 s, with a final 72 °C for 7 min. The presence of a PCR product was verified by electrophoresis (2% agarose gel). PCR products were sequenced using the Illumina MiSeq platform using 2 × 250 paired-end reads.
DNA from select culture-enriched samples and the sputum sample were sonicated to 300 bp and library preparations were made using the NEBNext DNA Library Prep Master Mix Set for Illumina (New England Biolabs) and sequenced using the Illumina HiSeq platform with 2 × 250 paired-end reads.
All sequencing results are publicly available (BioProject ID PRJNA503799).
16S rRNA sequence processing and analysis. 16S rRNA paired-end reads were processed using sl1p 54 . Briefly, reads were trimmed of any remaining primers using cutadapt 60 and discarded using sickle based on a quality threshold of 30 (https:// github.com/najoshi/sickle). Paired-end reads were assembled using PANDAseq 61 .
OTUs were picked using AbundantOTU+ with a 97% clustering threshold 62 and chimaeras removed using USEARCH 63 as implemented in QIIME 64 . The Ribosomal Database Project classifier 65 was used to assign taxonomy against the 4 February 2011 release of the Greengenes database 66 , and a phylogeny was created by pruning the Greengenes phylogeny to those taxa present in the dataset. OTU tables were created with QIIME 64 . Any OTU that was not assigned a bacterial taxonomy or where there was only one instance across the full dataset (singleton) was culled. Any sample with <1,000 reads was discarded (Supplementary Table 5). The result of this culling process, in combination with only sequencing plates with visual growth, resulted in a total of 531 samples (20 sputum samples and 511 plates). The mean sequence depth across this dataset was 68,160 reads per sample (range 2,032-159,381), with a mean number of OTUs of 94.1 (range 10-311). Taxonomic summaries over multiple samples were performed by calculating the maximum relative abundance across samples, and normalizing to 100%. Principal coordinate analysis (PCoA) plots were calculated using phyloseq 67 and ggplot2 68 in R after proportional normalization 69 . An OTU was considered present in the direct or culture-enriched sequencing if it had a relative abundance of >0.01% (all exceptions noted). Phylogenies were decorated with GraPhlAn 70 . Heatmaps were generated with pheatmap 71 . In the sequencing depth experiments shown in Extended Data Fig. 3, rarefaction was performed at varying depths using QIIME's alpha rarefaction function.
Recovery of isolates from frozen culture-enriched stocks. Improved isolation of Stenotrophomonas maltophilia from frozen skimmed milk stocks of select plates was performed using a selective medium as described previously 72 . Isolates were Sanger sequenced using the 8F (5′-AGAGTTTGATCCTGGCTCAG3′) and 926R (5′-CCGTCAATTCCTTTRAGTTT-3′) primers to the 16S rRNA gene, resulting in a 900 nt product. The identity of the isolates were confirmed by comparisons to the Human Oral Microbiome Database (HOMD) and to NCBI's 16S ribosomal RNA sequences (Bacteria and Archaea) Database.
The PLCA. Taking 16S rRNA gene sequencing results as input, the PLCA calculates, for each sample, the optimal subset of cultured plates that should be included in culture-enriched metagenomics in order to recapitulate the microbial community. The PLCA ( Supplementary Fig. 3a,b) first identifies any OTU above the user-supplied relative abundance threshold that was only cultured on a single plate, and that plate is added to the 'plate set' for culture-enriched sequencing. Next, for all OTUs not already identified in the plate set, the plate with the most OTUs present above the threshold is added to the plate set. This continues until all OTUs are accounted for in the plate set and this list is output to the user. The PLCA incorporates a user-adjustable relative abundance threshold to determine which cultured OTUs the algorithm should include in the resulting plate set. In the adjusted PLCA, a second threshold determines the cutoff of OTU inclusion from the direct sequencing. Altering these thresholds results in varying plate and OTU recovery (including OTUs below the threshold, which are included as a consequence of being present on a plate that is part of the optimal plate set; Fig.  4a and Extended Data Fig. 7). The plate set for the adjusted PLCA is not always a direct subset of the de novo PLCA because, when only the organisms present in the sputum are considered, there is often a better combination of plates that minimizes the number of total plates needed for sequencing.

Metagenomic sequence processing and analysis.
Resultant Illumina paired-end reads from the 20 sputum samples and their associated adjusted PLCA plate sets (and one additional de novo PLCA plate set for comparison) were processed first by using cutadapt to trim Illumina adapters and primers 60 . Sickle, with a quality threshold of 30, was used to remove low-quality sequences (https://github.com/ najoshi/sickle). The direct sequencing was decontaminated of host-associated reads using DeconSeq 73 . Metagenomic assembly of the direct sequencing reads was conducted using Megahit 74 . A co-assembly of the culture-enriched reads (as determined by the de novo PLCA and/or adjusted PLCA) was also conducted using Megahit 74 . The results of all assemblies were separately binned using MaxBin-2.2.1 75 ; MaxBin was chosen based on its performance in the CAMI Challenge 76 . Quality was assessed with checkM 77 , and taxonomic assignments for each bin determined using KrakenUniq (formally KrakenHLL) 78 and supplementary scripts (https://github.com/shekas3/BinTaxaAssigner). Assembly statistics, including mean/maximum contig length and N50 values, are provided in Supplementary Table 6. MAGs were defined as metagenomic bins containing ≥70% completeness and <10% contamination as previously described 79 ; non-MAG bins are any bin that does not meet the criteria of a MAG. Only contigs >1,000 bp were binned. Bowtie2 was used to map direct sequencing reads onto culture-enriched metagenomic bins. Even though extensive decontamination and quality control measures were performed, the direct sequencing suffered from considerable host DNA contamination, resulting in reads within these bins mapping with high stringency to the human genome. This resulted in bins from the direct sequencing that were much larger than the size of the closest reference genome (Supplementary Fig. 4). In contrast, culture-enriched bins with high 'contamination' indicated the inability of the binning algorithm to differentiate between closely related species. For example, b6 in de novo and b9 in adjusted had bin lengths almost double the closest reference and contained taxonomic signatures of two Streptococcus sp. (Supplementary Fig. 4).
To compare the taxonomic composition of 16S rRNA gene and metagenomic sequencing, we used the Kraken2 (version 2.0.7) classifier 80 to classify metagenomic reads from each culture-enriched sample against the 2011 Greengenes database 66 .
To show that the PLCA was not specific to the cystic fibrosis lung microbiota, the algorithm was applied to a previously published culture enrichment study of the gut microbiota 12 . The de novo and adjusted PLCA were applied, with default thresholds, to sample IBS3, and the taxonomic recovery from culture-enriched metagenomic sequencing was predicted based on the 16S rRNA gene sequencing profiles of the resulting plate set.
Functional annotations of each metagenomic bin were performed with a variety of software. Virulence gene counts were determined by use of blastn (E value cutoff of 10 −8 ) 81 against PATRIC's virulence factor library 82 and counting the number of hits to unique virulence genes per bin. COG functional category counts were determined using eggnog-mapper with default parameters 83,84 . Phage counts were determined by Phaster 85 and predictions of antibiotic resistance genes were conducted with CARD (2.0.2) in conjunction with the RGI (4.1.0) 86 . Secondary metabolites were predicted with PRISM 3 87 . The presence of CRISPRs were determined using MinCED 88 . The direct sequencing was plagued with host DNA contamination, resulting in excessive bin lengths, and, possibly, abundant type I errors in the identified functionality of the community (Fig. 6, Supplementary Fig.  4 and Supplementary Table 7). All bar charts were made using R's ggplot2 68 and heatmaps were visualized using R's pheatmap 71 . The haplotype diversity of the open reading frames of each bin was calculated using Hansel and Gretel 37 .
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability
All code developed by the authors is available under a GNU licence at http://github. com/fwhelan/PLCA and https://github.com/shekas3/BinTaxaAssigner. Fig. 9 | the PlCA consistently recovers targeted otus and is not specific to the cystic fibrosis lung microbiota. a, When the adjusted PLCA was applied to the 20 samples in this cystic fibrosis dataset, it consistently recovered the targeted OTUs (orange), though some (gray) were not recovered as metagenomic bins due to inadequate sequencing depth, or the inability to separate species into separate bins. The overlaying numbers represent the number of OTUs in each category. Additional species obtained as a consequence of being present on a plate with a targeted OTU are shown as yellow dots. b, The PLCA is not specific to the cystic fibrosis lung microbiota or to a particular set of culture conditions; here, we apply the PLCA to previously published culture-enriched gut microbiota data (reference 12 ). even through the culture conditions used by Lau et al. differ from those used in this study, the PLCA still predicts successful recovery of almost all species at abundances above the PLCA thresholds (dotted line).

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability All sequencing results are publically available (BioProject IDL PRJNA503799).