Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals

The gut microbiome is shaped by diet and influences host metabolism; however, these links are complex and can be unique to each individual. We performed deep metagenomic sequencing of 1,203 gut microbiomes from 1,098 individuals enrolled in the Personalised Responses to Dietary Composition Trial (PREDICT 1) study, whose detailed long-term diet information, as well as hundreds of fasting and same-meal postprandial cardiometabolic blood marker measurements were available. We found many significant associations between microbes and specific nutrients, foods, food groups and general dietary indices, which were driven especially by the presence and diversity of healthy and plant-based foods. Microbial biomarkers of obesity were reproducible across external publicly available cohorts and in agreement with circulating blood metabolites that are indicators of cardiovascular disease risk. While some microbes, such as Prevotella copri and Blastocystis spp., were indicators of favorable postprandial glucose metabolism, overall microbiome composition was predictive for a large panel of cardiometabolic blood markers including fasting and postprandial glycemic, lipemic and inflammatory indices. The panel of intestinal species associated with healthy dietary habits overlapped with those associated with favorable cardiometabolic and postprandial markers, indicating that our large-scale resource can potentially stratify the gut microbiome into generalizable health levels in individuals without clinically manifest disease. Analyses from the gut microbiome of over 1,000 individuals from the PREDICT 1 study, for which detailed long-term diet information as well as hundreds of fasting and same-meal postprandial cardiometabolic blood marker measurements are available, unveil new associations between specific gut microbes, dietary habits and cardiometabolic health.

D ietary contributions to health and chronic conditions, such as obesity, metabolic syndrome, cancer and cardiovascular disease, are of universal importance. Obesity and associated mortality/morbidity have risen dramatically over the past decades 1 , with the gut microbiome implicated as one of several potentially causal human-environment interactions 2,3 . Surprisingly, the details of the microbiome's role in obesity and cardiometabolic health have proven difficult to define reproducibly in large human populations 4 , probably due to the complexity of habitual diets, the difficulty of measuring them at scale and disentangling them from other lifestyle variables 5,6 and the personalized nature of the microbiome 7 .
To overcome these challenges, we launched the PREDICT 1 trial of diet-microbiome interactions in metabolic health 8 . PREDICT 1 included >1,000 participants profiled pre-and post-standardized dietary challenges using intensive in-clinic biometric and blood measures, habitual dietary data collection, continuous glucose monitoring and stool metagenomics. The study was inspired by previous large-scale diet-microbiome interaction profiles, which identified gut microbiome configurations and microbial taxa associated with postprandial glucose responses 9,10 , obesity-associated biometrics such as body mass index (BMI) and adiposity [11][12][13] and blood lipids and inflammatory markers 14,15 .

Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals
Diversity of healthy plant-based foods in habitual diet shapes gut microbiome composition. We assessed the links between habitual diet and the microbiome using random forest models, each trained on quantitative microbiome features to predict each dietary variable from food frequency questionnaires (FFQs) (Methods). The performance of the models was quantified with receiver operating characteristic (ROC) area under the curve (AUC) for classification and correlation for regression (Methods). Several foods and food groups exceeded the 0.15 median Spearman correlation over bootstrap folds (denoted as ρ) between predicted and FFQ-estimated values (14.5%) and AUC > 0.65 (10.8%; Fig. 2a). The strongest association was for coffee (instant or ground) (ρ = 0.43, AUC = 0.8), with dose-dependent effects and validated in the US cohort (Fig. 2d). Tighter microbiome links were found for energy-adjusted nutrients (Fig. 2a), with almost one-third (Supplementary Table 2) showing correlations above 0.3.
We then summarized constituent foods into dietary indices (Supplementary Table 2), including the Healthy Food Diversity (HFD) index (incorporating dietary diversity and food quality) 20 Supplementary Table 1. The '%E' label represents foods and nutrients normalized by the estimated daily energy intake in kcal. and the alternate Mediterranean diet (aMED) score 22 , all of which are associated with reduced risk of chronic disease [22][23][24][25][26][27] . We demonstrated tight correlations between microbial composition and the HFD, hPDI/uPDI and HEI in the UK (ρ between 0.31 and 0.37; Fig. 2a); the results were consistent in the US validation cohort, with ρ reaching 0.42 for HFD and 0.31 for aMED (Fig. 2e,f and Extended Data Fig. 3), highlighting the relationship between the microbiome and health-associated dietary patterns.
Microbial species segregate into groups associated with more and less healthy plant-and animal-based foods. We proceeded to identify the specific microbial taxa most responsible for these diet-based community associations (Fig. 2b). After adjusting for age and BMI, we found 42 species (24% of those at >20% prevalence) significantly correlated with at least 5 dietary exposures (q < 0.2; Supplementary Table 5). This included expected associations (Extended Data Fig. 2), such as enrichment of the probiotic taxa Bifidobacterium animalis 28 and Streptococcus thermophilus with greater full-fat yogurt consumption (ρ = 0.22 for both). The strongest food-microbe association was between the recently characterized butyrate-producing Lawsonibacter asaccharolyticus 29 and coffee consumption (Fig. 2b). However, due to the low resolution of FFQ data, the complexity of dietary patterns, nutrient-nutrient interactions and clustering of healthy/less healthy food items, it is challenging to disentangle the independent associations of single foods with microbial species. At a broader level, we found clear segregation of species (Fig. 2b) into two distinct clusters with either more healthy plant-based foods (for example, spinach, seeds, tomatoes, broccoli) or less healthy plant-based (for example, juices, sweetened beverages, refined grains) and animal-based foods, as defined by the PDI 30 (Supplementary Table 4). Taxa linked to healthy plant-based foods (Fig. 2b,c and Extended Data Fig. 2) mostly included butyrate producers, such as Roseburia hominis, Agathobaculum butyriciproducens, Faecalibacterium prausnitzii and Anaerostipes hadrus, as well as uncultivated species, predicted to have this metabolic capability (Roseburia bacterium CAG:182 and Firmicutes bacterium CAG:95). Clades correlating with several less healthy plant-based and animal-based foods included several Clostridium species (Clostridium innocuum, Clostridium symbiosum, Clostridium spiroforme, Clostridium leptum, Clostridium saccharolyticum). The segregation of species according to animal-based healthy foods (for example, eggs, white and oily fish) or animal-based less healthy foods (for example, meat pies, bacon, dairy desserts) using a new categorization (Methods), was also distinct and overlapping with taxa signatures for healthy and less healthy plant foods ( Fig. 2c and Extended Data Fig. 2). The few foods not fitting into the healthy cluster despite being classified as healthy plant foods, were (ultra)-processed foods 31 (for example, sauces, baked beans; Extended Data Fig. 2). This emphasizes the importance of food quality (for example, highly processed versus unprocessed), food source (for example, plant versus animal) and food type (that is, not all plant foods are healthy) both in overall health and microbiome ecology.
The strongest microbiome habitual diet associations are driven by poorly characterized microbes. Many of the strongest microbial associations with diet occurred with only recently isolated or still uncultured taxa including five species defined using coabundance gene groups (CAGs) from metagenomics 32 . Among indices, the hPDI significantly correlated with 60 of the 176 prevalent species, highlighting together with the HFD (Fig. 2e) the impact of dietary diversity and quality on gut microbial responsiveness. Among other dietary indices and nutrients, we observed general concordance with the two sets of microbes associated with healthy and less healthy foods. A greater animal-based food score (definition in Supplementary Table 4) was associated with the healthy cluster ( Fig. 2c and Extended Data Fig. 2), suggesting that a diet rich in healthier animal-based foods is associated with the more favorable diet-microbiome signature, although this may also reflect an overall healthier dietary pattern. The healthy and unhealthy PDIs, which differentially affect disease risk 25,30 , also had distinct clusters, again emphasizing the oversimplification of conventional plant and animal-based food groupings. The taxa with the highest correlations in the two clusters are Firmicutes bacterium CAG:95 and C. leptum for healthy and unhealthy diet, respectively. The lack or paucity of cultivated representatives for these two taxa may explain why these links were previously overlooked 9,12 . The US validation cohort generally confirmed these associations despite its smaller sample size: among the subset of derived pattern/index scores shared between the UK and US cohorts, of the 54 associations that were significant both in the UK cohort (false discovery rate (FDR) q < 0.2) and in the US cohort (P < 0.05), 70.4% were concordant.
Microbial indicators of obesity are reproducible across varied populations. Microbiome links to obesity have attracted much interest, although results have varied in human populations 3,4 . Our machine learning approach (Methods) found visceral fat to be more strongly linked to gut microbial composition than BMI 33 , a finding again validated in US participants (Fig. 3a). Some obesity-associated taxa were also indicators of poor dietary patterns after controlling for BMI (for example, Clostridium CAG:58, Flavonifractor plautii), whereas markers of lower visceral fat mass (for example, Fig. 2 | Food quality, regardless of source, is linked to overall and feature-level composition of the gut microbiome. a, Specific components of habitual diet comprising foods, nutrients and dietary indices are linked to the composition of the gut microbiome with variable strengths as estimated by machine learning regression and classification models. Box plots report the correlation between the real value of each component and the value predicted by regression models across 100 training/testing folds (Methods). The circles denote the median AUC values across 100 folds for a corresponding binary classifier between the highest and lowest quartiles (Methods). NSP, non-starch polysaccharide. b, Single Spearman correlations adjusted for BMI and age between microbial species and components of habitual diet with the asterisks denoting significant associations (FDR q < 0.2). The 30 microbial species with the highest number of significant associations across habitual diet categories are reported. All indices of dietary patterns are reported, whereas only food groups and nutrients (energy-adjusted) with at least 7 associations among the top 30 microbial species are reported. Rows and columns are hierarchically clustered (complete linkage, Euclidean distance). Full heatmaps of foods and unadjusted nutrients are reported in Extended Data Fig. 2; the full set of correlations, P and q values are available in Supplementary Tables 5 and 6 for UK and US, respectively. c, Number of significant positive and negative associations (Spearman correlation, P < 0.2) between foods and taxa categorized by more and less healthy plant-based foods and more and less healthy animal-based foods according to the PDI. The taxa shown are the 20 species with the highest total number of significant associations regardless of category. d, The association between the gut microbiome and coffee consumption in UK participants is dose-dependent, that is, stronger when assessing heavy (for example, >4 cups per day) versus never drinkers, and was validated in the US cohort when applying the UK model. The reported ROC curves represent the performance of the classifier at varying classification thresholds with regard to the true positive (that is, recall) and false positive rates (that is, precision). e,f, Among general dietary patterns and indices, the HFD (e) and aMED (f) were validated in the US cohort, thus showing consistency between the two populations on these two important dietary indices. Other validations of the UK model applied to the US cohort are reported in Extended Data Fig. 3. The box plots show the first and third quartiles (boxes) and the median (middle line); the whiskers extend up to 1.5× the IQR. F. prausnitzii) were more strongly linked to healthier foods and patterns of intake, illustrating that diet and obesity microbiome signatures overlap but are not identical (Fig. 3b).
Of the 17 species surpassing q < 0.05, 3 had an (absolute) ρ > 0.1 in the US cohort and 2 of these were concordant with those in the UK cohort (Fig. 3c). Across harmonized independent datasets, all but two median associations were consistent with the PREDICT 1 UK signatures and 12 of the 14 were concordant despite different sample collection and DNA extraction methods. Microbiome models to predict BMI in the UK cohort were further validated in six independent datasets available in curatedMetagenomicData 34 (Methods). Despite interpopulation differences 11,35 , the UK model improved cohort-specific cross-validation accuracy in most cases, on par with the leave-one-dataset-out (LODO) approach (Fig. 3d).  Fasting cardiometabolic markers associated with specific microbiome structures. To explore the connections between the gut microbiome and cardiometabolic health, we developed and evaluated microbiome-based machine learning models for each selected clinical and emerging cardiometabolic biomarker. We found modest concordance between microbiome models and several traditional clinical fasting cardiometabolic biomarkers (Fig. 4a) including blood pressure, lipids (triglycerides (TGs), total cholesterol, HDLC, low-density lipoprotein cholesterol (LDLC)), fasting glucose and glycosylated hemoglobin (percentage HbA 1c ) as well as a clinical prediction score estimating the latent 10-year risk of heart disease (atherosclerotic cardiovascular disease score) 36 .
For other blood biomarkers (Fig. 1a), we found stronger correlations between the microbiome and an inflammatory surrogate (GlycA; Fig. 4a), circulating polyunsaturated fatty acids (PUFAs) (both omega-6 (fatty aid ⍵6/fatty acid) and total PUFA (PUFA/ FA) to total fatty acid ratios, ρ = 0.3 and 0.32, respectively), as well as emerging lipid measures linked to host health, including HDL and very-low-density lipoprotein (VLDL) particle size (-D, ρ = 0.29 for both) and the lipid content of lipoprotein subfractions (including total lipids in very large HDL and total lipids in large HDL, ρ = 0.3 and 0.28, respectively). GlycA and VLDLD are associated with increased risk for metabolic syndrome, CVD and type 2 diabetes, whereas HDLD and its lipid constituents, omega-6 and total  Fig. 3 | Random forest machine learning models trained on microbial or functional profiles can predict obesity phenotypic markers, even on independent cohorts. a, Whole-microbiome machine learning models can assess personal factors with random forest regression (box plots and left-side y axis) using only taxonomic or functional (that is, pathway) microbiome features. Classification models (circles and right-side y axis) exceeded an AUC of 0.65 except for waist-to-hip ratio and smoking. b, We observed the highest correlations between the relative abundance of microbial species and age, BMI and visceral fat. The link between microbial features and visceral fat was of greater effect and more often significant than with traditional BMI. c, Using several independent datasets 34 , we confirmed the correlations between single microbial species and BMI, with the blue points denoting significant associations at P < 0.05. The statistical test used was a two-sided z-test (Methods). d, The machine learning model for BMI trained on PREDICT 1 data was reproducible in several external datasets (Extended Data Fig. 5), achieving correlations with true values exceeding those obtained in the cross-validation of a single given dataset in five of seven cases. When the PREDICT 1 microbiome model was expanded to include other datasets (excluding those used for testing, that is, LODO approach), performance remained comparable, confirming the generalizability of the PREDICT 1 model on obesity-related indicators. The box plots show the first and third quartiles (boxes) and the median (middle line); the whiskers extend up to 1.5× the IQR.
Species-based predictors proved more accurate than pathway abundance profiles (Extended Data Fig. 4a), which is consistent with other reports 40 . Our primary findings were generally replicated in the US cohort (Fig. 4a), corroborating the existence of a strong, previously overlooked link between the gut microbiome and surrogates of cardiometabolic health.
The gut microbiome is a better predictor of postprandial TGs and insulin concentrations than of glucose levels. Fasting blood assays are standard for research and clinical investigations; however, individuals consume multiple mixed-nutrient meals throughout the day and spend most of their waking hours in the postprandial state, resulting in repeated elevations in circulating TG, glucose and related metabolites 8 . While postprandial glucose responses may, in part, be predicted by the gut microbiome 9 , real-life variations in both postprandial lipid and glucose-mediated metabolites have not been explored. We assessed them by considering the overall magnitude of the response by incremental AUC (iAUC), peak concentrations and change from fasting (that is, rise).
First, we measured postprandial TG, glucose, C-peptide, insulin and circulating metabolite concentrations at regular intervals (0-6 h) in the clinic after 2 sequential test meals (890 kcal, 50 g fat and 85 g carbohydrates at 0 h (breakfast) and 500 kcal, 22 g fat and 71 g carbohydrates at 4 h (lunch); Fig. 4b,c). Notably, we found that postprandial TG (0-6 h iAUC), insulin and C-peptide (both 0-2 h iAUC) responses were more strongly associated with the gut microbiome (ρ = 0.15, 0.2, 0.24, respectively; AUC > 0.65) compared with postprandial glucose (0-2 h iAUC) responses (ρ = 0.13 and AUC = 0.6; Fig. 4b), findings that were replicated in our US cohort ( Fig. 4b-g). We also measured glucose concentrations during the 13-d at-home period 16 after isocaloric standardized meals with different macronutrient compositions (Supplementary Table 3). However, contrary to our clinic meal responses (Fig. 4b) and previous work 9 , the glucose 0-2 h iAUCs after these meals did not achieve high correlations with the microbiome (all ρ < 0.07 and AUC < 0.59; Fig. 4c). While this may be dependent on meal composition and the effect of multiple meals consumed after stool collection, these results suggest that the microbiome is a stronger predictor of postprandial lipemia than glycemia.
Postprandial rises in lipid-and glucose-mediated measures are differentially predicted by the microbiome compared with fasting levels. Postprandial measures depend both on the corresponding fasting levels and meal-induced rise. Therefore, we compared the differential prediction accuracy of the gut microbiome for fasting levels, postprandial (peak) total levels and postprandial rises (Fig. 4h). For lipid-and glucose-mediated (clinic day) measures, despite a similar strength of association between peak (6 h), magnitude (iAUC) and fasting TG concentrations, the rise (6-0 h) was not similarly correlated ( Fig. 4a-e,f). In contrast, the microbiome associations with glycemic measures were comparable between fasting, peak and rise ( Fig. 4a-d).
Of particular interest were lipoprotein subfraction concentrations, composition and size (Extended Data Fig. 4b,c), which are remodeled postprandially into potentially atherogenic lipoproteins (for example, large VLDL particles, TG-enriched LDL and HDL particles) 41 . These particles were predicted at comparable accuracy for both fasting and postprandial peak 6-h concentrations ( Fig. 4a-e,f-h); notably, HDLD and VLDLD achieved modestly stronger correlations (ρ = 0.32 for both) postprandially (Fig. 4f). However, as with TG, we found that the microbiome was substantially less predictive for the postprandial rise in all lipid metabolite measures compared with fasting and postprandial peak concentration ( Fig. 4a-e,f-h). For example, HDLD is closely associated with gut microbial composition at fasting and 6 h postprandially (ρ = 0.29 and 0.32; AUC = 0.71 and 0.72, respectively; Fig. 4a-e,f-h), but not with the rise (Fig. 4f). These differential associations suggest that the microbiome may influence postprandial lipid-mediated measures via effects on fasting measures.

Distinct microbial signatures discriminate between positive and negative metabolic health indices under fasting conditions.
Motivated by the observed potential of the gut microbiome to predict the fasting and postprandial levels of circulating metabolic markers, we next assessed the microbiome features driving these associations. Among three general risk indices of cardiovascular health (atherosclerotic cardiovascular disease, liver fat probability and insulin sensitivity or QUICKI; Fig. 4a), we found six species significantly and concordantly correlated with all three (negatively or positively, P < 0.05), hinting at a global underlying microbial signature of improved metabolic health. These taxa included Clostridium CAG:58 (higher cardiometabolic risk) and Haemophilus parainfluenzae (lower risk) that we had previously linked with healthy and less healthy dietary patterns (Fig. 2b).
We found similarly distinct separations between two opposing, clearly defined clusters of species either positively or negatively correlated with fasting cardiometabolic measures ( Fig. 5a,b). Species correlated with positive markers included some prevalent taxa generally regarded as healthy (F. prausnitzii) but also eight uncultivated and undercharacterized bacteria. The positive cluster included many distinct genera, pointing at a rich functional diversity. In contrast, the cluster negatively correlated with Indices are grouped in nine distinct categories and the box plots report the correlation between the prediction of random forest regression models trained on microbial taxa or pathway abundances across 100 training/ testing folds; the stars report the regressor performance when trained on the UK cohort and evaluated on the independent US validation cohort (left-side y axis). The circles denote the AUC values for the random forest classification (right-side y axis). b-f, Performance of our microbiome-based machine learning model in estimating postprandial absolute levels and postprandial increases in cardiometabolic markers. The stars denote the regression model results in our US validation cohort for postprandial measurements (not rises; Extended Data Fig. 4b,c). b, Random forest regression and classification performance in predicting postprandial metabolic responses for clinic meal 1 (breakfast) measured as iAUC at 6 h for TGs and iAUC at 2 h for glucose, C-peptide and insulin. c, Glycemic-mediated postprandial iAUCs at 2 h for the other meals (Supplementary Table 7). d, Glycemic-mediated markers of absolute levels versus rise. e, Postprandial inflammatory measures (concentration and rise). f, Postprandial lipoprotein measures (6 h concentration and rise). g, Overall agreement between random forest regression and classification tasks for the UK models applied to the independent US cohort. h, Random forest microbiome-based model performance with postprandial changes (concentrations and rise) in lipoprotein concentration, composition and size. Fasting and postprandial performance indices (correlation of the regressors' outputs) were more tightly linked to gut community structure than were their corresponding postprandial rises. The box plots show the first and third quartiles (boxes) and median (middle line); the whiskers extend up to 1.5× the IQR.
positive markers included eight Clostridium species and the recurrent negatively connotated Ruminococcus gnavus and F. plautii. Large HDL particles (and their lipid compositions; Extended Data Figs. 6 and 7), which have strong inverse associations with cardiometabolic outcomes 38 , were associated with the healthy cluster. Conversely, lipoproteins associated with an increased risk of CVD and type 2 diabetes (VLDL of all sizes and lipid composition) and atherogenicity 42 (small LDL, medium HDL and small HDL TG), were associated with the less healthy cluster (Extended Data Figs. 6 and 7).
Circulating omega-6 and total PUFA were associated with the healthy cluster ( Fig. 5a and Supplementary Table 5). Due to the lack of endogenous production of PUFA, circulating levels closely reflect dietary intake 43 and are linked to a reduced risk of chronic disease 38 . In contrast, circulating monounsaturated fatty acids (MUFAs), which do not closely reflect dietary intake and unlike dietary MUFA have been linked to increased risk of chronic disease 38 , were associated with the unhealthy cluster, with an undercharacterized Firmicutes species (CAG:170) and Clostridium bolteae responsible for the strongest negative and positive associations, respectively.   exceeding interleukin-6 levels (5 and 16 significant associations; Fig. 5b,c). C. boltae and R. gnavus correlated the most with increased fasting and postprandial inflammation, whereas H. parainfluenzae and Firmicutes bacterium CAG:95 were the strongest associations with reduced GlycA levels. VLDL lipoprotein subfractions (markers of adverse cardiometabolic effects) were also consistently associated with the less healthy cluster both at fasting and postprandially. Postprandial rises, rather than absolute postprandial levels, were in some cases uncoupled from the microbial associations with fasting markers (Fig. 5d). For example, change in GlycA (Fig. 5b) was differentially associated with clusters compared to fasting and postprandial levels (especially for F. plautii, Firmicutes bacterium CAG:95 and Firmicutes bacterium CAG:110), probably due to the small reduction in GlycA postprandially. Other immunological markers and some lipid and cholesterol levels paralleled this behavior (Extended Data Fig. 6), possibly reflecting postprandial lipoprotein remodeling 44 .
We observed the same favorable versus unfavorable clustering of microbiome features when analyzing microbial pathways and gene families (Extended Data Fig. 8) supporting taxa segregation by their underlying biochemical activities. The strengths of microbeblood marker associations were confirmed by random forest feature relevance analysis (Extended Data Fig. 9); importantly, they were confirmed in the US cohort. For the 209 microbe-index correlations that were significant both in the UK (q < 0.2) and US cohorts (P < 0.05), the concordance in the sign of the correlation reached 88.7% for the associations in fasting conditions and 96.1% postprandially.
P. copri diversity and Blastocystis presence are markers of improved postprandial glucose responses. Some ecologically unusual microbes hypothesized to have population-scale health effects solely based on their presence or absence appeared in our microbial signatures 45 . Among them, P. copri 45,46 had conflicting previous accounts for its role in glucose homeostasis 47,48 possibly due to subspecies diversity 49,50 . Our data found P. copri to be associated with beneficial cardiometabolic markers, being negatively correlated with estimated visceral fat (ρ = −0.11, P = 0.0006), fasting VLDL-D (ρ = −0.08, P = 0.011) and fasting GlycA (ρ = −0.14, P < 0.0001) among others (Supplementary Table 5). While almost no diet indices were associated with P. copri, postprandial rises in glucose (ρ = −0.11, P < 0.001) and polyunsaturated/omega-6 fatty acids (ρ =0.15 and 0.14, respectively, and P < 0.001) were top-scoring correlations for this bacterium and were stronger than corresponding fasting and postprandial levels in contrast with what we observed for the overall microbiome (Fig. 4a,b). P. copri was present in at least one of its subtypes 49 in 29.8% of the PREDICT 1 individuals and P. copri carriers had lower C-peptide (−9.2%, P = 0.002), insulin (−14%, P = 0.006) and TG levels (−3.2%, P = 0.003) compared to P. copri-negative individuals (Extended Data Fig. 10 and Supplementary Table 8). Similarly, postprandial glucose after breakfast was significantly less pronounced in individuals with P. copri (−20.4% glucose iAUC at 2 h, P = 0.002; Extended Data Fig. 10c) and visceral fat was significantly lower (−12.5%, P = 0.005; Extended Data Fig. 10a). This set of diverse associations supports that the presence of P. copri in the gut microbiome could be beneficial in glucose homeostasis and host metabolism.
Blastocystis is a unicellular eukaryotic parasite increasingly regarded as a commensal member of the gut microbiome [51][52][53] . It shares with P. copri a limited prevalence in Western-lifestyle populations 53 and a high abundance when present. We found evidence that Blastocystis-positive individuals (25.7% in our cohort) also had favorable glucose homeostasis and lower estimated visceral fat (−15.7% glucose iAUC, −22.1% visceral fat, P < 0.002; Extended Data Fig. 10). The latter confirms that Blastocystis is less prevalent in individuals with high BMI, as suggested previously 53 . Interestingly, the effect of the simultaneous presence of P. copri and Blastocystis (12% of individuals) appeared to further promote healthier metabolic function. Visceral fat was 17.3% lower on average (P < 0.005; Supplementary Table 8) for individuals positive for both P. copri and Blastocystis compared to individuals with only one or the other and 23.3% lower (P = 8.9 × 10 −6 ) compared with individuals lacking both.
A clear microbial signature of cardiometabolic health levels consistent across diet, obesity indicators and cardiometabolic risks. We observed above a consistent set of microbial species that were strongly linked to (1) food indices reflecting different levels of healthy diets, (2) indicators of obesity and cardiometabolic health, (3) fasting circulating metabolites connected with cardiometabolic risk and (4) postprandial responses. To test the consistency of such a signature, we selected representative cardiometabolic health indicators from each category and ranked microbial species based on their correlation coefficient. We found remarkable agreement among microbes associated with different positive or negative indicators of cardiometabolic health ( Fig. 6 and Supplementary Table 9).
In particular, Firmicutes bacterium CAG:95 was the uncultivated species with the most beneficial score. Of the health-associated microbial species, only F. prausnitzii and, partially, P. copri were already convincingly linked with health in previous investigations 54 . The beneficial signature also included Eubacterium eligens and H. parainfluenzae, without previous clear roles in health, and additional species without cultivated representatives such as Roseburia bacterium CAG:182, Oscillibacter sp. 57_20, Firmicutes bacterium CAG:170, Oscillibacter sp. PC13 and Clostridium sp. CAG:167. Species conversely consistent with indicators of poor overall health (Fig. 6) included the already discussed set of Clostridia (C. spiroforme, C. bolteae CAG:59, C. bolteae, Clostridium sp. CAG:58, C. symbiosum, C. innocuum and C. leptum) and the mucolytic microbes R. gnavus and F. plautii, again previously found to be associated with disease 55,56 . Overall, this set of 30 species serves as a marker of overall good or poor cardiometabolic health and dietary patterns in nondiseased human hosts.

Discussion
PREDICT 1 represents the first diet-microbiome study to identify both individual components of the microbiome and an overall gut microbial signature associated with multiple measures of dietary intake and cardiometabolic health. These signatures were reproduced across UK and US populations, across multiple previously published study populations and for multiple dietary and health indicators. Notably, microbiome signatures grouped both microbiome and dietary components into health-associated and anti-associated clusters, the latter in agreement with dietary quality and diversity scores 20,57 . The diversity and quality of a healthy diet (HFD and PDI) was particularly predictable by the microbiome, surpassing other indices including the Mediterranean diet previously linked with microbiome composition 58 . The segregation of favorable and unfavorable microbial clusters according to the heterogeneity of the food source (healthy or unhealthy animal or plant), quality (processed versus unprocessed) and dietary patterns highlights the importance of looking beyond nutrients and single foods in diet-microbiome research. The substantially greater detail and consistency in our results relative to previous diet-microbiome work 9,11-13,15 may be due to the quality in dietary recording, metagenomic profiling and the large sample size. However, given the limitations of FFQ dietary data, future diet-microbiome studies would benefit further from higher resolution dietary assessment methodologies, such as weighed food record data.
Several aspects of the consistent gut microbiome signatures across diet, obesity and cardiometabolic health measures are striking for their potential new epidemiology and microbial biochemistry. A surprising proportion of diet-or health-associated taxa in these results are largely uncharacterized or represented solely by metagenomic assemblies 5 . Other microbes found in this study to have dietary or cardiometabolic associations, such as Prevotella or Blastocystis spp., have been characterized in greater biochemical detail but their population structure in the human microbiome has only recently begun to be appreciated 49,53 . The latter in particular may be only one of many examples of nonbacterial microbiome members not amenable to most current high-throughput approaches but with unexpected and potentially key positive roles in humans.
Likewise, these new contributions of the gut microbiome to human dietary responses may help to explain some of the heterogeneity seen among previous population studies 4,9,59 . First, dietmicrobiome-blood marker associations were overall strongest with regard to circulating lipid levels relative to glycemic indices. It is possible that the relative contribution of gut microbes is higher for circulating lipid levels than carbohydrate derivatives, through either direct processes or indirectly through gastrointestinal or systemic bile acid signaling 60 . Alternatively, host metabolism may play a greater role in circulating glucose and insulin levels relative to microbial bioactivity. The lipoprotein features most closely associated with the microbiome (such as total lipids in large HDL) are also more strongly associated with cardiovascular risk compared with typically measured lipids (for example, total cholesterol, HDLC, LDLC), suggesting that their utility as clinical biomarkers or as targets for beneficial gut microbiome manipulation warrants further investigation.
Overall, this is the first study to identify a shared diet-metabolic health microbial signature, segregating favorable and unfavorable taxa with multiple measures of both dietary intake and cardiometabolic health. As a resource, these results will aid both in the utilization of the gut microbiome as a biomarker for cardiometabolic risk and in strategies for reshaping the microbiome to improve personalized dietary health. 38  shown. The rank of each microbe's correlation with individual indicators is written within cells when significant (P < 0.05). For each of the main categories of indices, we selected up to five representative markers (for 'personal' we considered only four since the remaining were highly correlated with visceral fat or not relevant in this context). Indices can be considered positive and negative depending on whether higher or lower values are a proxy for more or less healthy conditions. Partial correlations were computed using the pcor.test (two-sided) with params 'method=spearman' (Methods). Correlations and ranks are available in Supplementary Table 9. P values and FDR-adjusted P values are available in Supplementary Table 2.

Methods
The PREDICT 1 study. The PREDICT 1 clinical trial (NCT03479866) aimed to quantify and predict individual variations in metabolic responses to standardized meals. We integrated data from a cohort of twins and unrelated adults from the UK to explore genetic, metabolic, microbiome composition, meal composition and meal context data to distinguish predictors of individual responses to meals. We then validated these predictions in an independent cohort of adults from the USA. The trial was a single-arm, single-blinded intervention study that commenced in June 2018 and was completed in May 2019. Ethical approval for the study was obtained in the UK from the Research Ethics Committee and Integrated Research Application System (IRAS 236407) and in the USA from the Institutional Review Board (Partners Healthcare IRB 2018P002078). The trial was run in accordance with the Declaration of Helsinki (2013) and good clinical practice. Study procedures were only carried out after having received written informed consent from each participant. For the full protocol, see Berry et al. 16 . Briefly, 1,002 generally healthy adults from the UK (non-twins and identical (monozygotic) and nonidentical (dizygotic) twins) and 100 healthy adults from the USA (non-twins; validation cohort) were enrolled in the study (see Berry et al. 8 for the eligibility criteria) and completed the baseline clinic measurements. The study consisted of a 1-d clinical visit at baseline followed by a 13-d at-home period. At baseline (day 1), participants arrived fasted and were given a standardized metabolic challenge meal for breakfast (0 h; 86 g carbohydrate, 53 g fat) and lunch (4 h; 71 g carbohydrate, 22 g fat). Fasting and postprandial (9 time points; 0-6 h) venous blood was collected to determine the serum concentrations of glucose, TG, insulin, C-peptide (as a surrogate for insulin) and metabolomics (nuclear magnetic resonance). Stool samples, anthropometry and a questionnaire querying habitual diet, lifestyle and medical health were obtained at baseline. During the home phase (days 2-14), participants consumed standardized test meals in duplicate varying in sequence and in macronutrient composition while wearing digital devices to continuously monitor their blood glucose (continuous glucose monitor; CGM), physical activity and sleep. Capillary blood was collected using dried blood spot cards during the clinic visit and at home to analyze fasting and postprandial concentrations of TG and C-peptide. Participants were supported throughout the study with reminders and communication from study staff delivered through the Zoe study app. A second stool sample was collected at home by participants after completion of the study; all devices and samples were mailed back to study staff. To monitor compliance, all test meals consumed by participants were logged in the Zoe app (with an accompanying picture) and reviewed in real time by the study nutritionists. Only test meals that were consumed according to the standardized meal protocol (outlined in Berry et al. 8 ) were included in the analysis.
The recruitment criteria, meal intervention challenges, outcome variables and sample collection and analysis procedures relevant to this paper are described elsewhere 8,16 . The core characteristics of study participants at baseline were not significantly different between UK and US cohorts 8 .
Overview of microbiome sequencing and profiling. We performed deep shotgun metagenomic sequencing (mean 8.8 ± 2.2 gigabase pairs per sample) in stool samples from a total of 1,098 PREDICT 1 participants (UK, n = 1,001; USA, n = 97). From a random subset of these participants (n = 105), we additionally sequenced fecal metagenomes from a second stool sample collected 14 d after the first collection (Fig. 1a) for a total of 1,168 metagenomes. Computational analysis was performed using the bioBakery suite of tools 61  Microbiome sample collection. Participants were mailed a pre-visit study pack with a stool collection kit and relevant questionnaires and asked to collect an at-home stool sample at two time points (one before their in-person clinical visit on day 0 and the next at the conclusion of their home phase on day 14). Those who did not collect a sample before their in-person, baseline visit completed the collection as soon as possible during the home phase. Baseline samples in the UK were collected using the EasySampler Stool Collection Kit (ALPCO), whereas post-study samples, as well as the entirety of the US collection, was conducted using the FECOTAINER stool sample kit (Excretas Medical BV). For baseline samples, one fresh unfixed sample was deposited into a sterile universal collection container (catalog no. L0263-10; Sarstedt Australia) and one into a tube containing DNA/RNA Shield buffer (catalog no. R1101; Zymo Research). Samples were stored at ambient temperature until returned to the study staff. Follow-up samples were collected similarly but only sampled into a DNA/RNA Shield buffer tube and sent by standard mail to study staff. On receipt in the laboratory, samples were homogenized, aliquoted and stored at −80 °C in QIAGEN PowerBeads 1.5-ml tubes. This sample collection procedure was tested and validated internally comparing different storage conditions (fresh, frozen, buffer), different DNA extraction kits (PowerSoil Pro, FastDNA, Protocol Q, Zymo) and different sequencing technologies (16S ribosomal RNA, shotgun metagenomics and arrays) (data not shown).
DNA extraction and sequencing. DNA was isolated by QIAGEN Genomic Services using DNeasy 96 PowerSoil Pro from all day 0 (baseline) DNA/RNA Shield-fixed microbiome samples. A random subset of day 14 (end of at-home phase) samples (n = 105) were also extracted. Optical density measurement was done using spectrophotometer quantification (Tecan Infinite 200). Before library preparation and sequencing, the quality and quantity of the samples were assessed using the Fragment Analyzer system (Agilent Technologies) according to manufacturer's guidelines. Samples with a high-quality DNA profile were further processed. The NEBNext Ultra II FS DNA Module (catalog no. E7810S/L; New England Biolabs) was used for DNA fragmentation, end repair and A-tailing. For adapter ligation, the NEBNext Ultra II Ligation Module (catalog no. E7595S/L; New England Biolabs) was used. The quality and yield after sample preparation were measured with the Fragment Analyzer system. The size of the resulting product was consistent with the expected size of approximately 500-700 bp. Libraries were sequenced for 300-bp paired-end reads using the Illumina NovaSeq 6000 platform according to the manufacturer's protocols. The 1.1-nM library was used for flow cell loading. The NovaSeq control software NCS v.1.5 was used. Image analysis, base calling and quality checking were performed with the Illumina data analysis pipeline RTA3.3.5 and bcl2fastq v.2.20.
Metagenome quality control and preprocessing. All sequenced metagenomes were quality control edited using the preprocessing pipeline as implemented in https://github.com/SegataLab/preprocessing. Preprocessing consisted of three main steps: (1) read-level quality control; (2) screening of contaminants, that is, host sequences; and (3) split and sorting of cleaned reads. Initial quality control involves the removal of low-quality reads (quality score <Q20), fragmented short reads (<75 bp) and reads with >2 ambiguous nucleotides. Contaminant DNA was identified using Bowtie 2 (ref. 67 ) using the -sensitive-local parameter, allowing confident removal of the phi X 174 Illumina spike-in and human-associated reads (hg19). Sorting and splitting allowed for the creation of standard forward, reverse and unpaired reads output files for each metagenome.
Microbiome taxonomic and functional potential profiling. The metagenomic analysis was performed following the general guidelines 68 and relying on the bioBakery computational environment 61 . The taxonomic profiling and quantification of organisms' relative abundances of all metagenomic samples were quantified using MetaPhlAn v.3.0 (ref. 62 ). The updated species-specific database of markers was built using 99,237 reference genomes representing 16,797 species retrieved from GenBank (January 2019). From this set of reference genomes, we extracted a total of 1,132,166 markers used to profile 13,393 species. This set of species also included 83 species defined by the CAG group approach 32 that were very genetically distinct from species represented by isolate genomes and for which the use of unique marker genes limited the potential issues of using metagenomic assemblies reconstructed over multiple samples. Compared to the previous version of the MetaPhlAn2 database (mpa_v20_m200), the updated database profiled 7,116 more species. Metagenomes were mapped internally in MetaPhlAn v.3.0 against the marker gene database with Bowtie 2 v.2.3.4.3 with the parameter 'very-sensitive' . The resulting alignments were filtered to remove reads aligned with an MAPQ value <5, representing an estimated probability of the likelihood of the alignments.
To estimate the microbiome species richness of an individual from the taxonomic profiles of PREDICT 1 participants, we computed two alpha diversity measures: the number of species found in the microbiome ('observed richness'); and the Shannon entropy estimation. We did not perform rarefaction before the alpha diversity calculations because of the low s.d. in sequencing depths and the verified missing correlation between the metadata of interest and sequencing depth. Microbiome dissimilarity between participants (beta diversity) was computed using the Bray-Curtis dissimilarity on microbiome taxonomic profiles.
Functional potential analysis of the metagenomic samples was performed using HUMAnN2 (v.0.11.2 and UniRef database release 2014-07) (ref. 63 ), which computed the pathway profiles and gene family abundances.
Metagenomic assembly. Metagenomic samples were processed to obtain MAGs following the procedure we used elsewhere 5 . In brief, we used MEGAHIT v1.2.9 (ref. 64 ) with the parameter -k-max 127 for assembly; assembled contigs ≥1.5 kilobases (kb) were considered for the binning step performed using MetaBAT2 v.2.14 (ref. 65 ) with the parameters -m 1500 -unbinned. Quality control of the obtained MAGs was performed using CheckM v.1.0.18 (ref. 66 ) using default parameters. High-and medium-quality microbial genomes were integrated into the existing database of >150,000 human MAGs.
Collection and processing of habitual diet information. Habitual diet information was collected using FFQs. For the UK, the European Prospective Investigation into Cancer and Nutrition (EPIC) FFQ was used; in the USA, the Harvard semiquantitative FFQ was used.
For the UK, we used an adapted 131-item EPIC FFQ that was developed and validated against pre-established nutrient biomarkers for the EPIC Norfolk 69 . The questionnaire captured average intakes in the past year. UK nutrient intakes were determined using the FETA software (v. 2.53) to calculate macro-and micronutrient data 70 73 . Nutrient intakes were estimated using the Harvard nutrient database (version SFFQ 043019; https://regepi.bwh.harvard.edu/health/ nutrition/index.html). Submitted FFQs were excluded if more than 10 food items were left unanswered or if the total energy intake estimate derived from the FFQ as a ratio of the participant's estimated basal metabolic rate (determined by the Harris-Benedict equation 74 ) was more than 2 s.d. outside the mean of this ratio (<0.52 or >2.58).
The following dietary indices were calculated as described below and according to categorization listed in Supplementary Tables 2 and 4.
HFD index. The HFD index considers the number, distribution and health value of consumed foods. To obtain this index, FFQ foods were first aggregated into 15 food groups according to the HFD 20 . Health values were then derived from the German Nutrition Society dietary guidelines (https://www.dge.de/en/) and the weight of each food group was multiplied by its corresponding health value. Scores were divided by the maximum (health value = 0.26) to bind values between 0 and 1 before multiplication with the Berry index. The original HFD was used instead of the US-HFD for the following reasons: the original HFD gives greater emphasis to plant-based foods and less to meat than the US-HFD, which would more closely align with hypothesized microbiome-plant food/fiber interactions; converting UK g per serving to US volume measures (as required for the US-HFD) would introduce additional error to the FFQ estimates.
HEI 2010. The HEI 2010 (ref. 21 ) assesses to which extent an individual's food intake aligns with the Dietary Guidelines for Americans 2010 (ref. 75 ) developed by the US Department of Agriculture. These guidelines cover a total of 12 food groups and nutrients. The HEI has 9 adequacy (encouraged) and 3 moderation (discouraged) components; first, a density approach is used to set per 1,000 kcal calories; and second, least restrictive standards are employed, that is, those that are easiest to achieve among recommendations that vary by energy level, sex and/or age. Total fruits, whole fruits, total vegetables, greens and beans, whole grains, dairy (lean portion only), total protein foods (lean portion of meat only), seafood and plant proteins and fatty acids (PUFAs + MUFAs/SFAs) are considered adequate, whereas refined grains, sodium and empty calories (considered added sugars, solid fats and alcohol above 13 g per 1,000 kcal) are considered detrimental and should be consumed in moderation. The index ranges from 0 (not in agreement with the guidelines) to 100 (completely in agreement with the guidelines).
PDI. Three versions of the PDI 30 were considered: the original PDI; the healthy hPDI; and the uPDI. Eighteen food groups (amalgamated from the FFQ food groups; Supplementary Table 2) were assigned either positive or reverse scores after segregation into quintiles, as outlined in Supplementary Table 4 (ref. 30 ). Participants with an intake above the highest quintile for the positive score received a score of 5. Those below the lowest quintile intake received a score of 1. A reverse value was applied for the reverse scores. The scores for each participant were summed to create the final score. For the PDI, a positive score was applied to the healthy and less healthy/unhealthy plant foods and a reverse score was applied to the animal-based foods. For the hPDI, positive scores were applied to the healthy plant foods and a reverse score to the less healthy/ unhealthy plant foods and animal-based foods. For the uPDI, a positive score was applied to the less healthy/ unhealthy plant foods and a reverse score was applied to the healthy plant foods and animal-based foods.
The aMED score. Adherence to the aMED diet was calculated by following the method outlined by Fung et al. 22 . Nine food/nutrient categories were included (Supplementary Table 4) and the score ranged from 0 to 9 (least to most Mediterranean). To form groups, weekly intake frequencies were first multiplied for assigned foods by the amount in g per serving and then divided by seven to determine g per day. Next, food gram amounts were summed to make the final category total. For all food categories and the fatty acid intake ratio, the median intake of each category was calculated. A score of 0 (no aMED) or 1 (aMED) was given for each category depending on whether the participant was above or below the median intake. For alcohol intake, a range was used for score assignment: females: 5-25 g d −1 ; males: 10-50 g d −1 were assigned a score of 1, while those above or below this range were assigned a score of 0. Finally, the aMED was then generated by the summation of each category score.
Food groups. For individual analyses of food groups-microbe interaction, food groups were formed by aggregation of FFQ foods into the 18 PDI food groups plus margarine and alcohol (Supplementary Table 4).
Percentage of plants within the diet. The percentage of plants within the diet was calculated as the weight (g) of plant foods within the total weight (g) of the diet after adjustment of FFQ foods into quantities (g) per week.
Number of plant foods. For the number of plant foods, each plant food item within the FFQ above the value of 0 g was allocated a score of 1 and summed for each participant. For the total number of plants and the number of healthy and unhealthy plants, FFQ food items were allocated into groups according to the PDI food groupings.
Collection and processing of fasting and postprandial markers. Venous blood samples were collected as outlined in the accompanying protocol paper 16 . Briefly, participants were cannulated and venous blood was collected at fasting (before a test breakfast) and at 9 time points postprandially (15,30,60,120,180,240,270, 300 and 360 min). Plasma glucose and serum C-peptide and insulin were measured at all time points. Serum TG was measured at hourly intervals and serum metabolomics (nuclear magnetic resonance by Nightingale Health using the 2020 platform) at 0, 4 and 6 h. Fasting samples were analyzed for lipid profile, thyroid-stimulating hormone, alanine aminotransferase, liver function panel and complete blood count analysis.
Continuous glucose monitoring on days 2-14 was measured every 15 min using Freestyle Libre Pro continuous glucose monitors (Abbott) fitted on the upper, nondominant arm at participants' baseline clinical visits. Given the CGM device requires time to calibrate once fitted to a participant, CGM data collected 12 h and onwards after activating the device were used for analysis.
Dry blood spot analysis of TG and C-peptide was completed by participants on the first 4 d of the home phase while consuming test meals. The time points were dependent on the test meal as described elsewhere 8,16 . Test cards were stored in aluminum sachets with desiccant once completed and placed in the refrigerator at the end of the study day or until participants mailed them back to the study site. Dry blood spot cards were frozen at −80 °C on receipt in the laboratory until being shipped to Vitas for analysis (Vitas Analytical Services).
Specific time points and increments for TG, glucose, insulin and C-peptide were selected for the current analysis to reflect the different pathophysiological processes for each measure as described in our protocol 16 . The incremental area under the postprandial TG (0-6 h), glucose (0-2 h) and insulin (0-2 h) curves (iAUCs) were computed using the trapezium rule 85 .
Detailed descriptions of sample collection, processing and analysis have been reported elsewhere 8,16 .
Machine learning. The machine learning framework employed was based on the scikit-learn Python package 86 . The machine learning algorithms used for the prediction and classification of personal, habitual diet, fasting and postprandial metadata are based on random forest regression and classification. We selected random forest-based methods a priori since it has been repeatedly shown to be particularly suitable and robust to the statistical challenges inherent to microbiome abundance data 40,87 . For both the regression and classification tasks, a cross-validation approach was implemented, which was based on 100 bootstrap iterations and an 80/20 random split of training and testing folds. To specifically avoid overfitting as a result of our twin population and their shared factors, we removed any twin from the training fold if their twin was present in the test fold.
For the regression task, we trained a random forest regressor to learn the feature to predict and simple linear regression to calibrate the output for the test folds on the range of values in the training folds. From the scikit-learn package, we used the RandomForestRegressor with the n_estimators=1000, criterion = mse and max_features = sqrt parameters and LinearRegression with default parameters. For the classification task, we divided the continuous features into two classes: the top and bottom quartiles. From the scikit-learn package we used the RandomForestClassifier function with the n_estimators=1000, max_features = sqrt parameters.
We used random forest classification and regression on both species-level taxonomic relative abundance and functional potential profiles. For taxonomic abundances, we used the species-level relative abundances as estimated by MetaPhlAn v.3.0 (see above normalized using the arcsin-sqrt transformation for compositional data). For functional profiles, we considered both raw relative abundance estimates of single microbial gene families and pathway-level relative abundance as provided by HUMAnN2.
As an additional control, we verified that when randomly swapping the target labels or values (classification and regression, respectively), the performances reflected a random prediction, hence an AUC very close to 0.5 and a nonsignificant correlation between the real and predicted values approaching 0. Statistical analysis. Spearman correlations (reported with ρ in the text) were computed using the cor.test from the stats R package (version 3.5.1) and pcor.test from the ppcor R package (version 1.1), respectively. Correlations and P values were computed for each couple of metadata and species; P values were corrected using the FDR through the Benjamini-Hochberg procedure, which are reported in the text as q values. We considered significant correlations with a q < 0.2. Significant species were selected by ranking them according to their number of significant associations for the panel of metadata considered; then, the top 30 unique species were considered for each panel of metadata. In the heatmaps for partial correlations, the asterisk indicates that the correlation index for the corresponding species metadata pair is significant at an FDR ≤ 0.2.
The contribution of metadata variables to microbiota community variation was determined by distance-based redundancy analysis (dbRDA) on species-level Bray-Curtis dissimilarity and Aitchison distance with the capscale function in the vegan R package (version 2.5.6) 88 . Correction for multiple testing (Benjamini-Hochberg, FDR) was applied and significance was defined at an FDR < 0.1. The cumulative contribution of metadata variables or metadata categories was determined by forward model selection on dbRDA (stepwise dbRDA) with the ordiR2step function in vegan, with variables that showed a significant contribution to microbiota community variation in the previous step. Because of the high consistency between the two distance functions, we performed the cumulative distribution analysis using the Bray-Curtis dissimilarity. Only metadata variables with <15% missing data and without high collinearity with other variables (Spearman ρ < 0.8) were used as input in the stepwise model.

Data validation on the US cohort and on the curatedMetagenomicData datasets.
As independent validation, we considered the publicly available datasets collected in the curatedMetagenomicData v.1.16.0 R package 34 . Of the 57 datasets available, we selected those that had samples with the following characteristics: (1) gut samples collected from healthy adult individuals at first collection (days_from_first_collection=0 or not applicable); (2) samples with age and BMI data available and BMI interquartile range (IQR) of these samples between 3.5 and 7.5 (±2 regarding the PREDICT 1 UK IQR of 5.5; Extended Data Fig. 5). For each dataset with samples meeting the above criteria, only datasets with at least 50 samples were considered: CosteaPI_2017 (ref. 89  We used the previously selected validation datasets from curatedMetagenomicData in two analyses: one based on machine learning to verify the reproducibility of the machine learning model we trained using the PREDICT 1 UK samples; and the second to verify the species-level correlations found in the PREDICT 1 UK cohort. For the first task, we applied a regression algorithm to predict BMI and age. Three different cross-validation approaches were used. First, using each dataset independently in 100 bootstrap iterations and an 80/20 random split of training and testing folds. Second, one more iteration was performed using the PREDICT 1 UK dataset as the training fold and each dataset as the testing fold. Third, a final prediction was made using LODO, meaning that all datasets (PREDICT 1 UK, PREDICT 1 USA and the curatedMetagenomicData datasets) were considered together and each validation dataset was successively used as the test fold while all others were used for training. An additional validation performed using the curatedMetagenomicData datasets was done by applying a pairwise Spearman correlation for each species in each curatedMetagenomicData dataset against BMI and age. For each correlation, we selected the top associated species in PREDICT 1 UK (FDR, q < =0.05) and reported their correlation in curatedMetagenomicData. For those species found also in the PREDICT 1 USA dataset, we also reported their correlation.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The metagenomes are deposited in European Bioinformatics Institute European Nucleotide Archive under accession no. PRJEB39223. The non-metagenomic data used for analysis in this study are held by the Department of Twin Research at King's College London. The data can be released to bona fide researchers using our normal procedures overseen by the Wellcome Trust and its guidelines as part of our core funding. We receive around 100 requests per year for our datasets and have three meetings per month with independent members to assess proposals. The application can be found at https://twinsuk.ac.uk/resources-for-researchers/ access-our-data/. This means that data need to be anonymized and conform to GDPR standards.

Code availability
Computational analyses were performed using the bioBakery suite of tools; species-level microbial abundances were computed using MetaPhlAn v.3.0 (https:// github.com/biobakery/MetaPhlAn). Functional potential profiling was carried out with HUMAnN v.2.0 (https://github.com/biobakery/humann; Methods). Fig. 1 | Alpha diversity linked with personal factors, habitual diet, fasting, and postprandial markers. a, Microbiome alpha diversity computed using the Shannon index correlated markers from the four categories: personal, habitual diet, fasting, and post-prandial. Reported are the five strongest positive and negative Spearman correlations for each category with p < 0.05. All correlations and p-values available in the Supplementary  Table 1. b, Inter-sample microbiome distances (beta-diversity) were substantially lower, that is closer, among samples from the same individuals (two weeks apart) compared to those amongst different individuals. Gut microbial communities in monozygotic twins were slightly more similar than in dizygotic twins (Mann-Whitney U test two-sided p = 0.06), which, in turn, were more similar than unrelated individuals (p < 1e-12), even after adjusting for age (p < 1e-12). c, After excluding twin status (that is non-twin, vs. mono vs. dizygotic twins) from the model, personal factors still accounted for the greatest proportion of variance explained in overall microbial diversity, followed by dietary habits, fasting and postprandial cardiometabolic blood markers (by cumulative stepwise dbRDA). d, Cumulative (left bars) contributions and individual (right bars) contributions for each metadata variable based on Bray-Curtis dissimilarity. Box plots show first and third quartiles (boxes) and the median (middle line), whiskers extends up-to 1.5× the interquartile range. Fig. 4 | Performance for random Forest regression and classification on microbiome functional potential in predicting fasting measurements, total cholesterol and triglycerides in different lipoproteins. The figure shows the performance of both RF regression and classification tasks trained on microbiome gene families profiles in predicting (a) the fasting measurements presented in Fig. 4a, sorted as in Fig. 4a. b, Predicting performances of the total cholesterol and (c) of triglycerides in different sizes of lipoproteins. For each lipoprotein, we considered its concentration values at both fasting and postprandial (6 h), and also the difference (rise) between the post-prandial concentration and the fasting one. Box plots show the distribution of the Spearman correlations (left axis) between real and predicted values using RF regression. Box plots show first and third quartiles (boxes) and the median (middle line), whiskers extends up-to 1.5× the interquartile range. Circles show the median AUC (right axis) of RF classification in predicting the bottom quartile of the distribution vs. the top quartile. Fig. 6 | Pairwise partial Spearman correlations between bacterial species and total lipids and cholesterol in lipoproteins. a, The heatmap shows the species-level correlations with total lipids in lipoprotein variables at fasting, post-prandial (6 h), and the difference (rise) between the postprandial and fasting concentrations. The 30 species with the highest number of significant associations (FDR ≤ 0.2) are shown. The asterisk indicates a significant correlation between species and metadata variable using a t-test two-sided, corrected with FDR with q < 0.2. b, The heatmap shows the species-level correlations with total cholesterol in lipoprotein variables at fasting, post-prandial (6 h), and the difference (rise) between the postprandial and fasting concentrations. The 30 species with the highest number of significant associations (FDR ≤ 0.2) are shown. The asterisk indicates a significant correlation between species and metadata variable using a t-test two-sided, corrected with FDR with q < 0.2. All correlations, p-values, and q-values are available in the Supplementary Table 6. Fig. 8 | Pairwise partial Spearman correlations between bacterial gene families and pathway abundances with clinical and metabolic risk scores, glycaemic and inflammatory measures, and lipoproteins. a, The heatmap shows gene families correlations with the set of metadata presented in Fig. 5a-c reporting the top 2,000 genes selected among those with at least 20% prevalence on their number of significant correlations (q < 0.2). Gene families' correlations are showing the same clusters as the species-level correlations in Fig. 5a-c. b, The heatmap shows pathway abundances correlations with the set of metadata presented in Fig. 5a-c reporting all the pathways at 20% prevalence (349 in total). Pathway abundances correlations are showing the same cluster structure as the species-level correlations in Fig. 5a-c. Fig. 9 | Concordance of Random Forest scores with species-level partial correlations. Volcano plots of the scores assigned to each species by Random Forest and their partial correlation, showing an overall concordance between the two independent approaches. We considered the top 5 metadata variables for the six metadata categories: a, Foods, bacon (g) (corr. 0.49), garlic (g) (corr. 0.424), unsalted nuts (g) (0.422), dairy dessert  Fig. 10 | Prevotella copri and/or Blastocystis presence are indicators of a more favourable postprandial glucose response to meals. a-c, Differential analysis of visceral fat, HFD and glucose iAUC 2 h after standardised breakfast according to presence-absence of one and both of P. copri and Blastocystis. The analysis reveals that both these species are indicators of reduced visceral fat, good cholesterol and meal-driven increase of glucose. d,e, Differential analysis of C-peptide and triglycerides at different time points according to presence-absence of one and both of P. copri and Blastocystis. The distributions of the concentrations for C-peptide and triglycerides were typically lower when one or both are absent. An asterisk between two box plots represents a significant p-value (p < 0.05) according to the Mann-Whitney U test (two-sided, Supplementary Table 8). Box plots show first and third quartiles (boxes) and the median (middle line), whiskers extends up-to 1.5× the interquartile range. P-values are available in Supplementary Table 8.