Genome sequencing analysis identifies new loci associated with Lewy body dementia and provides insights into the complex genetic architecture

The genetic basis of Lewy body dementia (LBD) is not well understood. Here, we performed whole-genome sequencing in large cohorts of LBD cases and neurologically healthy controls to study the genetic architecture of this understudied form of dementia and to generate a resource for the scientific community. Genome-wide association analysis identified five independent risk loci, whereas genome-wide gene-aggregation tests implicated mutations in the gene GBA. Genetic risk scores demonstrate that LBD shares risk profiles and pathways with Alzheimer’s and Parkinson’s disease, providing a deeper molecular understanding of the complex genetic architecture of this age-related neurodegenerative condition.


Introduction
Lewy body dementia (LBD) is a clinically heterogeneous neurodegenerative disease characterized by progressive cognitive decline, parkinsonism, and visual hallucinations 1 . There are no effective disease-modifying treatments available to slow disease progression, and current therapy is limited to symptomatic and supportive care. At postmortem, the disorder is distinguished by the widespread cortical and limbic deposition of α-synuclein protein in the form of Lewy bodies that are also a hallmark feature of Parkinson's disease. The vast majority of LBD patients additionally exhibit Alzheimer's disease co-pathology 2 . These neuropathological observations have led to the, as yet unproven hypothesis that LBD lies on a disease continuum between Parkinson's disease and Alzheimer's disease 3 . Though relatively common in the community, with an estimated 1.4 million prevalent cases in the United States 4 , the genetic contributions to this underserved condition are poorly understood.
The rapid advances in genome sequencing technologies offer unprecedented opportunities to identify and characterize disease-associated genetic variation. Here, we performed whole-genome sequencing in a cohort of 2,981 patients diagnosed with LBD and 4,391 neurologically healthy subjects. We analyzed these data using a genome-wide association study (GWAS) approach and gene aggregation tests, and we modeled the relative contributions of Alzheimer's disease and Parkinson's disease risk variants to this fatal neurodegenerative disease (see Fig. 1 for an analysis overview). Additionally, we created a resource for the scientific community to mine for new insights into the genetic etiology of LBD and to expedite the development of targeted therapeutics.

Genome-wide association analysis identifies new loci associated with LBD
After quality control, whole-genome sequence data from 2,591 patients diagnosed with LBD and 4,027 neurologically healthy subjects were available for study. Participants were recruited across 44 institutions/consortia and were diagnosed according to established consensus criteria. Using a GWAS approach, we identified five loci that surpassed the genome-wide significance threshold (Table 1, Fig. 2a). Three of these signals were located at known LBD risk loci within the genes GBA, APOE, and SNCA [5][6][7][8] . The remaining GWAS signals in BIN1 and TMEM175 represented novel LBD risk loci. Notably, these loci have been implicated in other age-related neurodegenerative diseases, including Alzheimer's disease (BIN1) and Parkinson's disease (TMEM175) 9,10 . Conditional analyses detected a second signal at the APOE locus (Supplementary Fig. 1 for regional association plots, Supplementary Fig. 2 for conditional association analyses). Subanalysis GWAS of pathologically defined LBD cases only versus control subjects identified the same risk loci (Fig. 2b). Finally, we replicated each of the observed risk loci in an independent sample of 970 European-ancestry LBD cases and 8,928 control subjects (Table 1).

Gene-level aggregation testing identifies GBA as a pleomorphic risk gene
The significant loci from our GWAS explained only a small fraction (1%) of the conservatively estimated narrow-sense heritability of LBD of 10.81% (95% confidence interval [CI]: 8.28% -13.32%, p-value = 9.17 x 10 -4 ). To explore whether rare variants contribute to the remaining risk of LBD, we performed gene-level sequence kernel association -optimized (SKAT-O) tests of missense and loss-of-function mutations with a minor allele frequency (MAF) threshold ≤ 1% across the genome 11 . This rare variant analysis identified GBA as associated with LBD ( Fig. 2c). GBA, encoding the lysosomal enzyme glucocerebrosidase, is a known pleomorphic risk gene of LBD and Parkinson's disease 5,12,13 , and our rare and common variant analyses confirm a prominent role of this gene in the pathogenesis of Lewy body diseases.

Functional inferences from colocalization and gene expression analyses
Most GWAS loci are thought to operate through the regulation of gene expression 14,15 .
Thus, we performed a colocalization analysis to determine whether a shared causal variant drives association signals for LBD risk and gene expression. Expression quantitative trait loci (eQTLs) were obtained from eQTLGen and PsychENCODE 16,17 , the largest available human blood and brain eQTL datasets. We found evidence of colocalization between the TMEM175 locus and an eQTL regulating TMEM175 expression in blood (posterior probability for H4 (PPH4) = 0.99;  Table 1). There was also colocalization between the association signal at the SNCA locus and an eQTL regulating SNCA-AS1 expression in the brain (PPH4 = 0.96; Fig.   3b; Supplementary Table 1). Interestingly, the index variant at the SNCA locus was located within the SNCA-AS1 gene, which overlaps with the 5'-end of SNCA and encodes a long noncoding antisense RNA known to regulate SNCA expression. Sensitivity analyses confirmed that these colocalizations were robust to changes in the prior probability of a variant associating with both traits (Supplementary Fig. 3).
We interrogated the effect of each SNP in the region surrounding SNCA-AS1 on LBD risk using our GWAS data and SNCA-AS1 expression using the PsychENCODE data ( Supplementary   Fig. 4a). All genome-wide significant risk SNPs in the locus had a negative beta coefficient, while the shared SNCA-AS1 eQTL had a positive beta coefficient. This negative correlation suggested that increased SNCA-AS1 expression is associated with reduced LBD risk (Spearman's rho = -0.42; p-value = 0.0012; Supplementary Fig. 4b).
Analysis of human bulk-tissue RNA-sequencing data from the Genotype-Tissue Expression (GTEx) consortium and single-nucleus RNA-sequencing data of the medial temporal gyrus from the Allen Institute of Brain Science 18,19 demonstrated that TMEM175 is ubiquitously expressed, whereas SNCA-AS1 is predominantly expressed in brain tissue ( Fig. 6).

LBD risk overlaps with risk profiles of Alzheimer's disease and Parkinson's disease
We leveraged our whole-genome sequence data to explore the etiological relationship between Alzheimer's disease, Parkinson's disease, and LBD. To do this, we applied genetic risk scores derived from large-scale GWASes in Alzheimer's and Parkinson's disease to individuallevel genetic data from our LBD case-control cohort 20,21 . We tested the associations of the Alzheimer's and Parkinson's disease genetic risk scores with LBD disease status, and with age at death, age at onset, and the duration of illness observed among the LBD cases.

Enrichment analysis identifies pathways involved in LBD
Pathway enrichment analysis of LBD, using a polygenic risk score based on the GWAS risk variants, found several significantly enriched gene ontology processes associated with LBD ( Fig. 6). These related to the regulation of amyloid-beta formation (adjusted p-value = 0.04), regulation of endocytosis (adjusted p-value = 0.02), tau protein binding (adjusted p-value = 1.85 x 10 -5 ), and others. Among these, the regulation of amyloid precursor protein, amyloid-beta formation, and tau protein binding have been previously implicated in the pathogenesis of Alzheimer's disease, while regulation of endocytosis is particularly important in the pathogenesis of Parkinson's disease 22,23 . These observations support the notion of overlapping diseaseassociated pathways in these common age-related neurodegenerative diseases.

Discussion
Our analyses highlight the contributions of common and rare variants to the complex genetic architecture of LBD, a common and fatal neurodegenerative disease. Specifically, our GWAS identified five independent genome-wide significant loci (GBA, BIN1, TMEM175, SNCA-AS1, APOE) that influence risk for developing LBD, whereas the genome-wide genebased aggregation tests implicated mutations in GBA as being critical in the pathogenesis of the disease. We further detected strong cis-eQTL colocalization signals at the TMEM175 and SNCA-AS1 loci, indicating that the risk of disease at these genomic regions is driven by expression changes of these particular genes. Finally, we provided definitive evidence that the risk of LBD is driven, at least in part, by the genetic variants associated with the risk of developing both Alzheimer's disease and Parkinson's disease.
We replicated all five GWAS signals in an independent LBD case-control dataset derived from imputed genotyping array data. Among these, GBA (encoding the lysosomal enzyme glucocerebrosidase), APOE (encoding apolipoprotein E), and SNCA (encoding α-synuclein) are known LBD risk genes [5][6][7] . In addition to these previously described loci, we identified a novel locus on chromosome 2q14.3, located 28 kb downstream to the BIN1 gene, which is a known risk locus for Alzheimer's disease 9 . BIN1 encodes the bridging integrator 1 protein that is involved in endosomal trafficking. The depletion of BIN1 reduces the lysosomal degradation of β-site APP-cleaving enzyme 1 (BACE1), resulting in increased amyloid-β production 24 . The direction of effect observed in LBD is the same as in Alzheimer's disease (Supplementary Table   3). The observed pleiotropic effects between LBD and Alzheimer's disease prompt us to speculate that mitigating BIN1-mediated endosomal dysfunction could have therapeutic implications in both neurodegenerative diseases. Likewise, the directions of effect for the BIN1 and APOE signals were the same as the directions detected in Alzheimer's disease (Supplementary Table 3) 29 . However, we observed a notably different profile at the SNCA locus in LBD compared to PD. Our GWAS and colocalization analyses implicated SNCA-AS1, a non-coding RNA that regulates SNCA expression, as the main signal at the SNCA locus. In contrast, the main signal in Parkinson's disease is detected at the 3'end of SNCA 30 . This finding suggests that the regulation of SNCA expression may be different in LBD compared to Parkinson's disease and that only specific SNCA transcripts that are regulated by SNCA-AS1 drive risk for developing dementia. Further, SNCA-AS1 may prove to be a more amenable therapeutic target than SNCA itself due to its neuronal specificity.
As part of this study, we created a foundational resource that will facilitate the study of molecular mechanisms across a broad spectrum of neurodegenerative diseases. We anticipate that these data will be widely accessed for several reasons. First, the resource is the largest whole-genome sequence repository in LBD to date. Second, the nearly 2,000 neurologically healthy, aged individuals included within this resource can be used as control subjects for the study of other neurological and age-related diseases. Third, we prioritized the inclusion of pathologically confirmed LBD patients, representing more than two thirds of the case cohort, to ensure high diagnostic accuracy among our case cohort participants. Finally, all genomes are of high quality and were generated using a uniform genome sequencing, alignment, and variantcalling pipeline. Whole genome sequencing data on this large case-control cohort has allowed us to undertake a comprehensive genomic evaluation of both common and rare variants, including immediate fine-mapping of association signals to pinpoint the functional variants at the TMEM175 and SNCA-AS1 loci. The availability of genome-sequence data will facilitate similar comprehensive evaluations of less commonly studied variant types, such as repeat expansions and structural variants.
Our study has limitations. We focused on individuals of European ancestry, as this is the population in which large cohorts of LBD patients were readily available. Recruiting patients and healthy controls from diverse populations will be crucial for future research to understand the genetic architecture of LBD. Another constraint is the use of short-read sequencing, rather than long-read sequencing applications, that limits the resolution of complex, repetitive, and GC-rich genomic regions 31 . Further, despite our large sample size, we had limited power to detect common genetic variants of small effect size, and additional large-scale genomic studies will be required to unravel the missing heritability of LBD.
In conclusion, our study identified novel loci as relevant in the pathogenesis of LBD. Our findings confirmed that LBD genetically intersects with Alzheimer's disease and Parkinson's disease and highlighted the polygenic contributions of these other neurodegenerative diseases to its pathogenesis. Determining shared molecular genetic relationships among complex neurodegenerative diseases paves the way for precision medicine and has implications for prioritizing targets for therapeutic development. We have made the whole-genome sequence data available to the research community. These genomes constitute the largest sequencing effort in LBD to date and are designed to accelerate the pace of discovery in dementia.

Cohort description and study design
A total of 5,154 participants of European ancestry (2,981 LBD cases, 2,173 neurologically healthy controls) were recruited across 17 European and 27 North American sites/consortia to create a genomic resource for LBD research (Supplementary Table 4). In addition to these resource genomes, we obtained convenience control genomes from (1)

Sequence alignment, variant calling
Genome sequence data were processed using the pipeline standard developed by the Centers for Common Disease Genomics (CCDG; https://www.genome.gov/27563570/). This standard allows for whole-genome sequence data processed by different groups to generate 'functionally equivalent' results 36 . The GRCh38DH reference genome was used for alignment, as specified in the CCDG standard. For whole-genome sequence alignments and processing, the Broad Institute's implementation of the functional equivalence standardized pipeline was used.
This pipeline, which incorporates the GATK (2016) Best Practices 37 , was implemented in the workflow description language for deployment and execution on the Google Cloud Platform.
Single-nucleotide variants and indels were called from the processed whole-genome sequence data following the GATK Best Practices using another Broad Institute workflow for joint discovery and Variant Quality Score Recalibration. Both Broad workflows for WGS sample processing and joint discovery are publicly available (https://github.com/gatk-workflows/broadprod-wgs-germline-snps-indels). All whole-genome sequence data were processed using the same pipeline.
After these quality control filters were applied, there were 6,651 samples available for analysis. Supplementary Fig. 7 shows quality control metrics.

Single-variant association analysis
We performed a GWAS in LBD (n = 2,591 cases and 4,027 controls) using logistic regression in PLINK (v.2.0) with a minor allele frequency threshold of >1% based on the allele frequency estimates in the LBD case cohort 42 . We used the step function in the R MASS package to determine the minimum number of principal components (generated from common single nucleotide variants) required to correct for population substructure 43 . The first two principal components in our study cohorts compared to the HapMap3 Genomic Resource Panel are shown in Supplementary Fig. 8a. Based on this analysis, we incorporated sex, age, and five principal components (PC1, PC3, PC4, PC5, PC7) as covariates in our model. Quantile-quantile plots revealed minimal residual population substructure, as estimated by the sample size-adjusted genome-wide inflation factor λ1000 of 1.004 (Supplementary Fig. 8b). The Bonferroni threshold for genome-wide significance was 5.0 x 10 -8 . A conditional analysis was performed for each GWAS locus by adding each respective index variant to the covariates (Supplementary Fig. 2).
For the LBD GWAS replication analysis, we obtained genotyping array data from two independent, non-overlapping, European-ancestry LBD case-control cohorts, totaling 970 LBD cases and 8,928 controls combined, as described elsewhere 44,45 . The data were cleaned by applying the same sample-and variant-level quality control steps that were used in the discovery genomes. We imputed the data against the NHLBI TopMed imputation reference panel under default settings with Eagle v.2.4 phasing [46][47][48] . Variants with an R 2 value < 0.3 were excluded. A meta-analysis of the two cohorts was performed with METAL under a fixed-effects model and variants that were significant in the discovery stage were extracted 49 .

Colocalization analyses
Coloc (v.4.0.1) was used to evaluate the probability of LBD loci and expression quantitative trait loci (eQTLs) sharing a single causal variant 50 . This tool incorporates a Bayesian statistical framework that computes posterior probabilities for five hypotheses: namely, there is no association with either trait (hypothesis 0, H0); an associated LBD variant exists but no associated eQTL variant (H1); there is an associated eQTL variant but no associated LBD variant (H2); there is an association with an eQTL and LBD risk variant but they are two independent variants (H3); and there is a shared associated LBD variant and eQTL variant within the analyzed region (H4). Cis-eQTLs were derived from eQTLGen (n = 31,684 individuals; accessed February 19, 2019) and PsychENCODE (n = 1,387 individuals; accessed February 20, 2019) 16,17 . For each locus, we examined all genes within 1Mb of a significant region of interest, as defined by our of one in the control subjects. Regression models were then applied to test for association with the risk of developing LBD (based on logistic regression) or the age at death, age at onset, and disease duration (linear regression), adjusting for sex, age (risk and disease duration only), and five principal components (PC1, PC3, PC4, PC5, PC7) to account for population stratification.

Polygenic risk score generation for pathway enrichment analysis
A genome-wide LBD polygenic risk score was generated using PRSice-2 57 . The polygenic risk score was computed by summing the risk alleles associated with LBD that had been weighted by the effect size estimates generated by performing a GWAS in the pathologically confirmed LBD cases and controls. This workflow identified the optimum p-value threshold (1 x 10 -4 in our dataset) for variant selection, allowing for the inclusion of variants that failed to reach genome-wide significance but that contributed to disease risk, nonetheless. After excluding variants without an rs-identifier, the remaining 122 variants were ranked based on their GWAS p-values, with the APOE, GBA, SNCA, BIN1 and TMEM175 genes added to the top five positions. The list was then analyzed for pathway enrichment using the g:Profiler toolkit (v.0.1.8) 58 . We defined the genes involved in the pathways and gene sets using the following databases: (i) Gene Ontology, (ii) Kyoto Encyclopedia of Genes and Genomes, (iii) Reactome, and (iv) WikiPathways 59,60 . Significant pathways and gene lists with a single gene or containing more than 1,000 genes were discarded. Significance was defined as a p-value of less than 0.05.
The g:Profiler algorithm applies a Bonferroni correction to the p-value for each pathway to correct for multiple testing.
Human single-nucleus RNA sequence data are available at the Allen Institute for Brain Science portal (portal.brain-map.org/atlases-and-data/rnaseq/human-mtg/smart-seq). Specificity values for the Allen Institute for Brain Science and GTEx data and the code used to generate these values are openly available at: https://github.com.RHReynolds/MarkerGenes.

Acknowledgments
We thank contributors who collected samples used in this study, as well as patients and families, whose help and participation made this work possible. We would like to thank Ms.       For each of the five loci, the variant with the lowest p-value is listed. The gene that is in closest proximity to the top variant at each locus is represented. The chromosomal position is shown according to hg38. Genome-wide significance was defined as a p-value < 5 x 10 -8 . Abbreviations: A1, other allele; A2, effect allele; EAF, effect allele frequency; OR, odds ratio; Chr, chromosome; CI, confidence interval.