RESULTS
Pathway analysis used a three-stage study design
Overall, we evaluated the involvement of 7296 pathways in the pathogenesis of ALS using a polygenic risk score approach (see Fig. 1A for the workflow of our analysis). To ensure the accuracy of our results, we divided the available ALS genomic data into three sections. The first of these independent datasets (hereafter known as the reference dataset) was a published genome-wide association study (GWAS) involving 12,577 ALS cases and 23,475 controls (6). We used the summary statistics from this reference dataset to define the weights of the risk allele so that greater importance was given to alleles with higher risk estimates.
Polygenic risk score analysis was used to identify (A) biological pathways and (B) cell types contributing to the risk of developing ALS. The human frontal cortex single-nucleus dataset was obtained from the North American Brain Expression Consortium (NABEC). The human FTC and hippocampus DroNc-seq was obtained from Habib et al. (19). The human Motor Cortex sNUC-seq dataset was obtained from the Allen Cell Types database (20). FTC, prefrontal cortex; M1, primary motor cortex.
These risk allele weights were then applied to our second dataset (also known as the training dataset) to generate a polygenic risk score estimate for each biological pathway. These training data consisted of individual-level genotype and phenotype data from 5605 ALS cases and 24,110 control subjects that were genotyped in our laboratory (7). We investigated the pathways defined by the Molecular Signatures Database, a compilation of annotated gene sets designed for gene set enrichment and pathway analysis. We focused our efforts on three collections within the Molecular Signatures Database that have been previously validated (8, 9). These were the hallmark gene sets (containing 50 pathways), the curated gene sets (1329 pathways), and the gene ontology gene sets (5917 pathways).
To ensure our results’ accuracy and control for type I error, we attempted replication of our findings in an independent cohort. For this, we used our third independent dataset (also known as the replication dataset) consisting of individual-level genotype and phenotype data from 2411 ALS cases and 10,322 controls that were also genotyped in our laboratory (7). The pathways that achieved significance in the training dataset [defined as a false discovery rate (FDR)–corrected P value of <0.05] were selected for replication. Then, we report the pathways that achieved significance in the replication dataset (defined as a raw P value of <0.05). While the replication cohort was required to ensure the accuracy of our results and to avoid overfitting, it was limited in size, raising concerns of rejecting true associations. For this reason, we reported the pathways that achieved significance in the replication dataset using a raw P value of <0.05 as the threshold for significance. The FDR-corrected P values are also shown in Table 1.
Beta estimates, standard errors, and P values are after Z transformation. SE, standard error; BP, biological process; CC, cellular component; MF, molecular function.
We applied a similar polygenic risk score approach to determine which cell types are associated with the ALS disease process (Fig. 1B). In essence, a cell type associated with a disease will display a pattern whereby more of the polygenic risk score variance is attributable to genes specifically expressed in that cell type. We applied a linear model to detect this pattern in our ALS data, using a P value of less than 0.05 as the significance threshold. This strategy has become a standard approach for this type of analysis (10).
Biological pathways driving the risk of ALS
We calculated the contribution to ALS risk of 7296 gene sets and pathways listed in the Molecular Signature Database (fig. S1). This genome-wide analysis identified 13 biological processes, 12 cellular component pathways, and 2 molecular function pathways with a significant risk associated with ALS in the training data (table S1). We independently confirmed a significant association with ALS risk in 13 of these pathways in our replication cohort. These pathways included (i) seven biological processes, namely, neuron projection morphogenesis, neuron development, cell morphogenesis involved in differentiation, cell part morphogenesis, cellular component morphogenesis, cell development, and cell projection organization (Fig. 2A and Table 1); (ii) four cellular components, namely, autophagosome, cytoskeleton, nuclear outer membrane ER membrane network, and cell projection (Fig. 2B and Table 1); and (iii) two molecular function terms, namely, ribonucleotide binding and protein N-terminus binding (Fig. 2C and Table 1).
The Forest plots show polygenic risk score estimates in the replication cohort for the (A) biological processes, (B) cellular components, and (C) molecular function pathways that were significant in the training cohort. The blue squares represent the significant terms in both the training and replication datasets. The heatmaps depict semantic similarity calculated by GOSemSim among the significant (D) biological processes, (E) cellular components, and (F) molecular function. The REVIGO algorithm was used to obtain the cluster representatives. The Forest plot displays the distribution of beta estimates across pathways, with the horizontal lines corresponding to 95% confidence intervals. Beta estimates for the polygenic risk score are after Z transformation.
Pathways central to ALS risk
There is significant functional overlap among the 13 pathways that we identified as significantly associated with ALS risk. We sought to more broadly summarize the significant pathways by removing redundant terms. To do this, we computed semantic similarity that is a measure of the relatedness between gene ontology terms based on curated literature. We used the REVIGO algorithm to obtain cluster representatives (11). Overall, our results resolved into three central pathways as being involved in the pathogenesis of ALS, namely, neuron projection morphogenesis, membrane trafficking, and signal transduction mediated by ribonucleotides (Fig. 2, D to F, and fig. S2).
Pathway analysis among patients carrying the pathogenic C9orf72 repeat expansion
We found that the C9orf72 gene was a member of 2 of our 13 significant pathways, namely, the autophagosome and cytoskeleton pathways. We explored whether C9orf72 was the main driver of these pathways. To do this, we calculated the polygenic risk score associated with these two pathways in C9orf72 expansion carriers compared to healthy individuals, and non-C9orf72 carriers compared to healthy individuals. These cohorts consisted of 666 patients diagnosed with ALS who were C9orf72 expansion carriers, 7040 patients with ALS who were noncarriers, and 34,232 healthy individuals.
Our analysis revealed that the cytoskeleton pathway remained significantly associated with ALS risk in C9orf72 expansion carriers and noncarriers. This finding indicated that this critical biological process is broadly involved in ALS’s pathogenesis (Fig. 3B). In contrast, only C9orf72 expansion carriers showed significant risk in the autophagosome genes (Fig. 3A), indicating that the C9orf72 locus mostly drives this pathway’s involvement in the pathogenesis of ALS and points to an autophagy-related mechanism underlying C9orf72 pathology.
The polygenic risk scores associated with (A) autophagosome and (B) cytoskeleton in ALS C9orf72 expansion carriers (n = 666) compared to healthy subjects (n = 34,232), and ALS noncarriers (n = 7040) compared to healthy subjects (n = 34,232) are shown in this figure. The upper panels depict the cumulative genetic risk score for each group. The lower panel shows the forest plots of the beta estimates with 95% confidence intervals. Beta estimates are based on the Z-score scale. Genetic risk score mean comparisons from the ALS-noncarriers group compared to the ALS-C9orf72 carriers via t test are summarized by asterisks, with * denoting a two-sided mean difference at P < 0.05.
ALS polygenic risk is due to genes other than known risk loci
We also examined the contributions of the five genetic risk loci known to be associated with ALS. These loci were reported in the most recent ALS GWAS (7) and included TNIP1, C9orf72, KIF5A, TBK1, and UNC13A (7). Rare variants in other known ALS genes were not included in the polygenic risk score analysis, as there was no evidence of association within those loci in the GWAS. To do this, we added these five loci as covariates in the analysis of the replication cohort. Our data show that autophagosome and cell projection were no longer significant. However, the other 11 pathways were still associated, suggesting that there are risk variants contributing to the risk of ALS within these 11 pathways that remain to be discovered (table S2).
Mendelian randomization nominates genes relevant to ALS pathogenesis
Most variants associated with a complex trait overlap with expression quantitative trait loci (eQTL), suggesting their involvement in gene expression regulation (12). We applied two-sample Mendelian randomization within the 13 significant pathways (shown in Table 1) to integrate summary-level data from a large ALS GWAS (7) with data from cis-eQTLs obtained from previous studies in blood (13) and brain (14–17). This approach identifies genes whose expression levels are associated with ALS because of a shared causal variant. We used multiple single-nucleotide polymorphisms (SNPs) belonging to the 13 significant pathways as instruments, gene expression traits as exposure, and the ALS phenotype as the outcome of interest (Fig. 4A). Our analyses identified six genes whose altered expression was significantly associated with the risk of developing ALS. These were ATG16L2, ACSL5, MAP1LC3A, MAPKAPK3, PLXNB2, and SCFD1 within blood (table S3). In addition, SCFD1 was significantly associated with ALS in brain-derived tissue (Fig. 4B). Supporting the veracity of our findings, SCFD1 variants have been previously associated with ALS risk in a large population study (6), and ACSL5 has recently been identified as an ALS gene in a multiethnic meta-analysis (18).
(A) Schematic representation of the parameters used for the analysis. (B) The Forest plot displays the beta estimates, with the 95% confidence intervals shown as horizontal error bars. The grid on the left indicates the pathway to which the gene belongs.
Cell types involved in the pathogenesis of ALS
We leveraged our large GWAS dataset to determine which cell types participate in the pathological processes of ALS. To do this, we generated a single-nucleus RNA sequencing (sNuc-seq) dataset using the human frontal cortex collected from 16 healthy donors. Each cell was assigned to 1 of 34 specific cell types based on the clustering of the sNuc-seq data (Fig. 5, A and B). We then determined a decile rank of expression for the 34 cell types based on the specificity of expression. For instance, the TREM2 gene is highly expressed only in microglia. Thus, the specificity value of TREM2 in microglia is close to 1 (0.87), and it is assigned to the 10th decile for this cell type. In contrast, the POLR1C gene is expressed widely across tissues. Consequently, it has a specificity value of 0.007, and it is assigned to the fourth decile of microglia and a similar low decile across other cell types.
(A) Unsupervised UMAP clustering identifies 34 cell types in the human cortex. (B) Heatmap representing the gene expression per cluster. (C) The y axis corresponds to the phenotypic variance explained by polygenic risk score (pseudo-R2), and the x axis depicts deciles 1 to 10. The color pictures show the significant cell types and the P values of the linear regression fit models. The gray pictures show the cell types that were not significantly associated with the disease. The regression line depicts the association between the variance explained by polygenic risk score (pseudo-R2, adjusted by prevalence) and the specificity decile in each cell type. The gray shading shows the 95% confidence interval of the regression model. AST, astrocyte; EC, endothelial cell; ExN, excitatory neuron; InN, inhibitory neuron; MGL, microglia; ODC, oligodendrocyte; OPC, oligodendrocyte precursor.
The premise of this type of analysis is that, for a cell type associated with a disease, more of the variance explained by the polygenic risk score estimates will be attributable to the genes more highly expressed in that cell type. To test this hypothesis, we applied linear regression models to detect a trend of increased variance with the top deciles, a pattern indicating that a particular cell type is involved in the pathogenesis of ALS (10). This approach identified two subtypes of cortical GABAergic interneurons (PVALB- and TOX-expressing neurons, and ADARB2- and RELN-expressing neurons) and oligodendrocytes (OPALIN-, FCHSD2-, and LAMA2-expressing oligodendrocytes) as associated with ALS risk (Fig. 5C).
To confirm these findings, we used an independent dataset consisting of droplet single-nucleus RNA-seq (DroNc-seq) from the human prefrontal cortex and hippocampus obtained from five healthy donors (19). Our modeling in this second human dataset confirmed our previous findings: PVALB-expressing GABAergic neurons and oligodendrocytes were significantly enriched in ALS risk (see fig. S3).
To explore whether our main cell type findings were reproducible across other brain areas vulnerable to ALS pathology, we used an independent dataset consisting of snRNA-seq data from the human primary motor cortex (20). Our analysis identified several subtypes of primary motor cortex cell types that were associated with ALS. These included nine subtypes of GABAergic interneurons [clusters InN.38 (PVALB- and TOX-expressing neurons), InN.8, InN.14, InN.24, InN.32 (ADARB2- and RELN-expressing neurons), InN.4, InN.15, InN.20, and InN.21] and oligodendrocytes (cluster OCD.11, OPALIN-expressing oligodendrocytes). In addition, oligodendrocyte precursor cells (cluster OPC.37) and glutamatergic neurons (cluster ExN.25) were implicated with ALS within this dataset (Fig. 6).
(A) Unsupervised UMAP clustering identifies 40 cell types in the human primary motor cortex. (B) The y axis corresponds to the phenotypic variance explained by the polygenic risk score (pseudo-R2), and the x axis depicts deciles 1 to 10. The color pictures show the significant cell types and the P values of the linear regression fit models. The gray pictures show the cell types that were not significantly associated with the disease. The regression line depicts the association between the variance explained by the polygenic risk score (pseudo-R2, adjusted by prevalence) and the specificity decile in each cell type. The gray shading shows the 95% confidence interval of the regression model.
Last, we attempted to replicate our findings in a well-validated dataset based on single-cell RNA-seq data obtained from mouse brain regions (10). The advantage of this nonhuman dataset is that it is based on single-cell RNA-seq, a difficult technique to apply to human neurons, but which captures transcripts missed by sNuc-seq that may be important for neurological disease (10). Like the human data, cortical parvalbuminergic interneurons again showed enrichment in ALS risk using the mouse dataset (see fig. S4). Oligodendrocytes were not significantly associated with ALS in the mouse datasets. One plausible explanation is that the genes specifically expressed in human oligodendrocytes overlap with the genes related to human neurological disease, but these genes are not enriched in the mouse oligodendrocytes (21).
As spinal cord degeneration is a hallmark of ALS, we attempted to replicate our findings in a mouse lumbar spinal cord single-nucleus dataset (22). This approach implicated three cell types in the pathogenesis of ALS: GABAergic interneurons (InN.13), astrocytes, and dorsal root ganglion neurons (see fig. S5). Similar to before, GABAergic interneurons were enriched within the mouse spinal cord dataset.
DISCUSSION
A striking aspect of our analysis is that it identified a relatively small number of biological pathways as central to the pathogenesis of ALS. Considering the clinicopathological and genetic heterogeneity across ALS, the finding of such a small quantity of universal themes is unexpected. Our results illustrate how multiple unrelated genetic causes can lead to a similar downstream outcome, namely, motor neuron degeneration. Unraveling how disruption of these three fundamental biological processes predisposes to ALS may yield therapeutic targets that are effective across all patients with ALS.
The importance of membrane trafficking in ALS has been widely reported (23). In contrast, although neuronal outgrowth has been explored in ALS (24), our identification of genetic risk underlying neuronal morphogenesis was previously unknown. The combination of membrane trafficking and neuronal morphogenesis may be a driving force of the disease pathogenesis. The defining feature of motor neurons is the length of their axons, projections that require specialized long-range transport and efficient cytoskeletal dynamics to maintain synaptic connections (25). Similarly, signal transduction mediated by ribonucleotides is a broad term encompassing ion channel transport that regulates signal transmission at synapses. Disruption of this process leads to hyperexcitability, a phenomenon that has been observed in patients with ALS (26). We speculate that broadly expressed genes lead to selective damage due to the high reliance of motor neurons on cellular transport, morphogenesis, and axonal ion channels compared to other cell types.
Our data did not detect biological pathways that have been previously implicated in the pathogenesis of familial ALS, such as nucleocytoplasmic transport (27) and excitotoxicity (28). These cellular processes may only operate in specific genetic forms of ALS, such as C9orf72– or SOD1-related cases. A more likely explanation is that rare and low-frequency variants not captured by our methodology significantly contribute to those pathways. For this reason, we cannot rule these biological processes out as relevant to the pathogenesis of ALS. Future analyses of more substantial datasets that include whole-genome sequencing data may implicate them.
One of our study’s strengths is that we could distinguish differential pathways operating in C9orf72 expansion carriers versus noncarriers. The autophagosome pathway was only significant in the analysis of the C9orf72 expansion carriers. The C9orf72 protein is a known regulator of autophagy; hence, it is not unexpected that a higher burden of ALS genetic risk was found within autophagy genes in C9orf72 expansion carriers versus noncarriers. This is the first time that autophagy-related processes have been implicated in C9orf72 biology from a genetic perspective. The hexanucleotide repeat expansion is known to influence the C9orf72 gene expression, irrespective of reported biology involving dipeptide repeats and toxic RNA species arising directly from the repeat expansion (29), which reinforces the importance of our findings. The C9orf72 protein was also recently found to play a role in neuronal and dendritic morphogenesis in ALS by promoting autophagy (30).
Our rigorous approach using multiple human and mouse transcriptomic data identified GABAergic interneurons and oligodendrocytes as the cell types central to ALS. These findings are consistent with published literature. For example, alteration in inhibitory signaling through GABAergic interneurons contributes to neural hyperexcitability, an early event in ALS pathogenesis (26). Oligodendrocytes from sporadic and familial SOD1 ALS exert a harmful effect on motor neurons by secreting toxic factors (31). Although these cell types were previously linked with toxicity in ALS, our study indicates that oligodendrocytes incorporate a significant proportion of ALS genetic risk. This initial finding supports the idea that these cells directly contribute to the disease pathogenesis rather than merely playing a secondary role in the disease progression.
Our results show the power of data-driven approaches to nominate aspects of the nervous system for additional scrutiny. Nevertheless, our study has limitations. Although we analyzed 78,500 individuals in the current study, our power to detect pathways remains limited. This lack of power primarily stems from the genetic architecture of ALS, which is known to conform to the rare disease–rare variant paradigm (32). By design, our pathway analysis focuses on common variants with a frequency greater than 1%, but we know that the contribution of common variants to ALS risk is modest (6). Furthermore, our approach is based on intragenic variants, although intergenic mutations can affect gene expression. We have overcome this power limitation by performing multiple rounds of replication in both the pathway analysis and cell type analysis to ensure accuracy and validity. The detected pathways and cell types represent potent aspects of the ALS disease process, but additional critical cellular mechanisms will undoubtedly be found using more extensive datasets. In addition, the datasets used in this study are from individuals of European ancestry, meaning that caution is required in generalizing to other populations.
Another limitation faced by the pathway analysis field, in general, is the lack of accurate and complete databases that genuinely capture the complexity of the neurobiology. We have used the Molecular Signatures Database to define the pathways in our analysis, although this collection is incomplete for neuronal and glial pathways. As our understanding evolves and more single-cell expression datasets become available, it may be worthwhile to reevaluate our GWAS data periodically. To facilitate this, we have made the programming code needed to perform the analysis publicly available. We also created an interactive, online resource that enables the research community to explore the contribution of pathways and cell types to ALS risk (https://lng-nia.shinyapps.io/ALS-Pathways/).
In conclusion, we demonstrate the utility of data-driven approaches to dissect the molecular basis of complex diseases such as ALS. Our stringent approach points to neuron projection morphogenesis, membrane trafficking, and signal transduction mediated by ribonucleotides as primary drivers of motor neuron degeneration in ALS. It also nominates cortical GABAergic interneurons and oligodendrocytes as central to the pathogenesis of this fatal neurological disease.