Using MAGMA on ME/CFS genetic data

forestglip

Moderator
Staff member
Posts moved from: Eccentric medium spiny neuron (eMSN)


I ran FUMA cell type analysis on all of the datasets generated from the Siletti 2023 brain study.

Data
  • Cell types dissected from 105 brain regions
  • 3172 total cell type/region combinations were tested with MAGMA using DecodeME GWAS-1 data.
    • 1135 level 1 cell types (General cell types, like "neuron" or "astrocyte")
    • 2037 level 2 cell types (More specific cell types, like "Upper_layer_intratelencephalic" or "Eccentric_medium_spiny_neuron")
Results
  • 24 cell types from the 3172 tests were significant after Bonferroni correction at P.adj<.05.
    • All of these were "neuron" from different areas of the brain.
  • No level 2 cell types were significant after Bonferroni correction
    • The level 2 cell types with the lowest p-values are "Upper_layer_intratelencephalic" from several regions of the brain.

DatasetDissectionCell_typeNGENESBETABETA_STDSEPP.adj.pdsP.adj
33_Siletti_CerebralCortex.MFG.A46_Human_2022_level1Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46neuron166490.029850.0620670.00609174.84E-074.84E-060.0015342964
45_Siletti_CerebralNuclei.CEN_Human_2022_level1Amygdaloid complex (AMY) - Central nuclear group - CENneuron168560.0288320.0622750.00610371.17E-061.29E-050.0037055304
18_Siletti_CerebralCortex.SCG.A25_Human_2022_level1Cerebral cortex (Cx) - Subcallosal Gyrus (SCG) - Subgenual (subcallosal) division of MFC - A25neuron168230.0267290.0578930.00575561.72E-061.72E-050.0054663076
110_Siletti_Thalamus.LNC.VA_Human_2022_level1Thalamus (THM) - Lateral nuclear complex (LNC) - ventral anterior nucleus of thalamus - VAneuron163130.0234750.0546520.00516582.78E-063.05E-050.0088064236
12_Siletti_CerebralCortex.STG_Human_2022_level1Cerebral cortex (Cx) - Superior Temporal Gyrus - STGneuron166990.0250610.051490.00552082.84E-063.13E-050.0090154584
28_Siletti_CerebralCortex.AG.LEC_Human_2022_level1Cerebral cortex (Cx) - Anterior parahippocampal gyrus (AG) - Lateral entorhinal cortex - LECneuron168110.0252930.0564140.00563563.62E-063.62E-050.0114791508
53_Siletti_CerebralNuclei.GP.Gpi_Human_2022_level1Basal nuclei (BN) - Globus pallidus (GP) - Internal segment of globus pallidus - Gpineuron163420.0417670.0530060.00936574.14E-064.14E-050.0131244672
30_Siletti_CerebralCortex.ACC_Human_2022_level1Cerebral cortex (Cx) - Anterior cingulate cortex - ACCneuron167760.0253080.05620.00567684.16E-064.16E-050.0132056704
15_Siletti_CerebralCortex.Ig_Human_2022_level1Cerebral cortex (Cx) - Short insular gyri - Granular insular cortex - Igneuron167740.0264810.0569710.00601875.45E-065.45E-050.0173029428
31_Siletti_CerebralCortex.FuGt.TF_Human_2022_level1Cerebral cortex (Cx) - Occipitotemporal (fusiform) gyrus, temporal part (FuGt) - Temporal area TFneuron166620.0259820.0536190.00590655.48E-065.48E-050.0173768504

DatasetDissectionCell_typeNGENESBETABETA_STDSEPP.adj.pdsP.adj
82_Siletti_Midbrain.SC_Human_2022_level2Midbrain (M) - Superior colliculus and nearby nuclei - SCUpper_layer_intratelencephalic166190.0343770.0670730.00919689.31E-050.0021416680.295363952
81_Siletti_Midbrain.IC_Human_2022_level2Midbrain (M) - Inferior colliculus and nearby nuclei - ICUpper_layer_intratelencephalic167050.0232460.0542820.00644680.000156130.003278730.49524436
84_Siletti_Midbrain.PAG-DR_Human_2022_level2Midbrain (M) - Periaqueductal gray and Dorsal raphe nucleus - PAG-DRUpper_layer_intratelencephalic163660.0345050.0608230.009750.000201430.004230030.63893596
30_Siletti_CerebralCortex.ACC_Human_2022_level2Cerebral cortex (Cx) - Anterior cingulate cortex - ACCUpper_layer_intratelencephalic167760.0344490.081630.0100480.000304420.005783980.96562024
18_Siletti_CerebralCortex.SCG.A25_Human_2022_level2Cerebral cortex (Cx) - Subcallosal Gyrus (SCG) - Subgenual (subcallosal) division of MFC - A25Upper_layer_intratelencephalic168230.0367540.0844330.0108060.000336190.00672381
26_Siletti_CerebralCortex.LiG.V1C_Human_2022_level2Cerebral cortex (Cx) - Lingual gyrus (LiG) - Primary Visual Cortex - V1CUpper_layer_intratelencephalic165800.0343390.0717810.010570.000580710.009291361
12_Siletti_CerebralCortex.STG_Human_2022_level2Cerebral cortex (Cx) - Superior Temporal Gyrus - STGUpper_layer_intratelencephalic166990.0303580.0654240.00954710.000738330.01476661
31_Siletti_CerebralCortex.FuGt.TF_Human_2022_level2Cerebral cortex (Cx) - Occipitotemporal (fusiform) gyrus, temporal part (FuGt) - Temporal area TFUpper_layer_intratelencephalic166620.0326320.071130.0103580.000816310.014693581
83_Siletti_Midbrain.PTR_Human_2022_level2Midbrain (M) - Pretectal region - PTRUpper_layer_intratelencephalic166920.01920.0445680.00611020.000839830.018476261
15_Siletti_CerebralCortex.Ig_Human_2022_level2Cerebral cortex (Cx) - Short insular gyri - Granular insular cortex - IgUpper_layer_intratelencephalic167740.0325120.0729030.0103730.000862960.015533281

DatasetDissectionCell_typeNGENESBETABETA_STDSEPP.adj.pdsP.adj
38_Siletti_CerebralCortex.PRG.A35-A36_Human_2022_level2Cerebral cortex (Cx) - Perirhinal gyrus (PRG) - A35-A36Eccentric_medium_spiny_neuron167320.0308490.0655710.011420.00345710.06568491

Dataset: Dataset name
Dissection: Dissection associated with dataset name, from https://github.com/tanyaphung/scrnaseq_viewer/blob/main/resources/fumacelltype_datasets_master.csv
Cell_type: Cell type name
NGENES (v1.07)/OBS_GENES (v1.06): Number of genes used in the analysis.
BETA: Effect size
BETA_STD: Standardised effect size
SE: Standard error
P: P-value
P.adj.pds: Adjusted P-value per dataset
P.adj: Adjusted P-value across dataset

I believe the gene sets that tralfamadorian97 used are clusters defining cell types, but they aren't based on a specific location of the brain. There were 461 clusters in tralfamadorian's analysis, but 3172 tests in this region-specific analysis. So maybe when eMSNs from many parts of the brain were combined into a small number of clusters, it made the values less noisy and thus more significant.

Edit: I attached the results of the analysis.
 

Attachments

Last edited:
Another factor that may explain some of the discrepancies is how the RNAseq matrices are transformed.

A key step in applying MAGMA via RNAseq data is transforming the RNAseq matrix [whose (i,j) entry tells use how much RNA for gene i was detected in cell type j] into a specificity matrix [whose (i,j) entry tells us how specific gene i is for cell type j]. There are competing approaches, and different approaches can produce different MAGMA results.

- I used the specificity matrix generated by Duncan et al. for their paper. In the Duncan paper (see section "Gene Expression Data"), they describe producing the specificity matrix by log-transforming the RNSAseq matrix, then row-normalizing.

- I believe FUMA may use a different transformation. According to their documentation, FUMA log-transforms and adds an auxiliary average column as a control, but does not row-normalize.

To test this, we would want to fix the version of the Siletti dataset and compare different specificity matrix preparation strategies.
 
- I used the specificity matrix generated by Duncan et al. for their paper. In the Duncan paper (see section "Gene Expression Data"), they describe producing the specificity matrix by log-transforming the RNSAseq matrix, then row-normalizing.

- I believe FUMA may use a different transformation. According to their documentation, FUMA log-transforms and adds an auxiliary average column as a control, but does not row-normalize.
I'm going to have to read the Watanabe paper again to be sure I understand, but I wanted to get my thoughts out about this.

So I think I've been thinking about the MAGMA ranking problem wrong. I don't think p-values are comparable across datasets, or really even within a dataset, because of these techniques of specificity or controlling for the average. I think the goal of these analyses is just to identify reliable and unique MAGMA associations, but without any implication that you can compare the strengths of the associations. And I think this may partly explain why the boring "neuron" is so strongly significant, even more so than any of the more interesting specific cell types.

If I understand the procedure correctly: when the regression is done between expression of genes in a cell type and gene scores from GWAS, it is controlled for average expression of the genes in all cell types in the dataset. If we imagine that there is a real association with neurons but not astrocytes, this means if your dataset only has, say, 5 types of neuron, then you'd be controlling for neurons while testing for an association with neurons, so the p-value will not be very significant. If your dataset has one type of neuron and four types of astrocytes, then you'd be controlling mainly for astrocyte expression, which will not fit the GWAS scores well, and thus the neuron assocation can still be significant.

Going back to the actual FUMA analysis I did, here are the cell types included in the dataset that the most significant neuron finding came from:
From 33_Siletti_CerebralCortex.MFG.A46_Human_2022_level1:

* neuron
oligodendrocyte_precursor_cell
astrocyte
oligodendrocyte
pericyte
central_nervous_system_macrophage
endothelial_cell
fibroblast
vascular_associated_smooth_muscle_cell
leukocyte
Quite a wide assortment, where only one cell type out of ten is a neuron. The average expression across all these cell types will be nothing like neuron expression, and neuron can be very significant.

On the other hand, here are the cell types found in the dataset that the most significant specific neuron type (Upper_layer_intratelencephalic) came from:
From 82_Siletti_Midbrain.SC_Human_2022_level2:

* Upper_layer_intratelencephalic
Miscellaneous
* Thalamic_excitatory
* CGE_interneuron
* Deep_layer_intratelencephalic
* LAMP5_LHX6_and_Chandelier
* MGE_interneuron
Astrocyte
* Midbrain_derived_inhibitory
Oligodendrocyte
Oligodendrocyte_precursor
* Deep_layer_corticothalamic_and_6b
* Splatter
* Cerebellar_inhibitory
Committed_oligodendrocyte_precursor
* Upper_rhombic_lip
Ependymal
* Amygdala_excitatory
* Deep_layer_near_projecting
Microglia
Vascular
Choroid_plexus
Fibroblast
Assuming I correctly labeled all the neuron cell types in this dataset with a star, then 13/23 cell types in this dataset are neurons. The average expression that is controlled for will be much more "neuron-like" than in the other dataset, and will weaken the assocation with any given neuron cell type.

Even other level 2 datasets differ among each other. "10_Siletti_CerebralCortex.PaO.A43_Human_2022_level2" contains 18 total cell types, unlike 23 for the above dataset, and this will affect the average expression that is being controlled for.

All this is to say, it seems to me that if we want to zoom in on the potentially most interesting cell types, we don't want any of this controlling for average expression or specificity. Just a measure of the association of a cell type's expression with GWAS scores. This seems closer to making ranking of cell types actually mean something. I think MAGMA outputs a standardized beta, which might work for comparing between cell types. There may be an issue of if a cell type has outlier genes with extremely high expression, which I think might skew the standardized beta down, making a cell type look less interesting than it should look. In which case, maybe the expression values could be converted to ranks to make it robust?
 
I'm not able to comment on the new thread that discusses MAGMA (It says: "You have insufficient privileges to reply here.")
Using MAGMA on ME/CFS genetic data | Science for ME EDIT: Fixed now!

But having downloaded the FUMA processed files here, I think Forestglip is right.

FUMA calculated gene expression for each of the 31 supercluster cell types per dissection and then added a covariate to the MAGMA regression with the average gene expression per dissection. So the genes that stand out for a supercluster like 'eccentric medium spiny neuron' are relative to the brain region of the dissection, not the entire brain.

The approach by Duncan et al. 2025 likely used the 461 cell clusters and calculated relative expression of genes in each cluster compared to the entire Siletti dataset. I think this is the more interesting analysis for DecodeME.
 
Last edited:
Forestglip's FUMA cell type analysis of Siletti datasets level 2 has 2,037 results, of which 52 are for eMSN, exactly the same as for Paolo.
So those are the 31 superclusters measured in each dissection separately.

It's strange that eMSN stand out much more in Paolo's analysis but I think this is due to the addition of the MVP dataset. If I take the most significant result from Paolo's analysis, I got (my bolding):

Dataset "47_Siletti_CerebralNuclei.Cla_Human_2022_level2"
Cell_type "Eccentric_medium_spiny_neuron"
BETA "0.045468"
SE "0.01053"
P "7.9263e-06"
P_bon "0.01650256"
P_bh "0.01650256"
BETA_DME_1 "0.026689"
SE_DME_1 "0.010881"
P_DME_1 "0.0070937"
P_bon_DME_1 "1"
P_bh_DME_1 "0.3197161"
BETA_MVP "0.022589"
SE_MVP "0.010388"
P_MVP "0.014844"
P_bon_MVP "1"
P_bh_MVP "0.7023911"
P_HEAL2 "0.1347193"
P_bon_HEAL2 "1"
P_bh_HEAL2 "0.9212313"
GeneRatio "11/72"
BgRatio "893/8444"
FoldEnrichment "1.444631"
genes "ENSG00000119772/ENSG00000221914/...

This corresponds to forestglips analysis:
region "Siletti_CerebralNuclei.Cla"
cell_type "Eccentric_medium_spiny_neuron"
beta "0.025722"
beta_std "0.053982"
p "0.0070827"

There are slight differences but the p-value Paolo found for DecodeME alone (p = 0.007) is the same as what forestglip found in the file 'magma_celltype_47_Siletti_CerebralNuclei.Cla_Human_2022_level2.gsa.out'

The large MVP dataset also had a low p-value (0.0148) for eMSN, so that's probably why Paolo's meta-analysis had stronger results for eMSN.
 
The approach by Duncan et al. 2025 likely used the 461 cell clusters and calculated relative expression of genes in each cluster compared to the entire Siletti dataset. I think this is the more interesting analysis for DecodeME.
I agree that it probably more clearly highlights the more important cell types, but I think there are still issues if controlling for average expression across clusters.

We can imagine that eMSNs and glutamatergic neurons are both equally important. If 50 of the 461 clusters are similar to glutamatergic neurons but only 10 other clusters are similar to eMSNs, then the regression will condition out more of the signal for glutamatergic neurons, making it less significant just because of what kind of other clusters there are.

I don't know that that specific scenario is how it happened. Maybe it's the opposite, where eMSNs should be even more significant. The point is that the rankings are somewhat arbitrary, and I don't think this technique of controlling for average expression is optimal for identifying where the signal is strongest.
 
Was kind of scratching my head over why these analyses are controlling for ave expression.
I think it's because we kind of expect genes expressed in many tissues to significantly correlate to MAGMA scores, even if the tissues aren't actually important.

If, for example, the brain is important, then we might expect a gene highly expressed in brain, maybe like a serotonin receptor, to have a high GWAS p-value. But serotonin receptors are highly expressed in other places, like the small intestine. If there are enough similarities like this, then the small intestine will show up as significantly associated with GWAS scores.

So MAGMA controls for average expression of genes in all tissues, so that any given tissue is only significant if it associates with GWAS scores based on unique gene expression patterns.

Maybe that's good for if you don't want a false positive of small intestine. But it seems to me that if you instead don't add that covariate of average expression, then even if small intestine is spuriously significant, brain will still be more significant and have a larger effect size, suggesting it's a more promising possibility.
 
I suspect it's because without a covariate, pretty much all the signal will be for genes that are common in every cell, making it harder to notice the signals for genes that have more specific functions. It could be that it then picks up cells that don't have much specificity and mainly have genes involved in normal cell function.
 
Back
Top Bottom