Using MAGMA on ME/CFS genetic data

I'm hoping to make some expression files to do this based on either the 31 superclusters or 461 clusters, but not separated by dissection like FUMA does. I'm trying to base making these files on FUMA's scripts, using raw data for Siletti 2023 from CellxGene.
Perhaps the file provided by Duncan et al. 2025 paper might be useful? it gives a specificity score matrix file for Siletti et al 2023 atlas.
EDIT: so it's each cluster's gene expression relative to the total gene expression of the dataset. It doesn't work with a covariate.
 
Perhaps the file provided by Duncan et al. 2025 paper might be useful? it gives a specificity score matrix file for Siletti et al 2023 atlas.
EDIT: so it's each cluster's gene expression relative to the total gene expression of the dataset. It doesn't work with a covariate.
I was considering that, and I would guess it'd work fine that way too. It's basically what tralfamadorian did to identify independent signals. I like the idea of starting from real expression values though since it isn't affected by the composition of the rest of the brain.

Also, I think it might be good to use 31 superclusters instead of working with 461 clusters to have less multiple testing burden. Maybe just averaging the clusters into superclusters from Duncan would work.

I'm not sure what you mean by working with a covariate. I think you could still condition on various things with the specificity matrix.

Edit: I haven't thought about it very deeply, but maybe the specificity matrix gets you to more or less the same place as raw expression conditioned on average brain expression, which is what I was planning to do.
 
Last edited:
I'm not sure what you mean by working with a covariate
With covariate I meant a data column with the average gene expression in the dataset that you can choose to condition on or not. From what I remember, the Duncan et al. file only gives the relative gene expression in the dataset per cluster, so not the absolute gene expression for each cluster and for the total dataset separately.
 
With covariate I meant a data column with the average gene expression in the dataset that you can choose to condition on or not. From what I remember, the Duncan et al. file only gives the relative gene expression in the dataset per cluster, so not the absolute gene expression for each cluster and for the total dataset separately.
Got it, yeah, I don't think it's as necessary to condition on the whole dataset, since the values are already showing how much a gene is expressed in that cell type above and beyond the dataset in general, so we can already be somewhat confident that an association with a cell-type using this data isn't confounded by average expression in the dataset.

I think in theory, you could still just download something like GTEx bulk brain expression and attach it as another column to condition on.

Anyway, for the purpose of the interaction thing I was talking about, I think probably it'd work to follow the steps on that website you linked, which I think is what tralfamadorian did to identify 3 independent cell types, and then test interactions of each of those three cell types with a lot of gene sets to see if any significant interactions emerge. Maybe it's not worth doing all the pre-processing steps I was planning.
 
I'm hoping to see if I can more or less follow the advice of this paper about identifying important tissues/cell types: Conditional and interaction gene-set analysis reveals novel functional pathways for blood pressure (2018, Nature Communications, de Leeuw)
I'm probably not going to spend more time on that interaction idea. I did give it a try, and no interactions between any gene sets and upper intratelencephalic neurons were Bonferroni significant, but I'm not positive I did it right. The documentation in the paper and in the manual is too long and complicated for me to handle right now.
 
I finally managed to run MAGMA. My biggest stumbling block before was not being sure how to correctly add rsIDs to the variants, but I found out that the GWASLab software makes it easy.
I think the reason is that since that the dbSNP, which provides a database of rsID for SNPs, no longer provides a file for common variants since version 151.

So for their latest release, 157, you have to download and query a massive file vcf file that also includes rare variants and likely takes hours to match your GWAS summary data.

So what I think GWASlab does is to provide one of the smaller files with common variants from version 151 which can be downloaded here:
https://ftp.ncbi.nih.gov/snp/pre_build152/organisms/human_9606_b151_GRCh38p7/VCF/

Another option is to use MungeSumstats in R which includes the full VCF files up to version 155, but it also takes a long time to load it. This is what Paolo used for his meta-analysis.
 
A couple musings about how MAGMA might potentially be able to provide more interesting results:

1. Maybe it would be interesting to correlate MAGMA gene scores to protein levels instead of mRNA expression. It might be a better metric for how important a gene is to a cell, as the protein is the "final product". For example, maybe a lot of mRNA gets made from a particular gene, but only a small amount gets turned into the protein for some reason. I don't know if something like that happens, but it might be worth testing. I don't know if there are any protein expression datasets for various tissues or cell types that would be suitable for MAGMA.

2. MAGMA gene scores are made by combining p-values of SNPs within or near the genes. If there are 10 genes basically right on top of each other near a significant locus, this could lead to 9 out of 10 genes being spuriously significant just due to proximity to a causal gene.

So I wonder if it might help to do a weighted linear regression for the final test of association with expression, where each gene is weighted based on a "confidence" level in the gene. The confidence level would be determined based on how many genes are in the area. If there are 10 genes right next to each other, we basically wouldn't factor them in at all, because 9 out of 10 of them would have spuriously high gene scores. If one gene is not surrounded by any other genes, we can be pretty confident that it is the gene of interest, so it gets a lot of weight.

I think this would require changing the actual code of MAGMA, so it's not an easy thing to test.

The manual says they do somehow take the proximity between different genes into account, but this isn't a topic I'm familiar with, so I don't know if it achieves the same thing I'm describing:
The competitive gene-set analysis is implemented as a linear regression model on this gene- level data matrix, = 0 + + + , with the gene-set indicator variable and a matrix of covariates (such as gene size) to correct for. The residuals are modelled as multivariate normal with correlations set to the gene-gene correlations computed during the gene analysis. This is to account for the LD between genes in close proximity to each other, which would otherwise invalidate any statistical tests on the model.
 
Back
Top Bottom