A couple musings about how MAGMA might potentially be able to provide more interesting results:
1. Maybe it would be interesting to correlate MAGMA gene scores to protein levels instead of mRNA expression. It might be a better metric for how important a gene is to a cell, as the protein is the "final product". For example, maybe a lot of mRNA gets made from a particular gene, but only a small amount gets turned into the protein for some reason. I don't know if something like that happens, but it might be worth testing. I don't know if there are any protein expression datasets for various tissues or cell types that would be suitable for MAGMA.
2. MAGMA gene scores are made by combining p-values of SNPs within or near the genes. If there are 10 genes basically right on top of each other near a significant locus, this could lead to 9 out of 10 genes being spuriously significant just due to proximity to a causal gene.
So I wonder if it might help to do a weighted linear regression for the final test of association with expression, where each gene is weighted based on a "confidence" level in the gene. The confidence level would be determined based on how many genes are in the area. If there are 10 genes right next to each other, we basically wouldn't factor them in at all, because 9 out of 10 of them would have spuriously high gene scores. If one gene is not surrounded by any other genes, we can be pretty confident that it is the gene of interest, so it gets a lot of weight.
I think this would require changing the actual code of MAGMA, so it's not an easy thing to test.
The
manual says they do somehow take the proximity between different genes into account, but this isn't a topic I'm familiar with, so I don't know if it achieves the same thing I'm describing:
The competitive gene-set analysis is implemented as a linear regression model on this gene- level data matrix, = 0 + + + , with the gene-set indicator variable and a matrix of covariates (such as gene size) to correct for. The residuals are modelled as multivariate normal with correlations set to the gene-gene correlations computed during the gene analysis. This is to account for the LD between genes in close proximity to each other, which would otherwise invalidate any statistical tests on the model.