Preprint Initial findings from the DecodeME genome-wide association study of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, DecodeMe Collaboration

The main struggle seems to be either mapping the variants to rs ids or converting from GRCh38 to GRCh37, and it doesn't seem that either is a trivial task to do exactly right. I'm not sure we did the rs id mapping perfectly since we based it only on positions and not letters. Surprisingly hard to find good explanations.
It might not help with the practical problems—hard to know because I don't understand any of it!—but there is a catalogue of GWAS studies at the Biometrics Institute.

Thought I'd link it in case it hadn't already come up.

 
I’ve gained a whole new level of appreciation for what you and other people in bioinformatics do.…
I appreciate it, though I’ll be honest it’s partially a mess of our own making. Not a month goes by where I don’t witness a PhD bioinformaticist doing something utterly baffling like hardcoding 16 separate subsets of one data table instead of creating one variable to split the table by…(and I wish that was an exaggeration not a real example)
 
For those with strength, courage and coding skills:

I've noticed that the Dutch authors of MAGMA have a new method called FLAME. It combines multiple approaches to finding the effector gene within a significant GWAS locus using a machine-learning framework.
Prioritizing effector genes at trait-associated loci using multimodal evidence | Nature Genetics
GitHub - Marijn-Schipper/FLAMES: FLAMES: Accurate gene prioritization in GWAS loci

Recently, integrative methods have been developed that aim to combine many different levels of functional data to predict the effector gene in a GWAS locus1–4. There are two main strategies to do so. The first strategy prioritizes genes using locus-based SNP-to-gene data. Examples of this are chromatin interaction mapping, quantitative trait loci (QTLs) mapping or selecting the closest gene to the lead variant. These annotations can then be combined by linear regression or machine learning to merge several types of SNP-to-gene data into a single prediction1,3. The second strategy assumes that all GWAS signal converges into shared, underlying biological pathways and networks. These methods prioritize genes in a locus based on gene features enriched across the entire GWAS4. However, no current method leverages these two strategies together to make well-calibrated predictions of the effector genes in a locus. We designed a new framework, called FLAMES. This framework integrates SNP-to-gene evidence and convergence-based evidence into a single prediction for each fine-mapped GWAS signal.

1755972195915.png

If I understand correctly, the distance to the gene is still one of the most important pieces of information.
In our benchmarks, we found that the only competitive method to FLAMES is selecting the closest gene to lead SNP if it also has the highest PoPS in the locus. Generally, this produces slightly higher precision (3.1% across all benchmarks) than FLAMES would yield, without the need for running the FLAMES annotation and scoring steps, reducing the computational complexity of gene prioritization. However, prioritizing genes using FLAMES at the recommended threshold will yield approximately 33.2% more correctly identified causal genes.
 
This paper gives some data and background on eQTL not being as useful in identifying relevant genes as many anticipated.
The missing link between genetic association and regulatory function | eLife

Using diseases and genes were the mechanisms are relatively well understood, they found that GWAS hits fall in the regulatory region of those genes, as expected. Only a small fraction (around 8%) of those GWAS loci, however, show colocalization with eQTLs in the relevant tissue. In other words, the non-coding variants are real and likely regulatory, but current eQTL datasets and methods often fail to show which gene they regulate.

DecodeME mainly relied on eQTL colocalization to define Tier 1 genes, so I think it is worth exploring some other methods. I've mainly checked the closest protein-coding gene to the locus but there must be better methods (such as FLAME above).
 
Something I’ve only really started to appreciate going through this process is quite what is being done and quite how clever and meaningful it is… and how those dismissing it are showing their own ignorance and how their ideas about genetics are decades out of date.

I’m still pretty ignorant, the complex statistics and biology are beyond me, but the arguments against what has been found just seem ludicrous if you spend a bit of time trying to understand what DecodeMe has done. And will do.

I hadn’t appreciated quite how layered this is. Even after reading the paper and overviews. But just in what’s been done already we have:

GWAS - looks at SNPs and their individual significance, this is individual letter changes in the genome, a p-value per individual genetic difference found​
Gene based analysis - looks at the significance of genes, combining the p-values from all SNPs, all the genetic differences within and near a gene, to generate a p-value for that gene​
Gene set analysis - looks at the significance of predefined groups of genes, a set, such as those known to be involved in a particular pathway, so generates a p-value for that set​

So already there is a sort of movement from letters to words to sentences as we get closer to a meaning. Or thinking about layers and the treasure map analogy, perhaps a geological or mining way of looking it may fit, searching for precious metals?

First we do an aerial view and see some signs, some formations on the surface which look interesting.​
So then we go to that location and we get down and examine and analyse things and we find certain rock types which indicate where to go next.​
So we start to dig then once underground we find the different strata and seams which guide us in the right direction until…​
Bam! We find what we’re after.​

And those who said ‘oh well you didn’t find all the gold on the surface when you did that aerial reconnaissance’, well, they look silly don’t they.
 
For those with strength, courage and coding skills:

I've noticed that the Dutch authors of MAGMA have a new method called FLAME. It combines multiple approaches to finding the effector gene within a significant GWAS locus using a machine-learning framework.
Looks interesting. I tried to see if I could do anything, but it's too much stuff I don't know how to do, like the part about creating credible set files.



In other news, based on a suggestion by @hotblack, I tried to use the UK BioBank reference panel for FUMA instead of the 1000 Genomes reference as I did before. It looks like that was the main reason my results were somewhat different from the paper's results.

Now the tissue enrichment is almost identical (first chart is DecodeME). In my previous post, the highest -log10 p-value was around 7, and now it's around 8.5 like the study. There's still two pairs of tissues that swapped positions, so it's not exactly the same, but the p-values are all now very close to the study's values.

1755986378991.png 1755986352491.png

Here are the updated top ten gene sets:
1755986709965.png

Links to descriptions for these:

The first mention of synapse (GOCC_SYNAPTIC_MEMBRANE) moved down to rank 31 (out of 17,006 gene sets).

I also reran the cell-type analysis, testing the same brain region datasets as last time. Even more cell types are significant now!

There's something like a three step process, where it shows all the cell-types that showed significant enrichment of the DecodeME genes:
1755988246645.png

Then it removes redundant cell-types from within a dataset if multiple cell-types from one dataset are very similar to each other:
1755988340614.png

Then it looks for redundant cell-types between different datasets. I don't really know how to interpret this, but if anyone wants to have a go, the FUMA tutorial describes this analysis:
1755988622720.png

It looks like it now includes neurons from two new areas of the cortex (and one from before is gone), GABAergic neuron from the cerebellum, neuron from white matter, neuron from cerebral nuclei, and many specific cells (mostly subtypes of excitatory neurons, and one subtype of oligodendrocyte) from the primary motor cortex.

But I think the last image is showing that a lot of these cell-types are very correlated to each other, so many non-interesting neurons might just be showing up because they're so similar to a cell-type of interest, not because they all play a part.

Edit: I probably wouldn't put too much stock in the top gene sets. When I plot all the p-values, it looks to be basically a uniform distribution you'd expect if almost all gene sets were not true effects. While there might be some real enriched gene sets in there, there are probably too many false positives to know which they are. Nothing significant even with the less strict FDR. It makes sense that they didn't report any significant gene sets.
1756001402358.png
 
Last edited:
Looks interesting. I tried to see if I could do anything, but it's too much stuff I don't know how to do, like the part about creating credible set files.
I tried looking into this, seems like you can use FINEMAP to create the source file needed for credible sets. Then I ran into the problem with FINEMAP needing an LD matrix... which seems to come from a large file of people with European descent? This is all very outside of my wheel house, I'm surprised how fragmented all this software, they weren't kidding when someone said these bioinformatic pipelines are all over the place.
 
I tried looking into this, seems like you can use FINEMAP to create the source file needed for credible sets. Then I ran into the problem with FINEMAP needing an LD matrix... which seems to come from a large file of people with European descent? This is all very outside of my wheel house, I'm surprised how fragmented all this software, they weren't kidding when someone said these bioinformatic pipelines are all over the place.
Yeah, I don't know where to start to find the right files.

It's all so interesting. It feels like there's so much hidden treasure in this data file of DNA, and all these free tools across the internet to analyze it. I'm just very lacking in the experience and energy departments, so most of it is frustratingly out of my reach, and I have to just wait for the smarter folks to give us more gems.
 
I am not going to be contributing for a couple of days. Basically Sonya is right . The study shows that MECFS picks out a real biological problem (or a cluster). The results are pretty much what we saw in the last advisory board. There are immune genes and nerve genes but there are also some unexpected things which is good. There should be some mention of MHC but this turned out to be complicated and puzzling. I think it will prove relevant.

My understanding is that the main mitochondria linked gene wouldn't explain "feeble mitochondria". If anything maybe the reverse, but I wonder if it may show that the metabolic clues we have had make sense in an unexpected way.

No doubt when I am on dry land you will have sorted it all out.
Can we say this for sure, though?

I understand that you can identify significantly different SNPs for an enormous variety of different groupings. e.g, socioeconomic status, and even political views. Because there are always nonrandom patterns that determine who ends up in which category.
 
Can we say this for sure, though?

I understand that you can identify significantly different SNPs for an enormous variety of different groupings. e.g, socioeconomic status, and even political views. Because there are always nonrandom patterns that determine who ends up in which category.

Yes, I think we can. If there are non random patterns of gene variants that cause you to be in a group that group represents the outcome of a real common biological process or cluster of processes. So low socioeconomic status is the real result of a real, partially genetically determined, set of processes.

It may not seem much of a step forward but up until now a major proportion of the medical profession (along with the public) have taken the view that 'ME' (most have not even heard of ME/CFS) is an entirely bogus category arbitrary allocated to people by themselves or others, like 'loser'. The results show that this isn't the case. The ME/CFS category defines the real adverse result of some genes (and other things) just as low economic status does.

I think the results will tell us much more than that but until we have firmer evidence about exactly which genes are involved, which will hopefully come from rare allele studies, there are a lot of uncertainties.
 
Back
Top Bottom