Preprint Initial findings from the DecodeME genome-wide association study of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, DecodeMe Collaboration

Re: ancestry.

From the FAQs,

"Why did you only use DNA from participants of European ancestry in this initial analysis?

This is because we had to closely match the ancestry of the study samples with those of the control samples from the UK Biobank, which were largely of European ancestry. This was to be sure that the differences we are identifying are more likely to be because of ME/CFS, and not because of differences in ancestry. An ongoing analysis uses all study samples from all ancestries."

From the supplementary methods,

"We inferred major group ancestries (African, American, East Asian, European, and SouthAsian) by PCA projection of DecodeME samples onto the 1000 Genome reference population samples (15) using bigsnpr (17) (Fig. S8). We completed a second round of PCA on theEuropean ancestry cohort to calculate relatedness and principal components for downstream analyses. We ran the implementation of KING used in PLINK 2.0, on the subset of variants(‘in_Relatedness’) used by UKB that were present in DecodeME’s data (68,453 SNPs). This was because not subsetting the SNPs used would have severely inflated relatedness estimates(14)."

To translate, the ancestry was calculated from your DNA data, not in any other way.
 
So the study found significant locations in the DNA, not necessarily significant specific genes. The trouble is figuring out which gene associated with a given location is the troublemaker in ME/CFS.

I understand that this is the case but I’ve been struggling to understand why.

From the preprint:

“Linking GWAS variants to causal genes that may provide biological insights and medical applications remains a challenge for the field (43).”​

43. Qi T, Song L, Guo Y, Chen C, Yang J. From genetic associations to genes: methods, applications, and challenges. Trends Genet. 2024 Aug;40(8):642–67.​

After reading various other things it still wasn’t clear to me, so I asked ChatGPT for a simple explanation:

ChatGPT:
1. Most GWAS hits are not in genes

• Around 90% of GWAS-associated SNPs fall in noncoding regions (DNA that doesn’t directly code for proteins).

• These regions often act as regulatory elements (enhancers, silencers, etc.) that can control genes far away — sometimes hundreds of thousands of base pairs away, even skipping over nearby genes.

So the nearest gene to a SNP is often not the one it regulates.

2. Linkage disequilibrium (LD) blurs the picture

• GWAS doesn’t pinpoint the exact causal variant — it identifies a block of variants that are inherited together due to LD.

• Any variant in that block could be the causal one, and they may each point to different potential genes.

It’s like knowing a crime happened in a group of five roommates — but not knowing which roommate did it.

3. Regulatory wiring is complex

• Gene regulation is 3D, not just linear along the DNA strand.

• DNA folds and loops, bringing faraway enhancers into contact with gene promoters.

• This means a SNP located near one gene in the genome’s “street map” might actually physically interact with a totally different gene in 3D space.

4. Cell-type and context specificity

• The effect of a variant often depends on which cell type or biological condition you look at.

• A SNP might regulate a gene in neurons but not in liver cells — so if you only look in the wrong cell type, you might see no signal.

5. Multiple genes can be affected

• Sometimes a single regulatory variant influences several genes at once.

• This makes “the” causal gene ambiguous — it could be one, several, or even all of them contributing to the trait.

In short:

Linking GWAS SNPs to genes is tricky because:

• Most hits are in noncoding regulatory DNA

• The causal variant is hidden among many linked ones

• Regulatory elements can act over long genomic distances in 3D

• Effects depend on cell type and context

• Multiple genes may be involved

That makes sense to me but is it accurate? Is there anything that anyone would correct or add?

Sorry for blocking up this thread with entry level questions but I wasn’t sure where else to ask, and I’m guessing that I may not be the only person reading this thread that doesn’t understand this aspect of GWAS.
 
I understand that this is the case but I’ve been struggling to understand why.

From the preprint:

“Linking GWAS variants to causal genes that may provide biological insights and medical applications remains a challenge for the field (43).”​

43. Qi T, Song L, Guo Y, Chen C, Yang J. From genetic associations to genes: methods, applications, and challenges. Trends Genet. 2024 Aug;40(8):642–67.​

After reading various other things it still wasn’t clear to me, so I asked ChatGPT for a simple explanation:

ChatGPT:


That makes sense to me but is it accurate? Is there anything that anyone would correct or add?

Sorry for blocking up this thread with entry level questions but I wasn’t sure where else to ask, and I’m guessing that I may not be the only person reading this thread that doesn’t understand this aspect of GWAS.
I didn't know that stuff either and I'm really grateful that you posted it!
 
That makes sense to me but is it accurate? Is there anything that anyone would correct or add?
That seems pretty much right, but I'm no expert, so I wanted to find a more reputable source to back it up (these correspond to ChatGPT's points 1 and 2):

Prioritization of causal genes from genome-wide association studies by Bayesian data integration across loci (2025, PLOS Computational Biology)
Understanding which gene in a GWAS locus is responsible for the causal effect is a current challenge [1].

The challenge arises for two reasons.

First, SNPs identified by a GWAS are statistical associations, not causal mechanisms. Linkage disequilibrium creates large blocks of correlated SNPs or haplotypes. Methods that predict functional consequences of variants are helpful [2], but often statistical measures are insufficient to distinguish which SNPs in a block are responsible for a causal effect.

Second, even among causal variants, only a small fraction occur in protein-coding regions, and a small fraction of these cause amino acid changes that provide strong evidence implicating a particular gene. At the majority of loci, the causal variants occur in intergenic regions thought to regulate the expression of nearby genes, but without direct evidence from GWAS of which gene’s regulation is affected.
 
So isn't the argument simply: If you want large genetic studies of ME/CFS with rectruitment based on diagnosis, you have to have clinicans decidated to ME/CFS spread across the country? Instead of focusing on a possible shortcoming should the response not be: Yes, now is the time to give us a GWAS where the diagnosis is made by dedicated clinicans that know what they are doing! We'd all happily sign up for that, please get the wheels moving now.
Some sort of... plan? That would... deliver better outcomes, uh? It would involve targets and work to address specific shortcomings such as this? There's an idea there, if only there were some... people... organized in some fashion who could... you know... think of something here? Ah well.
 
Going through the candidate genes pdf and SLC9C2 is mentioned which I don’t think we’ve had much discussion of

SLC9C2 (Tier 1)
• Protein: Sodium/hydrogen exchanger 11 (NHE-11). UniProt. GeneCards. The allele that increases the risk of ME/CFS is associated with decreasing SLC9C2 gene expression.
• Molecular function: Predicted to enable potassium:proton antiporter activity and sodium:proton antiporter activity. Exhibits testis-/sperm-restricted expression in humans (24) although some expression occurs in the thalamus and midbrain, and in the choroid plexus.
• Cellular function: Unclear, but it is localised to the acrosomal region of the head in mature sperm cells.
Which reminded me of discussion in the Zhang paper starting with his from @chillier

As an aside, there's a couple of genes that seem to have high expression in spermatids (again from human protein atlas single cell with whatever problems that may or may not have). S100PBP and AKAP1 from precisionLife have very high spermatid expression specificity. ADCY10 from zhang et al as well for instance. Is there something that neuron function and spermatozoa have in common?

There was a fair bit of speculation about shared ion channels, results in PrecisionLife and if this was significant or not. I’m not sure what to make of it all but perhaps an interesting thread to pull on some more?
 
@richie this is a GPT5 answer to your question:


This is the analyse of the whole paper by the new GPT5 (hallucination is meant to be very low for health now with this model), is there anything new here?





This is the view of a customGPT I created:

Pathway-to-patient map (compressed view)

4. Pathway-to-patient map (compressed view)


PathwayGenesPotential trial classLow-risk now
Pain-synapseCA10NR2B NMDA antagonists, PSD95 disruptorsPEM-aware pacing, sensory load control
NETosis / neutrophil toneOLFM4Low-dose colchicine, DNase, PAD4 inhibitorsOral hygiene, prompt infection care
Autophagy / ER-phagyKLHL20, CCPG1ULK1 modulators, ER stress chaperonesRegular meals, avoid extreme fasting
Oxidative / membrane repairPRDX6, PEBP1Lipid replacement therapy, antioxidantsOmega-3, choline, vitamin-C foods
Viral sensingRABGAP1L, BTN2A2, ZNFX1TLR modulators, γδ T-cell agonistsInfection avoidance, vaccination as eligible
Thanks. I'm pretty familiar with the paper (which I hand in advance so that I could write the blog). I'd say the ChatGPT5 summary is astonishingly good - certainly as a summary (as opposed to any speculations). The custom GPT one was more mixed, but some good stuff.

I liked they picked up subtle things, at least one of them spotted the paper saying the HLA results didn't quite add up, and the authors are going to do a new analysis. And that replication didn't really work, but there were problems with case definition variations in the replication cohorts.

ADDED: sorry, the stuff about pain and drug trials was junk, not sure which version of ChatGPT produced that. It's a very interesting clue, but no more than that at this stage.

Also, I think they missed that the genes identified are not nailed on, and need more and let's go work, and ultimately, experimental work.

Even do, it's still very impressive

On AI generally, I had an extraordinarily helpful answer from the chat, but on the Visible app yesterday. Normally it spews junk, which you have to get through to connect to a real person. But it nailed a really a tricky question, I suspect that was also ChatGPT five or similar.

I feel borderline redundant already. Don't fancy my chances after ChatGPT 6.

I wonder how to chat it will deal with debate in contentious areas?
 
Last edited:
That makes sense to me but is it accurate? Is there anything that anyone would correct or add?
Looks good to me. And the DecodeME team are well aware of these issues (and spell out the generic limitations of GWAS in the paper) - one reason the analysis took so long.

They have more analyses in the pipeline. I think fine mapping is one of these, like they did for the genetic signal for ME/CFS vs depression. This should further narrow down the candidate list of genes. I believe the summary statistics will allow other groups to do this kind of analysis too.
 
Last edited:
EDIT: these probably reflect artefacts that were excluded in the main analysis?

I'm looking at the DecodeME summary data, namely the file: gwas_1.regenie.gz taken from here: https://osf.io/rgqs3/files/osfstorage

When I arrange by lowest p-value however, I get 8320 rows with a p-value lower than 5 * 10^-8. Some go all the way up to 2.44^-62.
It looks like what we want is in the qced.var.gz file. It just has a list of SNPs, which I assume are those that passed QC. When I filter the main summary stats file (gwas_1.regenie.gz) to only include the SNPs in this list, then it looks like it matches the reported data:

Screenshot from 2025-08-09 08-00-27.png
 
It looks like what we want is in the qced.var.gz file. It just has a list of SNPs, which I assume are those that passed QC. When I filter the main summary stats file (gwas_1.regenie.gz) to only include the SNPs in this list, then it looks like it matches the reported data:
Oh yeah, I should have read the readme file first. It's got all the info that confirms that.
The shared summary statistics (*.regenie.gz) are the raw output from REGENIE v4.1. only filtered on the fly for minor allele frequency (MAF ≥ 0.01). Two additional files are provided, one listing the variants that passed QC (MAF ≥ 0.01, INFO ≥ 0.9 and dentist-like test) and the other one with the imputation quality (INFO score) of each variants. The published GWAS were QCed we therefore recommend user to keep only the QCed variants.

In all these files, the genomic positions reported are based on genome build GRCh38/hg38.

All GWAS fit the 20 first PCs and genetic sex. The latter is not use for the sex separate analysis.

Files descrtiption

Summary statisitc: *.regenie.gz
REGENIE (v4.1) output one file per phenotype containig one line per variant after a header line (description below). The output is a single gzipped file.

Column Description
CHROM Chromosomes
GENPOS Genomic positions GRCh38 (bp)
ID Variant IDs formated as chromosome: position:allele0:allele1
ALLELE0 Non effect alleles (reference)
ALLELE1 Effect alleles (alternate)
A1FREQ Effect allele frequencies
A1FREQ_CASES Effect allele frequencies in cases
A1FREQ_CONTROLS Effect allele frequencies in controls (not to be use as incorrect calculation in REGENIE v4.1)
N Number of samples tested
N_CASES Number of cases tested
N_CONTROLS Number of controls tested
TEST Test performed here additive “ADD”
BETA Estimated effect sizes (for allele 1)
SE Standard errors
CHISQ Chi-square test statistics
LOG10P −log10(p-value)
EXTRA specifies if Firth corrections failed

List of variants that passed QC: gwas_qced.var
gzipped file with one variant ID (chromosome: position: ref_allele:alt_allele) per line.

Imputation quality (INFO score) for all the variants: imputed.info.gz
gzipped tab-separeted file with one variant per line.
header: chromosome, position, reference allele, alternate allele, imputation variant ID, INFO score

@ME/CFS Science Blog you were wondering which SNPs were imputed. The imputed.info.gz file might have that.

Also, I noted the issue with the reported frequencies not making sense. The readme says not to use A1FREQ_CONTROLS because it is incorrect.
 
Yes, the only way to get any meaningful data was probably doing something like what DecodeME did. So there's no point in coming off defensive.

But if they want to complain about that data being unreliable due to patients not seeing a clinican as part of the study I think there is now a pretty simple solution to that or as Carson/Wessely/Garner one said "Just do it better yourself": Set up a small well run GWAS study where people are invited from the DecodeME cohort to see a well trained board of clinicans (similar to what happened in the intramural study) and only look at the significant genes. You'd only need a couple of hundred patients and it shouldn't take long. Wouldn't that work? Practically, the problem will probably be: You won't find enough clinicans that aren't biased in some way or even want do this.
Could you have designed decode such that it checked a small sample of the less sick respondents as an accuracy validation exercise?
 
was anyone else besides me surprised that 86.1% of the cohort reported "muscle pain"? many years of following ME/CFS forums, webinars, conferences led me to believe the number would be much lower. I’m not saying this number is good or bad or inaccurate, just that it surprised me.
Same as me. Suprised as well. Perhaps a vestige of the “Myalgia” thing. (Ie. in biases as to whose active in the online “ME communities”)
 
Back
Top Bottom