Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

SNT Gatchaman

Senior Member (Voting Rights)
Staff member
Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis
Sai Zhang; Fereshteh Jahanbani; Varuna Chander; Martin Kjellberg; Menghui Liu; Katherine Glass; David Iu; Faraz Ahmed; Han Li; Rajan Douglas Maynard; Tristan Chou; Johnathan Cooper-Knock; Martin Jinye Zhang; Durga Thota; Michael Zeineh; Jennifer Grenier; Andrew Grimson; Maureen Hanson; Michael Snyder

Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a complex, heterogeneous, and systemic disease defined by a suite of symptoms, including unexplained persistent fatigue, post-exertional malaise (PEM), cognitive impairment, myalgia, orthostatic intolerance, and unrefreshing sleep. The disease mechanism of ME/CFS is unknown, with no effective curative treatments.

In this study, we present a multi-site ME/CFS whole-genome analysis, which is powered by a novel deep learning framework, HEAL2. We show that HEAL2 not only has predictive value for ME/CFS based on personal rare variants, but also links genetic risk to various ME/CFS-associated symptoms. Model interpretation of HEAL2 identifies 115 ME/CFS-risk genes that exhibit significant intolerance to loss-of-function (LoF) mutations.

Transcriptome and network analyses highlight the functional importance of these genes across a wide range of tissues and cell types, including the central nervous system (CNS) and immune cells. Patient-derived multi-omics data implicate reduced expression of ME/CFS risk genes within ME/CFS patients, including in the plasma proteome, and the transcriptomes of B and T cells, especially cytotoxic CD4 T cells, supporting their disease relevance. Pan-phenotype analysis of ME/CFS genes further reveals the genetic correlation between ME/CFS and other complex diseases and traits, including depression and long COVID-19.

Overall, HEAL2 provides a candidate genetic-based diagnostic tool for ME/CFS, and our findings contribute to a comprehensive understanding of the genetic, molecular, and cellular basis of ME/CFS, yielding novel insights into therapeutic targets. Our deep learning model also offers a potent, broadly applicable framework for parallel rare variant analysis and genetic prediction for other complex diseases and traits.


Link | PDF (Preprint: MedRxiv) [Open Access]
 
A large group of authors, with the main team from Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, CA, USA, led by Michael Snyder, but others from elsewhere, including Cornell, with Maureen Hanson there. I'm very interested to see what they have to say.

The impact of this condition is far-reaching, with one in four patients becoming housebound or bedbound, and many of the most severely affected individuals requiring feeding tubes3.
Not important, but while 'many of the most severely affected individuals requiring feeding tubes' can be true, depending on how you define 'most severely affected', it seems a bit misleading. I think I could count the numbers of people needing feeding tubes in my country on the fingers of one hand, but there are many more people that I would rate as 'severely affected'.

They are claiming a lot in the introduction, making me wonder, as I read it, if they might have been better to spread these analyses out across a couple of papers. (Paragraphs added).
We provide a large whole-genome analysis of ME/CFS, powered by the novel deep learning framework HEAL2. Based on rare coding variants, HEAL2 predicted ME/CFS risk for individuals across multiple cohorts. This approach further identified 115 risk genes associated with ME/CFS exhibiting critical functional roles across various tissues and cell types, including the central nervous system (CNS) and immune cells.

Multi-omics data from ME/CFS patients validated the relevance of these findings, showing reduced expression of these genes in affected individuals.

Finally, our pan-phenotype analysis uncovered a genetic correlation between ME/CFS and other conditions, such as depression and long COVID-19.

Our study not only contributes to a deeper understanding of the genetic basis of ME/CFS, but also provides a powerful, generic framework for conducting rare variant analysis in other complex diseases. By deriving a genetic risk score and identifying some genetic risk factors associated with ME/CFS, this research holds the future potential to catalyze the development of precision diagnosis and effective therapies, providing much-needed hope for the millions of people living with this life-altering disease.

So, a clean discovery cohort of 247 cases and 192 controls, and a testing cohort of 36 cases and 21 controls. 115 risk genes is a lot to come out of a relatively small sample.
We conducted whole-genome sequencing (WGS) on the same platform for N = 1,075 individuals (N = 464 ME/CFS patients, N = 611 negative controls) from three ME/CFS cohorts (Fig. 1A; Methods): (1) a Stanford cohort that we assembled (N = 208 cases, N = 534 controls; see methods), (2) a UK CureME cohort (N = 190 cases, N = 30 controls), and (3) a Cornell cohort (N = 66 cases, N = 47 controls). To increase the sample size of ME/CFS cases, we combined the Stanford and CureME cohorts into the discovery cohort and used the Cornell cohort as an independent testing cohort. After read alignment, variant calling, stringent quality controls (QCs), and ancestry analysis (Methods), we obtained the analysis-ready discovery cohort (N = 247 cases, N = 192 controls) and testing cohort (N = 36 cases, N = 21 controls). In particular, non-European and genetically-related individuals were excluded to control sampling bias
 
I'm sorry, I'm blundering about here. Don't read this post if you are looking for a succinct explanation and evaluation of this study.

If you feel like helping me work out what is going on, then perhaps read this post. I haven't read the whole study or even all of the Results yet. Perhaps it will all become clear later, especially in the Methods section that comes after the Results. And, obviously they were doing technical complicated data analysis. But, I think this could have been written better, to make it more accessible.
________

First, they identified the frequency of rare genetic variants by comparing the genetics in their samples with a non-Finnish European genetic database, keeping nearly 100,000 variants for analysis.
Given the moderate sample size of our cohorts, we used gnomAD11 (version 4.1, Non-Finnish European [NFE]) to estimate allele frequencies (AFs), and retained 99,958 rare coding variants (NFE AF < 1%) for downstream machine learning analysis


Then they calculated an ME/CFS risk score based on the rare variants. I can't really understand what they did from the description in Results (Methods comes after). They built their model on a simulated dataset.
Based on a simulated nonlinear G2P dataset, we found that HEAL2 accurately predicted the disease from rare variants (area under the receiver operating characteristic curve [AUROC] : 0.891 [mean] ± 0.002 [95% CI]; area under the precision-recall curve [AUPRC]: 0.876 ± 0.001
I'm not sure how they created the simulated dataset (from the discovery cohort or from both cohorts?).
Then they say "Similar results were obtained from an independent data set" - but, again, I don't know where this data set came from.
Then they say that they evaluated their model (HEAL2) against the discovery cohort.


They keep comparing the performance of the HEAL2 model with a HEAL model, saying HEAL2 model is better, and I keep wondering why we should care. They explain the differences:
HEAL2 extends our previous method HEAL12 in three aspects: (1) HEAL2 incorporates more comprehensive variant categories and functional scores; (2) HEAL2 employs an attention mechanism to improve model interpretation; (3) HEAL2 contains a non-linear GNN component based on the protein-protein interaction (PPI) network to capture the epistasis underpinning phenotypes, while HEAL is a linear model. Briefly speaking, for each individual HEAL2 first computes gene-level burden scores based on a variety of variant categories and functional scores using max and sum pooling (Fig. 1A; Methods). Next, HEAL2 conducts message passing of gene embeddings based on known PPIs. A sparse autoencoder (SAE)-based attention operation is then placed over gene embeddings after GNN to facilitate gene prioritization. Gene embeddings are pooled using attentions to construct a network embedding, which is used to compute the final ME/CFS risk score.
So, yeah... I'm assuming from something they say later that HEAL2 considers gene interactions (the protein-protein interaction network?) (see below), not just the presence or absence of a gene variant.

Overall, HEAL2 exhibited better prediction performance (AUROC : 0.677 [mean] ± 0.007 [95% CI]; AUPRC: 0.727 ± 0.006; Fig. 1B, Supplementary Fig. 2A) than HEAL (AUROC: 0.668 ± 0.005; AUPRC: 0.716 ± 0.004; Fig. 1B, Supplementary Fig. 2A), suggesting that gene interactions might be a significant contributor to ME/CFS pathogenesis.
I'm not sure that those AUROCs are that great given the model is trained on the data, although perhaps genetic risk will never explain a very high percentage of ME/CFS risk. Also 0.677 (HEAL2) and 0.668 (HEAL) don't actually look like very different numbers to me. So, if the difference between the two models is the thing making the authors conclude that the gene interactions they have found are important, well, I'm not so sure.


Of note, a logistic regression model using the first 10 principal components (PCs) as its features yielded nearly random prediction (AUROC: 0.518 ± 0.002; AUPRC: 0.586 ± 0.002; Fig. 1B, Supplementary Fig. 2A), indicating the population homogeneity of our cohort after QCs.
I don't think that I understand that. If they take a model with the 10 (gene variants?) that explain the most variation between the ME/CFS group and the control group, then that model has basically zero ability to differentiate the two groups? Do they mean that the population (including both the ME/CFS group and the control group) is homogeneous?
 
Last edited:
Figure 1c shows the sensitivity and specificity of HEAL1 when trained on the discovery cohort and tested on the test cohort. The AUROC is 0.67. But, the test cohort is only 36 cases and 21 controls. So, in order to identify 75% of the cases (true positives, sensitivity) the model will correctly identify only 50% of the controls (specificity, true negatives). It is something, but I'm not too sure how solid it is.

HEAL2 risk score correlates with ME/CFS-relevant symptoms
Surprisingly, we observed that although detailed phenotype data were not seen by the model, HEAL2 risk score was still strongly correlated with many symptoms showing different aspects of ME/CFS manifestation and disease severity (Fig. 1D), such as unrefreshing sleep, brain fog, malaise after exertion, and muscle discomfort.
I'm not sure it is so surprising. Having the symptoms is correlated with having ME/CFS - because ME/CFS is defined as having the symptoms. It seems to me that Figure 1D is really just a measure of what symptoms are most characteristic of having the rest of the symptoms that characterise ME/CFS.
 
This looks too complicated for me to follow without the help of some other brains here better at this than me. I strongly suspect that there are useful data in here but it is a pity that they do not present findings in the abstract in a more transparent way. I understand what Chris Ponting is trying to do because he says so transparently. I will believe his results. I will believe these only if someone can explain to me why I should!!

But these people know what they are doing. Even if there is a bit of over-egging, I strongly suspect once we have this and the Precision Life results and DecodeME along with the Beentjes results we will start seeing what is really going on.
 
HEAL2 identifies 115 ME/CFS risk genes - page 6
For ME/CFS, HEAL2 prioritized 115 genes that presented consistently larger attention scores among patients compared to controls (q-value < 0.02, Storey-Tibshirani procedure17; Fig. 2B, Supplementary Table 2; Methods). We defined these 115 genes as HEAL2-identified ME/CFS risk genes throughout this study.


ME/CFS genes display functional diversity across human tissues and cell types
higher expression of ME/CFS genes spanning multiple tissues (adjusted P < 0.05, two-sided t-test with Bonferroni correction; Fig. 3A), including cerebral cortex, skeletal muscle, and colon. To obtain a finer-resolution, we further analyzed single-cell RNA-seq data22 (Methods) and revealed higher expression of ME/CFS genes across various cell types (adjusted P < 0.05, two-sided t-test with Bonferroni correction; Fig. 3B), including neurons, smooth muscle cells, and immune cells. At the protein level23 (Methods), we confirmed the higher expression of ME/CFS genes (P < 0.05, two-sided t-test; Fig. 3C) within the central nervous system (CNS). These results are consistent with tissues and organs affected by ME/CFS24,25, implicating their causal roles in impacting the disease risk and symptoms.
That's quite a lot of tissues covered, and I'm not sure that we know enough about how all of the genes impact on all of the tissues to be drawing useful conclusions. So, it's interesting that things like neurons, muscles, colon and immune cells are affected by the identified genes but not definitive I don't think. The authors say the result are consistent with tissues and organs affected by ME/CFS, although I'm not sure we can say what tissues and organs are affected yet.

To investigate the function of our ME/CFS genes at a systems level, we further carried out a network analysis12,26–29 (Methods) by mapping 115 ME/CFS genes onto a protein-protein interaction (PPI) network30. We assessed the enrichment of ME/CFS genes within different gene modules of the PPI network. Notably, four gene modules (out of 1,261 modules) were identified to be significantly enriched with ME/CFS genes (false discovery rate [FDR] < 0.05, Fisher’s exact test; Fig. 4A and 4B, Supplementary Fig. 3). Gene ontology (GO) analysis showed that M9 genes were associated with proteasome function and particularly degradation of ubiquitinated proteins that are targeted for turnover (Fig. 4C), and M20 genes were linked to synaptic function (Fig. 4D). These findings reveal the functional diversity of ME/CFS risk genes.
I'm surprised that the text didn't list the four gene modules, only two. And Figure 4 only mentions those two gene modules, not the other two. I find the M20 gene module result interesting, with big hits on synapse function. A problem with synapse function could perhaps explain how both physical and mental exertion has effects in ME/CFS.
 
Good to see that the CureME Biobank cohort was used.

There seems to be too much emphasis on trying to make a diagnostic marker out of this rather than focus on mechanisms. I get the impression that this approach is more scattershot than DecodeME and as Hutan says, although it is intriguing to have brain, skin and prostate flagged up I rather doubt prostate has much to do with it!

The implication of CD4 cytotoxic cells is intriguing. That is not a population we tend to think about much.
 
From the supplementary materials, the other two gene modules are
C15 with nucleoside phosphate biosynthetic process; c-GMP mediated signalling and NAD metabolic process
Screen Shot 2025-04-17 at 8.09.04 pm.png
C18 with lots of interesting things like t-cell differentiation, protein dephosphorylation, stress-activated MAPK cascade, negative regulation of cell migration, response to molecule of bacterial origin, positive regulation of neuron death, sodium ion export across plasma membrane, intracellular potassium ion homeostasis
Screen Shot 2025-04-17 at 8.09.17 pm.png
I wonder why these weren't mentioned in the results. I don't know if the gene modules are standard ones, or if this team has identified them?
 
ME/CFS genes are differentially expressed in multiple conditions - page 7
So, then they do a separate analysis, looking to see if the results from a 'previously generated plasma proteome dataset' fit with their identified 115 genes with loss of function and other issues. The proteome dataset was from a small number of samples - 20 cases, 20 controls.

They find 57 relevant proteins that have been measured.

We found that the protein levels in module M9 (Fig. 4A) were significantly decreased in ME/CFS patients versus controls (adjusted P < 0.002, normalized enrichment score (NES) = -1.75; Fig. 4E). Of the nine ME/CFS genes in M9, four proteins were measured. Two of these, PSMB4 and PSMB5 (components of the 20S core proteasome complex), were part of the leading edge subset (i.e., the proteins that contributed the most to the enrichment signal). The 115 ME/CFS genes and other modules did not show significant enrichment (Supplementary Fig. 4).
That certainly sounds interesting. Two out of the four proteins measured in the M9 gene module appear to be lower in people with ME/CFS compared to the controls. M9 was all about the proteasome, which breaks down proteins for reuse, including misfolded proteins. So, that fits with the idea that waste isn't getting efficiently cleared in cells.
 
continued - page 9
We also investigated the expression patterns of ME/CFS genes across multiple blood cell types within ME/CFS patients using an RNA-seq dataset32 generated by Comella et al. Although no difference in expression of overall ME/CFS genes was observed (Supplementary Fig. 5A), ME/CFS genes within module M9 (Fig. 4A) were markedly down-regulated specifically in patient B cells and T cells compared to those from healthy samples (adjusted P < 0.05, two-sided Wilcoxon rank-sum test followed by Bonferroni correction; Fig. 4F; Methods), showing concordance with the protein dataset and underscoring their cell-type-specific disease relevance. As a negative control, no other module ME/CFS genes were differentially expressed within any patient cell types (P > 0.05, two-sided Wilcoxon rank-sum test; Supplementary Fig. 5B-D).

4F
Screen Shot 2025-04-17 at 8.37.29 pm.png
 
I get the impression that this approach is more scattershot than DecodeME

I've been assuming that DecodeME will just pull out as interest-worthy the SNP differences between cases and controls that cross a certain threshold for statistical significance, with each SNP being treated independently in statistical terms from the other SNPs. But it sounds as though the analysis approach in this paper is different, if you think it's more scattershot?
 
Last edited:
I have increased confidence when different methods converge on the same results. I would like to point the convergences here.

1) The paper mentions PTPN11 and GRB2. They both appear on figure 2B. These have been found since 2018 (note in one tweet, Michael Snyder - one of the authors is tagged) :

Screenshot 2025-04-17 at 11.58.38.png Screenshot 2025-04-17 at 11.59.03.png Screenshot 2025-04-17 at 11.59.36.png

2) The paper also mentions the proteasome system.

proteasome.png


and from the document I circulated in 2018, specific mention on Proteasome and Ubiquitin system. Note also the mention on protein degradation :






proteasome_themos.png


There are also mention on cholecystectomies, an association that even some patients have noticed and which I presented at EUROMENE in 2018. Regarding the paper I am using o3 reasoning to post the below, looking forward to comments about it :



03-1.png

03-2.png
 
Last edited:
I've been assuming that DecodeME will just pull out as interest-worthy the SNP differences between cases and controls that cross a certain threshold for statistical significance, with each SNP being treated independently in statistical terms from the other SNPs. But it sounds as though the analysis approach in this paper is different, if you think it's more scattershot?
Basic process seems to be fairly common ML..
- Get a good dataset, filter out similarities from other reasons like being related
- Train a network on this dataset, the idea being you know the inputs and output and are searching for what commonalities may exist in the data
- Once trained use his network to see if it can differentiate on fresh unseen data
- If it works, analyse the network to understand what it’s spotted, how it is deciding if new input data is a match or not

It looks like the point is the input data is all this meta-analysis not just raw genetic data like in the GWAS for DecodeME, but information on the variants (loss of function, missense, etc) and potential protein protein interactions, etc. So potentially a layer above?

I guess I like this as I’ve been thinking recently about other approaches to looking for patterns in downstream data rather than genetic data itself. Something like using AlphaFold to understand all the proteins based upon genetic data and then looking for patterns in that.

That’s just my cursory understanding, based upon little knowledge and a bit of reading. Subject to change when new information and experts come along :)
 
We could do with some professional input here (no aspersions being cast on Hutan). Maybe @chillier, @Evergreen, @jnmaciuch, @MelbME, @DMissa @mariovitali ... can get their heads around this or even get a comment from @Chris Ponting if he has time?
Flattered to be included on that list but even if my brain weren't mired in sludge, my professional opinion would amount to "Oooh, I hope someone can explain this to me some day in a journal club."

I suggest emailing Zhang and Hanson and seeing if one of them might be open to coming on here and explaining it to us. The request might be nice if it came from you, @Jonathan Edwards .

In the meantime, as well as the others already listed, @Simon M might have insights?
 
Back
Top Bottom