Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

How much do you need? Maybe a Fundraiser?
I’m meeting with a couple potential collaborators in the next few weeks to sort out exactly what would be feasible to do experimentally—the cost might vary quite a bit based on this. But I’m happy to keep folks on here updated as I know more!

Unfortunately I’m also limited by time and energy, since I’m still obligated to keep up with course work and research for the grant that’s actually funding me long term. Although I have more energy thanks to some medications and supplements, I’m still limited by my ME/CFS.

I’m very thankful for everyone’s support and excitement—I just want to make sure I don’t overpromise what I can do as one person in a given timeline.

But there are other smart people in the field who seem to be chasing down similar threads. Even if I’m exactly right, it’s entirely possible someone else will get there sooner with the proof!
 
Thanks! Unfortunately, blood samples will probably not be able to address the hypothesis I have in mind. Though it’s possible they could be used for something supplemental.
What sort of samples do you need? Do they exist in a biobank or would you need to collect fresh ones?

Have you thought of applying for funds to the ME charities? (Sorry, you may have already explained all this and I may have forgotten!)
 
What sort of samples do you need? Do they exist in a biobank or would you need to collect fresh ones?
I’m most interested in muscle. Whether they have to be fresh depends on a few details that haven’t been hammered out yet. My main concern is that for my particular hypothesis, participant selection is quite important. For something like a bio bank I may not be able to confirm that the participants are actually experiencing e.g. delayed flu-like PEM.

Have you thought of applying for funds to the ME charities? (Sorry, you may have already explained all this and I may have forgotten!)
I’ve reached out to SolveME but they’re not accepting applications. Currently running down other possibilities.
 
I’m most interested in muscle. Whether they have to be fresh depends on a few details that haven’t been hammered out yet. My main concern is that for my particular hypothesis, participant selection is quite important. For something like a bio bank I may not be able to confirm that the participants are actually experiencing e.g. delayed flu-like PEM.


I’ve reached out to SolveME but they’re not accepting applications. Currently running down other possibilities.
I think the UK ME/CFS chanties do international funding.
 
I think I still require assistance @forestglip @jnmaciuch @Jonathan Edwards.

I have only briefly skimmed things but my understanding is still the following:

We have a first cohort that is possibly very skewed cohort (for instance where it's possible that there is a high amount of ME/CFS patients from the UK, but a very low amount of controls from the UK). There is no information on how this cohort actually looks like.

Based on this cohort we are now looking at a list of genes that discriminate between ME/CFS status and controls (according to HEAL2) the most. It's possible that this list is just a reflection of the above skewing.

Now one can have some hope that HEAL2 can separate ME/CFS from controls despite a possible skewing given that the independent Cornell cohort provided some decent separation. However, that to me provides no reasoning that for example the top 30 genes from the first cohort would be of any relevance in the separation of the Cornell cohort, as it relies on the fully HEAL 2 analysis or is the list that has been created also looking at the top hits in that cohort? Surely the weighting there might be completely different?
 
Now one can have some hope that HEAL2 can separate ME/CFS from controls despite a possible skewing given that the independent Cornell cohort provided some decent separation. However, that to me provides no reasoning that for example the top 30 genes from the first cohort would be of any relevance in the separation of the Cornell cohort, as it relies on the fully HEAL 2 analysis or is the list that has been created also looking at the top hits in that cohort? Surely the weighting there might be completely different?
Yes that's definitely a concern, which is why I was hoping to get a sense of which genes are driving the validation in the test cohort. If it was clear from the training weights that only a few top genes were driving most of the predictive power, that would be fairly straightforward since it would be difficult to recapitulate the score without those same top hits.

Unfortunately we seem to see more of a gradient in their attention [edit: scores], so the situation might be exactly as you describe. That's why, in the absence of more detailed data from their test validation, I've been trying to look at the list as a whole and find patterns that incorporate many of the top hits across their list (keeping in mind that some redundancy will be artifacts of the PPI network analysis in HEAL2).

It's less of a "these are the genes that drive ME/CFS" story and more of a "what is a common pathway that would be affected by loss of function at many of these various points?" story. For the time being, I'm interested in these results in so far as they might compliment or flesh out some theories that I've already been forming. I'm hoping that cross referencing with DecodeME will aid in pulling out what's generalizable from this study as well.
 
Last edited:
I think I still require assistance @forestglip @jnmaciuch @Jonathan Edwards.

I agree with @jnmaciuch. For me a red herring of this study is the attempt to use genes as biomarkers to separate populations There are all sorts of sampling problems. But they have pulled out genes that look as if they are pointing us to major areas of biology that must be involved in ME/CFS. Synapses is one. T cells is another, although there are questions about exactly where that is pointing.

The biggest worry for me is that cohorts of people diagnosed with ME/CFS tend to include two quite different groups, who get the same diagnosis for spurious reasons. One might be synapses and the other T cells. But even so, the diagnosis picks out these people and they are different from healthy controls. Fluge's group picked out HLA-C too. And they also picked out a neural structure gene. I don't think this is fluff. It is real But forget about trying to identify patients by gene combinations.
 
I wanted to dig into that chart of phenotype associations (fig. 5: depression, COVID, etc). I realized the data they used is all on Genebass. I tried to figure out how to work with all >4500 phenotypes, but I couldn't figure out how to access the bulk data which is hosted in Google Cloud Storage. So instead I used the browser version of Genebass to download summary statistics for the few phenotypes that Zhang et al found were significant (e.g. here's the England COVID phenotype).

I was able to download the phenotype metadata file though to be able to find the codes to look up all phenotypes on the browser tool, because searching for things like "chronic fatigue syndrome" wasn't working on the website. This file is also hosted on Google Cloud, so it's not as straightforward as downloading from a link and requires a Google Cloud account with billing set up. It's a bit bigger than the file attachment size here, so if anyone needs it, I'll figure out how to share it.

So I downloaded the data for each phenotype they labeled in figure 5A and 5B as significant, plus a few other random ones as well as "chronic fatigue syndrome". What I think they did for a given phenotype is get the SKATO P-values for every single gene and make that one set of data. Then pick out only the 115 ME/CFS genes and use the P-Values from the same dataset for only those genes to make the other group. Then do a one-sided Mann-Whitney to compare the p values of ME/CFS genes versus all genes for a given phenotype.

I did that and I got identical results (I don't know what they're showing on the x-axis, but the y-axis is what I plotted and it matches up). The red line is p=0.05, and everything above it is more significant.
upload_2025-5-15_17-11-19.png p_skato.png

I did the same thing for "P-Value Burden" and it's also identical to the significant items in figure 5B. Except for one thing: I got IBS as the most significant phenotype, and their chart didn't show IBS at all. I think maybe they cut off the top of the chart where it would have been. [Edit: or my method isn't totally identical in some way.]
upload_2025-5-15_17-31-23.png p_burden.png

Notably, CFS is far down the chart with both methods.

Edit: Note about Ranitidine. There seem to be two different phenotypes that mention Ranitidine. The one labeled "Medication for pain relief..." on my chart that is in the same location as their "Ranitidine" appears to represent several pain killer drugs, including ranitidine. It's not totally clear from the phenotype metadata file. The one labeled "Ranitidine" on my chart is specifically ranitidine.

Edit 2: Nevermind about Ranitidine, I was looking at it wrong. They are both specifically Ranitidine, asked about using different methods. Here are what the two rows look like. "description" and "description_more" are italicized and "coding description" is bolded and underlined.
- Treatment/medication code
- Code for treatment Negative codes indicate free-text entry.

- ranitidine
- UK Biobank Assessment Centre > Verbal interview > Medications


- Medication for pain relief, constipation, heartburn
- ACE touchscreen question "Do you regularly take any of the following? (You can select more than one answer)" The following checks were performed: If code -7 was selected, then no additional choices were allowed. If code -1 was selected, then no additional choices were allowed. If code -3 was selected, then no additional choices were allowed. If the participant activated the Help button they were shown the message: Some over the counter medicines are known by other names. Please enter the corresponding name if you take any of the following REGULARLY (that is, most days of the week for the last 4 weeks): Aspirin: Alka Rapid Crystals, Alka-Seltzer XS, Anadin Extra, Anadin Original, Askit powders, Aspro Clear, Codis 500, Disprin, Disprin Extra Ibuprofen: Anadin Ultra, Anadin Ibuprofen, Cuprofen Plus, Nurofen, Solpaflex, Ibuleve Paracetamol: Anadin Extra, Hedex Extra, Panadol, Paracodol, Paramol, Solpadeine, Syndol, Veganin, Feminax, Midrid, Migraleve Codeine: Codis 500, Cuprofen Plus, Nurofen Plus, Panadol Ultra, Paracodol, Paramol, Solpadeine Max, Sopadeine Plus, Solpafelx, Syndol, Veganin, Feminax, Migraleve

- Ranitidine (e.g. Zantac)
- UK Biobank Assessment Centre > Touchscreen > Health and medical history > Medication

Edit 3: Turns out if I convert the metadata file from tsv to xlsx, it becomes much smaller and I can attach it. So it's here in case anyone needs.
 

Attachments

Last edited:
And in case anyone's curious about which of the 115 Zhang genes are most significant for depression, here are the 115 genes with their rankings out of the 18,358 total genes tested in depression. You can check here on the depression page, sort by P-Value SKATO, and verify that HOMER2 is the 44th most significant gene.
44 HOMER2
456 PSMB5
474 ENTPD8
591 ENTPD6
650 DLGAP1
893 SCAF1
919 PRKCZ
980 NFATC3
1023 STX10
1026 NODAL
1039 LEP
1210 GNRH1
1273 CREB3
1325 AGO1
1403 PSMC3
1513 TSC2
1756 CHMP3
1757 PRPF4B
1760 RAPGEF1
1802 NR3C2
1876 AHCYL2
2147 NT5C3B
2189 AMPD2
2452 NEDD9
2568 PSMB4
3007 DVL2
3553 CDC6
3904 RET
4319 PPP2R2B
4362 E2F6
4563 CAMK2A
4625 CACNA2D3
4769 PPCDC
4991 PANK2
5121 ADGRL2
5165 NOTCH1
5282 ACE
5346 ZC3H13
5468 HSF1
5469 PSMD7
5940 NCBP2
6174 RFK
6328 PDE4B
6460 DNMT3A
6474 PSMC5
6720 GABBR1
6805 CRKL
6836 DLGAP3
6906 STAM2
7146 AK2
7334 KRT5
7615 ENTPD5
7637 PTPN11
7709 MICALL2
7760 PANK1
7855 BAIAP2
7933 COASY
8009 AHCYL1
8096 SHARPIN
8233 NLGN1
8268 BUB3
8401 CREB5
8853 RBPJL
8880 CDC23
9003 TOP1
9157 CANT1
9209 IK
9228 HDAC1
9344 SF3B2
9367 BHMT
9514 PDYN
9557 INS
9817 PIK3CA
9872 DLGAP4
9948 CDC14A
9988 CHD8
10162 SMARCD3
10327 NMRK2
10547 IL12A
10603 RNF41
10652 NME1-NME2
10653 NME2
10785 HLA-C
10798 PSMB3
10928 CA2
11847 CDR2
12041 DLG2
12075 NME4
12168 HP
12311 NAMPT
12405 PPP2R2A
12429 NRAS
12704 ADCY10
12807 GDPD1
15313 NME1
15318 GRM1
15507 DNMT3B
15912 GDPD3
16025 ATP4B
16034 SYNGAP1
16544 SF1
16657 AK3
16699 PANK3
16895 NME3
16979 BNIP1
17197 PELP1
17439 NLGN2
17664 GALT
17781 ING3
17922 PARD6B
17976 PSMB7
18152 GRB2

Edit: You may notice that the Genebass page says "Mental health problems ever diagnosed by a professional" for the phenotype and not "Depression". There are several phenotypes for different conditions with this same name. The metadata file says the one with coding number 11 is "Depression" which is the one I linked to and used in the testing earlier.
 
Last edited:
The rankings in the Biobank data for chronic fatigue syndrome for these 115 genes might be interesting as well:
70 NT5C3B
118 CAMK2A
216 ING3
559 GABBR1
646 NODAL
677 RFK
721 PDYN
1067 CREB5
1175 CACNA2D3
1214 COASY
1238 HSF1
1243 CANT1
1255 RBPJL
1296 DLG2
1369 CDC6
1805 DNMT3A
1847 NME1
2106 PPCDC
2323 SYNGAP1
2493 CDR2
2598 HOMER2
3513 BHMT
3808 DVL2
3832 ADCY10
3931 GDPD3
4016 ENTPD8
4133 SF3B2
4386 AMPD2
4419 PANK2
4422 SHARPIN
4478 IK
4793 STX10
4837 PRKCZ
4907 ACE
5052 NME1-NME2
5056 RAPGEF1
5116 NME2
5354 CREB3
5644 ENTPD5
5967 AK3
6229 PELP1
6332 ATP4B
6506 CDC14A
7117 ZC3H13
7392 MICALL2
7681 AK2
7697 GDPD1
7744 HLA-C
7783 CRKL
7831 PANK1
7975 GALT
8015 PSMC3
8145 HP
8194 ENTPD6
8360 SF1
8874 KRT5
8879 NFATC3
9093 NME4
9423 GNRH1
9797 GRM1
9803 STAM2
10416 NCBP2
10763 BAIAP2
10851 NME3
11021 DNMT3B
11066 AHCYL2
11077 NEDD9
11301 E2F6
11657 NLGN2
11787 ADGRL2
12170 PANK3
12335 CA2
12497 CDC23
12725 HDAC1
12795 PSMD7
12961 IL12A
13088 CHD8
13245 PSMB4
13307 PRPF4B
13442 SMARCD3
13715 PIK3CA
13931 NRAS
14057 BUB3
14270 GRB2
14768 NMRK2
14791 DLGAP4
14824 PSMC5
14870 AGO1
14930 PSMB5
15089 NAMPT
15286 BNIP1
15331 CHMP3
15347 DLGAP3
15531 PARD6B
15604 SCAF1
15870 PSMB7
16149 NOTCH1
16197 NR3C2
16283 PPP2R2B
16582 RET
16710 AHCYL1
16762 NLGN1
16796 DLGAP1
16953 LEP
17076 PTPN11
17142 RNF41
17245 PDE4B
17579 TOP1
17739 PPP2R2A
17847 INS
17964 PSMB3
18046 TSC2
 
Last edited:
find patterns that incorporate many of the top hits across their list (keeping in mind that some redundancy will be artifacts of the PPI network analysis in HEAL2).
Are you saying that the way their model works is, if, say, one DLGAP is very useful for the model classification, that makes it more likely for other DLGAP genes to also have high attention scores, even if potentially there isn't much difference between the cases and controls for the others?
 
Are you saying that the way their model works is, if, say, one DLGAP is very useful for the model classification, that makes it more likely for other DLGAP genes to also have high attention scores, even if potentially there isn't much difference between the cases and controls for the others?
More or less. I suspect that there are no particularly strong associations to begin with—I’d be quite surprised if one gene had more than 2-3 associated mutations present in the dataset at all.

Rather, what the algorithm seems to be doing is leveraging attention across neighborhoods of nodes. The more a gene is connected to other genes that also showed up more often in ME/CFS than control (even if it’s only an n=1 difference for any given gene), the more attention that node gets. As @EndME already mentioned, that’s probably the only way you could get any signal out of such a small dataset.

That’s where the bias in the protein-protein interaction reference dataset comes in. The more particular protein “neighborhoods” are studied in the literature, the more edges they’re going to have with other nodes, and the more chance there is to skew towards those well-characterized neighborhoods over other networks that might be equally relevant.

At least, that’s the sense that I get from reading through. I unfortunately haven’t had time to fully dig into the algorithm.
 
Back
Top Bottom