Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Oh, I don't think this will tell us much. Most of the genes in those gene sets were not important, and the gene sets themselves might not be the best groupings of the genes that were important. Potentially, the overall gene sets/pathways themselves might provide clues, but I wouldn't do anything with specific genes in them, especially if the attention scores for those genes weren't high at all in the model.

And replication across gene sets isn't useful. It just means a gene is involved in more than one pathway, and we aren't sure which, if any, is the important one in ME/CFS.

OK makes sense. If you have any set of genes that appear to be important please tag me so I can have a look at them.
 
I do worry a bit how that particular finding might be misused by the BPS lot
Understandable. I’m sure we all have this concern to varying degrees. There are various things they could and probably will try.

But it’s good we have the environment here and some good scientists with which we can explore it. And ultimately I think we’re best served by getting on with good science and that will ultimately beat the bad in both quality and quantity.
 
proteasome subunits consistently come up as dysregulated in our work in ME cell lines or primary cells from different tissues, have not looked yet at specific overlaps in subunits or directions between tissues, or at functional assays yet, but there's something here even if subtle. However it may just be the case that something as closely married to a central process such as protein translation would be expected to be affected by probably any flavour of aberrant or disrupted homeostasis. There are also a lot of these subunits, so coming up by chance is also possible (I can do the stats to determine the likelihood of this, but it's been less important than other things I am working on)
Are you at liberty to give more details on exactly what you're finding? I was initially inclined to believe that proteasome findings would be the result of general homeostatic stress like you mentioned, but recently I have more reason to believe they might be directly involved. It's just a hunch though, so I'm searching for more evidence to back it up
 
Are you at liberty to give more details on exactly what you're finding? I was initially inclined to believe that proteasome findings would be the result of general homeostatic stress like you mentioned, but recently I have more reason to believe they might be directly involved. It's just a hunch though, so I'm searching for more evidence to back it up

in the LCLs, proteasome subunits were predominantly upregulated (https://www.mdpi.com/1422-0067/22/4/2046 figure 8, damn but my old figures were ugly). We have gene expression data from other cell types too now, but this aspect hasn't been specifically analysed yet so I can't comment on it

are you thinking along the lines of autophagy or more specific to the proteasome?
 
in the LCLs, proteasome subunits were predominantly upregulated (https://www.mdpi.com/1422-0067/22/4/2046 figure 8, damn but my old figures were ugly). We have gene expression data from other cell types too now, but this aspect hasn't been specifically analysed yet so I can't comment on it

are you thinking along the lines of autophagy or more specific to the proteasome?
Thanks! That's really interesting. More specific to the proteasome, or rather, failure to (adequately) degrade particular proteins post viral infection. I still have several details that need to be ironed out, so apologies for being vague.

It may not be the proteasome itself--I was initially thinking ubiquitin ligases might be at fault. But the other possibility is that failure to degrade certain proteins at a critical point is simply a byproduct of having too many proteins to degrade overall, in which case I would expect increased proteasome subunit expression as an attempt to compensate. Did you happen to see any other evidence of UPR?
 
Thanks! That's really interesting. More specific to the proteasome, or rather, failure to (adequately) degrade particular proteins post viral infection. I still have several details that need to be ironed out, so apologies for being vague.

It may not be the proteasome itself--I was initially thinking ubiquitin ligases might be at fault. But the other possibility is that failure to degrade certain proteins at a critical point is simply a byproduct of having too many proteins to degrade overall, in which case I would expect increased proteasome subunit expression as an attempt to compensate. Did you happen to see any other evidence of UPR?

Haven't noticed it standing out but haven't looked at the GSEA in the other datasets in detail yet, nor at UPR in the old data in more detail than published
 
@jnmaciuch check out the following thread reply on X/Twitter which may be of interest regarding ubiquitin ligase :

https://twitter.com/user/status/1895026389646221621


Just a suggestion, but many of us simply don't use Twitter anymore, and find it annoying as Twitter greatly limits what you can read without logging in - it doesn't let us read the replies for example.

I know you have an account on Bsky but haven't used the account since 2023. Just a suggestion anyway.
 
Are you thinking generally @jnmaciuch or that this is an issue in specific places? Say, issues around degradation of neurotransmitters or other proteins by the proteasome around the synapse?
I think it’s more than likely that brain tissue is involved in someway but the issue I’m toying with would not be exclusive to the brain or nerves. Again, sorry to be so vague, even for my idle speculations I prefer to have a couple pieces of evidence to back it up before I put it out there.

thanks to @DMissa for the extra info!
 
I did my plan of running GSEA on the HEAL2 genes using protein clusters provided by STRING as gene sets, then using the most significant of these clusters to run a second GSEA on the Genebass data for ME/CFS in the BioBank. Nothing came out very significant, the lowest was FDR = 0.220, but maybe some of top ranked clusters might be interesting. I typed up the following with what I did, and it links to the GSEA results for HEAL2 and for the Genebass data, to be able to see all the gene clusters that were included in both GSEAs.

Thanks to @jnmaciuch for some help getting started.

Abstract
The ME/CFS classification machine learning model, HEAL2, described in a preprint from Zhang et al (1) incorporated loss-of-function variants from a cohort of participants with and without ME/CFS, as well as the STRING database’s protein-protein interaction network in its training. The model included a mechanism which allowed for creating an attention score for every gene to allow for better interpretability.

Thus, to determine biological relevance of the model’s highest ranked genes, ranked gene set enrichment analysis (GSEA) was performed on the model’s gene list, using the attention score as the ranking metric. Local clusters of proteins provided by STRING were used as the gene sets of interest, as these clusters would plausibly correspond to the protein-protein interaction data incorporated during model training.

The clusters which passed a threshold of significance and enrichment were then used for a subsequent ranked GSEA using the SKATO p values for genes from the Genebass database, which provide a measure of the association between loss-of-function variants in these genes and ME/CFS.

2 of 50 clusters met an FDR threshold of q<0.25 in this final GSEA.

Mapping gene names from HEAL2 to STRING-DB identifiers.
STRING GSEA is able to automatically convert various gene identifiers to STRING’s own gene identifiers. However, mapping was performed using the STRING API beforehand to determine what proportion of provided genes could be converted to STRING IDs.

Using the information from Supplementary Table 2 from Zhang 2025, all 17759 provided genes were sent to the STRING API for mapping. 17533 genes (~98.7%) were returned with STRING identifiers which were used in the subsequent analysis. The API did not return identifiers for 226 genes. Followup queries of the STRING API with these remaining genes revealed that 64 of these genes were considered synonyms for already mapped genes by STRING (e.g. ZASP as a synonym LDB3) and 162 genes could not be identified by STRING (e.g. ACP2). Of this last group of genes, two appear to have been errors in data entry, as the gene fields have dates instead of genes.

STRING GSEA using HEAL2 genes
Using the STRING API, the 17533 genes, along with their attention scores, were used to run a ranked GSEA. A URL was returned which provides an interactive webpage for viewing the results. (2)

Using the settings on the GSEA results page, gene sets were filtered for false discovery rate (FDR) <= 0.01, enrichment score >= 1.0, and minimum count in gene set of 10. Gene sets were filtered for those enriched at the top of the input, which represent the gene sets associated with high attention scores. Further, gene sets with similarity (based on Jaccard index) >= 0.7 were merged (only the most significant of the similar clusters were displayed). This resulted in a filtered list of 52 enriched STRING local network clusters.

The most enriched of these clusters was CL:22984, with an enrichment score of ~3.43 and an FDR of 7.50e-11, and which is described by STRING as representing “Neurexins and neuroligins”. (3)

GSEA on ME/CFS cases from Genebass
Using the biomaRt R library, the protein identifiers from the STRING clusters were converted to HGNC symbols to match with the data on ME/CFS from Genebass. Of 740 unique proteins, 7 did not automatically convert, and their corresponding HGNC symbols were looked up manually using the STRING web interface. The clusters were then written to a GMT file to act as gene sets for GSEA.

From the Genebass web interface, summary statistics were downloaded for predicted loss of function (pLoF) variants for chronic fatigue syndrome.(4) An RNK file containing all of the included genes and their corresponding SKATO p-values was created. Finally, GSEA v4.4.0 for Linux was used to run preranked GSEA with the created RNK and GMT files. Minimum gene set size was set to 10.

50 of 52 gene sets met the size threshold and were analyzed for enrichment. The GSEA report is accessible online.(5) Two gene sets met an FDR threshold of 0.25, though it should be noted that these gene sets are small (16 and 10 genes, corresponding to the following two clusters respectively) and they had very few leading edge genes (2 and 3 respectively).

* CL:23065, Ionotropic glutamate receptor, and Neurotransmitter receptor transport, postsynaptic endosome to lysosome (Normalized Enrichment Score = 1.39, FDR q-val = 0.220)
* CL:6643, PR-DUB complex, and Methyl-CpG-binding domain protein (NES=1.31, q=0.229)

--------

1. Zhang S, Jahanbani F, Chander V, Kjellberg M, Liu M, Glass KA, et al. Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis [Internet]. medRxiv; 2025 [cited 2025 May 23]. p. 2025.04.15.25325899. Available from: https://www.medrxiv.org/content/10.1101/2025.04.15.25325899v2
2. Enrichment Results - STRING database [Internet]. [cited 2025 May 23]. Available from: https://version-12-0.string-db.org/cgi/globalenrichment?networkId=baSYcIb9YQhy
3. CL:22984 - STRING interaction network [Internet]. [cited 2025 May 23]. Available from: https://version-12-0.string-db.org/cgi/network?networkId=bd4ZcN9E0DnX
4. Genebass - Chronic fatigue syndrome [Internet]. [cited 2025 May 23]. Genebass - Chronic Fatigue Syndrome. Available from: https://app.genebass.org/gene/undef...?resultIndex=gene-manhattan&resultLayout=full
5. Genebass GSEA report [Internet]. [cited 2025 May 23]. Available from: https://glittery-tarsier-413b2d.netlify.app/gsea/cfs_string_clusters.gseapreranked.1748034397906/

Edit: Added links

Edit: I noticed now that the GSEA user guide says if you are doing gene_set permutation (which is what I did with preranked GSEA) as opposed to phenotype permutation (which can only be done if you have expression data of genes for individual samples), the FDR threshold should be 0.05, not 0.25. So basically a null finding here. Also, this analysis is not robust as I'm just learning the tools, so don't take these results at face value. But just wanted to share in case any methods or findings are useful to someone.
 
Last edited:
One thing I didn't mention in the above post, out of an abundance of statistical caution: When I first ran the GSEA on the Genebass dataset with the 52 clusters that were returned by STRING for the HEAL2 genes, I noticed that only 33 gene sets were tested in the analysis. I remembered that the GSEA software sets a default minimum gene set size of 15 genes for this reason:
When you run the gene set enrichment analysis, the GSEA software automatically normalizes the enrichment scores for variation in gene set size, as described in GSEA Statistics. Nevertheless, the normalization is not very accurate for extremely small or extremely large gene sets. For example, for gene sets with fewer than 10 genes, just 2 or 3 genes can generate significant results. Therefore, by default, GSEA ignores gene sets that contain fewer than 15 genes or more than 500 genes; defaults that are appropriate for datasets with 10,000 to 20,000 features. To change these default values, use the Max Size and Min Size parameters on the Run GSEA Page; however, keep in mind the possibility of inflated scorings for very small gene sets and inaccurate normalization for large ones.
I was curious if any of the smaller gene sets would be significant though and re-ran it with a minimum of 10 genes so that 50 clusters would be tested. Everything came out less significant, and those are the results I reported.

But on the first run with the recommended gene set size settings, this one had an FDR of 0.056: "Ionotropic glutamate receptor, and Neurotransmitter receptor transport, postsynaptic endosome to lysosome". Though only two genes in this gene set of size 16 are highly ranked in the Genebass CFS data of 18006 genes: CACNG5 (rank 39) and GRIN1 (229). (For some reason the rankings in the reports are off by one.)

Out of curiosity, I did GSEA with these same clusters of genes from HEAL2 with seven other phenotypes in the Genebass data. They're listed with links to their reports here, and were basically chosen at random from the phenotypes I tested earlier with just p values for genes. Only one of these seven had one or more genes sets with a comparable FDR to CFS: "Radiology of one body area (for < 20 minutes)" (a couple gene sets around q=0.06). Depression had one gene set at q=0.158, and the rest didn't have any below FDR of 0.2.
 
Last edited:
If there is proteasome dysfunction, why has it not been observed in tissue samples? Should it not lead to accumulation of proteins that can, eventually, be clearly seen?
If the proteasome is non-functional (or even severely dysfunctional), absolutely. Though you'd probably also present with much more serious and immediate health issues as well.

However, at a milder scale, accumulation of misfolded proteins happens in all sort of situations and activating different compensatory pathways to prevent this from killing the cell is a very ancient (and well studied) homeostatic response. So having less efficient proteasome function could still result in downstream problems if particular proteins are not being cleared out to the extent they ought to be. But it would still take a lot to get to the point of killing the cell or seeing truly abnormal amounts of misfolded proteins in tissue samples.
 
Quote from another thread, but thought it'd be good to post this response here.
I also must admit that though I'd be happy to attribute things to a T-cell response, I just don't think there is strong evidence for them above any other cell type. Zhang et al. gave a very weak indication for HLA-C only (and a loss of function mutation, not a gain of function, no less). I don't believe it was actually standalone in the STING networks, though perhaps someone else could correct me on that.
I put the top 115 genes into STRING to see the connections between proteins. HLA-C only has evidence linking it to one other protein with medium confidence, PSMB5. The same graph in interactive form can be seen here: https://version-12-0.string-db.org/cgi/network?networkId=bxRiZn0HPfgk

image.psd(1).jpg
 
Quote from another thread, but thought it'd be good to post this response here.

I put the top 115 genes into STRING to see the connections between proteins. HLA-C only has evidence linking it to one other protein with medium confidence, PSMB5. The same graph in interactive form can be seen here: https://version-12-0.string-db.org/cgi/network?networkId=bxRiZn0HPfgk

View attachment 26431
Thanks for verifying! I wasn’t sure if I was misremembering. So there’s a chance that LOF mutations in HLA-C showed up in 2 or 3 individuals rather than just 1. If it was any more than that I’d probably expect the score to be much higher
 
Back
Top