Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Discussion in 'ME/CFS research' started by SNT Gatchaman, Apr 17, 2025.

  1. mariovitali

    mariovitali Senior Member (Voting Rights)

    Messages:
    579
    OK makes sense. If you have any set of genes that appear to be important please tag me so I can have a look at them.
     
    Binkie4, Chestnut tree, Kitty and 3 others like this.
  2. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    894
    Location:
    UK
    Understandable. I’m sure we all have this concern to varying degrees. There are various things they could and probably will try.

    But it’s good we have the environment here and some good scientists with which we can explore it. And ultimately I think we’re best served by getting on with good science and that will ultimately beat the bad in both quality and quantity.
     
    Binkie4, MeSci, Sasha and 4 others like this.
  3. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    862
    Location:
    USA
    Are you at liberty to give more details on exactly what you're finding? I was initially inclined to believe that proteasome findings would be the result of general homeostatic stress like you mentioned, but recently I have more reason to believe they might be directly involved. It's just a hunch though, so I'm searching for more evidence to back it up
     
  4. DMissa

    DMissa Senior Member (Voting Rights)

    Messages:
    219
    Location:
    Australia
    in the LCLs, proteasome subunits were predominantly upregulated (https://www.mdpi.com/1422-0067/22/4/2046 figure 8, damn but my old figures were ugly). We have gene expression data from other cell types too now, but this aspect hasn't been specifically analysed yet so I can't comment on it

    are you thinking along the lines of autophagy or more specific to the proteasome?
     
  5. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    862
    Location:
    USA
    Thanks! That's really interesting. More specific to the proteasome, or rather, failure to (adequately) degrade particular proteins post viral infection. I still have several details that need to be ironed out, so apologies for being vague.

    It may not be the proteasome itself--I was initially thinking ubiquitin ligases might be at fault. But the other possibility is that failure to degrade certain proteins at a critical point is simply a byproduct of having too many proteins to degrade overall, in which case I would expect increased proteasome subunit expression as an attempt to compensate. Did you happen to see any other evidence of UPR?
     
    Lilas, Kitty, hotblack and 2 others like this.
  6. DMissa

    DMissa Senior Member (Voting Rights)

    Messages:
    219
    Location:
    Australia
    Haven't noticed it standing out but haven't looked at the GSEA in the other datasets in detail yet, nor at UPR in the old data in more detail than published
     
    jnmaciuch, Kitty, hotblack and 2 others like this.
  7. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    894
    Location:
    UK
    Are you thinking generally @jnmaciuch or that this is an issue in specific places? Say, issues around degradation of neurotransmitters or other proteins by the proteasome around the synapse?
     
    Kitty and Deanne NZ like this.
  8. mariovitali

    mariovitali Senior Member (Voting Rights)

    Messages:
    579
    Kitty and Deanne NZ like this.
  9. Snow Leopard

    Snow Leopard Senior Member (Voting Rights)

    Messages:
    4,081
    Location:
    Australia
    Just a suggestion, but many of us simply don't use Twitter anymore, and find it annoying as Twitter greatly limits what you can read without logging in - it doesn't let us read the replies for example.

    I know you have an account on Bsky but haven't used the account since 2023. Just a suggestion anyway.
     
    Robert 1973, bobbler, Sean and 6 others like this.
  10. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    894
    Location:
    UK
    I’d echo that. Or post threads using threadreader or something without any barriers? I haven’t been on twitter for many years and won’t be signing up just to read things as you seem to need to do these days.
     
  11. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    862
    Location:
    USA
  12. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    862
    Location:
    USA
    I think it’s more than likely that brain tissue is involved in someway but the issue I’m toying with would not be exclusive to the brain or nerves. Again, sorry to be so vague, even for my idle speculations I prefer to have a couple pieces of evidence to back it up before I put it out there.

    thanks to @DMissa for the extra info!
     
    Deanne NZ, MeSci, bobbler and 3 others like this.
  13. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    894
    Location:
    UK
    Understood @jnmaciuch I look forward to hearing if/when you have something more to share :)
     
    MeSci, bobbler, Kitty and 1 other person like this.
  14. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,462
    I did my plan of running GSEA on the HEAL2 genes using protein clusters provided by STRING as gene sets, then using the most significant of these clusters to run a second GSEA on the Genebass data for ME/CFS in the BioBank. Nothing came out very significant, the lowest was FDR = 0.220, but maybe some of top ranked clusters might be interesting. I typed up the following with what I did, and it links to the GSEA results for HEAL2 and for the Genebass data, to be able to see all the gene clusters that were included in both GSEAs.

    Thanks to @jnmaciuch for some help getting started.

    Abstract
    The ME/CFS classification machine learning model, HEAL2, described in a preprint from Zhang et al (1) incorporated loss-of-function variants from a cohort of participants with and without ME/CFS, as well as the STRING database’s protein-protein interaction network in its training. The model included a mechanism which allowed for creating an attention score for every gene to allow for better interpretability.

    Thus, to determine biological relevance of the model’s highest ranked genes, ranked gene set enrichment analysis (GSEA) was performed on the model’s gene list, using the attention score as the ranking metric. Local clusters of proteins provided by STRING were used as the gene sets of interest, as these clusters would plausibly correspond to the protein-protein interaction data incorporated during model training.

    The clusters which passed a threshold of significance and enrichment were then used for a subsequent ranked GSEA using the SKATO p values for genes from the Genebass database, which provide a measure of the association between loss-of-function variants in these genes and ME/CFS.

    2 of 50 clusters met an FDR threshold of q<0.25 in this final GSEA.

    Mapping gene names from HEAL2 to STRING-DB identifiers.
    STRING GSEA is able to automatically convert various gene identifiers to STRING’s own gene identifiers. However, mapping was performed using the STRING API beforehand to determine what proportion of provided genes could be converted to STRING IDs.

    Using the information from Supplementary Table 2 from Zhang 2025, all 17759 provided genes were sent to the STRING API for mapping. 17533 genes (~98.7%) were returned with STRING identifiers which were used in the subsequent analysis. The API did not return identifiers for 226 genes. Followup queries of the STRING API with these remaining genes revealed that 64 of these genes were considered synonyms for already mapped genes by STRING (e.g. ZASP as a synonym LDB3) and 162 genes could not be identified by STRING (e.g. ACP2). Of this last group of genes, two appear to have been errors in data entry, as the gene fields have dates instead of genes.

    STRING GSEA using HEAL2 genes
    Using the STRING API, the 17533 genes, along with their attention scores, were used to run a ranked GSEA. A URL was returned which provides an interactive webpage for viewing the results. (2)

    Using the settings on the GSEA results page, gene sets were filtered for false discovery rate (FDR) <= 0.01, enrichment score >= 1.0, and minimum count in gene set of 10. Gene sets were filtered for those enriched at the top of the input, which represent the gene sets associated with high attention scores. Further, gene sets with similarity (based on Jaccard index) >= 0.7 were merged (only the most significant of the similar clusters were displayed). This resulted in a filtered list of 52 enriched STRING local network clusters.

    The most enriched of these clusters was CL:22984, with an enrichment score of ~3.43 and an FDR of 7.50e-11, and which is described by STRING as representing “Neurexins and neuroligins”. (3)

    GSEA on ME/CFS cases from Genebass
    Using the biomaRt R library, the protein identifiers from the STRING clusters were converted to HGNC symbols to match with the data on ME/CFS from Genebass. Of 740 unique proteins, 7 did not automatically convert, and their corresponding HGNC symbols were looked up manually using the STRING web interface. The clusters were then written to a GMT file to act as gene sets for GSEA.

    From the Genebass web interface, summary statistics were downloaded for predicted loss of function (pLoF) variants for chronic fatigue syndrome.(4) An RNK file containing all of the included genes and their corresponding SKATO p-values was created. Finally, GSEA v4.4.0 for Linux was used to run preranked GSEA with the created RNK and GMT files. Minimum gene set size was set to 10.

    50 of 52 gene sets met the size threshold and were analyzed for enrichment. The GSEA report is accessible online.(5) Two gene sets met an FDR threshold of 0.25, though it should be noted that these gene sets are small (16 and 10 genes, corresponding to the following two clusters respectively) and they had very few leading edge genes (2 and 3 respectively).

    * CL:23065, Ionotropic glutamate receptor, and Neurotransmitter receptor transport, postsynaptic endosome to lysosome (Normalized Enrichment Score = 1.39, FDR q-val = 0.220)
    * CL:6643, PR-DUB complex, and Methyl-CpG-binding domain protein (NES=1.31, q=0.229)

    --------

    1. Zhang S, Jahanbani F, Chander V, Kjellberg M, Liu M, Glass KA, et al. Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis [Internet]. medRxiv; 2025 [cited 2025 May 23]. p. 2025.04.15.25325899. Available from: https://www.medrxiv.org/content/10.1101/2025.04.15.25325899v2
    2. Enrichment Results - STRING database [Internet]. [cited 2025 May 23]. Available from: https://version-12-0.string-db.org/cgi/globalenrichment?networkId=baSYcIb9YQhy
    3. CL:22984 - STRING interaction network [Internet]. [cited 2025 May 23]. Available from: https://version-12-0.string-db.org/cgi/network?networkId=bd4ZcN9E0DnX
    4. Genebass - Chronic fatigue syndrome [Internet]. [cited 2025 May 23]. Genebass - Chronic Fatigue Syndrome. Available from: https://app.genebass.org/gene/undef...?resultIndex=gene-manhattan&resultLayout=full
    5. Genebass GSEA report [Internet]. [cited 2025 May 23]. Available from: https://glittery-tarsier-413b2d.netlify.app/gsea/cfs_string_clusters.gseapreranked.1748034397906/

    Edit: Added links

    Edit: I noticed now that the GSEA user guide says if you are doing gene_set permutation (which is what I did with preranked GSEA) as opposed to phenotype permutation (which can only be done if you have expression data of genes for individual samples), the FDR threshold should be 0.05, not 0.25. So basically a null finding here. Also, this analysis is not robust as I'm just learning the tools, so don't take these results at face value. But just wanted to share in case any methods or findings are useful to someone.
     
    Last edited: May 24, 2025
  15. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    2,462
    One thing I didn't mention in the above post, out of an abundance of statistical caution: When I first ran the GSEA on the Genebass dataset with the 52 clusters that were returned by STRING for the HEAL2 genes, I noticed that only 33 gene sets were tested in the analysis. I remembered that the GSEA software sets a default minimum gene set size of 15 genes for this reason:
    I was curious if any of the smaller gene sets would be significant though and re-ran it with a minimum of 10 genes so that 50 clusters would be tested. Everything came out less significant, and those are the results I reported.

    But on the first run with the recommended gene set size settings, this one had an FDR of 0.056: "Ionotropic glutamate receptor, and Neurotransmitter receptor transport, postsynaptic endosome to lysosome". Though only two genes in this gene set of size 16 are highly ranked in the Genebass CFS data of 18006 genes: CACNG5 (rank 39) and GRIN1 (229). (For some reason the rankings in the reports are off by one.)

    Out of curiosity, I did GSEA with these same clusters of genes from HEAL2 with seven other phenotypes in the Genebass data. They're listed with links to their reports here, and were basically chosen at random from the phenotypes I tested earlier with just p values for genes. Only one of these seven had one or more genes sets with a comparable FDR to CFS: "Radiology of one body area (for < 20 minutes)" (a couple gene sets around q=0.06). Depression had one gene set at q=0.158, and the rest didn't have any below FDR of 0.2.
     
    Last edited: May 27, 2025
  16. Hoopoe

    Hoopoe Senior Member (Voting Rights)

    Messages:
    5,497
    If there is proteasome dysfunction, why has it not been observed in tissue samples? Should it not lead to accumulation of proteins that can, eventually, be clearly seen?
     
    hotblack, bobbler and Kitty like this.
  17. jnmaciuch

    jnmaciuch Senior Member (Voting Rights)

    Messages:
    862
    Location:
    USA
    If the proteasome is non-functional (or even severely dysfunctional), absolutely. Though you'd probably also present with much more serious and immediate health issues as well.

    However, at a milder scale, accumulation of misfolded proteins happens in all sort of situations and activating different compensatory pathways to prevent this from killing the cell is a very ancient (and well studied) homeostatic response. So having less efficient proteasome function could still result in downstream problems if particular proteins are not being cleared out to the extent they ought to be. But it would still take a lot to get to the point of killing the cell or seeing truly abnormal amounts of misfolded proteins in tissue samples.
     
    hotblack and Hoopoe like this.

Share This Page