Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Simon M · Sep 10, 2025

Does the study say how many individuals had at least one of these 115 risk genes? I'm still concerned that the number of genes is much too high, given the evidence we have on heritability.

We know that some of the inherited risk is through the genetic signals identified by DecodeME , and these are likely to be 90% non-coding, so different from these. So perhaps we can expect 10% accounted for by coding Variants. Hence my interest in the number of individuals this study identified as having risk variants here.

Jonathan Edwards · Sep 10, 2025

Simon M said:
Does the study say how many individuals had at least one of these 115 risk genes? I'm still concerned that the number of genes is much too high, given the evidence we have on heritability.

I do not understand the methodology for this study in detail but my guess is that the 'number of genes' is purely a statistical power issue and need not be reflected in total heritability calculations.

In theory, with an infinite sample, we would find that if there are 40,000 human genes maybe 25,000 have variants that make you very slightly more likely to have ME/CFS and 15,000 have variants that make you slight less likely - in fact with various rare variants you may well have overlap.

The puzzle for me is how on earth you get statistical significance for 100 genes with rare variant analysis in a sample this size. But the senior author is a well recognised worker in the field as I understand it.

forestglip · Sep 10, 2025

The way I always assumed it worked, though I could be wrong, is that the analysis finds relatively few harmful variants in the actual sample.

If they found, say, that participants 1 and 2 have an LoF variant in DLGAP1 and participants 3 and 4 have an LoF variant in DLGAP2, then the machine learning model will say, these proteins are too similar for this to be a coincidence, so let's prioritize all the DLGAP proteins as well as related proteins, so that if a new sample comes along where instead people have LoF variants in DLGAP3, the model will detect it.

Again, just the idea I've been working with, I don't understand their methods enough to know for certain.

jnmaciuch · Sep 10, 2025

forestglip said:
The way I always assumed it worked, though I could be wrong, is that the analysis finds relatively few harmful variants in the actual sample.

If they found, say, that participants 1 and 2 have an LoF variant in DLGAP1 and participants 3 and 4 have an LoF variant in DLGAP2, then the machine learning model will say, these proteins are too similar for this to be a coincidence, so let's prioritize all the DLGAP proteins as well as related proteins, so that if a new sample comes along where instead people have LoF variants in DLGAP3, the model will detect it.

Again, just the idea I've been working with, I don't understand their methods enough to know for certain.

Yup that’s the gist. The thing that they’re testing is not whether each specific gene is individually associated, but the likelihood that disease is associated with gene A and so many of the other genes that gene A is known to interact with. So the rare variants themselves are probably only showing up in a small handful of the participants, but the actual test being conducted has more statistical power because what you’re assessing is gene A and everything in its close network (compared to a random walk, I’m assuming).

The assumption is that if you have a bunch of very weak signals but a group of them have all been experimentally linked to the same biological pathway, you can increase your confidence that the associations are actually real rather than random noise for the members of that group.

Simon M · Sep 11, 2025

jnmaciuch said:
The thing that they’re testing is not whether each specific gene is individually associated, but the likelihood that disease is associated with gene A and so many of the other genes that gene A is known to interact with. So the rare variants themselves are probably only showing up in a small handful of the participants, but the actual test being conducted has more statistical power because what you’re assessing is gene A and everything in its close network (compared to a random walk, I’m assuming).

The assumption is that if you have a bunch of very weak signals but a group of them have all been experimentally linked to the same biological pathway, you can increase your confidence that the associations are actually real rather than random noise for the members of that group.

Thanks for the explanation. I would still like to know how many individuals in the 247 pwme had identified LoF genes, and to see how that compares with what we know about heritability.

Also, given the method, we would expect implicated genes to have a degree of consistency, as other genes that one gene interacts with are presumably likely to be affecting similar sorts of things.

forestglip · Dec 31, 2025

I previously wrote this about the supplementary table:

forestglip said:
Of this last group of genes, two appear to have been errors in data entry, as the gene fields have dates instead of genes.

I just came across a 2020 article that explains it:

The Verge: 'Scientists rename human genes to stop Microsoft Excel from misreading them as dates'

so when a user inputs a gene’s alphanumeric symbol into a spreadsheet, like MARCH1 — short for “Membrane Associated Ring-CH-Type Finger 1” — Excel converts that into a date: 1-Mar.

One study from 2016 examined genetic data shared alongside 3,597 published papers and found that roughly one-fifth had been affected by Excel errors.

Sure enough, the two lines in the table that have a date instead of a gene are "1-Mar" and "2-Mar", for the genes MARCH1 and MARCH2. The official names were changed at some point to MARCHF1 and MARCHF2 specifically to prevent this issue.

Edit: Sent an email letting the authors know.

hotblack · Dec 31, 2025

Nice find @forestglip and what a story!

This has reminded me to try throwing the top 115 genes at my scripts and see what comes up. And I need to reread the analysis you did.

Here’s the silhouette metrics, a couple of peaks but also looks maybe quite fragmented, will need to dig in some more to see if any individual clusters are more significant.

hotblack · Dec 31, 2025

Here’s the silhouette analysis charts of each cluster around the peaks, some reasonable looking clusters which seem to be refined but also others which undergo quite a lot of fragmentation. Not sure what to make of this or where to try the enrichment steps to be most informative (maybe 5, 11 and 21). Will have a think and am open to suggestions.

4, 5 and 6

10,11 and 12

21,22 and 23

Andy · Dec 31, 2025

forestglip said:
I previously wrote this about the supplementary table:

I just came across a 2020 article that explains it:

The Verge: 'Scientists rename human genes to stop Microsoft Excel from misreading them as dates'

Sure enough, the two lines in the table that have a date instead of a gene are "1-Mar" and "2-Mar", for the genes MARCH1 and MARCH2. The official names were changed at some point to MARCHF1 and MARCHF2 specifically to prevent this issue.

Edit: Sent an email letting the authors know.

Good spot, but not very reassuring that they, and other researchers, didn't know how to appropriately alter the formatting of cells in Excel.

ME/CFS Science Blog · Dec 31, 2025

hotblack said:
silhouette metrics,

Not familiar with this: is it a measure of how well datapoints have clustered into separate groups?

EDIT: and do the most left points corresponded to M9 (degradation of ubiquitinated proteins) and M20 (synaptic function)?

hotblack · Dec 31, 2025

ME/CFS Science Blog said:
Not familiar with this: is it a measure of how well datapoints have clustered into separate groups?

Basically yeah. A measure of how close a data point is to others in its cluster compared to those in other clusters.

I discuss a bit and give some links in my posts here.

hotblack · Dec 31, 2025

ME/CFS Science Blog said:
EDIT: and do the most left points corresponded to M9 (degradation of ubiquitinated proteins) and M20 (synaptic function)?

They’re just clusters of genes with similar patterns of tissue expression. I haven’t dug into what their functions are yet.

In each chart the bars on the left are the silhouette scores of every gene in every cluster, while the right hand side is the overall score for each cluster as a whole.

Utsikt · Dec 31, 2025

Andy said:
Good spot, but not very reassuring that they, and other researchers, didn't know how to appropriately alter the formatting of cells in Excel.

Excel has a mind of its own when it comes to formatting, especially interpreting things as dates and removing leading zeroes. There’s a reason excel isn’t supposed to be used as a database or for analyses.

I’ve seen this happen in massive businesses by seasoned developers. And I’ve prevented a couple of instances in projects.

jnmaciuch · Dec 31, 2025

Utsikt said:
Excel has a mind of its own when it comes to formatting, especially interpreting things as dates and removing leading zeroes. There’s a reason excel isn’t supposed to be used as a database or for analyses.

I’ve seen this happen in massive businesses by seasoned developers. And I’ve prevented a couple of instances in projects.

Agreed. I suspect this was just viewed as an easier solution than having every bioinformatician change their workflow when reading excel files into and out of their scripts (which we’re often forced to use no matter how many times we introduce lab mates and PIs to the concept of saving as CSV).

hotblack · Jan 2, 2026

Here’s the report, as with the PrecisonLife analysis please take with a pinch of salt, but there may be something interesting for people here. I’m having more difficulty getting my head around this than PrecisionLife and am not sure if mine is the right approach here, although there may be some broad shared themes

Zhang Gene-Tissue Expression and Enrichment Report

forestglip · Jan 2, 2026

hotblack said:
Here’s the report, as with the PrecisonLife analysis please take with a pinch of salt, but there may be something interesting for people here. I’m having more difficulty getting my head around this than PrecisionLife and am not sure if mine is the right approach here, although there may be some broad shared themes

Zhang Gene-Tissue Expression and Enrichment Report

I see there are about 22 out of the 115 genes that are primarily brain genes (high expression in brain, low expression everywhere else) based on the heatmap. None of the other tissues really stand out as having their own specific set of genes.

Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Simon M

Senior Member (Voting Rights)

Jonathan Edwards

Senior Member (Voting Rights)

forestglip

Moderator

jnmaciuch

Senior Member (Voting Rights)

Simon M

Senior Member (Voting Rights)

forestglip

Moderator

hotblack

Senior Member (Voting Rights)

hotblack

Senior Member (Voting Rights)

Andy

Senior Member (Voting rights)

ME/CFS Science Blog

Senior Member (Voting Rights)

hotblack

Senior Member (Voting Rights)

hotblack

Senior Member (Voting Rights)

Utsikt

Senior Member (Voting Rights)

jnmaciuch

Senior Member (Voting Rights)

hotblack

Senior Member (Voting Rights)

forestglip

Moderator