Preprint Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis, 2025, Zhang+

Does the study say how many individuals had at least one of these 115 risk genes? I'm still concerned that the number of genes is much too high, given the evidence we have on heritability.

We know that some of the inherited risk is through the genetic signals identified by DecodeME , and these are likely to be 90% non-coding, so different from these. So perhaps we can expect 10% accounted for by coding Variants. Hence my interest in the number of individuals this study identified as having risk variants here.
 
Does the study say how many individuals had at least one of these 115 risk genes? I'm still concerned that the number of genes is much too high, given the evidence we have on heritability.

I do not understand the methodology for this study in detail but my guess is that the 'number of genes' is purely a statistical power issue and need not be reflected in total heritability calculations.

In theory, with an infinite sample, we would find that if there are 40,000 human genes maybe 25,000 have variants that make you very slightly more likely to have ME/CFS and 15,000 have variants that make you slight less likely - in fact with various rare variants you may well have overlap.

The puzzle for me is how on earth you get statistical significance for 100 genes with rare variant analysis in a sample this size. But the senior author is a well recognised worker in the field as I understand it.
 
The way I always assumed it worked, though I could be wrong, is that the analysis finds relatively few harmful variants in the actual sample.

If they found, say, that participants 1 and 2 have an LoF variant in DLGAP1 and participants 3 and 4 have an LoF variant in DLGAP2, then the machine learning model will say, these proteins are too similar for this to be a coincidence, so let's prioritize all the DLGAP proteins as well as related proteins, so that if a new sample comes along where instead people have LoF variants in DLGAP3, the model will detect it.

Again, just the idea I've been working with, I don't understand their methods enough to know for certain.
 
The way I always assumed it worked, though I could be wrong, is that the analysis finds relatively few harmful variants in the actual sample.

If they found, say, that participants 1 and 2 have an LoF variant in DLGAP1 and participants 3 and 4 have an LoF variant in DLGAP2, then the machine learning model will say, these proteins are too similar for this to be a coincidence, so let's prioritize all the DLGAP proteins as well as related proteins, so that if a new sample comes along where instead people have LoF variants in DLGAP3, the model will detect it.

Again, just the idea I've been working with, I don't understand their methods enough to know for certain.
Yup that’s the gist. The thing that they’re testing is not whether each specific gene is individually associated, but the likelihood that disease is associated with gene A and so many of the other genes that gene A is known to interact with. So the rare variants themselves are probably only showing up in a small handful of the participants, but the actual test being conducted has more statistical power because what you’re assessing is gene A and everything in its close network (compared to a random walk, I’m assuming).

The assumption is that if you have a bunch of very weak signals but a group of them have all been experimentally linked to the same biological pathway, you can increase your confidence that the associations are actually real rather than random noise for the members of that group.
 
The thing that they’re testing is not whether each specific gene is individually associated, but the likelihood that disease is associated with gene A and so many of the other genes that gene A is known to interact with. So the rare variants themselves are probably only showing up in a small handful of the participants, but the actual test being conducted has more statistical power because what you’re assessing is gene A and everything in its close network (compared to a random walk, I’m assuming).

The assumption is that if you have a bunch of very weak signals but a group of them have all been experimentally linked to the same biological pathway, you can increase your confidence that the associations are actually real rather than random noise for the members of that group.
Thanks for the explanation. I would still like to know how many individuals in the 247 pwme had identified LoF genes, and to see how that compares with what we know about heritability.

Also, given the method, we would expect implicated genes to have a degree of consistency, as other genes that one gene interacts with are presumably likely to be affecting similar sorts of things.
 
I previously wrote this about the supplementary table:
Of this last group of genes, two appear to have been errors in data entry, as the gene fields have dates instead of genes.

I just came across a 2020 article that explains it:

The Verge: 'Scientists rename human genes to stop Microsoft Excel from misreading them as dates'
so when a user inputs a gene’s alphanumeric symbol into a spreadsheet, like MARCH1 — short for “Membrane Associated Ring-CH-Type Finger 1” — Excel converts that into a date: 1-Mar.
One study from 2016 examined genetic data shared alongside 3,597 published papers and found that roughly one-fifth had been affected by Excel errors.

Sure enough, the two lines in the table that have a date instead of a gene are "1-Mar" and "2-Mar", for the genes MARCH1 and MARCH2. The official names were changed at some point to MARCHF1 and MARCHF2 specifically to prevent this issue.

Edit: Sent an email letting the authors know.
 
Last edited:
Here’s the silhouette analysis charts of each cluster around the peaks, some reasonable looking clusters which seem to be refined but also others which undergo quite a lot of fragmentation. Not sure what to make of this or where to try the enrichment steps to be most informative (maybe 5, 11 and 21). Will have a think and am open to suggestions.

4, 5 and 6
silhouette_analysis_k4_median_log2_zscore.pngsilhouette_analysis_k5_median_log2_zscore.pngsilhouette_analysis_k6_median_log2_zscore.png

10,11 and 12
silhouette_analysis_k10_median_log2_zscore.png silhouette_analysis_k11_median_log2_zscore.pngsilhouette_analysis_k12_median_log2_zscore.png

21,22 and 23
silhouette_analysis_k21_median_log2_zscore.pngsilhouette_analysis_k22_median_log2_zscore.png silhouette_analysis_k23_median_log2_zscore.png
 
Last edited:
I previously wrote this about the supplementary table:


I just came across a 2020 article that explains it:

The Verge: 'Scientists rename human genes to stop Microsoft Excel from misreading them as dates'



Sure enough, the two lines in the table that have a date instead of a gene are "1-Mar" and "2-Mar", for the genes MARCH1 and MARCH2. The official names were changed at some point to MARCHF1 and MARCHF2 specifically to prevent this issue.

Edit: Sent an email letting the authors know.
Good spot, but not very reassuring that they, and other researchers, didn't know how to appropriately alter the formatting of cells in Excel.
 
EDIT: and do the most left points corresponded to M9 (degradation of ubiquitinated proteins) and M20 (synaptic function)?
They’re just clusters of genes with similar patterns of tissue expression. I haven’t dug into what their functions are yet.

In each chart the bars on the left are the silhouette scores of every gene in every cluster, while the right hand side is the overall score for each cluster as a whole.
 
Last edited:
Good spot, but not very reassuring that they, and other researchers, didn't know how to appropriately alter the formatting of cells in Excel.
Excel has a mind of its own when it comes to formatting, especially interpreting things as dates and removing leading zeroes. There’s a reason excel isn’t supposed to be used as a database or for analyses.

I’ve seen this happen in massive businesses by seasoned developers. And I’ve prevented a couple of instances in projects.
 
Excel has a mind of its own when it comes to formatting, especially interpreting things as dates and removing leading zeroes. There’s a reason excel isn’t supposed to be used as a database or for analyses.

I’ve seen this happen in massive businesses by seasoned developers. And I’ve prevented a couple of instances in projects.
Agreed. I suspect this was just viewed as an easier solution than having every bioinformatician change their workflow when reading excel files into and out of their scripts (which we’re often forced to use no matter how many times we introduce lab mates and PIs to the concept of saving as CSV).
 
Back
Top Bottom