Preprint Identification of Novel Reproducible Combinatorial Genetic Risk Factors for [ME] in [DecodeME Cohort] and Commonalities with [LC], 2025, Sardell+

ME/CFS Science Blog · Dec 5, 2025

I suppose their combinatorial analysis can be useful to get new or clearer findings, but in this case, it seems to have made things more complicated and muddled.

Their disease signatures map to 2,311 genes, while humans only have approximately 20.000-25.000 protein-coding genes.

mariovitali · Dec 5, 2025

I am still working in trying to connect the list of genes from this paper with previous research efforts. Some new findings :

NLGN1 : Appears on the Snyder study (HEAL2)

CYP7B1 : The node of this gene appears on the network analysis (2017) - towards the center bottom :

CH25H : Also identified by previous work (read also what I mention related to Ubiquitination (Snyder et.al) , see below :

Source : https://www.healthrising.org/blog/2023/10/21/ai-driven-chronic-fatigue-syndrome-clues/

UGGT1 : A gene directly linked to N-Linked glycosylation. I believe we will be seeing N-Linked glycosylation in the -hopefully near- future more.

I also believe that Glutamate excitotoxicity is something that needs to be looked at for sure.

Simon M · Dec 5, 2025

EndME said:
In the end they get 22,411 double-refined signatures.

EndME said:
Now due to quality control they could only use 10 569 of those cases

I'm not even beginning to keep up, but how is it possible to have more than twice as many double-refined signatures as cases?

mariovitali · Dec 5, 2025

Relevant thread on X. Note the association of CH25H, CYP7B1 (2 of core genes identified) with Epstein-Barr virus

https://threadreaderapp.com/thread/1997005812838076483.html

mariovitali · Dec 11, 2025

So I used the DAVID tool to enter the core genes identified by the study and I am providing the results of a Pathway analysis :
First KEGG Pathway Analysis. of particular interest the Glutamatergic synapse entry, GABAergic synapse cc @ME/CFS Science Blog

Next Reactome :

Of interest : NR1H3 (=LXRa) tagging @MelbME. Transmission across chemical synapses , NR1H2 is LXRb Receptor , Netrin mediated repulsion signals and O-Linked glycosylation.

Finally here is a functional annotation set from DAVID :

(Cell?) membrane appears to the top, note the Polar residues (=I believe directly related to Aminoacids) and my personal "favorite" which was also identified in the paper by Xiao et.al related to L-Asparagine (a key part of N-Linked glycosylation) and the potential role of glycoproteins in general.

EDIT : Removed entries related to FXR which are not part of the shown results

Chandelier · Dec 17, 2025

Chronic fatigue syndrome seems to have a very strong genetic element

The largest study so far into the genetics of chronic fatigue syndrome, or myalgic encephalomyelitis, has implicated 259 genes – six times more than those identified just four months ago

www.newscientist.com

(Paywall)

AI Summary:

Chronic fatigue syndrome seems to have a very strong genetic element
The largest study so far into the genetics of chronic fatigue syndrome, or myalgic encephalomyelitis, has implicated 259 genes – six times more than those identified just four months ago

The largest genetic study to date suggests that chronic fatigue syndrome, also known as myalgic encephalomyelitis (ME/CFS), has a strong genetic component. Researchers identified links to more than 250 genes, six times more than were reported just four months earlier. The findings may help explain why some people develop ME/CFS after infection while others do not, and may support future treatment development.

ME/CFS is a chronic and often disabling condition. A core symptom is post-exertional malaise, in which even small amounts of activity lead to prolonged exhaustion. Although infections often trigger the illness, its underlying causes remain unclear.

The study analysed genomic data from over 10,500 people diagnosed with ME/CFS, comparing it with data from people without the condition in the UK Biobank. Instead of examining single genetic variants, the researchers looked at groups of interacting variants known as single nucleotide polymorphisms. They identified more than 22,000 such groups associated with ME/CFS risk, and found that having more of these groups increased a person’s likelihood of developing the condition.

The variants were mapped to 2311 genes, of which 259 showed the strongest and most common links to ME/CFS. This represents a substantial increase compared with earlier studies and supports previously identified genomic regions.

The researchers also compared the genetic findings with those from long covid studies. About 42 per cent of genes linked to long covid overlapped with those linked to ME/CFS, suggesting the two conditions partially overlap genetically, though differences in analysis limit firm conclusions.

Andy · Dec 19, 2025

Chandelier said:
Chronic fatigue syndrome seems to have a very strong genetic element

The largest study so far into the genetics of chronic fatigue syndrome, or myalgic encephalomyelitis, has implicated 259 genes – six times more than those identified just four months ago

www.newscientist.com

(Paywall)

AI Summary:

Archived full version of the article

Andy · Dec 19, 2025

And the recording of the recent webinar

V.R.T. · Dec 19, 2025

How are people feeling about this study now the dust has settled?

From my perspective it seems like we are no closer to understanding PLs methadology.

I am interested in the (non Ampligen) repurposing opportunities, but more so in what these data can show us combined with other genetic data. Obviously it's all very far over my head. But do we think this is a useful study, or one that muddies that waters somewhat?

forestglip · Dec 19, 2025

V.R.T. said:
From my perspective it seems like we are no closer to understanding PLs methadology.

It seems to me that the important parts are laid out in their papers, just that it's somewhat of a complicated process. I don't have the energy to try to go through it and parse it, but maybe this summary of their method I had claude.ai write will be helpful.

This is from giving the AI the methods from the thread paper and from their 2022 paper.

Edit: Had an outline from Claude, but I think it made a mistake in describing the validation method, so here is ChatGPT's outline instead:

Step 1: Prepare the data (standard methods)
- Collect genetic data from people with ME and from healthy controls
- Remove low-quality samples and genetic markers
- Limit analysis to people with similar ancestry to avoid false signals
- Split the data into separate groups:
- Discovery (to find signals)
- Refinement (to check them)
- Test (to confirm results)

Step 2: Search for genetic patterns (custom / proprietary)
- Look for small groups of genetic variants that tend to appear together in people with ME
- These groups can contain 1, 2, 3, or more variants
- The search is guided by rules that focus on patterns common in cases but rare in controls
- Only patterns with strong statistics and seen in enough people are kept
- This step uses a custom algorithm owned by PrecisionLife

Step 3: Check against random data (standard idea, custom implementation)
- Randomly shuffle who is labeled as “case” or “control”
- Repeat the same pattern-finding process many times
- See how often strong-looking patterns appear by chance
- Remove real-data patterns that look similar to those commonly found in random data
- Patterns are judged by strength and frequency, not by exact genetic makeup

Step 4: Initial disease signatures
- The remaining patterns are called “disease signatures”
- These are still only candidates and not yet trusted

Step 5: Test patterns in new groups (mostly standard methods)
- Check whether each signature also appears in a different group of people with ME
- Remove patterns that do not repeat
- Remove genetic variants that do not show consistent effects
- Remove patterns that do not add new information

Step 6: Final disease signatures
- Patterns that pass all previous checks
- These appear consistently across multiple independent groups

Step 7: Group related patterns (custom / proprietary)
- Combine overlapping patterns into networks
- Identify key genetic variants that appear in many patterns
- Measure how strongly each network is linked to ME
- This grouping logic is part of PrecisionLife’s platform

Step 8: Link genes and biology (standard methods)
- Match genetic variants to nearby genes
- Use public databases to learn what those genes do
- Look for shared biological processes (immune system, nerves, metabolism, etc.)

Step 9: Test the overall genetic signal (standard methods)
- Count how many final patterns each person has
- Test whether people with more patterns are more likely to have ME
- Confirm this in a group of people never used earlier

Step 10: Compare with other studies (standard methods)
- Compare results with traditional genetic studies (GWAS)
- Check overlap with genes linked to related conditions like long COVID
- Use results to suggest possible biological explanations

Edit: And this might still not be easy to understand. My goal was to make sure people don't think it's all just a black box, or that we have to trust that whatever they're doing behind the scenes is right. There might be some secret parts, like how they select which combinations to even test since it's impossible to test all of them, but that is more like a preparation step for the actual analysis that is described.

V.R.T. · Dec 19, 2025

forestglip said:
It seems to me that the important parts are laid out in their papers, just that it's somewhat of a complicated process. I don't have the energy to try to go through it and parse it, but maybe this summary of their method I had claude.ai write will be helpful.

This is from giving the AI the methods from the thread paper and from their 2022 paper.

Edit: Had an outline from Claude, but I think it made a mistake in describing the validation method, so here is ChatGPT's outline instead:

Edit: And this might still not be easy to understand. My goal was to make sure people don't think it's all just a black box, or that we have to trust that whatever they're doing behind the scenes is right. There might be some secret parts, like how they select which combinations to even test since it's impossible to test all of them, but that is more like a preparation step for the actual analysis that is described.

thank you, this is helpful

jnmaciuch · Dec 19, 2025

V.R.T. said:
How are people feeling about this study now the dust has settled?

From my perspective it seems like we are no closer to understanding PLs methadology.

I am interested in the (non Ampligen) repurposing opportunities, but more so in what these data can show us combined with other genetic data. Obviously it's all very far over my head. But do we think this is a useful study, or one that muddies that waters somewhat?

My perspective as a student who has been studying bioinformatics for a couple years is that new methods for slicing and dicing various types of big data come out every month—some of them end up significantly outperforming other methods and becoming the new standard tool, but most of them end up forgotten. A few of them try to make money out of their tool and those usually aren’t the ones that become widely adopted.

Like @forestglip says this method isn’t a black box or magic. It’s comparable to a clever thesis project for someone doing a bioinformatics PhD.

If we were in a situation where we had several potentially efficacious treatments for ME/CFS and a lot of heterogeneity in responders/non-responders, I can see how a tool like this could eventually become useful if it was appropriately standardized and trained on different populations. At present, you can consider it one more study alongside DecodeME, Zhang et al. and others. If this method points in the same direction as other studies, it makes you more confident that the finding was not just an artifact of the algorithm’s particular method of slicing and dicing. But I am not sure if it gives us anything far and above what DecodeME already provided.

hotblack · Dec 23, 2025

I was interested in understanding what the Precision Life data was telling us. Not in validating the data or their methods, but working on the assumption that it is valid and then just understanding what was in it, what the story behind it was. In write-ups and presentations we hear about the combinatorial approach, specific clusters being identified and potential drug targets, but I don't really understand what is being identified, or pointed towards. We've just got this big list of genes.

I wanted to know, do these 259 genes show any patterns of tissues expression or underlying mechanisms?

In short I’ve now got python scripts which help try to answer this by taking the data, looking at tissue expression (using GTEx V10 data) and performing hierarchical clustering then functional enrichment using STRING and Enrichr databases.

hotblack · Dec 23, 2025

A longer but hopefully readable explanation…

An approach that I have seen used is looking at expression of genes in tissues, finding genes which are more expressed together in particular tissues. This seems to often be linked in terms of mechanisms, or so I’ve read. So I took the gene list and took data from GTEx, which is about expression of genes in different tissues in the body, and merged that to get a grid, a matrix of this list of genes and how much those genes are expressed in different tissues in the human body.

Then I used something called Z-score normalisation. What this does is it changes the values in this matrix so rather than having an absolute value of how much a gene is expressed in a tissue, it's relative, but relative just for that gene. It looks at each gene and asks: 'In this specific tissue, is the gene expressed more or less than its average across other tissues?' It then calculates a Z-score, which is a measure of standard deviation. So, if a gene has a Z-score of +2 in the liver, it means its expression there is two standard deviations above its own average. And it does that for every single gene in every tissue.

So then you've got this matrix and then you can run clustering analysis on that, which looks for patterns in the data and gives you clusters of genes which share a similar pattern of expression. For example, if genes ABC and DEF both have z-scores around +2 in the liver and -2 in the lung, they might be put into the same cluster

You can then use statistical methods (something called silhouette value) to measure how coherent these clusters are, how ‘matched’ a gene is to other genes in that cluster and therefore how matched the cluster is overall. We can also visualise this with heatmaps of the z-scores and by plotting silhouette values per cluster and gene. This gives you a way of telling if you’ve split the genes into the right number of clusters. That decision isn’t automatic but a judgement call, albeit one guided by the data.

One big caveat to all of this is that I've never done anything like this before. I didn't really understand what I was doing going into it and am only starting to now. I just started sort of blindly throwing things around and reading bits and using some of the agentic coding tools to help with scripts to explore the data. So I've basically been winging it and making it up as I go along. I've learned a lot and I've found it interesting, but do take it all with a pinch of salt. My methodology or implementation of it may not be sound. So please toke this with a suitable level of caution and skepticism.

But I thought it would be useful to share the methods and get people’s thoughts. If people think it worthwhile I’ll share the final steps of the process, some more data and results. The clustering itself seems interesting but the biological side particularly will need people with a better understanding of all this to interpret I think.

A bit more background reading as well as a more technical methodology write up (written by an LLM fed all the python scripts used) attached.

Gene co-expression network - Wikipedia

en.wikipedia.org

Hierarchical clustering - Wikipedia

en.wikipedia.org

AgglomerativeClustering

Gallery examples: Agglomerative clustering with different metrics Plot Hierarchical Clustering Dendrogram Comparing different clustering algorithms on toy datasets A demo of structured Ward hierarc...

scikit-learn.org

Silhouette (clustering) - Wikipedia

en.wikipedia.org

Simon M · Dec 23, 2025

hotblack said:
A longer but hopefully readable explanation…

This is really interesting approach, and I'm looking forward to seeing what you come up with.

Just one question: are gene expression levels normally distributed? I thought they weren't, and if not, are Z scores appropriate? Thought I'd toss that in, but I am seriously out of my depth.

hotblack · Dec 23, 2025

Simon M said:
are Z scores appropriate?

Thanks Simon! In all honesty I don’t know. I’m hoping people with a background and better understanding in all this can tell me.

My scripts do allow analysis of mean or median tpm values and with or without z-score normalisation, so people could try different approaches if more appropriate.

forestglip · Dec 23, 2025

hotblack said:
Thanks Simon! In all honesty I don’t know. I’m hoping people with a background and better understanding in all this can tell me.

My scripts do allow analysis of mean or median tpm values and with or without z-score normalisation, so people could try different approaches if more appropriate.

I'd assume a log transformation before doing the standardization would be good so that the expression data isn't heavily skewed. https://www.researchgate.net/post/W...d-why-do-we-do-it-in-gene-expression-analysis

Edit: Spoke too soon, I looked at the methodology file and I think you do that:

Transformation: Prior to clustering, the aggregated expression matrix ($N_{genes} \times M_{tissues}$) undergoes a $log_2(x + 1)$ transformation to stabilize variance across the dynamic range of expression levels

hotblack · Dec 25, 2025

Some charts of clustering as a precursor to some more explanation. These show the relationship between the number of clusters (k) and two statistical measures: the Sum of Squared Errors (SSE) and the Silhouette Coefficient. Both are for the same underlying data, one without z-score normalisation and one with.

hotblack · Dec 25, 2025

forestglip said:
Spoke too soon, I looked at the methodology file and I think you do that

Yeah. It’s one of the things that is an option in my scripts. There are quite a few options for exploring the data in different ways (mean or median, log transformation or not, z-score or not). This wasn’t clever design. It was the result of an iterative process of failures, of exploration and trying things, realising they were wrong, adding something else, repeat…

hotblack · Dec 26, 2025

A continuation on my above posts. I’ll try to explain a bit of the process I went through and the theory then how I decided to look at the number clusters that I did.

In clustering theory, there's both the silhouette (a measure of cohesion, or how similar a data point is to other data points in its cluster and how different it is to other clusters) and the elbow method, here using Sum of Squared Errors (where you plot the in cluster variation of data points and find a point where adding more clusters doesn’t reduce variation).

That’s what these charts show in Silhouette Coefficient and SSE. So as I understand we should look for a point where the first measure is high and the second is starting to reduce less, so you’re getting coherent clusters and diminishing returns for adding more clusters.

The scripts also create heat maps which allow you to visually assess groups of genes, and show dendrograms (diagrams which illustrate the arrangement of the clusters). As well as plots of gene and cluster silhouette values.

Why cluster at all? The idea is if genes seem to have similar patterns of expression in different tissues, it's plausible that they're more likely to be involved in the same pathways than genes which have different patterns.

Why not cluster into 2 groups, that seems to have a high silhouette? I think this largely just separates things into brain and non-brain, which doesn’t tell us much. We need a greater number of clusters to tell us more.

So then the final step once you have these clusters, these different lists of related or potentially related or co-expressed genes, is to use some of the tools like STRING DB or Enrichr to identify what they may be doing.

These take lists of genes and query big databases and give you an idea of what protein-protein interactions, biological mechanisms, pathways or diseases these gene groups are involved in.

And as far as I know, an advantage of doing this with the clusters rather than the whole gene set is that you’re increasing specificity. Rather than saying, 'Okay, what processes are these 259 genes involved in?' You're saying, 'What processes are these five or 10 or 20 genes involved in?' So you can be much more specific and get a better and hopefully more statistically significant answer.

And these databases, these queries, one of the nice things is they return measures of statistical significance. So they return p-values and false discovery rates. So then the idea is for each cluster, we get a report of what genes are in that cluster and the results from these databases and the statistics for everything to aid interpretation.

I generate a report at the end of this which includes the more statistically significant results from the queries. But there are also links to the complete data from the analysis for people who need it.

Some more background and references

Determining the number of clusters in a data set - Wikipedia

en.wikipedia.org

Elbow method (clustering) - Wikipedia

en.wikipedia.org

Dendrogram - Wikipedia

en.wikipedia.org

Gene Ontology overview

Gene Ontology overview The Gene Ontology (GO) is a structured, standardized representation of biological knowledge. GO describes concepts (also known as terms, or formally, classes) that are connected to each other via formally defined relations. The GO is designed to be species-agnostic to...

geneontology.org

https://maayanlab.cloud/Enrichr/help#background

Preprint Identification of Novel Reproducible Combinatorial Genetic Risk Factors for [ME] in [DecodeME Cohort] and Commonalities with [LC], 2025, Sardell+

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Chronic fatigue syndrome seems to have a very strong genetic element​

The largest study so far into the genetics of chronic fatigue syndrome, or myalgic encephalomyelitis, has implicated 259 genes – six times more than those identified just four months ago​

Senior Member (Voting rights)

Senior Member (Voting rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Attachments

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Chronic fatigue syndrome seems to have a very strong genetic element

The largest study so far into the genetics of chronic fatigue syndrome, or myalgic encephalomyelitis, has implicated 259 genes – six times more than those identified just four months ago