Preprint Identification of Novel Reproducible Combinatorial Genetic Risk Factors for [ME] in [DecodeME Cohort] and Commonalities with [LC], 2025, Sardell+

I suppose their combinatorial analysis can be useful to get new or clearer findings, but in this case, it seems to have made things more complicated and muddled.

Their disease signatures map to 2,311 genes, while humans only have approximately 20.000-25.000 protein-coding genes.
 
I am still working in trying to connect the list of genes from this paper with previous research efforts. Some new findings :


NLGN1 : Appears on the Snyder study (HEAL2)

CYP7B1 : The node of this gene appears on the network analysis (2017) - towards the center bottom :

network_clean.jpeg


CH25H : Also identified by previous work (read also what I mention related to Ubiquitination (Snyder et.al) , see below :

Screenshot 2025-12-05 at 15.25.01.png

Source : https://www.healthrising.org/blog/2023/10/21/ai-driven-chronic-fatigue-syndrome-clues/

UGGT1 : A gene directly linked to N-Linked glycosylation. I believe we will be seeing N-Linked glycosylation in the -hopefully near- future more.

I also believe that Glutamate excitotoxicity is something that needs to be looked at for sure.
 
So I used the DAVID tool to enter the core genes identified by the study and I am providing the results of a Pathway analysis :
First KEGG Pathway Analysis. of particular interest the Glutamatergic synapse entry, GABAergic synapse cc @ME/CFS Science Blog


KEGG Pathway.png
Next Reactome :

Of interest : NR1H3 (=LXRa) tagging @MelbME. Transmission across chemical synapses , NR1H2 is LXRb Receptor , Netrin mediated repulsion signals and O-Linked glycosylation.


reactome pathway.png


Finally here is a functional annotation set from DAVID :


(Cell?) membrane appears to the top, note the Polar residues (=I believe directly related to Aminoacids) and my personal "favorite" which was also identified in the paper by Xiao et.al related to L-Asparagine (a key part of N-Linked glycosylation) and the potential role of glycoproteins in general.

functional_annotations_DAVID.png

EDIT : Removed entries related to FXR which are not part of the shown results
 
Last edited:
(Paywall)

AI Summary:

Chronic fatigue syndrome seems to have a very strong genetic element​

The largest study so far into the genetics of chronic fatigue syndrome, or myalgic encephalomyelitis, has implicated 259 genes – six times more than those identified just four months ago​


The largest genetic study to date suggests that chronic fatigue syndrome, also known as myalgic encephalomyelitis (ME/CFS), has a strong genetic component. Researchers identified links to more than 250 genes, six times more than were reported just four months earlier. The findings may help explain why some people develop ME/CFS after infection while others do not, and may support future treatment development.

ME/CFS is a chronic and often disabling condition. A core symptom is post-exertional malaise, in which even small amounts of activity lead to prolonged exhaustion. Although infections often trigger the illness, its underlying causes remain unclear.

The study analysed genomic data from over 10,500 people diagnosed with ME/CFS, comparing it with data from people without the condition in the UK Biobank. Instead of examining single genetic variants, the researchers looked at groups of interacting variants known as single nucleotide polymorphisms. They identified more than 22,000 such groups associated with ME/CFS risk, and found that having more of these groups increased a person’s likelihood of developing the condition.

The variants were mapped to 2311 genes, of which 259 showed the strongest and most common links to ME/CFS. This represents a substantial increase compared with earlier studies and supports previously identified genomic regions.

The researchers also compared the genetic findings with those from long covid studies. About 42 per cent of genes linked to long covid overlapped with those linked to ME/CFS, suggesting the two conditions partially overlap genetically, though differences in analysis limit firm conclusions.
 
How are people feeling about this study now the dust has settled?

From my perspective it seems like we are no closer to understanding PLs methadology.

I am interested in the (non Ampligen) repurposing opportunities, but more so in what these data can show us combined with other genetic data. Obviously it's all very far over my head. But do we think this is a useful study, or one that muddies that waters somewhat?
 
From my perspective it seems like we are no closer to understanding PLs methadology.
It seems to me that the important parts are laid out in their papers, just that it's somewhat of a complicated process. I don't have the energy to try to go through it and parse it, but maybe this summary of their method I had claude.ai write will be helpful.

This is from giving the AI the methods from the thread paper and from their 2022 paper.

Edit: Had an outline from Claude, but I think it made a mistake in describing the validation method, so here is ChatGPT's outline instead:
Step 1: Prepare the data (standard methods)
- Collect genetic data from people with ME and from healthy controls
- Remove low-quality samples and genetic markers
- Limit analysis to people with similar ancestry to avoid false signals
- Split the data into separate groups:
- Discovery (to find signals)
- Refinement (to check them)
- Test (to confirm results)

Step 2: Search for genetic patterns (custom / proprietary)
- Look for small groups of genetic variants that tend to appear together in people with ME
- These groups can contain 1, 2, 3, or more variants
- The search is guided by rules that focus on patterns common in cases but rare in controls
- Only patterns with strong statistics and seen in enough people are kept
- This step uses a custom algorithm owned by PrecisionLife

Step 3: Check against random data (standard idea, custom implementation)
- Randomly shuffle who is labeled as “case” or “control”
- Repeat the same pattern-finding process many times
- See how often strong-looking patterns appear by chance
- Remove real-data patterns that look similar to those commonly found in random data
- Patterns are judged by strength and frequency, not by exact genetic makeup

Step 4: Initial disease signatures
- The remaining patterns are called “disease signatures”
- These are still only candidates and not yet trusted

Step 5: Test patterns in new groups (mostly standard methods)
- Check whether each signature also appears in a different group of people with ME
- Remove patterns that do not repeat
- Remove genetic variants that do not show consistent effects
- Remove patterns that do not add new information

Step 6: Final disease signatures
- Patterns that pass all previous checks
- These appear consistently across multiple independent groups

Step 7: Group related patterns (custom / proprietary)
- Combine overlapping patterns into networks
- Identify key genetic variants that appear in many patterns
- Measure how strongly each network is linked to ME
- This grouping logic is part of PrecisionLife’s platform

Step 8: Link genes and biology (standard methods)
- Match genetic variants to nearby genes
- Use public databases to learn what those genes do
- Look for shared biological processes (immune system, nerves, metabolism, etc.)

Step 9: Test the overall genetic signal (standard methods)
- Count how many final patterns each person has
- Test whether people with more patterns are more likely to have ME
- Confirm this in a group of people never used earlier

Step 10: Compare with other studies (standard methods)
- Compare results with traditional genetic studies (GWAS)
- Check overlap with genes linked to related conditions like long COVID
- Use results to suggest possible biological explanations

Edit: And this might still not be easy to understand. My goal was to make sure people don't think it's all just a black box, or that we have to trust that whatever they're doing behind the scenes is right. There might be some secret parts, like how they select which combinations to even test since it's impossible to test all of them, but that is more like a preparation step for the actual analysis that is described.
 
Last edited:
It seems to me that the important parts are laid out in their papers, just that it's somewhat of a complicated process. I don't have the energy to try to go through it and parse it, but maybe this summary of their method I had claude.ai write will be helpful.

This is from giving the AI the methods from the thread paper and from their 2022 paper.

Edit: Had an outline from Claude, but I think it made a mistake in describing the validation method, so here is ChatGPT's outline instead:


Edit: And this might still not be easy to understand. My goal was to make sure people don't think it's all just a black box, or that we have to trust that whatever they're doing behind the scenes is right. There might be some secret parts, like how they select which combinations to even test since it's impossible to test all of them, but that is more like a preparation step for the actual analysis that is described.
thank you, this is helpful
 
How are people feeling about this study now the dust has settled?

From my perspective it seems like we are no closer to understanding PLs methadology.

I am interested in the (non Ampligen) repurposing opportunities, but more so in what these data can show us combined with other genetic data. Obviously it's all very far over my head. But do we think this is a useful study, or one that muddies that waters somewhat?
My perspective as a student who has been studying bioinformatics for a couple years is that new methods for slicing and dicing various types of big data come out every month—some of them end up significantly outperforming other methods and becoming the new standard tool, but most of them end up forgotten. A few of them try to make money out of their tool and those usually aren’t the ones that become widely adopted.

Like @forestglip says this method isn’t a black box or magic. It’s comparable to a clever thesis project for someone doing a bioinformatics PhD.

If we were in a situation where we had several potentially efficacious treatments for ME/CFS and a lot of heterogeneity in responders/non-responders, I can see how a tool like this could eventually become useful if it was appropriately standardized and trained on different populations. At present, you can consider it one more study alongside DecodeME, Zhang et al. and others. If this method points in the same direction as other studies, it makes you more confident that the finding was not just an artifact of the algorithm’s particular method of slicing and dicing. But I am not sure if it gives us anything far and above what DecodeME already provided.
 
I was interested in understanding what the Precision Life data was telling us. Not in validating the data or their methods, but working on the assumption that it is valid and then just understanding what was in it, what the story behind it was. In write-ups and presentations we hear about the combinatorial approach, specific clusters being identified and potential drug targets, but I don't really understand what is being identified, or pointed towards. We've just got this big list of genes.

I wanted to know, do these 259 genes show any patterns of tissues expression or underlying mechanisms?

In short I’ve now got python scripts which help try to answer this by taking the data, looking at tissue expression (using GTEx V10 data) and performing hierarchical clustering then functional enrichment using STRING and Enrichr databases.
 
A longer but hopefully readable explanation…

An approach that I have seen used is looking at expression of genes in tissues, finding genes which are more expressed together in particular tissues. This seems to often be linked in terms of mechanisms, or so I’ve read. So I took the gene list and took data from GTEx, which is about expression of genes in different tissues in the body, and merged that to get a grid, a matrix of this list of genes and how much those genes are expressed in different tissues in the human body.

Then I used something called Z-score normalisation. What this does is it changes the values in this matrix so rather than having an absolute value of how much a gene is expressed in a tissue, it's relative, but relative just for that gene. It looks at each gene and asks: 'In this specific tissue, is the gene expressed more or less than its average across other tissues?' It then calculates a Z-score, which is a measure of standard deviation. So, if a gene has a Z-score of +2 in the liver, it means its expression there is two standard deviations above its own average. And it does that for every single gene in every tissue.

So then you've got this matrix and then you can run clustering analysis on that, which looks for patterns in the data and gives you clusters of genes which share a similar pattern of expression. For example, if genes ABC and DEF both have z-scores around +2 in the liver and -2 in the lung, they might be put into the same cluster

You can then use statistical methods (something called silhouette value) to measure how coherent these clusters are, how ‘matched’ a gene is to other genes in that cluster and therefore how matched the cluster is overall. We can also visualise this with heatmaps of the z-scores and by plotting silhouette values per cluster and gene. This gives you a way of telling if you’ve split the genes into the right number of clusters. That decision isn’t automatic but a judgement call, albeit one guided by the data.

One big caveat to all of this is that I've never done anything like this before. I didn't really understand what I was doing going into it and am only starting to now. I just started sort of blindly throwing things around and reading bits and using some of the agentic coding tools to help with scripts to explore the data. So I've basically been winging it and making it up as I go along. I've learned a lot and I've found it interesting, but do take it all with a pinch of salt. My methodology or implementation of it may not be sound. So please toke this with a suitable level of caution and skepticism.

But I thought it would be useful to share the methods and get people’s thoughts. If people think it worthwhile I’ll share the final steps of the process, some more data and results. The clustering itself seems interesting but the biological side particularly will need people with a better understanding of all this to interpret I think.

A bit more background reading as well as a more technical methodology write up (written by an LLM fed all the python scripts used) attached.




 

Attachments

Thanks Simon! In all honesty I don’t know. I’m hoping people with a background and better understanding in all this can tell me.

My scripts do allow analysis of mean or median tpm values and with or without z-score normalisation, so people could try different approaches if more appropriate.
I'd assume a log transformation before doing the standardization would be good so that the expression data isn't heavily skewed. https://www.researchgate.net/post/W...d-why-do-we-do-it-in-gene-expression-analysis

Edit: Spoke too soon, I looked at the methodology file and I think you do that:
Transformation: Prior to clustering, the aggregated expression matrix ($N_{genes} \times M_{tissues}$) undergoes a $log_2(x + 1)$ transformation to stabilize variance across the dynamic range of expression levels
 
Back
Top Bottom