Preprint Identification of Novel Reproducible Combinatorial Genetic Risk Factors for [ME] in [DecodeME Cohort] and Commonalities with [LC], 2025, Sardell+

Apologies for the long posts but it seemed important to outline the process and I’ve been hesitant to share any biological data until the context had been explained in case it was all rubbish!

Some configuration details and then the results:

The source data from GTEx (I use v10 for this analysis) has TPM values (transcriptions per million) for different genes and tissue samples. After processing this gives the mean and median TPM values and the number of samples this data comes from.

I started looking at mean values but read it’s better to use median to avoid skewing from outliers in the data. So that’s what this analysis uses.

Similarly I moved to using z-scores to look at relative rather than absolute values for tissue expression to find genes which are likely working together or co-expressed.

I’m clustering at both 5 and 13 as these both show up as statistically meaningful and I think his gives us different perspectives on the data. We have a broad view at 5 telling us about larger scale biological systems and then a more narrow view at 13 giving us a closer look at possible mechanisms and a more granular view of relevant genes/gene sets. To me, there's potentially some really interesting stories to tell about some of these clusters and how they could potentially relate to ME/CFS. And I think that's what the Precision Life data is trying to tell us.

There’s more tweaks that could be made to this process and presentation of the data, things like adding different stats and querying different gene set libraries available from STRING-db and Enrichr that may shed more light and be useful to add to the report. But I think I’ve taken this as far as I can.

What’s really needed is for people with a better understanding and experience in this field to take a look. My methods or implementation need checking, they may be flawed. And even if they’re valid, the biology needs careful interpretation.

The summary report shows a selection of the results from Enrichr and String-db based upon the cutoffs shown. Links to the full data are included. The silhouette analysis plots and individual cluster heat maps are probably most visually interesting while the global Enrichr and STRING-db terms have wider detail for biological interpretation.

Please treat this with caution and remember the caveats mentioned

Maybe PrecisionLife already have better ways of explaining this. I think their approach uses data from these enrichment databases, so I’m less doing something new as peering into part of their process? It would be great if they could shed some insight and explain more.

Hopefully this piques someone’s interest though. My main hope is that what I’ve tried is (a) not all nonsense and (b) someone better versed in this sort of thing can take it up. All the code is in a git repository so others should be able to run/reproduce/modify as needed and are free and very welcome to do so.

Many thanks to @forestglip for proof reading, questions and corrections
Also to @jnmaciuch and @Simon M for encouragement
And to Claude, ChatGPT and mostly Gemini CLI for python wrangling assistance
 
Last edited:
Did you use only silhouette measure for optimal K ? Did you ran others as well (e.g. NbClust?)
Mainly silhouette, I also looked at SSE and the heat-maps visually. I've never heard of NbClust. Looks like an R package and I’m not at all familiar with R. Would be really interested to hear what you find using it though.

(Btw, for some reason half of your message didn’t show up for me, looks like it’s formatted as black text?)
 
@hotblack Thanks, I do see my message correctly. Unfortunately I do not have time to do this but from my experience silhouette coefficient usually gave good results in optimising K. It would be great if multiple methods were pointing to the same K obviously but..

So I went through the results. Again, this could be a coincidence but I noticed in the Enrichr annotation enrichment results the term "Heparan Sulfate Proteoglycan Metabolic Process ". This came up so many times in my analyses (2017):

Note also how many concepts are shown which we have seen before : LXR, CD8+ T Cells,
https://algogenomics.blogspot.com/2017/09/sulfation-revisited-dhea-and-syndecans.html

When I was sick, I had a test that was looking at how well my pituitary reacts at a substance due to subclinical hypothyroidism. For this they were giving me heparin. For two days I did not have any ME symptoms. Then after 2 days the problems began again. I will never forget it. Took me years to understand the connection that there was something with Heparin.

Note also that CA10 (DecodeME) is associated with Heparan Sulfate. We are definitely on the right track, I will look further to your results. Thank you for this work.
 
From the New Scientist article:

“In August, the researchers behind DecodeME also identified variants in eight regions of the genome, including the 43 genes that contribute to ME/CFS risk, but they were unable to replicate all of them in independent datasets. PrecisionLife, however, rediscovered all eight regions, supporting the idea of being true risk factors for the condition​

Is that accurate? Have PrecisionLife replicated the DecodeME results using different datasets? I’ve not kept up but I thought they had only analysed the DecodeME data using different methods.
 
Last edited:
Is that accurate? Have PrecisionLife replicated the DecodeME results using different datasets
A couple of the same specific genes showed up in both. But that’s not surprising and not replication I don’t think. And it really depends what they mean by regions in the article.

After looking at it more my feeling/understanding is that PrecisionLife’s results support the findings of DecodeME. They implicate some similar broad biological regions, i.e neurological, immune and possibly some specific processes. It appears to me to be, at best, more crosses on the treasure map pointing to something.
 
From the New Scientist article:

“In August, the researchers behind DecodeME also identified variants in eight regions of the genome, including the 43 genes that contribute to ME/CFS risk, but they were unable to replicate all of them in independent datasets. PrecisionLife, however, rediscovered all eight regions, supporting the idea of being true risk factors for the condition​
Is that accurate? Have PrecisionLife replicated the DecodeME results using different datasets? I’ve not kept up but I thought they had only analysed the DecodeME data using different methods.
I think they also relied on the UK BioBank, but it was mainly based on DecodeME, so I wouldn't call it replicating in independent datasets.
We identified 22,411 double-refined signatures, comprised of 7,555 SNPs mapped to 2,311 genes, that are consistently associated with increased odds of ME in multiple DecodeME and UKB cohorts

Here's the bit about overlap with the 8 original DecodeME loci, which is based on SNPs in double refined signatures, which were created based on DecodeME.
The double-refined signatures map to 13 of the 32 Tier 1 or 2 DecodeME GWAS study genes including genes located in 6 of the 8 GWAS loci (Table 4, Supplementary Table 8). The double-refined signatures also contain SNPs that map to the remaining 2 loci but are not located within protein-coding genes, potentially representing non-coding variants that affect expression of one or more Tier 1 or 2 genes. Of the Tier 1 or 2 GWAS genes, only OLFM4 was included among the 259 prioritized genes, along with DCC, which is linked to chronic pain.
 
I think they also relied on the UK BioBank, but it was mainly based on DecodeME, so I wouldn't call it replicating in independent datasets.


Here's the bit about overlap with the 8 original DecodeME loci, which is based on SNPs in double refined signatures, which were created based on DecodeME.
DCC is the one that came up recently in fibro right? And OLFM4 seems like a significant one too. Its good we can be a little more confident about them.
 
Do you think it would be fair to say that both analyses are pointing to the same broad genetic loci then @forestglip ?
In the sense that they found significant SNPs in the same 8 areas, sure. But since it's not really new data, I'd say that the replication mainly adds confidence about their method being reliable (and thus potentially having more confidence in the other findings), rather than adding much confidence about these 8 specific loci being meaningful for ME/CFS.
 
I'd say that the replication mainly adds confidence about their method being reliable (and thus potentially having more confidence in the other findings), rather than adding much confidence about these 8 specific loci being meaningful for ME/CFS.
Thanks, nice way of looking at it.

I suppose that could strengthen confidence in their own replication attempts and crossovers with Long Covid in this and other studies (using All of Us and Sano GOLD data as well as UKB and DecodeME).
 
Last edited:
But since it's not really new data, I'd say that the replication mainly adds confidence about their method being reliable (and thus potentially having more confidence in the other findings), rather than adding much confidence about these 8 specific loci being meaningful for ME/CFS.
I thought more about it, and came to the conclusion that this might be an inaccurate interpretation. Maybe it does add some confidence about the loci - seeing them come up again with a totally different method.
 
Of the Tier 1 or 2 GWAS genes, only OLFM4 was included among the 259 prioritized genes, along with DCC, which is linked to chronic pain.
I think they named the wrong gene. OLFM4 isn't one of the 259 core genes in this paper's Extended Table 3, as far as I can tell. On the other hand, CSE1L is a DecodeME tier 1 gene that is one of this study's core genes.

And while DCC is one of the PrecisionLife core genes, it wasn't a tier 1 or 2 gene in DecodeME, but the sentence kind of implies it was. It was a gene at a less significant locus in DecodeME.

Will see if I can find a contact method to let them know. [Edit: sent]

For reference, mariovitali copied the 259 core PrecisionLife genes into an earlier post.
 
Last edited:
Back
Top Bottom