Preprint Identification of Novel Reproducible Combinatorial Genetic Risk Factors for [ME] in [DecodeME Cohort] and Commonalities with [LC], 2025, Sardell+

Apologies for the long posts but it seemed important to outline the process and I’ve been hesitant to share any biological data until the context had been explained in case it was all rubbish!

Some configuration details and then the results:

The source data from GTEx (I use v10 for this analysis) has TPM values (transcriptions per million) for different genes and tissue samples. After processing this gives the mean and median TPM values and the number of samples this data comes from.

I started looking at mean values but read it’s better to use median to avoid skewing from outliers in the data. So that’s what this analysis uses.

Similarly I moved to using z-scores to look at relative rather than absolute values for tissue expression to find genes which are likely working together or co-expressed.

I’m clustering at both 5 and 13 as these both show up as statistically meaningful and I think his gives us different perspectives on the data. We have a broad view at 5 telling us about larger scale biological systems and then a more narrow view at 13 giving us a closer look at possible mechanisms and a more granular view of relevant genes/gene sets. To me, there's potentially some really interesting stories to tell about some of these clusters and how they could potentially relate to ME/CFS. And I think that's what the Precision Life data is trying to tell us.

There’s more tweaks that could be made to this process and presentation of the data, things like adding different stats and querying different gene set libraries available from STRING-db and Enrichr that may shed more light and be useful to add to the report. But I think I’ve taken this as far as I can.

What’s really needed is for people with a better understanding and experience in this field to take a look. My methods or implementation need checking, they may be flawed. And even if they’re valid, the biology needs careful interpretation.

The summary report shows a selection of the results from Enrichr and String-db based upon the cutoffs shown. Links to the full data are included. The silhouette analysis plots and individual cluster heat maps are probably most visually interesting while the global Enrichr and STRING-db terms have wider detail for biological interpretation.

Please treat this with caution and remember the caveats mentioned

Maybe PrecisionLife already have better ways of explaining this. I think their approach uses data from these enrichment databases, so I’m less doing something new as peering into part of their process? It would be great if they could shed some insight and explain more.

Hopefully this piques someone’s interest though. My main hope is that what I’ve tried is (a) not all nonsense and (b) someone better versed in this sort of thing can take it up. All the code is in a git repository so others should be able to run/reproduce/modify as needed and are free and very welcome to do so.

Many thanks to @forestglip for proof reading, questions and corrections
Also to @jnmaciuch and @Simon M for encouragement
And to Claude, ChatGPT and mostly Gemini CLI for python wrangling assistance
 
Last edited:
Did you use only silhouette measure for optimal K ? Did you ran others as well (e.g. NbClust?)
Mainly silhouette, I also looked at SSE and the heat-maps visually. I've never heard of NbClust. Looks like an R package and I’m not at all familiar with R. Would be really interested to hear what you find using it though.

(Btw, for some reason half of your message didn’t show up for me, looks like it’s formatted as black text?)
 
@hotblack Thanks, I do see my message correctly. Unfortunately I do not have time to do this but from my experience silhouette coefficient usually gave good results in optimising K. It would be great if multiple methods were pointing to the same K obviously but..

So I went through the results. Again, this could be a coincidence but I noticed in the Enrichr annotation enrichment results the term "Heparan Sulfate Proteoglycan Metabolic Process ". This came up so many times in my analyses (2017):

Note also how many concepts are shown which we have seen before : LXR, CD8+ T Cells,
https://algogenomics.blogspot.com/2017/09/sulfation-revisited-dhea-and-syndecans.html

When I was sick, I had a test that was looking at how well my pituitary reacts at a substance due to subclinical hypothyroidism. For this they were giving me heparin. For two days I did not have any ME symptoms. Then after 2 days the problems began again. I will never forget it. Took me years to understand the connection that there was something with Heparin.

Note also that CA10 (DecodeME) is associated with Heparan Sulfate. We are definitely on the right track, I will look further to your results. Thank you for this work.
 
Back
Top Bottom