hotblack
Senior Member (Voting Rights)
Apologies for the long posts but it seemed important to outline the process and I’ve been hesitant to share any biological data until the context had been explained in case it was all rubbish!
Some configuration details and then the results:
The source data from GTEx (I use v10 for this analysis) has TPM values (transcriptions per million) for different genes and tissue samples. After processing this gives the mean and median TPM values and the number of samples this data comes from.
I started looking at mean values but read it’s better to use median to avoid skewing from outliers in the data. So that’s what this analysis uses.
Similarly I moved to using z-scores to look at relative rather than absolute values for tissue expression to find genes which are likely working together or co-expressed.
I’m clustering at both 5 and 13 as these both show up as statistically meaningful and I think his gives us different perspectives on the data. We have a broad view at 5 telling us about larger scale biological systems and then a more narrow view at 13 giving us a closer look at possible mechanisms and a more granular view of relevant genes/gene sets. To me, there's potentially some really interesting stories to tell about some of these clusters and how they could potentially relate to ME/CFS. And I think that's what the Precision Life data is trying to tell us.
There’s more tweaks that could be made to this process and presentation of the data, things like adding different stats and querying different gene set libraries available from STRING-db and Enrichr that may shed more light and be useful to add to the report. But I think I’ve taken this as far as I can.
What’s really needed is for people with a better understanding and experience in this field to take a look. My methods or implementation need checking, they may be flawed. And even if they’re valid, the biology needs careful interpretation.
The summary report shows a selection of the results from Enrichr and String-db based upon the cutoffs shown. Links to the full data are included. The silhouette analysis plots and individual cluster heat maps are probably most visually interesting while the global Enrichr and STRING-db terms have wider detail for biological interpretation.
Please treat this with caution and remember the caveats mentioned
Maybe PrecisionLife already have better ways of explaining this. I think their approach uses data from these enrichment databases, so I’m less doing something new as peering into part of their process? It would be great if they could shed some insight and explain more.
Hopefully this piques someone’s interest though. My main hope is that what I’ve tried is (a) not all nonsense and (b) someone better versed in this sort of thing can take it up. All the code is in a git repository so others should be able to run/reproduce/modify as needed and are free and very welcome to do so.
github.com
Many thanks to @forestglip for proof reading, questions and corrections
Also to @jnmaciuch and @Simon M for encouragement
And to Claude, ChatGPT and mostly Gemini CLI for python wrangling assistance
Some configuration details and then the results:
The source data from GTEx (I use v10 for this analysis) has TPM values (transcriptions per million) for different genes and tissue samples. After processing this gives the mean and median TPM values and the number of samples this data comes from.
I started looking at mean values but read it’s better to use median to avoid skewing from outliers in the data. So that’s what this analysis uses.
Similarly I moved to using z-scores to look at relative rather than absolute values for tissue expression to find genes which are likely working together or co-expressed.
I’m clustering at both 5 and 13 as these both show up as statistically meaningful and I think his gives us different perspectives on the data. We have a broad view at 5 telling us about larger scale biological systems and then a more narrow view at 13 giving us a closer look at possible mechanisms and a more granular view of relevant genes/gene sets. To me, there's potentially some really interesting stories to tell about some of these clusters and how they could potentially relate to ME/CFS. And I think that's what the Precision Life data is trying to tell us.
There’s more tweaks that could be made to this process and presentation of the data, things like adding different stats and querying different gene set libraries available from STRING-db and Enrichr that may shed more light and be useful to add to the report. But I think I’ve taken this as far as I can.
What’s really needed is for people with a better understanding and experience in this field to take a look. My methods or implementation need checking, they may be flawed. And even if they’re valid, the biology needs careful interpretation.
The summary report shows a selection of the results from Enrichr and String-db based upon the cutoffs shown. Links to the full data are included. The silhouette analysis plots and individual cluster heat maps are probably most visually interesting while the global Enrichr and STRING-db terms have wider detail for biological interpretation.
Please treat this with caution and remember the caveats mentioned
Maybe PrecisionLife already have better ways of explaining this. I think their approach uses data from these enrichment databases, so I’m less doing something new as peering into part of their process? It would be great if they could shed some insight and explain more.
Hopefully this piques someone’s interest though. My main hope is that what I’ve tried is (a) not all nonsense and (b) someone better versed in this sort of thing can take it up. All the code is in a git repository so others should be able to run/reproduce/modify as needed and are free and very welcome to do so.
GitHub - s4mehotblack/ClusterME: Extract, cluster and analyse gene expression patterns using GTEx data and STRING-db and Enrichr APIs
Extract, cluster and analyse gene expression patterns using GTEx data and STRING-db and Enrichr APIs - s4mehotblack/ClusterME
Many thanks to @forestglip for proof reading, questions and corrections
Also to @jnmaciuch and @Simon M for encouragement
And to Claude, ChatGPT and mostly Gemini CLI for python wrangling assistance
Last edited: