A longer but hopefully readable explanation…
An approach that I have seen used is looking at expression of genes in tissues, finding genes which are more expressed together in particular tissues. This seems to often be linked in terms of mechanisms, or so I’ve read. So I took the gene list and took data from GTEx, which is about expression of genes in different tissues in the body, and merged that to get a grid, a matrix of this list of genes and how much those genes are expressed in different tissues in the human body.
Then I used something called Z-score normalisation. What this does is it changes the values in this matrix so rather than having an absolute value of how much a gene is expressed in a tissue, it's relative, but relative just for that gene. It looks at each gene and asks: 'In this specific tissue, is the gene expressed more or less than its average across other tissues?' It then calculates a Z-score, which is a measure of standard deviation. So, if a gene has a Z-score of +2 in the liver, it means its expression there is two standard deviations above its own average. And it does that for every single gene in every tissue.
So then you've got this matrix and then you can run clustering analysis on that, which looks for patterns in the data and gives you clusters of genes which share a similar pattern of expression. For example, if genes ABC and DEF both have z-scores around +2 in the liver and -2 in the lung, they might be put into the same cluster
You can then use statistical methods (something called silhouette value) to measure how coherent these clusters are, how ‘matched’ a gene is to other genes in that cluster and therefore how matched the cluster is overall. We can also visualise this with heatmaps of the z-scores and by plotting silhouette values per cluster and gene. This gives you a way of telling if you’ve split the genes into the right number of clusters. That decision isn’t automatic but a judgement call, albeit one guided by the data.
One big caveat to all of this is that I've never done anything like this before. I didn't really understand what I was doing going into it and am only starting to now. I just started sort of blindly throwing things around and reading bits and using some of the agentic coding tools to help with scripts to explore the data. So I've basically been winging it and making it up as I go along. I've learned a lot and I've found it interesting, but do take it all with a pinch of salt. My methodology or implementation of it may not be sound. So please toke this with a suitable level of caution and skepticism.
But I thought it would be useful to share the methods and get people’s thoughts. If people think it worthwhile I’ll share the final steps of the process, some more data and results. The clustering itself seems interesting but the biological side particularly will need people with a better understanding of all this to interpret I think.
A bit more background reading as well as a more technical methodology write up (written by an LLM fed all the python scripts used) attached.
en.wikipedia.org
en.wikipedia.org
Gallery examples: Agglomerative clustering with different metrics Plot Hierarchical Clustering Dendrogram Comparing different clustering algorithms on toy datasets A demo of structured Ward hierarc...
scikit-learn.org
en.wikipedia.org