How do you envision people using this database @forestglip? I was thinking, if I was a researcher that just completed a genetic study and found some statistically significant genes I would want to know how the previous literature would look like.
Yeah, that's one use case: quickly find if and where a gene was significant and how often. You could see whether this is a new finding, in which case it might just be chance, or if it's been significant ten times already, in which case it might make sense to dig into this finding specifically.
Now your website will instantly be able to tell how often a my genes had been significant previously but it won't be able to tell me how often it was not significant which arguably might be just as important. I'm not sure how that could be handled without introducing even more work. What are your thoughts on this?
Yes, non-significant findings would arguably add a lot of value. If TNF has been significant more than any other gene, but
not significant many times more than that even, then it might be the case of researchers targeting it over and over because they're interested in it, with all the significant findings just being chance.
But adding this kind of information adds similar practical hurdles as adding the direction of change. What if a study says "We tested TNF in the plasma and it was not significantly higher than controls. We also tested how much TNF was released by macrophages upon stimulation, and it was not significantly more than controls. Finally, we tested the cerebrospinal fluid and found that TNF was significantly higher in ME/CFS." Do I save TNF three times, twice as non-significant?
I suppose it's an option. But for a few reasons, I probably won't incorporate that:
- I'm tired of coding and it would require a significant rewrite of the most annoying part of the website to write, the search tool that matches genes to HGNC before inserting, to allow for adding other data along with a gene when inserting.
- It would slow down data insertion. I think there's value in trying to have as large a breadth as possible of the study landscape, as opposed to getting perfect snapshots of each study. Currently, I barely have to think while saving genes. I notice a gene name, which stands out because it's all-caps, and I look for language that says something like "significant", then I copy and paste. It'd probably at least double the required time (and make it less pleasant of an experience) if I had to add something about yes or no for significance for every gene.
- The data on non-significant genes is often not reported. One example is untargeted proteomics studies. They might say they tested 6000 proteins, but only report the 100 that were significant. Or a GWAS. They might say the ten genes that were significant. Do I then store all 20,000 other genes in the human genome as non-significant every time there's a GWAS?
- Finally, because I don't think this data is necessary for the main use case I envisioned:
How do you envision people using this database @forestglip?
Although targeted lookups, as you described, are one way to use it, I envision the tool's main value in being a hypothesis generator itself. The
browse page already offers a crude version of this for just seeing which genes have been significant the most times. But what I think could be valuable is using the connections between related genes.
I was inspired by the discussion from the
Zhang study, which led to learning about GSEA, which was an
attempt by researchers to deal with the huge amount of data arising from the human genome, but where most genes' effects on disease are too small to be significant without a massive sample size. But if, for example, a study finds moderate effect sizes in 100 genes, where most are totally unrelated, but 5 of those genes happen to be very related (e.g. NLGN1, NLGN2, NLGN3, NLGN4X, NLGN4Y), that's a good sign that the NLGN-related genes aren't coming up just by chance. Considering related genes together allows for finding signals that would be invisible if considering each gene on its own.
I'm not exactly sure how this website's database could be used in a similar way, but here's a couple rough ideas:
- Just GSEA using the number of studies that said a gene was significant for the measure that goes along with each gene. Normally a pre-ranked GSEA gives each gene a score representing biological importance that could be based on a p-value or an effect size, and then the algorithm sees if any groups of genes with high scores are over-represented in any gene sets (clusters of related genes curated by others, like the NLGN genes above). I'm not sure if a standard GSEA would be optimal here, but I think something similar could be done using study count as the metric.
- Create a network using protein-protein interaction scores from STRING database. STRING is a curated database of scores between proteins, representing how related they are. It includes subscores like a score for how co-expressed two proteins are (if gene A is higher, do we always see gene B as higher too?) and a subscore for how often the proteins are mentioned together in published papers.
- For example, here's a visualization of the protein relatedness between NLGN1, NLGN2, and TNF:

- Out of a maximum of 1, STRING gives the connection between NLGN1 and NLGN2 a score of .704, represented by the thick connecting line. TNF only has a small connection of .154 to NLGN1 and nothing to NLGN2, so it's clear that the NLGN genes are more related to each other than to TNF.
- So what I imagine is that you create a network using all the genes from the Sick Genes database, where you not only incorporate the STRING score between genes, but you also include a score for each individual gene based on how many studies it was significant in. You might imagine it as something like the above image, but where the size of each gene circle represents the number of studies.
- A sophisticated algorithm would highlight clusters based on scores between genes and scores for individual genes. If all of the NLGN genes are showing up as significant in many studies, and since they would have strong connections between them, I'd hope for the algorithm to pick out this cluster as one of the most interesting, even if individual NLGN genes aren't showing up a lot more often than many other genes.
- Why would information about non-significant findings for a gene not be as important for this use case? It might be easier to illustrate, so here's a rough mockup of how I imaging a network based on the database would look:

- Circle size represents how many studies a gene was significant in, based on the Sick Genes database. The connections between circles represent how related the genes are, based on data from a database like STRING. Thick red line means very related and thin blue line means weakly related.
- I envision the algorithm highlighting the cluster of NLGN genes above as most interesting because they all have been significant in many studies, plus they are strongly related to each other.
- TNF is a the biggest circle, so one might think that it's very interesting. But the genes it is most related to, CHUK and TRAF2, have not shown up much more than most other genes. This might indicate that researchers are often seeing TNF as significant not because it is a real effect, but because they keep testing it over and over. If it was a real effect, you might expect related genes to also show up a lot in untargeted proteomics studies (though it is possible TNF is working alone). [Edit: Also possible that TNF is interacting with other genes in a novel way that hasn't yet been characterized so the connection wouldn't be in STRING.] We can be fairly confident that it's not a real effect without even knowing how often it's been not significant.
- The IGHG genes have strong connections, but haven't shown up much more than other genes, so that's a sign that they are not actually important here.
- (All data described about specific genes here is a made up example.)
I have no idea if I'll be able to do the more sophisticated analysis above on my own, but maybe others who do know how to do that might become interested in that in the future.