Personal project: Sick Genes, a website for compiling significant gene findings from studies on ME/CFS and other conditions

A lot of parts I wasn't sure about, like what differentially accessible here means:

I assume they are referring to differential chromatin accessible, where differential is either between PwME vs Control, or before and after CPET difference. From the paper:
Thus, we used ATAC-seq to compare chromatin accessibilities of TM (and TN) cells in the same cohort of patients and healthy controls profiled by RNA-seq, identifying 67,189 consensus chromatin-accessible regions (ChARs) across all conditions. Accessible regions were enriched at transcriptional start sites, as expected for reliable ATAC-seq profiles (SI Appendix, Fig. S4A).

That is a rather technical paper and I agree it's difficult to understand. Well done for trying.

According to AI in simple terms, chromatin accessibility refers to how "open" or "loose" the DNA is within a cell's nucleus. To understand this, it helps to know that our DNA isn't just a free-floating strand; it's tightly wound and packaged with proteins, primarily histones, into structures called chromatin.
Open vs. Closed Chromatin:
  • Open (Accessible) Chromatin: Imagine a loosely wound spool of thread. In this state, the DNA is more exposed and available.
  • Closed (Inaccessible) Chromatin: This is like a tightly wound, compact spool. The DNA is hidden and unavailable.
 
Last edited:
I assume they are referring to chromatin accessible. From the paper:


According to AI in simple terms, chromatin accessibility refers to how "open" or "loose" the DNA is within a cell's nucleus. To understand this, it helps to know that our DNA isn't just a free-floating strand; it's tightly wound and packaged with proteins, primarily histones, into structures called chromatin.
Open vs. Closed Chromatin:
  • Open (Accessible) Chromatin: Imagine a loosely wound spool of thread. In this state, the DNA is more exposed and available.
  • Closed (Inaccessible) Chromatin: This is like a tightly wound, compact spool. The DNA is hidden and unavailable.
I see, thanks. So the DNA is a bit looser or denser near these genes in ME/CFS is I think what I'm getting. That's a type of finding I'd never seen before. But I think it makes sense to add along with mutations and abundance since it's a finding that is different from controls and refers to a change related to a specific gene.
 
Edit: I think I'll post things when recording genes that aren't very straightforward on this thread like above, just in case anyone has any suggestions. But I think it's okay if it's not perfect. I might miss some genes or mistakenly mark some genes as significant based on a misunderstanding, but I don't think it's a big deal if the vast majority of genes recorded are correct. Just a bit of noise in the data.
Good insight!!
A bit of noise on paper, is OK.
 
I suppose in those borderline cases it’s probably best to cast the net wider and have more in there than less? It can always be narrowed down later, when someone digs more into a paper/when experts are using it and provide feedback that a gene isn’t relevant/appropriate for inclusion.
 
How do you envision people using this database @forestglip? I was thinking, if I was a researcher that just completed a genetic study and found some statistically significant genes I would want to know how the previous literature would look like. Now your website will instantly be able to tell how often a my genes had been significant previously but it won't be able to tell me how often it was not significant which arguably might be just as important. I'm not sure how that could be handled without introducing even more work. What are your thoughts on this?
 
How do you envision people using this database @forestglip? I was thinking, if I was a researcher that just completed a genetic study and found some statistically significant genes I would want to know how the previous literature would look like.
Yeah, that's one use case: quickly find if and where a gene was significant and how often. You could see whether this is a new finding, in which case it might just be chance, or if it's been significant ten times already, in which case it might make sense to dig into this finding specifically.

Now your website will instantly be able to tell how often a my genes had been significant previously but it won't be able to tell me how often it was not significant which arguably might be just as important. I'm not sure how that could be handled without introducing even more work. What are your thoughts on this?
Yes, non-significant findings would arguably add a lot of value. If TNF has been significant more than any other gene, but not significant many times more than that even, then it might be the case of researchers targeting it over and over because they're interested in it, with all the significant findings just being chance.

But adding this kind of information adds similar practical hurdles as adding the direction of change. What if a study says "We tested TNF in the plasma and it was not significantly higher than controls. We also tested how much TNF was released by macrophages upon stimulation, and it was not significantly more than controls. Finally, we tested the cerebrospinal fluid and found that TNF was significantly higher in ME/CFS." Do I save TNF three times, twice as non-significant?

I suppose it's an option. But for a few reasons, I probably won't incorporate that:
  • I'm tired of coding and it would require a significant rewrite of the most annoying part of the website to write, the search tool that matches genes to HGNC before inserting, to allow for adding other data along with a gene when inserting.
  • It would slow down data insertion. I think there's value in trying to have as large a breadth as possible of the study landscape, as opposed to getting perfect snapshots of each study. Currently, I barely have to think while saving genes. I notice a gene name, which stands out because it's all-caps, and I look for language that says something like "significant", then I copy and paste. It'd probably at least double the required time (and make it less pleasant of an experience) if I had to add something about yes or no for significance for every gene.
  • The data on non-significant genes is often not reported. One example is untargeted proteomics studies. They might say they tested 6000 proteins, but only report the 100 that were significant. Or a GWAS. They might say the ten genes that were significant. Do I then store all 20,000 other genes in the human genome as non-significant every time there's a GWAS?
  • Finally, because I don't think this data is necessary for the main use case I envisioned:

How do you envision people using this database @forestglip?
Although targeted lookups, as you described, are one way to use it, I envision the tool's main value in being a hypothesis generator itself. The browse page already offers a crude version of this for just seeing which genes have been significant the most times. But what I think could be valuable is using the connections between related genes.

I was inspired by the discussion from the Zhang study, which led to learning about GSEA, which was an attempt by researchers to deal with the huge amount of data arising from the human genome, but where most genes' effects on disease are too small to be significant without a massive sample size. But if, for example, a study finds moderate effect sizes in 100 genes, where most are totally unrelated, but 5 of those genes happen to be very related (e.g. NLGN1, NLGN2, NLGN3, NLGN4X, NLGN4Y), that's a good sign that the NLGN-related genes aren't coming up just by chance. Considering related genes together allows for finding signals that would be invisible if considering each gene on its own.

I'm not exactly sure how this website's database could be used in a similar way, but here's a couple rough ideas:
  • Just GSEA using the number of studies that said a gene was significant for the measure that goes along with each gene. Normally a pre-ranked GSEA gives each gene a score representing biological importance that could be based on a p-value or an effect size, and then the algorithm sees if any groups of genes with high scores are over-represented in any gene sets (clusters of related genes curated by others, like the NLGN genes above). I'm not sure if a standard GSEA would be optimal here, but I think something similar could be done using study count as the metric.
  • Create a network using protein-protein interaction scores from STRING database. STRING is a curated database of scores between proteins, representing how related they are. It includes subscores like a score for how co-expressed two proteins are (if gene A is higher, do we always see gene B as higher too?) and a subscore for how often the proteins are mentioned together in published papers.
    • For example, here's a visualization of the protein relatedness between NLGN1, NLGN2, and TNF:
    • 1752843322631.png
    • Out of a maximum of 1, STRING gives the connection between NLGN1 and NLGN2 a score of .704, represented by the thick connecting line. TNF only has a small connection of .154 to NLGN1 and nothing to NLGN2, so it's clear that the NLGN genes are more related to each other than to TNF.
    • So what I imagine is that you create a network using all the genes from the Sick Genes database, where you not only incorporate the STRING score between genes, but you also include a score for each individual gene based on how many studies it was significant in. You might imagine it as something like the above image, but where the size of each gene circle represents the number of studies.
    • A sophisticated algorithm would highlight clusters based on scores between genes and scores for individual genes. If all of the NLGN genes are showing up as significant in many studies, and since they would have strong connections between them, I'd hope for the algorithm to pick out this cluster as one of the most interesting, even if individual NLGN genes aren't showing up a lot more often than many other genes.
  • Why would information about non-significant findings for a gene not be as important for this use case? It might be easier to illustrate, so here's a rough mockup of how I imaging a network based on the database would look:
  • New Project(3).png
  • Circle size represents how many studies a gene was significant in, based on the Sick Genes database. The connections between circles represent how related the genes are, based on data from a database like STRING. Thick red line means very related and thin blue line means weakly related.
  • I envision the algorithm highlighting the cluster of NLGN genes above as most interesting because they all have been significant in many studies, plus they are strongly related to each other.
  • TNF is a the biggest circle, so one might think that it's very interesting. But the genes it is most related to, CHUK and TRAF2, have not shown up much more than most other genes. This might indicate that researchers are often seeing TNF as significant not because it is a real effect, but because they keep testing it over and over. If it was a real effect, you might expect related genes to also show up a lot in untargeted proteomics studies (though it is possible TNF is working alone). [Edit: Also possible that TNF is interacting with other genes in a novel way that hasn't yet been characterized so the connection wouldn't be in STRING.] We can be fairly confident that it's not a real effect without even knowing how often it's been not significant.
  • The IGHG genes have strong connections, but haven't shown up much more than other genes, so that's a sign that they are not actually important here.
  • (All data described about specific genes here is a made up example.)
I have no idea if I'll be able to do the more sophisticated analysis above on my own, but maybe others who do know how to do that might become interested in that in the future.
 
Last edited:
I suppose in those borderline cases it’s probably best to cast the net wider and have more in there than less? It can always be narrowed down later, when someone digs more into a paper/when experts are using it and provide feedback that a gene isn’t relevant/appropriate for inclusion.
That might be the best way to go. I guess I was thinking if a study reports that 5 genes are really interesting and have really big effect sizes, but 4000 total are significant, then including all 4000 really waters down the information from the small number that are really interesting.

I suppose it fits the criteria of my website to just add all 4000, but I wasn't sure if I was reading the study wrong because that's so many. Maybe it's not that strange though. They basically tested every gene multiple times, once each in about a dozen different cell types. So if different genes are significant in each cell type, then when combined that could be a ton of unique genes.
 
Would you consider making it open source? You'd still have to review PRs, but it could be an option.
Yeah, I might do that. This is the first somewhat complete coding project I've done, and I've never collaborated with others in coding, so I'm not sure how that would go. I imagine people who are much more experienced than me sending PRs that I can't understand at all, so I'd be nervous about adding them. But I could give it a try.
 
Yeah, I might do that. This is the first somewhat complete coding project I've done, and I've never collaborated with others in coding, so I'm not sure how that would go. I imagine people who are much more experienced than me sending PRs that I can't understand at all, so I'd be nervous about adding them. But I could give it a try.
GitHub works well for sharing the code. Just because someone sends a PR it doesn't mean you have to implement it! But it could be a way for others here to help you on certain tasks.

To start with you could make it a private repository and play with the features yourself. And then if/when happy you can add some volunteer coder members you choose to help with the coding while still keeping it private for a select few. At least to get your feet wet and build confidence. Or not.

Regarding help with STRING you might want to reach out to @paolo who put together tools for this recent paper.
 
GitHub works well for sharing the code. Just because someone sends a PR it doesn't mean you have to implement it! But it could be a way for others here to help you on certain tasks.

To start with you could make it a private repository and play with the features yourself. And then if/when happy you can add some volunteer coder members you choose to help with the coding while still keeping it private for a select few. At least to get your feet wet and build confidence. Or not.
Oh yeah, I'm familiar. I don't think I have an issue with just making the code public immediately.

Regarding help with STRING you might want to reach out to @paolo who put together tools for this recent paper.
https://www.s4me.info/threads/towar...ed-gene-prioritization-2025-maccallini.43656/
Thanks for the suggestion! Maybe I'll reach out once I have at least like a couple hundred studies saved.
 
Back
Top Bottom