Personal project: Sick Genes, a website for compiling significant gene findings from studies on ME/CFS and other conditions

A lot of parts I wasn't sure about, like what differentially accessible here means:

I assume they are referring to differential chromatin accessible, where differential is either between PwME vs Control, or before and after CPET difference. From the paper:
Thus, we used ATAC-seq to compare chromatin accessibilities of TM (and TN) cells in the same cohort of patients and healthy controls profiled by RNA-seq, identifying 67,189 consensus chromatin-accessible regions (ChARs) across all conditions. Accessible regions were enriched at transcriptional start sites, as expected for reliable ATAC-seq profiles (SI Appendix, Fig. S4A).

That is a rather technical paper and I agree it's difficult to understand. Well done for trying.

According to AI in simple terms, chromatin accessibility refers to how "open" or "loose" the DNA is within a cell's nucleus. To understand this, it helps to know that our DNA isn't just a free-floating strand; it's tightly wound and packaged with proteins, primarily histones, into structures called chromatin.
Open vs. Closed Chromatin:
  • Open (Accessible) Chromatin: Imagine a loosely wound spool of thread. In this state, the DNA is more exposed and available.
  • Closed (Inaccessible) Chromatin: This is like a tightly wound, compact spool. The DNA is hidden and unavailable.
 
Last edited:
I assume they are referring to chromatin accessible. From the paper:


According to AI in simple terms, chromatin accessibility refers to how "open" or "loose" the DNA is within a cell's nucleus. To understand this, it helps to know that our DNA isn't just a free-floating strand; it's tightly wound and packaged with proteins, primarily histones, into structures called chromatin.
Open vs. Closed Chromatin:
  • Open (Accessible) Chromatin: Imagine a loosely wound spool of thread. In this state, the DNA is more exposed and available.
  • Closed (Inaccessible) Chromatin: This is like a tightly wound, compact spool. The DNA is hidden and unavailable.
I see, thanks. So the DNA is a bit looser or denser near these genes in ME/CFS is I think what I'm getting. That's a type of finding I'd never seen before. But I think it makes sense to add along with mutations and abundance since it's a finding that is different from controls and refers to a change related to a specific gene.
 
Edit: I think I'll post things when recording genes that aren't very straightforward on this thread like above, just in case anyone has any suggestions. But I think it's okay if it's not perfect. I might miss some genes or mistakenly mark some genes as significant based on a misunderstanding, but I don't think it's a big deal if the vast majority of genes recorded are correct. Just a bit of noise in the data.
Good insight!!
A bit of noise on paper, is OK.
 
I suppose in those borderline cases it’s probably best to cast the net wider and have more in there than less? It can always be narrowed down later, when someone digs more into a paper/when experts are using it and provide feedback that a gene isn’t relevant/appropriate for inclusion.
 
How do you envision people using this database @forestglip? I was thinking, if I was a researcher that just completed a genetic study and found some statistically significant genes I would want to know how the previous literature would look like. Now your website will instantly be able to tell how often a my genes had been significant previously but it won't be able to tell me how often it was not significant which arguably might be just as important. I'm not sure how that could be handled without introducing even more work. What are your thoughts on this?
 
How do you envision people using this database @forestglip? I was thinking, if I was a researcher that just completed a genetic study and found some statistically significant genes I would want to know how the previous literature would look like.
Yeah, that's one use case: quickly find if and where a gene was significant and how often. You could see whether this is a new finding, in which case it might just be chance, or if it's been significant ten times already, in which case it might make sense to dig into this finding specifically.

Now your website will instantly be able to tell how often a my genes had been significant previously but it won't be able to tell me how often it was not significant which arguably might be just as important. I'm not sure how that could be handled without introducing even more work. What are your thoughts on this?
Yes, non-significant findings would arguably add a lot of value. If TNF has been significant more than any other gene, but not significant many times more than that even, then it might be the case of researchers targeting it over and over because they're interested in it, with all the significant findings just being chance.

But adding this kind of information adds similar practical hurdles as adding the direction of change. What if a study says "We tested TNF in the plasma and it was not significantly higher than controls. We also tested how much TNF was released by macrophages upon stimulation, and it was not significantly more than controls. Finally, we tested the cerebrospinal fluid and found that TNF was significantly higher in ME/CFS." Do I save TNF three times, twice as non-significant?

I suppose it's an option. But for a few reasons, I probably won't incorporate that:
  • I'm tired of coding and it would require a significant rewrite of the most annoying part of the website to write, the search tool that matches genes to HGNC before inserting, to allow for adding other data along with a gene when inserting.
  • It would slow down data insertion. I think there's value in trying to have as large a breadth as possible of the study landscape, as opposed to getting perfect snapshots of each study. Currently, I barely have to think while saving genes. I notice a gene name, which stands out because it's all-caps, and I look for language that says something like "significant", then I copy and paste. It'd probably at least double the required time (and make it less pleasant of an experience) if I had to add something about yes or no for significance for every gene.
  • The data on non-significant genes is often not reported. One example is untargeted proteomics studies. They might say they tested 6000 proteins, but only report the 100 that were significant. Or a GWAS. They might say the ten genes that were significant. Do I then store all 20,000 other genes in the human genome as non-significant every time there's a GWAS?
  • Finally, because I don't think this data is necessary for the main use case I envisioned:

How do you envision people using this database @forestglip?
Although targeted lookups, as you described, are one way to use it, I envision the tool's main value in being a hypothesis generator itself. The browse page already offers a crude version of this for just seeing which genes have been significant the most times. But what I think could be valuable is using the connections between related genes.

I was inspired by the discussion from the Zhang study, which led to learning about GSEA, which was an attempt by researchers to deal with the huge amount of data arising from the human genome, but where most genes' effects on disease are too small to be significant without a massive sample size. But if, for example, a study finds moderate effect sizes in 100 genes, where most are totally unrelated, but 5 of those genes happen to be very related (e.g. NLGN1, NLGN2, NLGN3, NLGN4X, NLGN4Y), that's a good sign that the NLGN-related genes aren't coming up just by chance. Considering related genes together allows for finding signals that would be invisible if considering each gene on its own.

I'm not exactly sure how this website's database could be used in a similar way, but here's a couple rough ideas:
  • Just GSEA using the number of studies that said a gene was significant for the measure that goes along with each gene. Normally a pre-ranked GSEA gives each gene a score representing biological importance that could be based on a p-value or an effect size, and then the algorithm sees if any groups of genes with high scores are over-represented in any gene sets (clusters of related genes curated by others, like the NLGN genes above). I'm not sure if a standard GSEA would be optimal here, but I think something similar could be done using study count as the metric.
  • Create a network using protein-protein interaction scores from STRING database. STRING is a curated database of scores between proteins, representing how related they are. It includes subscores like a score for how co-expressed two proteins are (if gene A is higher, do we always see gene B as higher too?) and a subscore for how often the proteins are mentioned together in published papers.
    • For example, here's a visualization of the protein relatedness between NLGN1, NLGN2, and TNF:
    • 1752843322631.png
    • Out of a maximum of 1, STRING gives the connection between NLGN1 and NLGN2 a score of .704, represented by the thick connecting line. TNF only has a small connection of .154 to NLGN1 and nothing to NLGN2, so it's clear that the NLGN genes are more related to each other than to TNF.
    • So what I imagine is that you create a network using all the genes from the Sick Genes database, where you not only incorporate the STRING score between genes, but you also include a score for each individual gene based on how many studies it was significant in. You might imagine it as something like the above image, but where the size of each gene circle represents the number of studies.
    • A sophisticated algorithm would highlight clusters based on scores between genes and scores for individual genes. If all of the NLGN genes are showing up as significant in many studies, and since they would have strong connections between them, I'd hope for the algorithm to pick out this cluster as one of the most interesting, even if individual NLGN genes aren't showing up a lot more often than many other genes.
  • Why would information about non-significant findings for a gene not be as important for this use case? It might be easier to illustrate, so here's a rough mockup of how I imaging a network based on the database would look:
  • New Project(3).png
  • Circle size represents how many studies a gene was significant in, based on the Sick Genes database. The connections between circles represent how related the genes are, based on data from a database like STRING. Thick red line means very related and thin blue line means weakly related.
  • I envision the algorithm highlighting the cluster of NLGN genes above as most interesting because they all have been significant in many studies, plus they are strongly related to each other.
  • TNF is a the biggest circle, so one might think that it's very interesting. But the genes it is most related to, CHUK and TRAF2, have not shown up much more than most other genes. This might indicate that researchers are often seeing TNF as significant not because it is a real effect, but because they keep testing it over and over. If it was a real effect, you might expect related genes to also show up a lot in untargeted proteomics studies (though it is possible TNF is working alone). [Edit: Also possible that TNF is interacting with other genes in a novel way that hasn't yet been characterized so the connection wouldn't be in STRING.] We can be fairly confident that it's not a real effect without even knowing how often it's been not significant.
  • The IGHG genes have strong connections, but haven't shown up much more than other genes, so that's a sign that they are not actually important here.
  • (All data described about specific genes here is a made up example.)
I have no idea if I'll be able to do the more sophisticated analysis above on my own, but maybe others who do know how to do that might become interested in that in the future.
 
Last edited:
I suppose in those borderline cases it’s probably best to cast the net wider and have more in there than less? It can always be narrowed down later, when someone digs more into a paper/when experts are using it and provide feedback that a gene isn’t relevant/appropriate for inclusion.
That might be the best way to go. I guess I was thinking if a study reports that 5 genes are really interesting and have really big effect sizes, but 4000 total are significant, then including all 4000 really waters down the information from the small number that are really interesting.

I suppose it fits the criteria of my website to just add all 4000, but I wasn't sure if I was reading the study wrong because that's so many. Maybe it's not that strange though. They basically tested every gene multiple times, once each in about a dozen different cell types. So if different genes are significant in each cell type, then when combined that could be a ton of unique genes.
 
Would you consider making it open source? You'd still have to review PRs, but it could be an option.
Yeah, I might do that. This is the first somewhat complete coding project I've done, and I've never collaborated with others in coding, so I'm not sure how that would go. I imagine people who are much more experienced than me sending PRs that I can't understand at all, so I'd be nervous about adding them. But I could give it a try.
 
Yeah, I might do that. This is the first somewhat complete coding project I've done, and I've never collaborated with others in coding, so I'm not sure how that would go. I imagine people who are much more experienced than me sending PRs that I can't understand at all, so I'd be nervous about adding them. But I could give it a try.
GitHub works well for sharing the code. Just because someone sends a PR it doesn't mean you have to implement it! But it could be a way for others here to help you on certain tasks.

To start with you could make it a private repository and play with the features yourself. And then if/when happy you can add some volunteer coder members you choose to help with the coding while still keeping it private for a select few. At least to get your feet wet and build confidence. Or not.

Regarding help with STRING you might want to reach out to @paolo who put together tools for this recent paper.
 
GitHub works well for sharing the code. Just because someone sends a PR it doesn't mean you have to implement it! But it could be a way for others here to help you on certain tasks.

To start with you could make it a private repository and play with the features yourself. And then if/when happy you can add some volunteer coder members you choose to help with the coding while still keeping it private for a select few. At least to get your feet wet and build confidence. Or not.
Oh yeah, I'm familiar. I don't think I have an issue with just making the code public immediately.

Regarding help with STRING you might want to reach out to @paolo who put together tools for this recent paper.
https://www.s4me.info/threads/towar...ed-gene-prioritization-2025-maccallini.43656/
Thanks for the suggestion! Maybe I'll reach out once I have at least like a couple hundred studies saved.
 
I took a long break from adding studies to my app after I let perfect become the enemy of good, and became overwhelmed and overworked.

Basically, I was trying to create an all encompassing criteria that could apply to every study I came across, and as you can see it was getting quite complicated, with new points continually being added (don't worry about reading all this):
Genes to Include
  • Gene product levels: Level of a gene product is significantly associated with having the phenotype in question (the phenotype(s) assigned to the Study Cohort being edited)
  • Coding mutations: Gene where a mutation within the coding region is significantly associated with the phenotype
  • Regulatory mutations: Gene where a mutation near the coding region is significantly associated with the phenotype, and the paper suggests or mentions the nearby gene as potentially interesting for this reason
  • Cell type markers: A gene that identifies a cell type, where the cell type is significantly increased or decreased (e.g., CD4+ T cells), even if the gene itself wasn't directly measured
  • Genes altered after in vitro stimulation: A gene that has a significantly different change after in vitro stimulation than in healthy controls.
Example where IFNG and TNF would be saved for both ME/CFS and long COVID cohorts (1):
"We designed a classic ICS assay to provide a direct measure of the functional capabilities of magnet-enriched fresh CD8 T-cells in a format that would be easy to adapt to clinical testing. These functional ICS assays showed that CD8 T-cells of ME/CFS and Long COVID patients had a significantly diminished capacity to produce both cytokines, IFNγ or TNFα, after PMA stimulation when compared to HC as seen in representative FACS plots (Fig. S1) and following statistical analysis of multiple individuals from each group (Fig. 1)."
  • Gene product shape or activity different: Example [2]:
    Recordings of TRPM3 ion channel currents were obtained from freshly isolated NK cells from HC, post-COVID-19 condition patients, and ME/CFS patients using the whole-cell patch-clamp electrophysiological technique. Endogenous TRPM3 function was rapidly and reversibly activated by application of 100 μM PregS. We found a significant difference among three groups of ionic current amplitude after PregS stimulation (p < 0.0001).
  • Methylation differences: If methylation that relates to a specific gene is significantly different, the gene can be added.
  • Antibodies to a specific gene product: If a case group has significantly increased or decreased levels of autoantibodies to a gene product, the gene in question should be added.
  • Severity associations: Any of the above but associated with the severity/amount of the phenotype instead of with another group
  • Symptom change associations: If a clinical trial finds a correlation between a gene product and the amount that symptoms improved, the gene can be added.
  • Longitudinal studies: If a gene is significant at any timepoint in a longitudinal study, include it
  • Ambiguous cases: When authors cannot discriminate between similar genes (e.g., IGHV3-23/30), include all mentioned genes
  • Two non-healthy groups where they only differ by a phenotype: If two non-healthy groups are compared with the only difference being the addition of one or more phenotypes in one group, the finding can be saved for that phenotype difference. For example, if a study finds that those with fibromyalgia have higher IL-10 than those with fibromyalgia+ME/CFS, then IL-10 can be saved for ME/CFS because that is the only way the groups differ. (As the next section says, if the groups were fibromyalgia versus ME/CFS, then they each have one phenotype the other group does not have and thus the findings should not be saved.)



Genes to NOT Include
  • Protein complexes: Do not include constituent genes when a protein complex is significantly associated with the phenotype (e.g., TSH)
  • External database predictions: Genes predicted to be of interest based on predictions made using external protein-protein or gene-gene interaction databases (e.g., GSEA or STRING network analysis)
  • Rare mutations without controls: Rare mutations related to a gene, but without a comparison group to determine significance
  • Disease vs. disease only: Do not store genes from studies that only compare groups that have health conditions without healthy controls
  • Complex protein assemblies: Constituent genes of protein complexes with 4 or more components
  • Machine learning identified: A gene predicted to be of interest based on machine learning algorithms, including genes that were not individually significant but were part of a gene panel that significantly discriminated between groups in ML analysis



Statistical Significance Criteria
  • Genes are primarily limited to those that pass a p-value threshold defined in the paper, or 0.05 if none is specified.
  • Exception: If authors mention a finding because it approaches significance (e.g., p=0.053) and believe it may be important, it can be added.
  • Multiple testing correction: If authors performed multiple test correction, use the adjusted p-value. If not, use the nominal p-value.

It was becoming frustrating trying to make sure I was fitting every finding correctly into this framework. Also, I had a list of the ~8000 studies that mention ME/CFS from PubMed, and it was becoming overwhelming to go through them, especially due to there being probably 50-100 studies that have no relevant genes for every one that might have something. I was getting exhausted and dispirited from trying to parse them all.

I was worrying too much about making sure every gene included was perfectly correct for some potential future statistical project, and was forgetting the main reason I made the app in the first place: a more systematic way to cross-check genes in studies.

So I started from a blank slate again, with simpler criteria:
On this site, I am compiling interesting gene findings from studies on ME/CFS.

Basically, I add findings that a paper's authors propose as potentially interesting or which can be inferred by a reader as potentially interesting, and which can be described by the name of a single gene.

Genes added here could be based on genes nearest to or predicted to be affected by significant variants in a GWAS of ME/CFS, genes which are differentially expressed in ME/CFS, protein products of genes which are up- or down-regulated in ME/CFS, differential methylation associated with a gene, or other findings.

The goal is to be able to look up any given genes of interest and see how often and in which papers these genes have previously been mentioned as potentially promising.

Some decisions on whether to include a gene or not are slightly arbitrary because of the vast diversity of methodologies and types of findings in studies. Any interpretation of gene findings on this website should be based on reading the actual studies.

Basically, I want to save any genes that we would be likely to post in a study thread here and say "these genes might be interesting". This way, when a new genetic or transcriptomic study comes out, we don't have to go search through the text, figures, and S4ME posts of old studies trying to check if the genes had previously been mentioned (not to mention there's a good chance we would miss the gene if an old study used a different name for the gene).

So no hard limits on significance or convoluted criteria about types of findings. If the authors say a gene might be relevant to ME/CFS based on their results, or if one would assume it might be relevant, I'll add it. If a study gives over 4000 significant genes (which at least one transcriptomics study did), I might just only add the genes they talk about in the text.

Instead of going one by one through PubMed results, this time I'm going one by one in papers posted to the ME/CFS Research forum, going backwards from present day. But the priority will be adding new studies as they come out, with existing studies being a secondary concern that I'll try to slowly work on adding.

If anyone thinks there are any really important studies that it'd be good to save the genes for, and which aren't yet on the studies page, feel free to post in this thread or DM me.

It's still at the same domain, sickgenes.xyz, with 20 studies saved so far after restarting. I moved the previously logged data to old.sickgenes.xyz.
 
Great to see you find a way to get back to this @forestglip and thank you for all the thought and work you’ve put in. And for sharing your thinking.

It’s always such a difficult balance to reach isn’t it? I was thinking similarly on my recent projects and used to see the same dilemma often with professional projects. But for those of us with ME/CFS… it’s even harder
 
Back
Top Bottom