Personal project: Sick Genes, a website for compiling significant gene findings from studies on ME/CFS and other conditions

forestglip

Moderator
Staff member
For the past month or so I've been working on a project. It's a website for logging genes that were significant from studies on ME/CFS, meant to make it easier to compare genes that are replicated in multiple studies. You can take a look at https://sickgenes.xyz

Screenshot:
1752768638855.png



In essence, it's like the tag system on this forum. Each study get tags representing all the genes that were significant.

One of the advantages over simple tags is that I use standardized names for genes. Different studies can refer to the same gene by different names, and the website wouldn't know they were the same if they were just stored as written. So I made use of the HGNC database, which creates official names for every gene, and also has aliases for each gene. I made a search tool to plug in the name of a gene from a study to match it to the official HGNC symbol before saving.

Another advantage is that the data can be manipulated more easily than with forum tags. For example, I made a page that lists the genes in order of how many studies they were significant in. Presumably, those genes that appeared in the most studies might be the best leads. I'm also thinking about the possibility of using the STRING protein-protein interaction network to find the most interesting genes based on clusters of related genes that were found multiple times, instead of based on single genes. The data can also be downloaded by anyone to explore in any other way. [Edit: still need to add the functionality to download data.]



I currently have a list of all 8000+ studies that come up for "ME/CFS" (and associated terms) in PubMed, and I'm going through them to find the ones that had results that include genes, and then I store the significant genes on this website. It might take some time to store findings from all the relevant published studies, but once I do, it shouldn't be hard to keep up with new studies.

To make things manageable, I'm keeping it simple by lumping genetic, transcriptomic, and proteomic studies together. So if a gene is significant because people with ME/CFS have a mutation in it, it gets added. Or if it is significant because the protein it encodes is found at different levels in ME/CFS, it gets added. And I also don't make any distinction between untargeted studies on thousands of genes versus studies that tested one or two genes.

I'm also not saving any information about direction, significance, or effect size, again to keep things simple.



Studies also get "disease" tags, which can be "ME/CFS", but also can include, for example, tags like "CCC" or "Fukuda" if the criteria was specified, in order to allow for more targeted analysis. And I can also save genes that were significant in other diseases, though for now I'm focusing on getting through all ME/CFS studies.



I might at some point consider adding functionality to make adding gene findings more collaborative, to make things go faster, especially if there's interest in doing this for other diseases. But that would add a lot of complexity to the website because of the need for user accounts that get tied to the genes each user adds, and it'd require documenting the exact criteria for deciding which genes to save.



Many studies present interesting questions for deciding if reported genes make sense to save for a given disease, and I'm still basically making up the rules as I go for these. A couple examples:
  • If a gene is significant between two different diseases, as opposed to between a disease and a healthy group, do I store the gene as significant for both diseases? Or neither? I've decided on both for now.
  • If a study says more CD8 cells, do I store the CD8 gene as significant? What is significant here is having more of the entire cells, not the CD8 gene which is just one tiny component. Still, I decided to store genes like CD8 as significant in these cases because CD8 technically is more abundant, even if only in the context of lymphocytes.
  • Do I store the genes from a study like the Zhang machine learning paper? They used STRING to predict genes that might be important in ME/CFS, so it is possible that many of these genes were not even strikingly different between the groups that they studied. I'm leaning towards not saving these kinds of findings. (I'm also not storing preprints for now.)
Link: https://sickgenes.xyz
 
Last edited:
Interesting project and a lot of work!

Would it not make sense to also store if something is "up" or "down" rather then just either?
It might, but it adds a lot more complexity. Partly because of the added code to write to store that kind of information, but mainly because a single study might find a gene was both up and down in different parts of the study, for example it might be up in the CSF and down in the plasma, so I couldn't think of a good way to store them in that kind of circumstance, except maybe storing it twice. It'd also require discriminating between mutation findings, which don't have a direction, and abundance findings.

I'm trying to keep things as simple as possible while still being potentially valuable so that I can more quickly store lots of studies. I think the marginal benefit from adding up or down isn't worth it at this point. Especially if I'm not adding location of the finding, like CSF or plasma. Maybe a gene being up in plasma is known to cause it to be down in CSF, so in this case you wouldn't want to just look for genes that are going the same direction.
 
I wonder if it would be possible for others to help out? Spread the workload and crowd source the effort.
Theoretically, yeah, I would like to make it so that I can give other people accounts to be able to add studies and genes from those studies.

That's the part I'm most nervous about though because of grey areas in deciding what genes make sense to save, and I don't want there to be too much variety in the types of findings that get saved. I listed some of those grey areas in the first post, but maybe we can all come to some consensus on the best way to go for these types of things. Maybe I'll try to write up my current criteria for saving findings here first.
 
Current procedure and criteria:

Add study
First add a study by typing in the DOI, which looks up the rest of the info about the study, then click Save.

Add study cohort
On the study detail page, click Add study cohort. A study cohort describes the group of interest. For example, if a study compared ME/CFS to healthy controls, create a study cohort with the tag "ME/CFS". If the ME/CFS group all fulfilled CCC criteria, then add a tag for that as well for this study cohort.

If a study looked at multiple groups of people, like ME/CFS and long COVID and migraine, then make different study cohorts for each. If the same group has different diseases, for example both ME/CFS and migraine, then make one study group with tags for both diseases. Every person in the study cohort that was studied should have the disease or appropriate descriptor, otherwise don't include that tag (e.g. don't add tag for CCC criteria if only most people in the group fulfilled it.)

Add gene findings
Once there is a study cohort added to the study page, click Add gene findings. Add all the genes that the paper in question found to be significant in the context of the study cohort group. Criteria for determining which genes to include below.

Make sure you include genes that are reported in figures and supplementary data, as well as in the main text.

Criteria
Findings to add:

  • Mutation in a gene which is associated with having a disease.
  • Mutation near a gene, which the authors suggest may be associated with that gene.
  • More or less expression of the gene as mRNA or as a protein.
  • Expression or mutations associated with severity of a disease, even in absence of a control group.
Grey areas:
  • Genes that are predicted to be of interest based on STRING network.
    • Don't save
  • Genes predicted to be of interest from machine learning based solely on the data from the study
    • Save
  • Genes named in pathways from pathway analysis of other genes in a study. For example, if GSEA found that the pathway "Response of GCN2" is significant, but the protein GCN2 on its own is not significant.
    • Don't save
  • Genes significant between different disease groups instead of between disease group and healthy group.
    • Save for cohorts for both disease groups
  • Genes that describe a cell type (e.g. CD8 T cells)
    • Save
  • Rare mutation in a gene in a single patient
    • Don't save

----

A study I'm not sure whether it's right to save: https://www.sciencedirect.com/science/article/pii/S1807593225000304

They found a mutation in a gene that's likely responsible for this one person's illness, so I'd probably save that gene. Where it's questionable is that they refer to it as being misdiagnosed as ME/CFS. Can that be stored as ME/CFS since she fulfills the criteria, but the authors think the label does not apply to her because they found a specific cause?
 
Last edited:
That makes sense. Thanks for sharing and for outlining your process.

Cooperative efforts and certainly supporting them can be a lot of work. And I understand the difficulty over criteria, ultimately the quality of the data in the database is the valuable part.

Maybe there’s scope for a well trained team of mini-glips down the line.

Hope you can be proud of what you’ve achieved and take some time to rest. Well done, impressive achievement.
 
That makes sense. Thanks for sharing and for outlining your process.

Cooperative efforts and certainly supporting them can be a lot of work. And I understand the difficulty over criteria, ultimately the quality of the data in the database is the valuable part.

Maybe there’s scope for a well trained team of mini-glips down the line.

Hope you can be proud of what you’ve achieved and take some time to rest. Well done, impressive achievement.
Thank you.

In my head, I'm imagining this website as kind of a proof of concept that eventually convinces people who have much more expertise in web development and biomedical research than me to take over or make a better version.

I think I can get through all ME/CFS studies on my own in a few months since for most I can quickly dismiss them based on title or abstract not being relevant. The proportion that have gene findings is pretty small.
 
That looks fantastic @forestglip :thumbsup::thumbsup: A lot of work, but I'm going to be referring back to this quite I think.
I hope you and others can find it useful. The way we've been doing it here of just making a post with all the genes in a thread for a study to be able to search for them is good, but I thought we could probably complement that with a more systematic approach.

Great having the link to the S4ME thread too. You might like to strip the last eg "/page-2" (when sometimes present) in the URL, to link to the start of each thread? (Unless deliberate)
Oh, I actually try to remove those when I add the link. I guess some slipped by. I could probably add some code that removes the extra bit if it accidentally gets added.
 
Can that be stored as ME/CFS since she fulfills the criteria, but the authors think the label does not apply to her because they found a specific cause?

I'd definitely include that study. It's clear she has ME/CFS. The authors seem to be arguing that ME/CFS requires no alternative cause to explain symptoms and they found the cause in this specific patient.

There's a hint that they think ME/CFS is psychobehavioural, so that finding a biological cause means it's not ME/CFS. They don't state and I may be over-interpreting this but they write a BPS-adjacent intro —

Chronic Fatigue Syndrome (CFS), also known as Myalgic Encephalomyelitis (ME) […] The cause of ME/CFS is not known, but it is thought to be triggered by a combination of factors. As there is no biomarker to confirm the diagnosis, it can only be made by ruling out other health problems with similar symptoms.

Regardless, extending their logic, when we do find out the mechanism(s) of ME/CFS - suddenly no more ME/CFS?
 
I'd definitely include that study. It's clear she has ME/CFS. The authors seem to be arguing that ME/CFS requires no alternative cause to explain symptoms and they found the cause in this specific patient.
Thanks, yes, that's what I was leaning towards. I didn't know if I should take a purist approach of just accepting whatever authors say, or allowing for a bit of interpretation.
 
In my head, I'm imagining this website as kind of a proof of concept that eventually convinces people who have much more expertise in web development and biomedical research than me to take over or make a better version.
Wish I could help, seems like a fun and useful project. I'm a software engineer but not much of one at my current severity. :emoji_sweat_smile:
 
Just went through this paper, which is quite intense and full of terms I don't understand:

Transcriptional reprogramming primes CD8+ T cells toward exhaustion in Myalgic encephalomyelitis/chronic fatigue syndrome

I stored all the genes from the text and images that appear to be differentially expressed in ME/CFS. A lot of parts I wasn't sure about, like what differentially accessible here means:
Differential accessibility analysis found 471 upregulated and 1,477 downregulated ChARs in ME TM cells (Fig. 3A). We associated ChARs with their nearest genes and aggregated them to generate gene-level analyses. Using this approach, we found 43 upregulated genes and 192 downregulated genes between the two cohorts (Fig. 3A). ME TM cells exhibited decreased accessibility at genes associated with chemokine and cytokine signaling (CCR4, CCL2, CSF2, IL3; Fig. 3B), and decreased accessibility for genes involved in NF-κB/TNFα signaling and KRAS signaling (Fig. 3C), suggesting that ME TM cells are refractory to activation to facilitate peripheral tolerance (46).
But I stored those genes for now.

There's also a supplementary Dataset 1:
We sought to interrogate the transcriptional programs of case and control lymphocytes at the single cell level at baseline. We identified dysregulation across multiple clusters, with the greatest signal in CD8+ T cell subsets and γδT cells (Fig. 1B and Dataset S1).
It has around 4000 unique genes that have an adjusted p-value below .05. That seems like a lot. That's around one tenth to one fifth of all human genes. So I'm not sure if I should just add all of these or if I'm misunderstanding something.

---

Edit: I think I'll post things when recording genes that aren't very straightforward on this thread like above, just in case anyone has any suggestions. But I think it's okay if it's not perfect. I might miss some genes or mistakenly mark some genes as significant based on a misunderstanding, but I don't think it's a big deal if the vast majority of genes recorded are correct. Just a bit of noise in the data.
 
Last edited:
Back
Top Bottom