For the past month or so I've been working on a project. It's a website for logging genes that were significant from studies on ME/CFS, meant to make it easier to compare genes that are replicated in multiple studies. You can take a look at https://sickgenes.xyz
Screenshot:

In essence, it's like the tag system on this forum. Each study get tags representing all the genes that were significant.
One of the advantages over simple tags is that I use standardized names for genes. Different studies can refer to the same gene by different names, and the website wouldn't know they were the same if they were just stored as written. So I made use of the HGNC database, which creates official names for every gene, and also has aliases for each gene. I made a search tool to plug in the name of a gene from a study to match it to the official HGNC symbol before saving.
Another advantage is that the data can be manipulated more easily than with forum tags. For example, I made a page that lists the genes in order of how many studies they were significant in. Presumably, those genes that appeared in the most studies might be the best leads. I'm also thinking about the possibility of using the STRING protein-protein interaction network to find the most interesting genes based on clusters of related genes that were found multiple times, instead of based on single genes. The data can also be downloaded by anyone to explore in any other way. [Edit: still need to add the functionality to download data.]
I currently have a list of all 8000+ studies that come up for "ME/CFS" (and associated terms) in PubMed, and I'm going through them to find the ones that had results that include genes, and then I store the significant genes on this website. It might take some time to store findings from all the relevant published studies, but once I do, it shouldn't be hard to keep up with new studies.
To make things manageable, I'm keeping it simple by lumping genetic, transcriptomic, and proteomic studies together. So if a gene is significant because people with ME/CFS have a mutation in it, it gets added. Or if it is significant because the protein it encodes is found at different levels in ME/CFS, it gets added. And I also don't make any distinction between untargeted studies on thousands of genes versus studies that tested one or two genes.
I'm also not saving any information about direction, significance, or effect size, again to keep things simple.
Studies also get "disease" tags, which can be "ME/CFS", but also can include, for example, tags like "CCC" or "Fukuda" if the criteria was specified, in order to allow for more targeted analysis. And I can also save genes that were significant in other diseases, though for now I'm focusing on getting through all ME/CFS studies.
I might at some point consider adding functionality to make adding gene findings more collaborative, to make things go faster, especially if there's interest in doing this for other diseases. But that would add a lot of complexity to the website because of the need for user accounts that get tied to the genes each user adds, and it'd require documenting the exact criteria for deciding which genes to save.
Many studies present interesting questions for deciding if reported genes make sense to save for a given disease, and I'm still basically making up the rules as I go for these. A couple examples:
Screenshot:

In essence, it's like the tag system on this forum. Each study get tags representing all the genes that were significant.
One of the advantages over simple tags is that I use standardized names for genes. Different studies can refer to the same gene by different names, and the website wouldn't know they were the same if they were just stored as written. So I made use of the HGNC database, which creates official names for every gene, and also has aliases for each gene. I made a search tool to plug in the name of a gene from a study to match it to the official HGNC symbol before saving.
Another advantage is that the data can be manipulated more easily than with forum tags. For example, I made a page that lists the genes in order of how many studies they were significant in. Presumably, those genes that appeared in the most studies might be the best leads. I'm also thinking about the possibility of using the STRING protein-protein interaction network to find the most interesting genes based on clusters of related genes that were found multiple times, instead of based on single genes. The data can also be downloaded by anyone to explore in any other way. [Edit: still need to add the functionality to download data.]
I currently have a list of all 8000+ studies that come up for "ME/CFS" (and associated terms) in PubMed, and I'm going through them to find the ones that had results that include genes, and then I store the significant genes on this website. It might take some time to store findings from all the relevant published studies, but once I do, it shouldn't be hard to keep up with new studies.
To make things manageable, I'm keeping it simple by lumping genetic, transcriptomic, and proteomic studies together. So if a gene is significant because people with ME/CFS have a mutation in it, it gets added. Or if it is significant because the protein it encodes is found at different levels in ME/CFS, it gets added. And I also don't make any distinction between untargeted studies on thousands of genes versus studies that tested one or two genes.
I'm also not saving any information about direction, significance, or effect size, again to keep things simple.
Studies also get "disease" tags, which can be "ME/CFS", but also can include, for example, tags like "CCC" or "Fukuda" if the criteria was specified, in order to allow for more targeted analysis. And I can also save genes that were significant in other diseases, though for now I'm focusing on getting through all ME/CFS studies.
I might at some point consider adding functionality to make adding gene findings more collaborative, to make things go faster, especially if there's interest in doing this for other diseases. But that would add a lot of complexity to the website because of the need for user accounts that get tied to the genes each user adds, and it'd require documenting the exact criteria for deciding which genes to save.
Many studies present interesting questions for deciding if reported genes make sense to save for a given disease, and I'm still basically making up the rules as I go for these. A couple examples:
- If a gene is significant between two different diseases, as opposed to between a disease and a healthy group, do I store the gene as significant for both diseases? Or neither? I've decided on both for now.
- If a study says more CD8 cells, do I store the CD8 gene as significant? What is significant here is having more of the entire cells, not the CD8 gene which is just one tiny component. Still, I decided to store genes like CD8 as significant in these cases because CD8 technically is more abundant, even if only in the context of lymphocytes.
- Do I store the genes from a study like the Zhang machine learning paper? They used STRING to predict genes that might be important in ME/CFS, so it is possible that many of these genes were not even strikingly different between the groups that they studied. I'm leaning towards not saving these kinds of findings. (I'm also not storing preprints for now.)
Last edited: