I don't think this is 'just' clutching at straws. It is a very uncertain relation to NEGR1 but the p value threshold for the GWAS, as I understand it, has to be set high for multiplicity reasons and there are going to be lots of type 2 errors missing genuine links. I think the odds of there being a genuine risk factor in this DNA segment are very high. And being quite well off the NEGR1 gene itself is not an indication that NEGR1 expression is not the risk mediator. It may well not be but if it is it would be foolish to have ignored it.
To some extent we are clutching at straws for everything here but maybe a bit more clutching at dry grass on a bank, where some of the time it comes away in your hand but other times it gives you a lift up one step!
Could you spell this out a bit more for my slow brain:
Are you saying you looked around NEGR1 and found DecodeME hits (ie. SNPs with relatively low p-value?) in locations where it would make sense for a promotor or enhancer to be, though none have been identified there yet by researchers?
If you look at the LocusZoom for NEGR1 you can see there’s a little arrow on the label for NEGR1pointing from right to left, that indicates it is read in that direction, so a promoter (that starts the read) would be off in that direction. And thst’s where the hits are in DecodeME.
But exactly where doesn’t seem to be an exact science.
I’m not sure of there’s a data mismatch somewhere, more info in GeneCards then GenomeBrowser or maybe I’m missing some of these uncharacterised locations in my new scripts or just don’t understand this well enough…
If you look at the LocusZoom for NEGR1 you can see there’s a little arrow on the label for NEGR1pointing from right to left, that indicates it is read in that direction, so a promoter (that starts the read) would be off in that direction. And thst’s where the hits are in DecodeME.
As promoters are typically immediately adjacent to the gene in question, positions in the promoter are designated relative to the transcriptional start site, where transcription of DNA begins for a particular gene (i.e., positions upstream are negative numbers counting back from -1, for example -100 is a position 100 base pairs upstream).
I could easily have this very wrong, but my impression is that promoter sequences are usually very very close to the start of the gene sequence.
I think those 73.00, 73.50 numbers on the x axis are millions of base pairs, so the area where there are good hits ( logp of around 7) are a really long way in base pair terms from the start of gene NEGR1. i.e 72,300,000 is approximately the start of NEGR1 and the good hits are 73,000,000 to 73.500,000.
It's possible that the DecodeME Chromosome 1 good hits are enhancers for NEGR1, but then it's also possible that they are enhancers for other genes.
Yes, that’s what I was thinking and trying to dig into.
I think I’ve noticed a problem in that my scripts look for all TFBS in a region rather than for a gene. This could explain the difference I was seeing and limit my results for enhancers away from the main gene, I’ll investigate and see if there’s a better approach or what results I can get by increasing the window around a gene…
Edit: the API only allows region based searches, I can easily increase the search window around the gene then filter results down… Or look at trying to get the entire genehancer data file and filter it locally to be more comprehensive, that looks possibly better but is a job for later I think, although one I’m interested in looking into.
And so people are assuming that those hits between 73.00 and 73.50 are affecting the expression of the NEGR1 gene.
How much of a long shot is that? Is it just clutching at straws?
Is it at least as likely that the hits are affecting other genes? For example, KRT8P21 is in that range of peaks - below the purple diamond.
Good question. To restate what's at issue here: if we have a significant locus, and we have no other information, what's the likelihood that the nearest protein-coding gene (NEGR1 in this case) is the pathogenic gene implicated by the significant variant?
I realize I rather uncritically accepted this claim made a few months ago, and maybe read too much into it, without digging into it much:
In the lecture, they mention that the closest gene is certainly not always the causal one but it is significantly more likely to be so than further away genes. So perhaps it would be worthwhile to highlight the closest 1-2 genes for each of the hits, as these are more likely to be relevant than others.
I'm not using the following to as evidence since I haven't dug into the papers cited, but rather just to paste potential leads to follow to learn more about this. These are tweets from someone named Eric Fauman.
Dr. Fauman is a senior scientific director at Pfizer. He received his Ph.D. from the University of California, San Francisco for work on protein structure determination by X-ray crystallography and structure-based drug design. For the past 20 years, he has been a computational biologist at Pfizer, working with disease biologists to identify and prioritize drug targets to address unmet medical needs. He currently leads a team of computational biologists and geneticists dedicated to Computational Target Validation, working across multiple disease areas. Dr. Fauman is particularly interested in the intersection of genetics and quantitative molecular traits, and has published on the analysis and interpretation of molecular and protein quantitative trait loci.
While it is true that the gene closest to a GWAS peak is not always the causal gene, it is also true that it usually is.
In fact, we can quantify how often we should expect the causal gene to be the closest gene, and that number is about 70%
3 papers from 2021 help pin this down:
These papers provide 3 independent approaches to quantifying the distribution of ordinal rank for the causal gene from a lead GWAS SNP
Here I'm defining distance to the "gene body" (TSS-TES)
At least in ABCmax the lead variant has been fine-mapped.
closest gene: 70%-76%
As far as the question of, if we're talking about closest gene, why NEGR1 and not the non-coding KRT8P21, here's at least what that same researcher has to say:
And I do talk a lot about the closest gene, but almost always I mean the closest protein coding gene. I think it is very rare for the causal gene to be not a protein coding gene.
I don't know how right he is, and haven't looked at those three papers. I assumed there would be a good paper that was focused specifically on how good of a prediction "closest protein-coding gene to a GWAS hit" is, and there might be, but I was having trouble finding one that clearly laid it out, though my search wasn't very extensive.
It seems to me that any type of such evidence is going to be affected by selection bias (i.e. the nearest gene is often seen to be the true causal gene, but only because it's easiest to identify the causal gene if it's the nearest gene) and also affected by there not really being a good ground truth for well-validated connections between very many complex trait variants and which specific genes they affect.
Also, at least one of the papers cited above is based on how often genes known to be affected by pQTLs (variants which are known to affect expression of a protein) are the closest gene to the variant, but there may be an issue of assuming variants that affect complex traits are going to have similar properties to pQTL variants, but as a paper posted elsewhere showed, complex trait variants may be quite different from QTLs.
I'm not sure about these things though, just my sense for why this may be hard to know for sure, and why you might be right that it may be premature to focus too much on NEGR1.
But with regard to the idea of ME/CFS Science Blog quoted above, where if we suspect ME/CFS is a nervous system disease, and one of the nearest genes is a nervous system gene, then maybe it's worth giving it a little extra weight:
[edit: see context below regarding the quote - I was thinking of the wrong statement]
To illustrate with an extreme example: Assume we were studying a condition of fingernail disease and the nearest gene to a significant variant is a gene whose only known function is in the growth of fingernails. In that case, I think focusing on that gene as a good candidate would make sense.
Whether we can be confident enough yet about (1) ME/CFS being primarily a neurological disease, or about (2) how well NEGR1's function would fit into that as opposed to other nearby genes, I'm not sure. I think other genetic evidence like DecodeME's MAGMA and the Zhang paper's synapse enrichment, makes (1) somewhat likely, and NEGR1 already being implicated in other brain-related diseases, such as depression and anxiety, makes (2) somewhat likely.
And I do talk a lot about the closest gene, but almost always I mean the closest protein coding gene. I think it is very rare for the causal gene to be not a protein coding gene.
Interestingly the gene tht casuses lack of pain my friends at UCL were interested in codes for a long non-coding RNA, not a protein.
But with regard to the idea of ME/CFS Science Blog quoted above, where if we suspect ME/CFS is a nervous system disease, and one of the nearest genes is a nervous system gene, then maybe it's worth giving it a little extra weight:
To clarify: I didn't highlight NEGR1 much in my blog article because the gene was quite far from the DecodeME hit.
I didn't assume ME/CFS is a nervous system disease and tried looking at potential genes from that angle. Instead, I focused on the protein-coding genes closest to DecodeME signals with little competition (few other protein coding genes around). These have a reasonable chance of being involved in ME/CFS pathology without relying on gene expression data. Following this approach, there were quite a lot of genes involved in neural development and communication such as CA10, SHISA6, SOX6, LRRC7, and DCC.
if we suspect ME/CFS is a nervous system disease, and one of the nearest genes is a nervous system gene, then maybe it's worth giving it a little extra weight:
To illustrate with an extreme example: Assume we were studying a condition of fingernail disease and the nearest gene to a significant variant is a gene whose only known function is in the growth of fingernails. In that case, I think focusing on that gene as a good candidate would make sense.
I agree with this. But something I don't understand is why we are seemingly more hesitant about the immune system genes - e.g. OLFM4 or BTN2A2 likely being the genuine article, when we broadly suspect a brain immune signalling loop and these genes implicate things like interferons and T cells - things that have a high chance of being involved in such a loop.
Oh sorry, I did mischaracterize the quote I posted. I was thinking about this quote while writing that (which to be clear, seems like a somewhat reasonable idea to me):
I wonder if we should interpret the likelihood of possible genes in light of this MAGMA analysis: those that are not expressed in the brain might be less likely to be a relevant gene compared to those who are highly expressed in the brain (Figure 4 In the paper)?
But something I don't understand is why we are seemingly more hesitant about the immune system genes - e.g. OLFM4 or BTN2A2 likely being the genuine article, when we broadly suspect a brain immune signalling loop and these genes implicate things like interferons and T cells - things that have a high chance of being involved in such a loop.
I won't speak for others - maybe it would be good to focus in on those. I just like the nervous system angle based on the genetic evidence seeming to provide stronger evidence towards that (MAGMA from DecodeME and the model from Zhang 2025, to be clear).
Oh sorry, I did mischaracterize the quote I posted. I was thinking about this quote while writing that (which to be clear, seems like a somewhat reasonable idea to me):
No problem. Just wanted to clarify because without context it could perhaps be misunderstood as if I started with the neural hypothesis to look for genes that fit the hypothesis, etc.
One issue is that there are a lot of other genes nearby the DecodeME signal, making it less certain which genes are involved in ME/CFS. For genes like CA10 or DCC this is less of a problem.
Agree that NEGR1 is not one of the strongest clues.
On the other hand: none of the candidate genes has good certainty but we know pretty certain that some of them will be involved in ME/CFS pathology. So it's like working with 10 candidate genes knowing that perhaps 7 will be flukes that we're misinterpreting, but the other 3 are relevant. I think that justifies exploring these genes and their potential implications, even though the evidence for each of them is still quite uncertain.
Thanks @forestglip and others for that helpful discussion. For sure, there will be the 'drunk man looking for his car keys under the street lamp' sort of bias applying to this. If everyone starts looking at the nearest protein coding gene as the one affected by an enhancer, then it is a lot more likely that relationships between enhancers and their nearest gene will be found.
To flip to the other side, an argument why the NEGR1 gene could be relevant, looking at Forestglip's map:
there are a lot of logp=3 significant hits in half of NEGR1 and it's a big gene, and so, relatively speaking, quite a lot of the DecodeME participants may have one of a wide range of variants of the gene that affects the risk of ME/CFS. There are a few hits, very close by, in the promoter region too. And there are those logp=6 and 7 hits, again across quite a stretch of base pairs, in the further away region where enhancers of NEGR1 might be found - and we know that enhancers can have a very big effect on the function of the gene - potentially considerably more than some of the gene variants and promoters.
There always seems to be about 100 interesting directions to go down next, and so much to learn, and since we jump on whatever other people are talking about on here attention tends to get focused on just a few angles at a time. But personally I'm very interested in the immune system genes too. I'm sure we'll be talking about them lots more.
Good question. To restate what's at issue here: if we have a significant locus, and we have no other information, what's the likelihood that the nearest protein-coding gene (NEGR1 in this case) is the pathogenic gene implicated by the significant variant?
...
I don't know how right he is, and haven't looked at those three papers. I assumed there would be a good paper that was focused specifically on how good of a prediction "closest protein-coding gene to a GWAS hit" is, and there might be, but I was having trouble finding one that clearly laid it out, though my search wasn't very extensive.
It seems to me that any type of such evidence is going to be affected by selection bias (i.e. the nearest gene is often seen to be the true causal gene, but only because it's easiest to identify the causal gene if it's the nearest gene) and also affected by there not really being a good ground truth for well-validated connections between very many complex trait variants and which specific genes they affect.
I wonder how much of it comes from the bias that people used to look at/look for/study protein-coding genes and considered non-coding DNA "junk". I'm not saying he's wrong. I'm just wondering if and how the bias might be affecting current approaches and theories.
On whether NEGR1 is actually the closest gene:
The end of NEGR1 closest to the best hit is roughly at base pair 72,282,000
The best hit is roughly at base pair 73,200,000
The end of LRRIQ3 (another protein coding gene) is roughy at base pair 74,030,000.
So, I think LRRIQ3 is closer. It's even more so, if you take the mid point of a gene, rather than the closest end, because NEGR1 is a big gene and LRRIQ3 isn't. If we are just looking at the closest protein coding gene as the one most likely to be affected, LRRIQ3 is the one.
Can I make a plausible story for LRRIQ3 being the target gene rather than NEGR1? Yep
There's a table of GWAS hits for LRRIQ3 here.
Possibly the IQ in LRRIQ3 is for intelligence? Because the strongest specific association of LRRIQ3 reported is for mathematical ability. 'Participation in health studies' is in the top 5 associations. An association with neurodegenerative diseases is also in that top 5; an association in pain has been found.
We don't have a thread for LRRIQ3 yet.
I’m not sure of there’s a data mismatch somewhere, more info in GeneCards then GenomeBrowser or maybe I’m missing some of these uncharacterised locations in my new scripts or just don’t understand this well enough…
Anither update, as well as the window issues I mentioned above, there is a data (well version) mismatch. With GenomeBrowser using older version of the GenHancer database. Hopefully my plan to get the full up to date data and run locally will help, if they give it to me (you need to request it, they say it’s free for academic use but it’s hit or miss if we count as academics!)
But for now here’s the info I have for NEGR1 and for LRRIQ3 for comparison (both using GenomeBrowser database and a larger search window)
It’s not ideal yet, it was only meant to be an exploratory script… so looking at the Genecards page may be best, but LRRIQ3s regulatory elements all seem in the other direction. Sort by TSS distance (kb) and -ve is to the left and +ve to the right on LocusZoom AFAIU. Compare to NEGR1 which has enhancers 900kb in either direction.
The end of NEGR1 closest to the best hit is roughly at base pair 72,282,000
The best hit is roughly at base pair 73,200,000
The end of LRRIQ3 (another protein coding gene) is roughy at base pair 74,030,000.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.