Genetics: Chromosome 17 CA10

That's an amazing match with the 'Ease of getting up in the morning' gene. Very impressive investigation FG.


I'm not understanding why the x axis in the neck shoulder pain chart is different. Can you help me understand?
It's using an older assembly/coordinate system called GRCh37. The more recent one that DecodeME uses is called GRCh38. Defining where exactly a SNP is on a chromosome isn't an exact science, so as they learn more, they make updates to the positions.

"Liftover" is converting the coordinates to a different assembly. I used liftover for the "Getting up in the morning" GWAS to make it match DecodeME.

On gnomAD, if you look up a SNP, it'll tell you where that SNP is in the other assembly. For example, here's the page for the lead SNP of the neck and shoulder pain GWAS in GRCh37 coordinates, as reported in Table 2: https://gnomad.broadinstitute.org/variant/17-50259142-A-C?dataset=gnomad_r2_1

Partway down the page, it says:
Liftover
This variant lifts over to the following GRCh38 variant:
 
The trait most significantly associated with this SNP is "Ease of getting up in the morning", which would make sense as being related to ME/CFS.
Thank you for finding/sharing this. These results are so fascinating. I hope I get less foggy soon so I can read more of the details.

One question for you forestglip (and maybe @jnmaciuch if you're not too busy) -- do we have any idea how much of the similarity in 'shape' of the DecodeME stats and the ease-of-getting-up stats is due to variants at all those elevated positions being inherited together? i.e. would we expect that all the dots I circled in orange tend to be inherited together? (In that case I guess any condition where one of the dots is elevated you'd expect to see them all elevated?)

Untitled.jpg
(Crop from post #57)

(Maybe this question translates to is 'is this what linkage disequilibrium looks like in summary statistics'?)
 
(Maybe this question translates to is 'is this what linkage disequilibrium looks like in summary statistics'?)

I think that is probably right at first pass.
It intrigued me that if you have a batch of variants tightly linked - so that they produce a flat line of linkages - you cannot simply ask 'did SNP 45 or SNP 52 cause the risk' because 'cause' in this situation is a complicated concept dependent on what options are biologically available. If these variants always go hand in hand, even if you can in theory trace a transcription factor binding being critical at a particular point it may not be legitimate to attribute cause to any one variant (which might be a variant not in your SNP library but closely linked to those in the line I guess).

The others still have a clearer understanding of this than I do so let's see what they say.
 
(Maybe this question translates to is 'is this what linkage disequilibrium looks like in summary statistics'?)
Yes, this is showing linkage disequilibrium. The following plot actually shows the strength of LD between each of the variants in the plot with the lead variant (purple diamond).
1776877146232.png
The variants in red have very strong LD with the lead variant, so would be expected to show up very often in people that have the lead variant. Hence, they are just about as significant as the lead variant. The yellow variants have less LD with the lead variant, so the significance might be further off, as we see here.

(In that case I guess any condition where one of the dots is elevated you'd expect to see them all elevated?)
Yes, if the causal variants in the two studies were two different high LD "red" variants, the plots would probably look pretty similar. In theory, coloc helps to mathematically determine the probability that there is a shared variant based on the overall pattern, which may subtly change even if the other study's causal variant is a high LD "red" variant.

With a set of variants that have near perfect LD with each other (LD~1), I would think it's probably very difficult to determine which one is causal.

I guess it may be wise to leave open the possibility that the "getting up in the morning" study has a different causal variant that is just in high LD with ME/CFS's causal variant.
 
For the PheWAS analysis I did a few posts ago (looking up which other traits have significant associations at a variant), I used GWAS Atlas. They have 4,756 GWAS datasets, and they have not added any datasets since 2019.

I found out that IEU Open GWAS also has PheWAS functionality, and they have over 10 times as many GWAS datasets (50,069), so it seemed worth looking for associations there too.

Unfortunately, the Open GWAS website doesn't have a simple PheWAS lookup using the web browser like GWAS Atlas, and instead it requires using their API. But they provide Python and R packages for accessing the API, and at least the Python package is fairly straightforward. I haven't tried the R package.



Keep in mind that the variant being significant in multiple traits doesn't necessarily indicate a shared causal variant between ME/CFS and each of these traits. Though it at least shows that a causal variant is likely in the same general region near CA10 in the two traits.

The most significant in this case is a reaction time trait. The link in the table doesn't take you directly to the BioBank details for the trait, but here are the details for that first trait, including a little video of what the cognitive test is: https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=404

Basically, it's a "game" where two cards are shown on a screen, and the participant presses a button as quickly as possible if the two cards match. This trait is how long it takes to press the button (durations even if the participant got it wrong are counted here).

The direction of effect is as expected: the T allele is associated with increased risk of ME/CFS, and it is also associated with longer duration to press the button.

Here are all associations with p<1e-6, sorted starting from most significant:
idtraitchrposition (GRCh37)rsideaneaeafbetasepn
ukb-b-19373Duration to first press of snap-button in each round1750260366rs34626694TC0.3285170.01519020.002171612.70023e-12459281
ukb-b-16287Mean time to correctly identify matches1750260366rs34626694TC0.3284920.01461370.002181352.09991e-11459523
ukb-b-6306Overall health rating1750260366rs34626694TC0.3285110.01036030.001599929.3994e-11460844
ukb-b-2772Getting up in morning1750260366rs34626694TC0.32854-0.01098860.001702311.09999e-10461658
ukb-b-9130Pain type(s) experienced in last month: None of the above1750260366rs34626694TC0.328525-0.006360320.001083264.30002e-09461857
ukb-a-10Getting up in morning1750260366rs34626694TC0.330974-0.01128840.001984731.28911e-08336501
ukb-a-251Overall health rating1750260366rs34626694TC0.3309740.01027970.00187654.30229e-08336020
ukb-b-929Frequency of tiredness / lethargy in last 2 weeks1750260366rs34626694TC0.328340.01014120.001860995.1e-08449019
ukb-b-18335Wheeze or whistling in the chest in last year1750260366rs34626694TC0.3285880.004873480.0009047477.19996e-08453959
ukb-b-8746Illnesses of siblings: High blood pressure1750260366rs34626694TC0.328590.005425890.001016149.29994e-08364661
ukb-a-199Mean time to correctly identify matches1750260366rs34626694TC0.3309740.01345140.002553761.38535e-07335139
ukb-b-17595Medication for pain relief, constipation, heartburn: Paracetamol1750260366rs34626694TC0.3284770.004670760.0009167743.50002e-07457547
ebi-a-GCST90029014Smoking status1750260366rs34626694TC0.3288680.007345030.001465983.79997e-07468170
ukb-d-20116_0Smoking status: Never1750260366rs34626694TC0.331077-0.006263540.001235744.00876e-07359706
ukb-b-18596Pain type(s) experienced in last month: Neck or shoulder pain1750260366rs34626694TC0.3285250.004700010.0009334394.79999e-07461857
ebi-a-GCST90012794Participation in an health questionnaire (not invited vs invited)1750260366rs34626694TC0.3300990.00475840.0009528776.1e-07451097
ukb-a-472Pain type(s) experienced in last month: Neck or shoulder pain1750260366rs34626694TC0.3309740.005386910.001086347.09741e-07336650
ukb-b-4063Number of self-reported non-cancer illnesses1750260366rs34626694TC0.3285060.008974750.001820128.19993e-07462933

I also did the PheWAS with the other 7 DecodeME variants, and will post the results on their respective threads. [Edit: Not all of them had other significant traits.]

Edit: Note, if you want to look up the details page of a UK BioBank trait (ID starts with "ukb"):
- Go to the Open GWAS page for the trait linked in the table
- In the "note" row of the table, there's a number. For example, for the most significant trait above, it says "404" in the "note" field.
- Replace the number at the end of the following URL with the number for the trait: https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=404
 
Last edited:
Here are all associations with p<1e-6, sorted starting from most significant:
Thanks for doing this.

One issue I see is that we don't have a standardized measure of effect size to filter these. I suspect that in a lot of the things that come up like smoking, overall health and cognitive tests, the DNA has only very minor effects and that they only come up because these traits could be tested with enormous sample sizes. So a large proportion of all possible DNA variants and genes are likely associated with them.
 
One issue I see is that we don't have a standardized measure of effect size to filter these. I suspect that in a lot of the things that come up like smoking, overall health and cognitive tests, the DNA has only very minor effects and that they only come up because these traits could be tested with enormous sample sizes. So a large proportion of all possible DNA variants and genes are likely associated with them.
Yes good point. We can at least see the total sample size for studies in the table above, and they're all around 350,000 to 450,000 for this group of datasets. Though unbalanced group sizes in binary traits, for example, could make a study's effective sample size smaller, making the findings less directly comparable to each other.

(It's also possible the beta's might all be on a standardized scale making them more easily comparable as effect sizes, since they're all around the same order of magnitude, though I'm not sure.)

One thing is that the Manhattan plots can be viewed for some of these traits to see how significant a given locus is compared to others. For example, for "Duration to first press of snap-button in each round", there's a link on the side to a plot on the Genotype-Phenotype Map website. I copied the plot and made an arrow pointing to the area near the CA10 locus (labeled as 17:52176967 A/G):
1776950040040.png

Yes, there are a lot of genome-wide significant loci thanks to the large sample size. But it looks like this CA10 locus is in maybe the top 10 or 20 most significant, so not nothing, I think.

Edit: Converted image to thumbnail because the huge text for the title was a bit overwhelming.
 
Last edited:
Open GWAS has another tool called the Genotype-Phenotype Map where it can do actual colocalization testing against many GWAS datasets using uploaded summary stats.

I uploaded the DecodeME GWAS-1 summary stats. The results are on the following page, indicating which traits colocalize (share a causal variant) with ME/CFS at various DecodeME loci: https://gpmap.opengwas.io/trait.html?id=a3069ff7-f52c-0c62-fde3-c38e99f0cd91

For example, here are the traits that appear to colocalize with ME/CFS at the CA10 locus (note the ME/CFS row is for the uploaded data):
TraitData typeGeneTissueCis/TransP-value
ME/CFS2.11e-9
CA10 Whole blood cg04881814 methQTLMethylationCA10Whole bloodcis0.00e+0
CA10 Whole blood cg07398767 methQTLMethylationCA10Whole bloodcis4.19e-14
CA10 Whole blood cg08605326 methQTLMethylationCA10Whole bloodcis4.10e-19
CA10 Whole blood cg20552747 methQTLMethylationCA10Whole bloodcis1.87e-15
Gastroesophageal reflux diseasePhenotype (Disease Of Digestive System)1.69e-9
Participation in an health questionnaire (not invited vs invited)Phenotype (Physiological Measures)1.20e-7
Mean time to correctly identify matchesPhenotype (Behavioural Measures)2.10e-11
Illnesses of siblings: None of the above (group 1)Phenotype (Environmental Measures)1.20e-7
Medication for pain relief, constipation, heartburn: ParacetamolPhenotype (Physiological Measures)4.60e-8
Wheeze or whistling in the chest in last yearPhenotype (Physiological Measures)7.20e-8
Pain type(s) experienced in last month: Neck or shoulder painPhenotype (Disease Of Musculoskeletal System And Connective Tissue)4.80e-12
Getting up in morningPhenotype (Behavioural Measures)8.10e-11
Time spent watching television (TV)Phenotype (Behavioural Measures)5.50e-8
Overall health ratingPhenotype (Physiological Measures)2.60e-12
Illnesses of siblings: High blood pressurePhenotype (Disease Of Circulatory System)3.00e-8

Info column for this locus says:
Candidate Variant: 17:52176967 A/G
LD Region: EUR/17/49918934-53040813

Note: They ask for sample size, and I gave a sample size of 58792, which is the effective sample size of DecodeME calculated using a common formula for unbalanced studies. For example, described on this page:
Note: Often the effective sample size is defined as 4nϕ(1−ϕ), because that quantity tells what would be the total sample size (cases + controls) in a hypothetical study that has equal number of cases and controls and whose power matches the power of our current study.
The formula is Neff = 4 * [total sample size] * [proportion of cases] * [proportion of controls]. For DecodeME, that's Neff = 4 * 275488 * 0.056550558 * 0.943449442 ≈ 58792

For illustration, a study with 10 cases and 1,000,000 controls doesn't have anywhere near the statistical power of a study with 500,000 cases and 500,000 controls, so n=1,000,010 wouldn't be useful for the first study. The effective sample size calculated with the above formula in this hypothetical study would be about 40.
The formula for effective sample size is sometimes written in a different but mathematically equivalent form, such as on the following page:

Neff = 4 * (1/[num cases] + 1/[num controls])
 
Back
Top Bottom