Low-Dose Naltrexone restored TRPM3 ion channel function in Natural Killer cells from long COVID patients, 2025, Martini et al

Here's how I understand it. Assuming the null hypothesis is true, you're equally likely to get any p value between 0 and 1.

https://davidlindelof.com/how-are-p-values-distributed-under-the-null/


It doesn't actually matter what the sample size is, p values are always uniformly distributed under the null hypothesis. So there's a 1 in 10,000 chance of getting a p value of 0.9999 or higher if there's no real difference between the groups. Such a high p value isn't an indicator that the two groups are similar.
Agree.

It's an indicator that the means of the two groups are extraordinarily close considering the high variance in the groups, and such a situtation should just happen due to chance 1 in 10,000 times. That or there could have been an error in the analysis like comparing one group to itself by accident.

To be honest, I don't know (not judging your statement in any way) if that conclusion can be drawn directly from from this test.
Usually the null hypothesis being true isn't talked about in frequentist statistics since that's a bayesian idea and things get "funky".

This covers the same topic:
https://stats.stackexchange.com/que...e-for-or-interpretation-of-very-high-p-values
 
Agree.



To be honest, I don't know (not judging your statement in any way) if that conclusion can be drawn directly from from this test.
Usually the null hypothesis being true isn't talked about in frequentist statistics since that's a bayesian idea and things get "funky".

This covers the same topic:
https://stats.stackexchange.com/que...e-for-or-interpretation-of-very-high-p-values
I don't know much about bayesian statistics, and I don't see any new insights on that page (though I don't totally understand all the probability terminology). But one comment has a similar view:
About usefulness of very large p-values, I've got p-values near 1 in t-tests that failed to meet the normality of means assumption. In general, a p-value=1 should be seen at least as a warning that something might be wrong.
 
I don't know much about bayesian statistics, and I don't see any new insights on that page (though I don't totally understand all the probability terminology). But one comment has a similar view:
For what it's worth (as non-simulated further validation of your earlier analysis), when I do differential gene expression analyses where I'm running ~10K tests, a portion of those will always come up as p>0.99 (and that is with a test that does not assume normality). Spot checking my most recent analysis, it was around 180 out of 13000 comparisons, so roughly 1%.

Which just speaks to your earlier point of equal likelihood of any p-value under the null hypothesis. But my understanding was that for [edit: any one specific] test, it will never tell you anything other than whether you can reject the null hypothesis. The logic of the test is not reciprocal in that way.

I've also gotten a 0.999 p-value when I was just doing a single comparison and it seemed unlikely that any of the assumptions were violated. I think it is sometimes just a luck of the draw thing, [Edit: though >0.9999 being reported twice in the results seems to indicate it's not just luck of the draw unless we're all witnessing a once-in-a-lifetime event. I agree it's most likely an assumptions thing]
 
Last edited:
But my understanding was that for a specific test, it will never tell you anything other than whether you can reject the null hypothesis. The logic of the test is not reciprocal in that way.
That's my understanding as well.

I've also gotten a 0.99 p-value when I was just doing a single comparison and it seemed unlikely that any of the assumptions were violated. I think it is sometimes just a luck of the draw thing, though I agree that assumptions should be checked anyways.
Yeah, definitely not impossible that they're the lucky 1 in 10,000. (Though technically even less of a chance than that since it's ">.9999" which could be any number between that and 1.)

Out of curiosity I searched "p>.9999" and there are plenty of papers, though I suppose with millions of papers that have been written, that's to be expected.
 
That's my understanding as well.


Yeah, definitely not impossible that they're the lucky 1 in 10,000. (Though technically even less of a chance than that since it's ">.9999" which could be any number between that and 1.)

Out of curiosity I searched "p>.9999" and there are plenty of papers, though I suppose with millions of papers that have been written, that's to be expected.
Sorry, I think I added my edit right after you quoted me. I had just realized that they reported >0.9999 twice which makes this extremely unlikely to be a luck of the draw thing
 
Last edited:
Sorry, I think I added my edit after you quoted me. I had just realized that they reported >0.9999 twice which makes this extremely unlikely to be a luck of the draw thing
I was thinking about that. I agree that makes it even less likely, but I guess there is still the possibility that the results from the two tests are extremely correlated to each other, in which case the p values should also be similar. I don't know anything about these tests though. But I'm guessing the correlation would have to be very, very high for this to work out, and it'd probably make sense to look for other explanations.
 
1/10,000 * 1/10,000 = 1/100,000,000

They ran more experiments than just two, so the probability is higher if the p-values are close to 1-(1/10,000).
 
But my understanding was that for [edit: any one specific] test, it will never tell you anything other than whether you can reject the null hypothesis. The logic of the test is not reciprocal in that way.
I've been going down a p value rabbit hole the past couple days because it annoys me when something that seems like it should be intuitive isn't. This page explaining p values is excellent if you're interested.

But anyway, specifically regarding your quote, which earlier I agreed with, here's a relevant quote from that page:
In the context in which a low p-value is evidence against the null hypothesis (that is, when the statistical power of the test is held constant), having a high p-value is indeed evidence in favor of the null hypothesis, because a high p-value is more likely to occur if the null hypothesis is true than if it is false. It's not necessarily very strong evidence, but the law of conservation of expected evidence requires it to be nonzero. If you walk in the woods and see no humans, that is weak evidence towards there being no humans on the planet, and the more of the planet you explore while still seeing no humans, the stronger and stronger the evidence becomes.
 
I've been going down a p value rabbit hole the past couple days because it annoys me when something that seems like it should be intuitive isn't. This page explaining p values is excellent if you're interested.

But anyway, specifically regarding your quote, which earlier I agreed with, here's a relevant quote from that page:
Thanks for the link! That’s interesting, I suppose that makes sense now that I think about it. My intuition still makes me cautious about whether it’s valid to make inferences about anything other than rejecting the null hypothesis. I’ll have to sit with that a bit more
 
My intuition still makes me cautious about whether it’s valid to make inferences about anything other than rejecting the null hypothesis.
Oh yeah, I don't take it as much more than an interesting fact that if it is p=.99 the null hypothesis is at least slightly more likely to be correct than if p=.50. I think you'd probably have to do much more math to quantify if that's to a degree that's useful for any given test.

Edit: Though I'm not totally sure what I said is true. I didn't dig much deeper into high p values, just thought the quoted part might be interesting.
 
Last edited:
Side note @forestglip since you mentioned you were unfamiliar with Bayesian statistics:

our thought process about the two >0.9999 p values is exactly the intuition behind Bayes’ theorem.

Given that we’re seeing two >0.9999 p-values in a research paper (data), and knowing how likely it is to get p > 0.9999 to begin with (prior probability), is it more likely that what we’re seeing is a result of 1) a random happenstance or 2) an error in the statistical analysis (posterior probability)?

Apologies if I’m explaining something you already know, I just thought it was a neat, very intuitive example so it would be worthwhile pointing out
 
Side note @forestglip since you mentioned you were unfamiliar with Bayesian statistics:

our thought process about the two >0.9999 p values is exactly the intuition behind Bayes’ theorem.

Given that we’re seeing two >0.9999 p-values in a research paper (data), and knowing how likely it is to get p > 0.9999 to begin with (prior probability), is it more likely that what we’re seeing is a result of 1) a random happenstance or 2) an error in the statistical analysis (posterior probability)?

Apologies if I’m explaining something you already know, I just thought it was a neat, very intuitive example so it would be worthwhile pointing out
Thanks, basically all I know is that Bayes' incorporates prior knowledge about how likely something is to occur from before even running the experiment. I hear the word often and it seems interesting. So many things I want to learn about and too little energy and time but it's in the queue!
 
But anyway, specifically regarding your quote, which earlier I agreed with, here's a relevant quote from that page:

"having a high p-value is indeed evidence in favor of the null hypothesis, because a high p-value is more likely to occur if the null hypothesis is true than if it is false."

I have a problem with that statement (if the "evidence" is supposed to be meaningful evidence).

This is from a well known consensus paper:

6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
Researchers should recognize that a p-value without
context or other evidence provides limited information.
For example, a p-value near 0.05 taken by itself offers only
weak evidence against the null hypothesis. Likewise, a
relatively large p-value does not imply evidence in favor
of the null hypothesis; many other hypotheses may be
equally or more consistent with the observed data.

https://www.tandfonline.com/doi/epdf/10.1080/00031305.2016.1154108?needAccess=true
 
That does make sense, thanks for that paper.
It really is a rabbit hole.

The best way if one wants a deep understanding is imo literally to "do the math" from the beginning and forget the intuition.
Write it down mathematically & deduce what is desired via known theorems (shortly spoken).
Unfortunately I can't do that kind of deep thinking anymore due to symptoms (hello 24/7 severe headache).
And testing in a way medicine needs it is not common in my domain (physics).
 
Last edited:
Now published:

Low-Dose naltrexone restored TRPM3 ion channel function in natural killer cells from long COVID patients

Etianne Martini Sasso, Natalie Eaton-Fitch, Peter Smith, Katsuhiko Muraki, Sonya Marshall-Gradisnik

Introduction
Long COVID is a multisystemic condition that includes neurocognitive, immunological, gastrointestinal, and cardiovascular manifestations, independent of the severity or duration of the acute SARS-CoV-2 infection. Dysfunctional Transient Receptor Potential Melastatin 3 (TRPM3) ion channels are associated with the pathophysiology of long COVID due to reduced calcium (Ca2+) influx, negatively impacting cellular processes in diverse systems. Accumulating evidence suggests the potential therapeutic benefits of low-dose naltrexone (LDN) for people suffering from long COVID. Our study aimed to investigate the efficacy of LDN in restoring TRPM3 ion channel function in natural killer (NK) cells from long COVID patients.

Methods
NK cells were isolated from nine individuals with long COVID, nine healthy controls, and nine individuals with long COVID who were administered LDN (3–4.5 mg/day). Electrophysiological experiments were conducted to assess TRPM3 ion channel functions modulated by pregnenolone sulfate (PregS) and ononetin.

Results
The findings from this current research are the first to demonstrate that long COVID patients treated with LDN have restored TRPM3 ion channel function and validate previous reports of TRPM3 ion channel dysfunction in NK cells from individuals with long COVID not on treatment. There was no significant difference in TRPM3 currents between long COVID patients treated with LDN and healthy controls (HC), in either PregS-induced current amplitude (p > 0.9999) or resistance to ononetin (p > 0.9999).

Discussion
Overall, our findings support LDN as a potentially beneficial treatment for long COVID patients by restoring TRPM3 ion channel function and reestablishing adequate Ca2+ influx necessary for homeostatic cellular processes.

Link | PDF (Front. Mol. Biosci.) [Open Access]
 
So I think the p>.9999 (which actually shows up 3 times in the full text) is a result of multiple test adjustment.

Paper said:
Statistical comparisons between groups for noncategorical variables (agonist and antagonist amplitudes) were conducted using the independent nonparametric Kruskal–Wallis test (Dunn’s multiple comparisons). Categorical variables (sensitivity to ononetin) were analyzed using Fisher’s exact test (Bonferroni method).

Bonferroni adjustment can be done by dividing the p value threshold by the number of tests where you would call it significant, but as described here, you can also keep the p=.05 threshold and multiply the calculated p value by the number of tests. If the starting p value multiplied by the number of tests is greater than 1 (e.g. p=.45 and three tests were done), the software would return p=1.0000. The authors might have changed it to p>.9999.

https://www.ibm.com/support/pages/calculation-bonferroni-adjusted-p-values
Statistical textbooks often present Bonferroni adjustment (or correction) in the following terms. First, divide the desired alpha-level by the number of comparisons. Second, use the number so calculated as the p-value for determining significance. So, for example, with alpha set at .05, and three comparisons, the LSD p-value required for significance would be .05/3 = .0167.

SPSS and some other major packages employ a mathematically equivalent adjustment. Here's how it works. Take the observed (uncorrected) p-value and multiply it by the number of comparisons made. What does this mean in the context of the previous example, in which alpha was set at .05 and there were three pairwise comparisons? It's very simple. Suppose the LSD p-value for a pairwise comparison is .016. This is an unadjusted p-value. To obtain the corrected p-value, we simply multiply the uncorrected p-value of .016 by 3, which equals .048. Since this value is less than .05, we would conclude that the difference was significant.

Finally, it's important to understand what happens when the product of the LSD p-value and the number of comparisons exceeds 1. In such cases, the Bonferroni-corrected p-value reported by SPSS will be 1.000. The reason for this is that probabilities cannot exceed 1. With respect to the previous example, this means that if an LSD p-value for one of the contrasts were .500, the Bonferroni-adjusted p-value reported would be 1.000 and not 1.500, which is the product of .5 multiplied by 3

And a page describing the other multiple comparison method they used, Dunn's, which uses the same adjustment:
Multiply the uncorrected P value computed in step 2 by K. If this product is less than 1.0, it is the multiplicity adjusted P value. If the product is greater than 1.0 the multiplicity adjusted P value is reported as > 0.9999
 
I think they may have used non-independent samples which artificially decreased the p-values, though. There were 9 people per group, but they used multiple cells per person, which are expected to be correlated to each other. Technically, you could get a p value as low as you want by using a very high number of cells from each person. [Edit: More accurately: greatly increase the chances of a low p value.]
In the electrophysiological experiments, we included nine participants in each group and analyzed recordings from 61, 65, and 63 independent cells for PregS effects from long COVID, HC, and long COVID receiving LDN groups, respectively. In addition, to assess ononetin effects in the presence of PregS, 52 independent recordings from NK cells in the long COVID group, 53 in NK cells from HC, and 53 recordings from NK cells in the long COVID group receiving LDN.

I don't see any indication in the paper's description of methods that they accounted for correlated samples.

See explanation of the issue of pseudoreplication here: Pseudoreplication in physiology: More means less (Eisner, 2021, Journal of General Physiology)

Edit: This is regarding the p values they reported that are very low:
As reported in earlier studies, we confirmed a reduction in ononetin amplitude (p = 0.0021) and the number of cells sensitive to ononetin (p < 0.0001) when compared to the HC and long COVID group. In contrast, NK cells from the long COVID group receiving LDN had a significant elevation in amplitude (p = 0.0005) and sensitivity (p < 0.0001) to ononetin compared with the long COVID group.
 
Last edited:
While we are talking about statistical intuitions, I find adjustment for multiple comparison to feel very weird.

You take a bunch of p-values and just multiply them by a big number. It's effective at making lots of possibly significant results go away. Lot of bathwater gets thrown out and an unknown number of babies too.

If you called your experiments different experiments and published them in separate papers, nobody would demand adjustment for multiple comparison. if they are done in one batch, it's expected.

This notable plot from Hanson is generated by ~~Bonferroni~~ (edit: not bonferroni, actually another kind of multiple comparison) adjustment. The distribution of the fold change of a lot of the metabolites on the x-axis is similar, but the q-value on the y-axis of the right-side plot is made zero by ~~Bonferroni~~ adjustment. ( I was so flabbergasted by these two plots that I taught myself multiple comparison and downloaded the data and reran the analysis, the plots are accurate. But I do find myself wondering what meaning is left after adjustment).

upload_2025-5-20_9-32-7.png
 
Last edited:
While we are talking about statistical intuitions, I find adjustment for multiple comparison to feel very weird.

You take a bunch of p-values and just multiply them by a big number. It's effective at making lots of possibly significant results go away. Lot of bathwater gets thrown out and an unknown number of babies too.

This notable plot from Hanson is generated by Bonferroni adjustment. The distribution of the fold change of a lot of the metabolites on the x-axis is similar, but the q-value on the y-axis of the right-side plot is made zero by Bonferroni. ( I was so flabbergasted by these two plots that I taught myself multiple comparison and downloaded the data and reran the analysis, the plots are accurate. But I do find myself wondering what meaning is left after adjustment).

View attachment 26271
I'm not sure if there's an intuitive meaning for the actual value of p values after Bonferroni adjustment.

What Bonferroni does is maintains the same rate of false positives. When you do a single test, there's a 5% chance of a real null result having a p-value of less than 0.05. So in 100 studies that are testing something [edit: that is not a real effect], you'd expect around 5 studies to be false positives.

But if you do many tests at once in a study, you're increasing the number of false positives in that one study. For example, if you test 100 things [edit: that in reality have no effect] at once in a study, there's no longer only a 5% chance of reporting a false positive. Now you're almost guaranteed to report [edit: at least one] positive result. Bonferroni adjusts the p value threshold or the p value itself so that there is still only a 5% chance of reporting one or more false positives.

It does make it harder for real positive results to cross the significance threshold, but it's a tradeoff to not report lots of false positive results as well.
 
Last edited:
Back
Top