Cardiopulmonary and metabolic responses during a 2-day CPET in [ME/CFS]: translating reduced oxygen consumption [...], Keller et al, 2024

Discussion in 'ME/CFS research' started by Nightsong, Jul 5, 2024.

Tags:
  1. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    I'm getting the same cohen d value for with and without the outliers, but different t-test and mann-whitney values.

    I highlighted the results for workload absolute difference at AT because it matches the t-test and cohen d values from your first calculations.

    With outliers included:
    Screenshot from 2024-09-11 08-38-53.png

    Outliers removed:
    Screenshot from 2024-09-11 08-38-45.png

    That's with Jamovi. Same results with Python on the full dataset:
    Code:
    mecfs_df = df[df['phenotype'] == 'MECFS']
    hc_df = df[df['phenotype'] == 'HC']
    
    es = cohend(mecfs_df['AT_wkld_diff_percentage'], hc_df['AT_wkld_diff_percentage'])
    
    U1, mw_p_val = mannwhitneyu(mecfs_df['AT_wkld_diff_percentage'], hc_df['AT_wkld_diff_percentage'])
    
    t_stat, tt_p_val = ttest_ind(mecfs_df['AT_wkld_diff_percentage'], hc_df['AT_wkld_diff_percentage'], equal_var=False)
    With PI-026 removed (but outliers included), the Mann-Whitney p-value goes from 0.048 to 0.044: (note the second metric is different from above)
    upload_2024-9-11_8-44-58.png
     
    Last edited: Sep 11, 2024
    Kitty likes this.
  2. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    Apologies for the wrong p-values, not sure what went wrong there.

    Here's what I got for values at AT and with PI-026 excluded. The first row looks very similar to your results for Work at AT (although for some reason, some figures are a bit different).

    upload_2024-9-11_15-51-58.png

    One thing that strikes me is that working with scores on the outcome scale or with percentages makes a lot of difference. For the outcome scores the results for a t-test and Mann-Whitney test are relatively the same but when working with percentages, it's an enormous difference.

    Recalculating everything to percentages seems to inflate some of the outliers. In the original score difference for Work at AT, this wasn't that big of a problem.

    upload_2024-9-11_15-55-47.png

    This all makes the analysis quite complex.

    Conceptually I think the percentage scores are the most correct. Using the original scales is like giving extra weight to the participants that had a large value on day 1 (as if they matter more) and there is no good reason to do this. Unfortunately, that means that the cohen d is not a reliable effect size and that we probably need to use Mann-Whitney U.

    Here's what I got for values at AT EDIT: max and with PI-026 excluded and exclusion of the 10 patients who did not meet the criteria for peak values.

    upload_2024-9-11_16-13-43.png

    Time and VO2 seems significant but don't make it after multiplicity correction because 20 tests were done (I have removed VO_t because it is basically the same as VO2 and distorts the p-value correction).

    upload_2024-9-11_16-24-40.png
     
    Last edited: Sep 11, 2024
    Kitty likes this.
  3. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    Nice, thanks. Which differences are you talking about? The three stats for both metrics seems to match.

    Did you exclude those 10 from AT? Is that necessary? I'm not sure that not hitting peak affects their AT values.

    Still not much is significant before corrections. Just wkld, VE_VCO2, and PETCO2.
     
    Kitty and ME/CFS Skeptic like this.
  4. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    Small differences like:
    Cohen_d_difference: I got: -0.129, you got: 0.12646
    P_Welch_Difference: I got:0.424, you got: 0.432
    etc.

    Did you exclude those 10 from AT? Is that necessary? I'm not sure that not hitting peak affects their AT values.
    No sorry, typo, the second overview I posted was for maximum values.

    For the max values VO2 (p=0.005) and Time (p=0.012) also point to an effect that was almost significant after multiplicity correction.

    Apologies for the repeated errors. The analysis is more complex than I thought and I have overextended myself a bit trying to understand it all.
     
    Kitty and forestglip like this.
  5. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    Oh, you mean for absolute difference? Does your first chart show both absolute and percentage? There are two of each stat in every row. The percentage ones seem to match. For absolute, I think mine match your first chart in a previous post.

    No worries, I've been focusing a lot on this too as it's pretty fun and my brain's kind of moving slow now.

    Based on the Mann-Whitney and the plot for AT wkld percentage and fairly consistent previous studies, I'd feel pretty confident saying there's almost definitely some effect between the groups for this metric. Without those outliers, it looks a lot like a clear shift in the whole group down by about 10-15%. Though might be worth checking the matched groups too.

    VO2 at AT not being significant is kind of surprising.

    I'd like to look into those other effect measurements that can handle outliers I posted about too, not sure how complex that would be.
     
    Last edited: Sep 11, 2024
    Kitty likes this.
  6. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    If found one that I find quite intuitive: the Common Language Effect Size (CLES). If you were to randomly take a participant from the ME/CFS group and a random participant from the HC group, how often would the ME/CFS patients have a lower value?

    If this was random noice and no equal values, then it would be 50/50.

    In the case of VO2_max, If found a value of 64%. So if one were to compare a random ME/CFS patients with a random healthy control, the former would have lower values 64% of the time. For Work_AT I found a value of 59%.

    upload_2024-9-11_22-48-34.png
     
    Kitty and forestglip like this.
  7. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721

    Very cool. I tried out cliffs delta, which seems to be a popular non-parametric alternative to Cohen's d for effect size. I used all the default arguments in R.

    https://en.wikipedia.org/wiki/Effect_size

    https://openpublishing.library.umass.edu/pare/article/1977/galley/1980/view/
    The function also provides the upper and lower range of the 95% confidence interval of the effect size.

    Rule of thumb for interpreting:
    Interpretation | Cohen’s d | Cliff’s delta (δ)
    Negligible | <0.20 | <0.15
    Small | 0.20 | 0.15
    Medium | 0.50 | 0.33
    Large | 0.80 | 0.47
    upload_2024-9-11_19-38-50.png

    I wanted to visualize the relative distributions (outliers excluded) in a way that looks a little less disorganized than the swarm plot. Here are density plots and histograms for AT workload and max VO2:

    AT workload percentage difference
    AT_wkld_diff_percentage_histogram.png AT_wkld_diff_percentage_kde.png

    max VO2 percentage difference
    max_VO2_diff_percentage_histogram.png max_VO2_diff_percentage_kde.png

    Easier to see the leftward "shift" when it's a thumbnail, than zoomed in with all the spikes, I think.
     
    Last edited: Sep 12, 2024
    Kitty and Murph like this.
  8. Murph

    Murph Senior Member (Voting Rights)

    Messages:
    139
    So just to clarify what you two are finding: there is a small difference between mecfs and healthy on a 2 day cpet ? Is that the top line?
     
    Kitty likes this.
  9. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    Edit: I decided I don't feel comfortable making any conclusions about this without more experience in statistics. I'm happy playing with the data to see if anything interesting stands out, but I can't say anything definitive.
     
    Last edited: Sep 12, 2024
    Kitty and Murph like this.
  10. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    It's similar to CLES I believe. While CLES is the number of wins (groupA > Group B) divided by the total number of possible comparisons, cliff d seems like the (wins - loses) divided by the total number of possible comparisons. I get the same values as you using this formula.

    I think that visually it means how much one distribution sticks out of the other one.
     
    Kitty and forestglip like this.
  11. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    One other thing that I've been looking it is Winsorizing the data, meaning cutting of the data at a certain percentile and replacing the values outside the limit with the percentile at both sides of the data.
    Winsorizing - Wikipedia

    I tried different cutoffs at 1%, 2.5% and 5%. For the VO_max data I think the 1% cutoff is most appropriate as it seems to deal with the most extreme outliers. Higher percentiles distort the data.

    upload_2024-9-12_16-54-27.png
    upload_2024-9-12_16-54-38.png
    upload_2024-9-12_16-54-43.png

    For workload_AT 1% wasn't sufficient, I think 2.5% works better. But even at this point, the effect size is really small.

    upload_2024-9-12_16-56-9.png
    upload_2024-9-12_16-56-45.png
    upload_2024-9-12_16-56-51.png

    I also checked and while there is a small effect for Workload_AT for the full cohort, it isn't there for the matched pairs (p Mann Whitney = 0.346, CLES = 0.55, cohen_d stays lower than 0.2 even after windsorizing 10%).

    upload_2024-9-12_17-1-12.png

    The results for VO2_max are very similar in the matched pairs, so I think that is pretty much the only consistent effect visible in the data.
     
    Kitty and forestglip like this.
  12. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    There seems to be a moderate effect size for VO2 at peak values (not VO2 at the anaerobic threshold) with ME/CFS patients having lower values than controls, 63% of the time.

    The associated p-value is 0.005 but the authors tested more than 20 outcomes, at both the maximal and the anaerobic threshold, and in the matched and full cohort. So after controlling for multiple tests, this difference is not statistically significant and could have happened just by chance.

    None of the other values seem to show a notable difference, although there is a consistent trend that ME/CFS patients have lower values. One could perhaps argue that is no coincidence that VO2peak shows up with the highest effect size as this has been highlighted in previous studies but a recent meta-analysis showed that VO2peak had a small effect that was not significant.

    It's workload at the anaerobic threshold that has been consistently reported to be reduced in ME/CFS patients but that was not found in this study. For the full cohort there is perhaps a small effect, but this is not seen in the matched pairs cohort.
     
    Murph, Kitty and forestglip like this.
  13. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    I was having a hard time finding a detailed explanation of how all the metrics in a CPET are determined. I'm too tired to read dense papers on protocol, so my understanding could be wrong for some or all of this.

    For VO2 at max, what exactly is that? Is that the instantaneous oxygen consumption rate at the moment one attains at least two of those max criteria? That would seem a bit more objective, but my impression is that it is the highest VO2 achieved, period. So even if both groups achieve max criteria, if one group feels a bit more motivated for whatever reason, they push a little harder and get a higher VO2. If my understanding is right, this feels more like a psychological measure than physiological. Same for all metrics at peak if that's how it is measured.

    I'm more interested in anaerobic threshold since it seems like a fairly direct measure of the body's ability to efficiently do work.

    With that too, I'm confused about how workrate at AT is measured. Is it instantaneous at the moment of AT? If so, it seems that could introduce a fair bit of randomness. For example, if someone's workrate is constant and just by chance they significantly slow down one second before AT, it probably won't be enough time to delay AT, so the workrate will seem lower than reality. Same if someone's workrate is very volatile. It just depends what point on the ups and downs of workrate you happen to measure.

    VO2 seems like it'd be more smoothed out even if the participant was changing speed a lot since it'd be delayed from changes in speed, unlike work, and would probably have less variability.

    This seems to align with the data, where -50 to 50% changes in workrate at AT are commonly seen, while VO2 is only commonly seen within -20 to 20% change.

    I wonder if it would make more sense to calculate work maybe over five seconds up to the point of AT, which would reduce randomness.

    Also, from what I understand, with a bike CPET, the participant can't be forced to maintain a constant workrate. The paper says "the prescribed pedal rate of 50–80 rpm", so that introduces more variability. If some participants have different rates, I think that could significantly change what metrics at AT look like.

    A treadmill CPET seems much more controlled. You can set an exact workrate by setting a speed and the participant has to maintain that exact speed or fall off the back.

    A more robust peak metric seems like it'd be useful too.
     
    Last edited: Sep 12, 2024
    Kitty and ME/CFS Skeptic like this.
  14. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    I don't know for sure but I suspect that the measuring device automatically averages multiple values over a short period (for example 10 seconds) and that becomes the score.
     
    Kitty and forestglip like this.
  15. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    We haven't posted or discussed this yet, but I think most statisticians would use a mixed linear effects model for this. That way they can account for the repeated measures (day1-day2) of participants but still test for differences between group.

    I've tried this using the following model:
    model = Lmer('VO2 ~ phenotype * Study_Visit + (1 | ParticipantID)', data=df_max)

    upload_2024-9-12_21-50-34.png
    The last line, the interaction, is what interest us and the results are the same that we got by calculating the differences per participant and then comparing patients and controls: -0.93 mean difference and p = 0.03. This doesn't account for outliers though.
     
    Kitty and forestglip like this.
  16. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    That's interesting, but getting a bit out of my depth and brain bandwidth to really grasp at this point. Looks like the p-value isn't super low, and we've tested about a million things already, so with correction not amazing, I think.
     
    Kitty likes this.
  17. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    I made a couple little charts of all 2-day CPET studies with controls from MEpedia showing percent difference between means of the days for each group for VO2peak and workload at AT.

    Notes:
    • Lien 2019 not included because no numbers are given, I only see graphs of points.
    • Nelson 2019, Davenport 2020, and Keller 2024 used sedentary controls.
    • Both van Campen studies used idiopathic chronic fatigue controls.
    • All participants used cycle ergometer, except for 2 out of 6 ME/CFS and 0 out of 6 controls in VanNess 2007 used Bruce treadmill protocol.
    • Keller et al (VO2peak matched), 2024 is a subset of participants from Keller et al, 2024.
    VO2_peak (ml_kg_min)_dumbbell.png
    wkld_AT (W)_dumbbell.png
     
    Last edited: Sep 14, 2024
    Murph and ME/CFS Skeptic like this.
  18. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    721
    I think there might be an error here. For Nelson, it shows ME decreased by 0.11 and CTL decreased by 0.38. But the study shows ME increased by 0.1 and CTL increased by 0.4. I'm guessing they requested the data from the authors to get more decimals, but the signs are switched. I checked one other metric from VanNess 2007 and that seems consistent with the meta analysis.
     
    Last edited: Sep 13, 2024
  19. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    Nice visualization, thanks. One suggestion: it might more intuitive if both graphs have the same scale, so that difference in VO2 and wkld can be compared. Now they look the same size but VO2 has a scale from -20 to 10 and wkld from -50 to 20.
     
    Last edited: Sep 14, 2024
    forestglip likes this.
  20. ME/CFS Skeptic

    ME/CFS Skeptic Senior Member (Voting Rights)

    Messages:
    3,960
    Location:
    Belgium
    I think the Nelson 2019 study is interesting. Even though it is quite small, it is one of the few that used an appropriate analysis comparing both the testing difference (CPET2-CPET1) and group difference (MECFS versus HC) at the same time. They also suggest using percentages:
    They only difference they found was for workload at the AT which went from 87.8 to 72.5, a percentage difference of the mean of -17.43%.

    They also did a ROC analysis with an optimal threshold of −9.8% (sensitivity=68.8%, specificity=100%). In other words none of the HC controls had a decrease of 10% or more, while 2/3ds of the ME/CFS patients did and they suggested that this might become a biomarker.
    Yet in the data by Keller et al., almost 40% of controls had a percentage decline of -9.8% or greater for work at the anaerobic threshold while only 53% of ME/CFS patients did.

    I tried to do an ROC analysis for the Keller data to find the optimal threshold and got these results. A decline of -14.89% best separated the groups. Approximately 70% of HC are above this threshold, 30% under it. 50% of MECFS are above, and 50% under.


    upload_2024-9-14_10-20-22.png


    upload_2024-9-14_10-20-27.png
     
    forestglip likes this.

Share This Page