Rethinking the treatment of chronic fatigue syndrome—A reanalysis and evaluation of findings from a recent major trial of graded exercise and CBT

Discussion in 'Psychosomatic research - ME/CFS and Long Covid' started by Carolyn Wilshire, Feb 6, 2018.

  1. Subtropical Island

    Subtropical Island Senior Member (Voting Rights)

    Messages:
    2,056
    Non-complaining sick people = higher rates of ‘cure’ that can be claimed.

    So, no, not useful for society, resources, govt funding, etc etc (those who fund the research).
    But useful for the people doing the ‘research’: we got funded, we did something, it got published as a ‘success’, we’ll get more funding for ‘research into effective treatments’. (My inner cynic wants to say: better outcome for these ‘researchers’ than a cure would be).

    The scammed are not just patients but funding bodies.

    The sad thing is that when you have been scammed, it’s often very hard to ‘give up’ and admit that all your ‘good work’ has been a fail. Easier to rationalise.
    This is human.

    What we need is modern scientific method: where every negative result adds to our body of knowledge.
    The old quote about the filament for lightbulbs: (something like) not 100 failures but 100 things we now know don’t work well.


    ETA What i’m trying to say here is that studies like PACE can have real value. We need to all (especially the people involved in publishing PACE) appreciate the value of a result that confirms the negative of your hypothesis, or yields a null result. If mistakes are made we need to review them and learn from them.
    A conclusion that something is not significantly effective is a useful conclusion. A conclusion that there are better ways to run a trial is also a useful conclusion - so long as you make future trials better.

    This is what is so brilliant about this reanalysis: we are looking at what what really found, and what was not. The ONLY way to make progress.
     
    Last edited: Feb 15, 2018
  2. Tom Kindlon

    Tom Kindlon Senior Member (Voting Rights)

    Messages:
    2,254
    So it looks like Larun have moved from the -2.12 (not statistically significant) finding to the -2.55 (statistically significant) finding? Or am I reading it incorrectly? I don't think the latter finding/data is mentioned anywhere else in the review.
     
  3. Esther12

    Esther12 Senior Member (Voting Rights)

    Messages:
    4,393
    I don't think I'd noticed that before.

    "-2.55 (-4.99 to -0.11; p=.040) for 0123 scoring"

    Those are the figures from the FINE RR: http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0

    So Larun is contrasting those "effect estimate"s with her "unadjusted analysis", which is what was used in the Cochrane review?

    So it seems that the difference was a result of adjustments made in the FINE analysis? I've forgotten how FINE data was analysed now.
     
    Luther Blissett and Simon M like this.
  4. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363
    Thanks for that really clear explanation, @Valentijn . Wow. So that's 88 patients whose bimodal scores indicate abnormal fatigue being counted as recovered under the Likert switch. I'm guessing their bimodal scores hadn't reduced by 50% either.

    For people like me who need to see these things again and again, from the PACE protocol:
    "We will use the 0,0,1,1 item scores [bimodal] to allow a possible score of between 0 and 11. A positive outcome will be a 50% reduction in fatigue score, or a score of 3 or less, this threshold having been previously shown to indicate normal fatigue [27]."

    Reference 27 was
    Chalder T, Berelowitz G, Hirsch S, Pawlikowska T, Wallace P,
    Wessely S, Wright D: Development of a fatigue scale. J Psychosom
    Res 1993, 37:147-153.
     
    Luther Blissett, Inara and Valentijn like this.
  5. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363
    Thanks for that link, @Esther12 . My FINE folder is filling up.

    This playing with numbers until they tell the story you want makes me suspicious that the SF36-PF may find itself usurped by a measure that is deemed more sensitive to change, because it often stubbornly suggests little or no change in physical function.

    Although this reminds me of a piece I was rereading yesterday from Collin & Crawley's 2017 paper, where they seem to be suggesting that objective measures will show more improvement than subjective:

    Objective measures haven't quite played ball so far, though, have they. (Reference 26 is a study of a small sample of Australians who did a 12 week CBT/GET/graded cognitive activity intervention and reportedly improved on objective measures of cognitive performance but not subjective.)

    I find it baffling that there's no awareness of how flimsy "patient satisfaction" is. It's like going on a date and judging its success based on whether the person says "I had a nice time tonight" at the end rather than whether they call to arrange a second date, and show up for the second date, and arrange a third. Very few people end a date by saying "I don't like you. I have no intention of ever having any contact with you ever again." And they certainly wouldn't if the person they went on a date with was in charge of their healthcare and the gateway to their only source of income.
     
    Hutan, Webdog, large donner and 13 others like this.
  6. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    995
    Location:
    UK
    Comparison of protocol primary outcomes with published primary outcomes (did the switching matter?)​

    I thought I'd bring things together to show the impact of changing primary outcomes, using the results from the new paper. Spoiler alert: the published outcomes always look better than protocol-specified ones.

    First, how the authors revised reporting of primary outcomes:
    The protocol looked at the proportion of patients who improved on CFQ, SF36, or both, with a clinically-meaningful difference required of a 2-3x higher rate to be effective.

    The published version switched to measuring mean (average) score differences between whole groups (e.g. CBT vs control group), rather than how many patients improved, and so no longer required both CFQ and SF36 to improve. It also changed from a bimodal scoring of CFQ (0-11) to a 0-33 scoring system that is more sensitive to small changes.

    A "clincially useful difference" was set at 0.5D, which worked out at -2 points for the CFQ and +8 points for the SF36. The smallest possible change on these scales are 1 and 5 respectively, so the new definition of "clinically useful" effectively means "more than the smallest possible positive change".

    Overall results

    PACE-primary-published-vs-p.jpg

    So while the protocol results showed that only half of outcomes were statistically significant (3/6), all of the published results were comfortably statistically signficant. And instead of 2/6 being clincically meaningful for the protocol, most (3/4) were for the published primary outcomes.

    More info in the thumbnail:
    PACE-primary-pub-v-proto-2.jpg


    Curiously, the PACE Lancet paper didn't come out and say that the effect of CBT on CFQ scores reached a clinically useful difference, but did not for SF36 scores. Instead, the start of the discussion section includes this:
    Note the elding of the more positve secondary outcome of proportion improving, and the post hoc outcome of normal range, with the primary outcomes. More details in the box
    To sum up:
    • The protocol results are much worse than those published: CBT is not an effective overall treatment for CFQ fatigue and SF36 function. GET is, but the results are weak (at 20% improvement vs 10% of controls).
    • In any case, the long-term results show that these improvements don't persist. This new paper unpicks the PACE authors' optimistic theory that the non-difference at long-term was because people had CBT/GET after the trial to 'catch up'. (The difference between groups also disappeared amongst those that did not have further treatment.)
    The authors of the new paper put this nicely into context:
     
    Last edited: Feb 16, 2018
  7. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    995
    Location:
    UK
    Here’s, a couple of graphs that reveal how far the PACE authors moved the goalposts for the primary outcomes, by comparing the protocol primary outcome of overall improver rates from this new paper with the published improver rates and the published recovery rates.

    Both definitions are based largely on self-reported fatigue and function.

    Far more patients “improved” with the published improver definition compared with the protocol version.
    PACE-primary-vs-publshed-im.jpg

    The published improver rates are now a secondary outcome, but they are based on the how many patients improved by a clinically useful difference for both of the revised primary outcome measures.

    Surprisingly, similar rates of patients “recovered“ using the published definition as improved overall with the protocol one.

    PACE-primary-vs-published-r.jpg

    Note that it won’t be all the same people in both improved and recovered groups. The improvers would include those who had an initial low score e.g. SF36=30 and substantial improvements, while the “recovered“ group will include those who had a high initial score e.g. SF36=60 ( which already matches the recovery definition for function) and relatively minor improvements.

    Neither group needed to improve on objective outcomes.
     
    Last edited: Feb 18, 2018
  8. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363
    Thanks so much for putting this information in such an easy-to-grasp format, @Simon M , really impactful. Gobsmacking stuff.

    And this is the clincher, really, isn't it:

     
  9. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    Whether not not their 'committee' gave them approval is neither here nor there from the perspective of evaluating the research. If you change what you specify in the protocol, you need a good reason, one that's a whole lot better than 'our mates on the committee agreed so it was okay'.

    All that matters from the point of view of the science is that they changed various outcomes and analyses they promised to do in the trial protocol, and that these changes are not scientifically justified - who ever did or did not agree with them at the time is entirely irrelevant.
     
    Last edited: Feb 19, 2018
  10. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    That's a good point, @Sasha.
     
  11. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    No, and that is something for people concerned with ethics and patients' rights. From the point of view of the science, which was the only thing under scrutiny in the current paper, this has little bearing.
     
    Last edited: Feb 19, 2018
  12. Adrian

    Adrian Administrator Staff Member

    Messages:
    6,563
    Location:
    UK

    This has been their basic defense for the changes. My point is I think they pulled the wool over their mates eyes with the stats plan not specifying reasons just slipping in changes.
     
  13. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    Yes, in fact we present this analysis in the paper, and we obtained the same results as you did. We actually corrected for 5 AND for 6 total comparisons (because the protocol specifies 6 and the stat plan specifies 5), and presented results for both. The conclusions were the same wither way, and accorded with yours.

    I'm not sure who suggested this, but yes, Bonferroni is pretty conservative and FDR (False discovery rate aka Bonferroni-Holm) is more lenient, and probably slightly preferable. The main results might have just passed FDR threshold.

    Bonferroni is the only method of correction described or used in any of the PACE papers, including the stats plan. So I think its a reasonable assumption that that's what they would have used. I agree there's not much in it - the results are borderline, and they get through based on some thresholds not others. The truth is probably that people self-rated a little better on CFQ and/or SF36 physical function scale after GET and CBT. But there is some value in showing that the results are less impressive than they appear after the outcome switch. It showed that researcher outcome selection introduced a source of bias into the study.
     
    Last edited: Feb 19, 2018
  14. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    995
    Location:
    UK
    Thanks for these replies.
    Maybe my post wasn't clear - I'm quoting the figures from your paper. The NNT was my own, and I thought the protocol wasn't clear as to whether or not the "clinically important difference" measure applied to just the overall improvers, or also applied to improvers on CFQ or SF36 alone.

    Your figures give a p value of 0.010 for GET overall, which I'm assuming does pass the Stat plan version (i.e. marginally under 0.010) but not the protocol version (6 contrasts, + Bonferroni as per stats plan).


    I asked, because I thought it's just the sort of defence that the PACE authors might advance. But I wasn't sure if using FDR, or other such less-conservative measures than Bonferroni, was normal on such a small number of comparisons: I thought you would only consider that for > 10 comparisons, but I have little experience and was hoping you could throw some light on it.

    Also, I realise now that the 6 comparisons of the protocol include the APT ones and without those I couldn't apply the Bonferroni-Holm correction properly myself.

    Well, that's the huge value of the paper! The paper talks about "statistically unreliable results and modest effects - I was just trying to get at what we can say in the most concrete terms that lay people might understand. And I thought that using the more generous stat plan version was less open to any defence from the PACE authors.

    As you say in the conclusion, these are "modest effects" on self-report measures. Though there appears to be no overall effect of CBT even on self-report, and only a modest overall effect on GET.

    By the way, the link to the PACE primary outcomes on the Wolfson site doesn’t work any more (ref 22), at least not today.
     
    Last edited: Feb 19, 2018
  15. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    When I read about the PACE trial and its illusory outcomes, I always come back to the one thing I know, like I really really know ...

    I first met my wife in 1977, and we have been together ever since; she went down with ME in 2006. She has always had a very strong drive to do as much as she is capable of, physically and mentally, and she has never ever lost that. We used to do lots of walking in the country, my wife loves gardening, she does quilting and is currently doing a distance-learning course on it, having already done a good many previously. My wife is very self-motivated, and as physically fit as her ME allows her to be (I cannot imagine she is deconditioned to any significant degree, if at all).

    So what is the discrepancy between what I know must be true, and what the PACE author's claim to be true? For me it always comes back to the one cardinal issue we all know and agree on - the discrepancy between subjective versus objective outcomes, and the PACE authors' pathological faith in the former and rejection of the latter. This tells me, intuitively, that the difference is very significant. And the PACE data mean the science tells us the same, no matter what the authors might try to hide behind.

    I think if my wife had participated in the PACE trial, it is highly likely she would have self-reported much the same as others. Why? Because the person she would have felt she was most letting down, if she didn't demonstrate the will to live up to the goals and expectations she had been convinced to set for herself, would have been - herself. It would not simply have been about letting others down but letting herself down. And I've a feeling the PACE investigators may have exploited that personality trait in the participants.

    I know we clarify here that the outcomes under discussion are self-reported, but I'd hate for us to lose sight of how very significant that is.
     
    Last edited: Feb 19, 2018
    Hutan, Solstice, TiredSam and 16 others like this.
  16. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    The paper reports results based on both models - 6 comparisons versus 5 comparisons. So readers can actually see what difference this made.

    But can I just say again the protocol is the protocol is the protocol. Its fine to produce a stats plan that elaborates on the protocol, but that stats plan cannot contradict the protocol in any way. Well, because its not the protocol.
     
    Last edited: Feb 19, 2018
  17. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    Yes, Bruce Levin was very emphatic on the point that values that equal the critical value of p count as significant. I looked it up, and this is definitely correct.
     
    Luther Blissett and Simon M like this.
  18. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    This is brilliant, @Evergreen!:laugh:
     
  19. large donner

    large donner Guest

    Messages:
    1,214
    We are not even friends on benefits with them.
     
    Last edited: Feb 19, 2018
  20. Carolyn Wilshire

    Carolyn Wilshire Senior Member (Voting Rights)

    Messages:
    103
    I've only seen FDR used in the context of many tests (although I admit I might not have noticed much what researchers use in areas outside my own). FDR is very common in neuroscience work - where you perform thousands of tests, so correcting appropriately - neither too much nor too little - really matters. But lately there have been calls there that FDR is too lenient and controls poorly for false positives. People are now recommending permutation thresholding as the gold standard there (no-one is recommending Bonferroni in that field, it is considered too conservative when applied over so many tests, and likely to miss genuine effects).

    The other consideration here is that this is a clinical trial. So its purpose is different from just advancing knowledge, its purpose is to demonstrative treatment effectiveness. So you would want to choose a correction method that is bulletproof to to the possibility that some of your results might have been false positives. This is just the situation where you might choose Bonferroni. Then no one gets on your case about whether your outcomes are real or not.
    I don't have access to any stats software right now, but I've attached a file with the raw data in it for overall improvement rates (protocol-specified definition). It has two data columns, depending upon whether you go for intention-to-treat (counting drop-outs as non-improvers) or available cases (excluding dropouts from the analysis).

    How very interesting!
     

    Attached Files:

    Luther Blissett, Simon M and Barry like this.

Share This Page