Rethinking the treatment of chronic fatigue syndrome—A reanalysis and evaluation of findings from a recent major trial of graded exercise and CBT

Discussion in 'Psychosomatic research - ME/CFS and Long Covid' started by Carolyn Wilshire, Feb 6, 2018.

  1. Esther12

    Esther12 Senior Member (Voting Rights)

    Messages:
    4,393
    Thanks Tom. That link now works for me too, when yesterday it had some error message instead of the image. Maybe it had just glitched out?
     
  2. Tom Kindlon

    Tom Kindlon Senior Member (Voting Rights)

    Messages:
    2,254
    Seems like there were:
    from: https://sites.google.com/site/pacefoir/pace-ipd_foia-qmul-2014-f73.xlsx
    Readme file: https://sites.google.com/site/pacefoir/pace-ipd-readme.txt
     
    Luther Blissett likes this.
  3. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    995
    Location:
    UK
    I'm bravely/recklessly going to try to help (the first part of this post, more complex stuff further down). It's simplest if we us the correction method specified in the stats plan: 5 contrasts and the Bonferroni method of correction. Here's a handy summary of the results, showing what is statistically significant:

    PACE-primary-stats-plan.jpg

    According to this, GET has a statistically significant effect on the overall rate of improvers (improving on SF36 and CFQ) but CBT does not.

    CBT has a stat sig effect on the rate of fatigue improvers, but not SF36. Conversely GET had a stat sig effect on SF36 improvers, but not on fatigue.

    For overall and CBT/CFQ, the stat sig effect also just matches the 2x rate of improvers set as a "clinically important difference", marked in my table by the green border. The protocol said between 2 and 3x, and and since it was exactly 2 it isn’t technically between two and three, but that seems to be pushing the argument a bit.

    (However that 2x threshold might apply only to the overall rate of improvers, the protocol seemed a bit ambiguous).

    To sum up: using the stats plan correction method, GET had a statistically signifcant effect on the overall improvement rate and was at the bottom end of a "clinically important difference". CBT had no stat sig effect overall.

    CBT had a sig, clinically important effect on CFQ alone; GET had a sig effect on SF36 score but it wasn't clinically important.

    You could add that even the GET result means you have to treat 10 patients to get one overall improver, and other analysis in the paper shows that these improvements vs no treatment don't last.
    =====

    Assuming I've got my numbers right, that's the key point. It gets a bit more complicated looking at the protocol.

    The protocol specifies 6 contrasts but no method for statistical correction. They would have to have choose something but would have had a choice. Bonferroni is the strictest option and the most obvious choice, but there are other methods, which might make the GET overall result significant. I'd appreciate @Carolyn Wilshire opinion on this both for accuracy and plausibility. (See quote box for boring exploration of this).

    So here are the protocol results:

    PACE-primary-protocol.jpg

    Neither therapy is effective overall, though there is a stat sig and (just) clinically important effect on fatigue and a stat sig (but not clinically important) effect on SF36 function. So little to shout about here and wildly-unimpressive compared with the results trumpted by the PACE authors.

    Finally, here are the basic figures used in the analysis (click on the thumbnail). I'm happy to share my excel file with anyone who wants it.
     

    Attached Files:

    Last edited: Feb 14, 2018
  4. Barry

    Barry Senior Member (Voting Rights)

    Messages:
    8,420
    Many thanks @Simon M. Of course all these deliberations of improvers, be they statistically significant or clinically important, are all subject to the overriding caveat that they are self-reported and incorporate significant favourable bias in them. If that could be corrected for (i.e. corrected for objectivity) then it all becomes insignificant.
     
  5. Esther12

    Esther12 Senior Member (Voting Rights)

    Messages:
    4,393
    So that would be a 'yes' then?

    Thanks @Simon M

    I was also unclear on whether it counted as their pre-specified analysis using the method for accounting for multiple comparisons in their statistical analysis for the primary outcomes from their protocol when the statistical analysis plan also changed those primary outcomes.

    I also liked Wilshire's point [edit: in a post earlier in this thread] that 2x is not 'between' 2x and 3x, but it also feels a bit cheeky to use that in a debate (unless they first try to claim that they had reached this prespecified criteria for clinical significance.

    I feel like this is all good ammo for challenging any attempt to claim that the primary outcomes from their protocol support the primary findings reported in the 2011 Lancet paper, but that for simple illustrations of the problems with PACE we're still best off focussing on claims of 'recovery'.
     
    Last edited: Feb 14, 2018
  6. Valentijn

    Valentijn Guest

    Messages:
    2,275
    Location:
    Netherlands
    That would seem to support that "improvements" are due to bias built in to the therapies. CBT is about denying your fatigue, so fatigue questionnaire scores improve. GET is about believing you can do more, so physical functioning questionnaire scores improve. The effect on the SF36-PF was even more pronounced with the Lightning Process, which is extremely heavy-handed on pushing patients to believe that they can do more.

    These clever buggers have reinvented the subjective placebo effect all by themselves :rolleyes: I suspect when they've completely run out of biomedical road to drive their psychosomatic cures on, they'll switch to pondering how insane patients are to report improvement when they have none, and act like they've invented the wheel in the process :D
     
  7. Simon M

    Simon M Senior Member (Voting Rights)

    Messages:
    995
    Location:
    UK
    Or a no, if we go with the Stats plan approach.

    As you and I both noted, the same plan also changed the primary outcomes, but I can see no way the PACE authors could argue against its plan for correcting multiple comparisons: the multiple comparison problem applies equally to the protocol primary outcomes.

    My concern is using the 6 contrasts of the protocol and applying the Bonferroni correction from the stats plan - that kind of mix and match approach might have a defence against it - hence my question to @Carolyn Wilshire (@Tom Kindlon).

    Protocol with stats plan correction for multiple comparisons says:
    "CBT has no overall effect, GET does and reaches the threshold for clinical importance (but the effect doesn't persist long-term) and CBT has an effect on fatigue only - again on the margin of clinical importance."

    Also, you need to treat 10 patients to get one overall GET improver, and the overall improver rates are low: 10% for no treatment, 20% for GET (self-report scoring, not real improvement).

    These are poor results.

    That's not bad?

    I'm not sure that argument would impress a neutral. Saying it's the "margins of clinical importance" might be a better approach.

    UPDATE; By my calculation, while be overall effect of GET is between two and three, that for CBT on CFQ is fractionally below 2 (as is the non-significant effect of CBT overall).

    IF you look at the %age improvements below, that's true of CBT but both CFQ and SF36 improve for GET, the CFQ just falls below the margin for significance. As Carolyn said, they would need to test the improvement of fatigue vs improvement of SF36, which they didn't, and it probably would not be significant.

    PACE-primary-core-data2.jpg

    We are into picky territory here, but I think it's important to establish what we can say robustly.
     
    Last edited: Feb 14, 2018
  8. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363
    I'm tortoising my way through your paper - congrats to all authors on having it accepted for publication.


    Building on @Sean ’s comment on @Carolyn Wilshire ’s comment about when the decision to diverge from the original protocol definition of improvement was made -


    Just stumbled on my copy of the FINE trial open with a particular sentence underlined, which was:


    'In accordance with our protocol, "improvement" was defined as scoring less than 4 on the fatigue scale or improving by 50% or more or scoring 75% [sic?] or more on the SF-36 physical functioning scale'


    The bolded piece is the same as the PACE protocol, as in Table 1 of your paper. The FINE protocol was published in 2006, the PACE protocol in 2007.


    FINE was, essentially, a negative trial. It was published 23 April 2010 and had been accepted 8 February 2010.


    In your paper you state that May 2010 was when PACE made changes:


    'However, in May 2010, several months after data

    collection was complete, this primary outcome

    measure was replaced with two continuous

    measures: fatigue and physical function ratings on

    the two scales described above (see [13,14] for

    details). According to the researchers, the changes

    were made “before any examination of outcome

    data was started...” [13, p. 25].'


    So the changes to PACE were made shortly after the FINE trial was published, i.e. after the publication of a negative trial which used a similar protocol definition of improvement.


    In Table 1 of your paper you show that PACE changed their definition of improvement to “At least an 8 point increase in the 100-point SF-36 physical function scale” and “At least a 2 point decrease on the 33-point CFQ”.


    In FINE, by 70 weeks the pragmatic rehab arm had changed by about 13 points on the SF-36 PF, the supportive listening arm by about 5 and the GP treatment as usual arm changed by about 10 points.


    In FINE, by 70 weeks the pragmatic rehab arm was the only one to change by almost 2 points on the Chalder Fatigue Scale (NB bimodal scoring, not equivalent to 2 points on likert scoring used in PACE), the other two arms changed by less than 1 point.


    I’m aware I am probably making observations that have been made, hey, probably multiple times, eloquently, in published form by people on this thread – sorry to anybody whose toes/shoulders I’m inadvertently stepping on!

    Edit: I've edited to reflect the typo picked up by @Tom Kindlon: the pragmatic rehab arm in FINE increased by about 13 points on the SF-36 PF, not 3 as my original post stated, and to point out that FINE used bimodal scoring on the Chalder Fatigue Scale rather than the likert scoring used in PACE, again, thanks to @Tom Kindlon! I recommend having him handy when brainfogged.
     
    Last edited: Feb 14, 2018
  9. Tom Kindlon

    Tom Kindlon Senior Member (Voting Rights)

    Messages:
    2,254
    The increase on the SF-36 physical function for the pragmatic rehab arm was 29.84 to 43.27 i.e. about 13, not 3.
     
  10. Tom Kindlon

    Tom Kindlon Senior Member (Voting Rights)

    Messages:
    2,254
    Those scores you quote for FINE are for bimodal scoring, not the 33-point CFQ.
     
  11. Tom Kindlon

    Tom Kindlon Senior Member (Voting Rights)

    Messages:
    2,254
    Just to point out in this case "Wilshire" just refers to a point in this thread, not a point in the paper itself.
     
  12. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363
    You're absolutely right, a typo, I've edited my post to correct this. Thanks!
     
  13. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363
    Again, you're right, and I've edited my post to point this out. Thanks!

    Just had a look at the PACE protocol paper, where they explain using a bimodally scored Chalder Fatigue Questionnaire as a primary efficacy measure, and a Likert scored CFQ as a secondary efficacy measure. In the PACE 2011 paper they use Likert scoring as a primary efficacy measure, stating the reason for the switch as "to more sensitively test our hypotheses of effectiveness." I don't follow the logic there. Can you or anyone explain to me why the switch makes sense? And does the FOIA dataset indicate that findings would have been different if the bimodal scoring of CFQ had been retained as a primary efficacy measure?

    From the PACE protocol paper https://bmcneurol.biomedcentral.com/articles/10.1186/1471-2377-7-6:
    "Primary outcome measures – Primary efficacy measures
    ...The 11 item Chalder Fatigue Questionnaire measures the severity of symptomatic fatigue [27], and has been the most frequently used measure of fatigue in most previous trials of these interventions. We will use the 0,0,1,1 item scores to allow a possible score of between 0 and 11. A positive outcome will be a 50% reduction in fatigue score, or a score of 3 or less, this threshold having been previously shown to indicate normal fatigue [27].
    ...
    Secondary outcome measures – Secondary efficacy measures
    1. The Chalder Fatigue Questionnaire Likert scoring (0,1,2,3) will be used to compare responses to treatment [27]."
     
  14. dave30th

    dave30th Senior Member (Voting Rights)

    Messages:
    2,447
    They made the change because the FINE trial had found null results with the bimodal scoring but significant results in a post-hoc analysis using Likert. They didn't mention that reasoning in PACE, of course. It made no sense not to provide both analyses, since they were already providing the Likert as a secondary analysis anyway. They obviously figured they might get a significant finding by switching to the Likert and then they could hide the bimodal finding that might turn out to provide null results, like it did in PACE. They've never provided a satisfactory answer to why they did this and never, as far as I've seen, acknowledged that they did it specifically because they saw the FINE findings. That, of course, would have required them to mention FINE and point out that it basically had null results. They managed not to mention that anywhere in PACE as well.

    oops--I meant to say above, like it did in FINE.
     
    Last edited: Feb 15, 2018
  15. Valentijn

    Valentijn Guest

    Messages:
    2,275
    Location:
    Netherlands
    I think the issue is that the threshold on the likert scale wasn't directly comparable to the threshold on the bimodal scare. The result was that it did lower the bar a bit for recovery, though I'm not sure how much practical effect that had by itself. I can take a look at my copy of the data set later if there's no definitive answer posted yet (it might have been addressed in one of the publications).
     
  16. Valentijn

    Valentijn Guest

    Messages:
    2,275
    Location:
    Netherlands
    For recovery, the protocol CFQ bimodal threshold for recovery was a score of 3 or less. In the published recovery paper, they changed that to a likert score of 18 or less. Only 89 patients (out of the 607 for whom there are CFQ scores at 52 weeks) qualified as recovered on the CFQ using the bimodal scoring, and that increased to 177 with the likert scoring. So it doubled the amount of patients crossing the CFQ threshold. Those who were added under the likert scheme scored bimodally as follows: 8 (1 patient), 7 (18 patients), 6 (26 patients), 5 (20 patients), and 4 (23 patients).

    It went from 31 in CBT crossing the threshold to 60.
    It went from 30 in GET crossing the threshold to 51.
    It went from 17 in APT crossing the threshold to 34.
    It went from 11 in SMC crossing the threshold to 32.
     
  17. Evergreen

    Evergreen Senior Member (Voting Rights)

    Messages:
    363

    Thanks @dave30th . Hm.

    I think the fact that Alison Wearden, lead author of the FINE trial, is listed in the PACE trial 2011 paper as being on the PACE trial group (as an “observer”) is noteworthy (p.835). While you would assume that people doing such similar work at the same time in the same field would be very much aware of each other, that explicit link between the trials is interesting.

    I can’t see any mention of a post-hoc analysis using Likert in the FINE trial paper itself – am I missing something (I do skip over things thanks to brain fog) or was this mentioned somewhere else? Or do we know this because someone has the FINE data and has done the post-hoc analysis? (I did note this line in the FINE trial paper: “Data sharing: We will be happy to make our dataset available to researchers, once we have finished reporting our findings. Please contact the corresponding author.”)

    Yes, I saw that FINE was also not mentioned in GETSET’s review of the literature:

    “Research in context

    Evidence before this study

    We searched PubMed, PsychINFO, and the Cochrane Library

    from database inception until Aug 1, 2016, without language

    restrictions, for full reports of randomised controlled trials,

    systematic reviews, and meta-analyses using the search terms

    “chronic fatigue syndrome”, “myalgic encephal*”, “self-help”,

    “self-management”, “self-care”, and “self-instruction”.

    We excluded trials of adolescents, education, and group

    interventions…[mentions Chalder study]. After excluding studies in which participants had

    unexplained fatigue but were not diagnosed with chronic

    fatigue syndrome…[mentions Knoop study and Tummers study]. We found no trials of self-help management for chronic fatigue syndrome based on guided exercise

    therapy principles."



    The FINE trial’s title was “Nurse led, home based self help treatment for patients in primary care with chronic fatigue syndrome: randomised controlled trial”. The reason for exclusion from the GETSET lit review is not clear to me. Patients did have 5 face to face sessions and 5 phonecalls - maybe this was why? Another team might have made the reason for excluding such a large relevant trial explicit.

    @Valentijn I found some discussion of it in relation to the 2013 PACE recovery paper in @Carolyn Wilshire @Tom Kindlon et al’s 2016 paper “Can patients with chronic fatigue syndrome really recover…”:

    “Our analyses show that in PACE,

    changing this threshold [Evergreen: from no more than 3 on bimodal scoring of Chalder Fatigue Questionnaire to 18 or below on Likert] doubled the number of patients who qualified as ‘recovered’

    on this criterion (the total recovered rose from 15% to 29%). 16 of the new qualifying

    cases reported continuing fatigue on seven out of the 11 CFQ items, and one case even

    reported fatigue on 8 of the 11 items. These scores indicate considerably greater levels

    of fatigue than the maximum score of 3 specified on the original protocol. Finally, and

    perhaps most worryingly, seven of the PACE participants themselves fulfilled this new

    recovery criterion upon trial entry”

    I’d be interested to know how it affected improvement rates in the main 2011 paper – it seems like it would have inflated them, but if anyone knows where this is discussed please direct me to it!
     
  18. Esther12

    Esther12 Senior Member (Voting Rights)

    Messages:
    4,393
    It was actually in this linked rapid response: http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0

    The FINE authors have since gone on to use these results in presentations on their results, although they were never peer-reviewed and their figures are contradicted by the Cochrane review on exercise therapy which (after being challenged by Robert Courtney on their use of results that did not seem to have been reported elsewhere, despite claiming to have only used results from published paper) reported having access to FINE data, but that their analysis showed that Likert results were not significant.
     
  19. dave30th

    dave30th Senior Member (Voting Rights)

    Messages:
    2,447
    I haven't understood that point of the Cochrane reviewers--why are they saying the Likert findings are not significant when the post-hoc analysis from the FINE team said they were? I haven't looked closely at the numbers, but where is that contradiction coming from?
     
  20. Esther12

    Esther12 Senior Member (Voting Rights)

    Messages:
    4,393
    It could be that someone has just made an error, as they're claiming to have different results from the same data, or it could be that in the FINE RR adjustments were made that allowed them to reach statistical significance (Larun's response mentions "our unadjusted analysis"). It's hard to say (for me anyway - maybe others are more able to make an informed judgement), and those involved do not seem keen on explaining. [edit: Kindlon later pointed out that Larun's response to Courtney describes the figures from the FINE RR as an "effect estimate", and contrasts that with "our unadjusted analysis" for which there was not a statistically signifiant effect]

    Regardless, as Courtney points out, their use of this data clearly contradicts the claim in their review that "For this updated review, we have not collected unpublished data for our outcomes...".

    Copy of the relevant bits from the Cochrane review for those interested.

    From Cochrane review:

    Larun response to Kindlon:

    Courtney comment:

    Larun reply to Courtney:

     
    Last edited: Feb 15, 2018

Share This Page