Rethinking the treatment of chronic fatigue syndrome—A reanalysis and evaluation of findings from a recent major trial of graded exercise and CBT

Esther12 · Feb 12, 2018

Tom Kindlon said:
The link still works for me. Anyway here it is:View attachment 1744

Thanks Tom. That link now works for me too, when yesterday it had some error message instead of the image. Maybe it had just glitched out?

Tom Kindlon · Feb 13, 2018

Adrian said:
I think it was 13% would have met one of the criteria at entry (specifically the sf36 or CFQ ones). Clearly no one met the CGI ones as this is a 'how much better do you feel after the trial' or if you didn't tell them 'how much better did the assessor think you felt after the trial'

The Oxford criteria one is just strange the way they introduced the thresholds and they change quite a lot. I can't remember if there were non-oxford who were also able to meet the trial criteria at the end.

Seems like there were:
from: https://sites.google.com/site/pacefoir/pace-ipd_foia-qmul-2014-f73.xlsx
Readme file: https://sites.google.com/site/pacefoir/pace-ipd-readme.txt

Simon M · Feb 13, 2018

Esther12 said:
Does this mean that we can/can't say "using the prespecified analysis for the trial's primary outcome there was no significant treatment effect for [CBT and/or GET]"?

Or do we once again have to deal with niggling complexities which prevent a nice simple statement (ideally one suited to those of us not used to discussing Bonferroni correction)?

I'm bravely/recklessly going to try to help (the first part of this post, more complex stuff further down). It's simplest if we us the correction method specified in the stats plan: 5 contrasts and the Bonferroni method of correction. Here's a handy summary of the results, showing what is statistically significant:

According to this, GET has a statistically significant effect on the overall rate of improvers (improving on SF36 and CFQ) but CBT does not.

CBT has a stat sig effect on the rate of fatigue improvers, but not SF36. Conversely GET had a stat sig effect on SF36 improvers, but not on fatigue.

For overall and CBT/CFQ, the stat sig effect also just matches the 2x rate of improvers set as a "clinically important difference", marked in my table by the green border. The protocol said between 2 and 3x, and and since it was exactly 2 it isn’t technically between two and three, but that seems to be pushing the argument a bit.

(However that 2x threshold might apply only to the overall rate of improvers, the protocol seemed a bit ambiguous).

To sum up: using the stats plan correction method, GET had a statistically signifcant effect on the overall improvement rate and was at the bottom end of a "clinically important difference". CBT had no stat sig effect overall.

CBT had a sig, clinically important effect on CFQ alone; GET had a sig effect on SF36 score but it wasn't clinically important.

You could add that even the GET result means you have to treat 10 patients to get one overall improver, and other analysis in the paper shows that these improvements vs no treatment don't last.
=====

Assuming I've got my numbers right, that's the key point. It gets a bit more complicated looking at the protocol.

The protocol specifies 6 contrasts but no method for statistical correction. They would have to have choose something but would have had a choice. Bonferroni is the strictest option and the most obvious choice, but there are other methods, which might make the GET overall result significant. I'd appreciate @Carolyn Wilshire opinion on this both for accuracy and plausibility. (See quote box for boring exploration of this).

The problem with using the stats plan's Bonferroni approach is that the plan has been criticised for being late and changing the outcome measures so it can't be taken as definitive (hey, maybe they went ultra-strict to make up for watering down the outcomes). Certainly it's reasonable to apply Bonferroni but it might be reasonable to use other approaches too.

One option would be the Bonferroni-Holm method, which applies successively easier thresholds to each p value, starting at the fully-corrected one for the smallest p value and ending up with just p<0.05. But by my calculation, this method would make the GET overall result significant along with the GET SF36 one, but none of the CBT results.

So here are the protocol results:

Neither therapy is effective overall, though there is a stat sig and (just) clinically important effect on fatigue and a stat sig (but not clinically important) effect on SF36 function. So little to shout about here and wildly-unimpressive compared with the results trumpted by the PACE authors.

Finally, here are the basic figures used in the analysis (click on the thumbnail). I'm happy to share my excel file with anyone who wants it.

Barry · Feb 13, 2018

Many thanks @Simon M. Of course all these deliberations of improvers, be they statistically significant or clinically important, are all subject to the overriding caveat that they are self-reported and incorporate significant favourable bias in them. If that could be corrected for (i.e. corrected for objectivity) then it all becomes insignificant.

Esther12 · Feb 13, 2018

do we once again have to deal with niggling complexities which prevent a nice simple statement

So that would be a 'yes' then?

Thanks @Simon M

I was also unclear on whether it counted as their pre-specified analysis using the method for accounting for multiple comparisons in their statistical analysis for the primary outcomes from their protocol when the statistical analysis plan also changed those primary outcomes.

I also liked Wilshire's point [edit: in a post earlier in this thread] that 2x is not 'between' 2x and 3x, but it also feels a bit cheeky to use that in a debate (unless they first try to claim that they had reached this prespecified criteria for clinical significance.

I feel like this is all good ammo for challenging any attempt to claim that the primary outcomes from their protocol support the primary findings reported in the 2011 Lancet paper, but that for simple illustrations of the problems with PACE we're still best off focussing on claims of 'recovery'.

Valentijn · Feb 14, 2018

Simon M said:
CBT has a stat sig effect on the rate of fatigue improvers, but not SF36. Conversely GET had a stat sig effect on SF36 improvers, but not on fatigue.

That would seem to support that "improvements" are due to bias built in to the therapies. CBT is about denying your fatigue, so fatigue questionnaire scores improve. GET is about believing you can do more, so physical functioning questionnaire scores improve. The effect on the SF36-PF was even more pronounced with the Lightning Process, which is extremely heavy-handed on pushing patients to believe that they can do more.

These clever buggers have reinvented the subjective placebo effect all by themselves

I suspect when they've completely run out of biomedical road to drive their psychosomatic cures on, they'll switch to pondering how insane patients are to report improvement when they have none, and act like they've invented the wheel in the process

Simon M · Feb 14, 2018

Esther12 said:
So that would be a 'yes' [to niggling complexiities re Bonferroni] then?

Or a no, if we go with the Stats plan approach.

As you and I both noted, the same plan also changed the primary outcomes, but I can see no way the PACE authors could argue against its plan for correcting multiple comparisons: the multiple comparison problem applies equally to the protocol primary outcomes.

My concern is using the 6 contrasts of the protocol and applying the Bonferroni correction from the stats plan - that kind of mix and match approach might have a defence against it - hence my question to @Carolyn Wilshire (@Tom Kindlon).

Protocol with stats plan correction for multiple comparisons says:
"CBT has no overall effect, GET does and reaches the threshold for clinical importance (but the effect doesn't persist long-term) and CBT has an effect on fatigue only - again on the margin of clinical importance."

Also, you need to treat 10 patients to get one overall GET improver, and the overall improver rates are low: 10% for no treatment, 20% for GET (self-report scoring, not real improvement).

These are poor results.

That's not bad?

Esther12 said:
I also liked Wilshire's point that 2x is not 'between' 2x and 3x, but it also feels a bit cheeky to use that in a debate (unless they first try to claim that they had reached this prespecified criteria for clinical significance.

I'm not sure that argument would impress a neutral. Saying it's the "margins of clinical importance" might be a better approach.

UPDATE; By my calculation, while be overall effect of GET is between two and three, that for CBT on CFQ is fractionally below 2 (as is the non-significant effect of CBT overall).

Valentijn said:
That would seem to support that "improvements" are due to bias built in to the therapies. CBT is about denying your fatigue, so fatigue questionnaire scores improve. GET is about believing you can do more, so physical functioning questionnaire scores improve.

IF you look at the %age improvements below, that's true of CBT but both CFQ and SF36 improve for GET, the CFQ just falls below the margin for significance. As Carolyn said, they would need to test the improvement of fatigue vs improvement of SF36, which they didn't, and it probably would not be significant.

We are into picky territory here, but I think it's important to establish what we can say robustly.

Evergreen · Feb 14, 2018

Sean said:
And after other critical papers, like Wiborg, and FINE.

I'm tortoising my way through your paper - congrats to all authors on having it accepted for publication.

Building on @Sean ’s comment on @Carolyn Wilshire ’s comment about when the decision to diverge from the original protocol definition of improvement was made -

Just stumbled on my copy of the FINE trial open with a particular sentence underlined, which was:

'In accordance with our protocol, "improvement" was defined as scoring less than 4 on the fatigue scale or improving by 50% or more or scoring 75% [sic?] or more on the SF-36 physical functioning scale'

The bolded piece is the same as the PACE protocol, as in Table 1 of your paper. The FINE protocol was published in 2006, the PACE protocol in 2007.

FINE was, essentially, a negative trial. It was published 23 April 2010 and had been accepted 8 February 2010.

In your paper you state that May 2010 was when PACE made changes:

'However, in May 2010, several months after data

collection was complete, this primary outcome

measure was replaced with two continuous

measures: fatigue and physical function ratings on

the two scales described above (see [13,14] for

details). According to the researchers, the changes

were made “before any examination of outcome

data was started...” [13, p. 25].'

So the changes to PACE were made shortly after the FINE trial was published, i.e. after the publication of a negative trial which used a similar protocol definition of improvement.

In Table 1 of your paper you show that PACE changed their definition of improvement to “At least an 8 point increase in the 100-point SF-36 physical function scale” and “At least a 2 point decrease on the 33-point CFQ”.

In FINE, by 70 weeks the pragmatic rehab arm had changed by about 13 points on the SF-36 PF, the supportive listening arm by about 5 and the GP treatment as usual arm changed by about 10 points.

In FINE, by 70 weeks the pragmatic rehab arm was the only one to change by almost 2 points on the Chalder Fatigue Scale (NB bimodal scoring, not equivalent to 2 points on likert scoring used in PACE), the other two arms changed by less than 1 point.

I’m aware I am probably making observations that have been made, hey, probably multiple times, eloquently, in published form by people on this thread – sorry to anybody whose toes/shoulders I’m inadvertently stepping on!

Edit: I've edited to reflect the typo picked up by @Tom Kindlon: the pragmatic rehab arm in FINE increased by about 13 points on the SF-36 PF, not 3 as my original post stated, and to point out that FINE used bimodal scoring on the Chalder Fatigue Scale rather than the likert scoring used in PACE, again, thanks to @Tom Kindlon! I recommend having him handy when brainfogged.

Tom Kindlon · Feb 14, 2018

Evergreen said:
In FINE, by 70 weeks the pragmatic rehab arm had changed by about 3 points on the SF-36 PF, the supportive listening arm by about 5 and the GP treatment as usual arm changed by about 10 points.

The increase on the SF-36 physical function for the pragmatic rehab arm was 29.84 to 43.27 i.e. about 13, not 3.

Tom Kindlon · Feb 14, 2018

Evergreen said:
In Table 1 of your paper you show that PACE changed their definition of improvement to “At least an 8 point increase in the 100-point SF-36 physical function scale” and “At least a 2 point decrease on the 33-point CFQ”.

[..]

In FINE, by 70 weeks the pragmatic rehab arm was the only one to change by almost 2 points on the Chalder Fatigue Scale, the other two arms changed by less than 1 point.

Those scores you quote for FINE are for bimodal scoring, not the 33-point CFQ.

Tom Kindlon · Feb 14, 2018

Simon M said:
Esther12 said:

I also liked Wilshire's point that 2x is not 'between' 2x and 3x, but it also feels a bit cheeky to use that in a debate (unless they first try to claim that they had reached this prespecified criteria for clinical significance.

Click to expand...

I'm not sure that argument would impress a neutral. Saying it's the "margins of clinical importance" might be a better approach.

UPDATE; By my calculation, while be overall effect of GET is between two and three, that for CBT on CFQ is fractionally below 2 (as is the non-significant effect of CBT overall).

Just to point out in this case "Wilshire" just refers to a point in this thread, not a point in the paper itself.

Evergreen · Feb 14, 2018

Tom Kindlon said:
The increase on the SF-36 physical function for the pragmatic rehab arm was 29.84 to 43.27 i.e. about 13, not 3.

You're absolutely right, a typo, I've edited my post to correct this. Thanks!

Evergreen · Feb 14, 2018

Tom Kindlon said:
Those scores you quote for FINE are for bimodal scoring, not the 33-point CFQ.

Again, you're right, and I've edited my post to point this out. Thanks!

Just had a look at the PACE protocol paper, where they explain using a bimodally scored Chalder Fatigue Questionnaire as a primary efficacy measure, and a Likert scored CFQ as a secondary efficacy measure. In the PACE 2011 paper they use Likert scoring as a primary efficacy measure, stating the reason for the switch as "to more sensitively test our hypotheses of effectiveness." I don't follow the logic there. Can you or anyone explain to me why the switch makes sense? And does the FOIA dataset indicate that findings would have been different if the bimodal scoring of CFQ had been retained as a primary efficacy measure?

From the PACE protocol paper https://bmcneurol.biomedcentral.com/articles/10.1186/1471-2377-7-6:
"Primary outcome measures – Primary efficacy measures
...The 11 item Chalder Fatigue Questionnaire measures the severity of symptomatic fatigue [27], and has been the most frequently used measure of fatigue in most previous trials of these interventions. We will use the 0,0,1,1 item scores to allow a possible score of between 0 and 11. A positive outcome will be a 50% reduction in fatigue score, or a score of 3 or less, this threshold having been previously shown to indicate normal fatigue [27].
...
Secondary outcome measures – Secondary efficacy measures
1. The Chalder Fatigue Questionnaire Likert scoring (0,1,2,3) will be used to compare responses to treatment [27]."

dave30th · Feb 15, 2018

They made the change because the FINE trial had found null results with the bimodal scoring but significant results in a post-hoc analysis using Likert. They didn't mention that reasoning in PACE, of course. It made no sense not to provide both analyses, since they were already providing the Likert as a secondary analysis anyway. They obviously figured they might get a significant finding by switching to the Likert and then they could hide the bimodal finding that might turn out to provide null results, like it did in PACE. They've never provided a satisfactory answer to why they did this and never, as far as I've seen, acknowledged that they did it specifically because they saw the FINE findings. That, of course, would have required them to mention FINE and point out that it basically had null results. They managed not to mention that anywhere in PACE as well.

oops--I meant to say above, like it did in FINE.

Valentijn · Feb 15, 2018

Evergreen said:
And does the FOIA dataset indicate that findings would have been different if the bimodal scoring of CFQ had been retained as a primary efficacy measure?

I think the issue is that the threshold on the likert scale wasn't directly comparable to the threshold on the bimodal scare. The result was that it did lower the bar a bit for recovery, though I'm not sure how much practical effect that had by itself. I can take a look at my copy of the data set later if there's no definitive answer posted yet (it might have been addressed in one of the publications).

Valentijn · Feb 15, 2018

For recovery, the protocol CFQ bimodal threshold for recovery was a score of 3 or less. In the published recovery paper, they changed that to a likert score of 18 or less. Only 89 patients (out of the 607 for whom there are CFQ scores at 52 weeks) qualified as recovered on the CFQ using the bimodal scoring, and that increased to 177 with the likert scoring. So it doubled the amount of patients crossing the CFQ threshold. Those who were added under the likert scheme scored bimodally as follows: 8 (1 patient), 7 (18 patients), 6 (26 patients), 5 (20 patients), and 4 (23 patients).

It went from 31 in CBT crossing the threshold to 60.
It went from 30 in GET crossing the threshold to 51.
It went from 17 in APT crossing the threshold to 34.
It went from 11 in SMC crossing the threshold to 32.

Evergreen · Feb 15, 2018

dave30th said:
They made the change because the FINE trial had found null results with the bimodal scoring but significant results in a post-hoc analysis using Likert. They didn't mention that reasoning in PACE, of course. It made no sense not to provide both analyses, since they were already providing the Likert as a secondary analysis anyway. They obviously figured they might get a significant finding by switching to the Likert and then they could hide the bimodal finding that might turn out to provide null results, like it did in PACE. They've never provided a satisfactory answer to why they did this and never, as far as I've seen, acknowledged that they did it specifically because they saw the FINE findings. That, of course, would have required them to mention FINE and point out that it basically had null results. They managed not to mention that anywhere in PACE as well.

oops--I meant to say above, like it did in FINE.

Thanks @dave30th . Hm.

I think the fact that Alison Wearden, lead author of the FINE trial, is listed in the PACE trial 2011 paper as being on the PACE trial group (as an “observer”) is noteworthy (p.835). While you would assume that people doing such similar work at the same time in the same field would be very much aware of each other, that explicit link between the trials is interesting.

I can’t see any mention of a post-hoc analysis using Likert in the FINE trial paper itself – am I missing something (I do skip over things thanks to brain fog) or was this mentioned somewhere else? Or do we know this because someone has the FINE data and has done the post-hoc analysis? (I did note this line in the FINE trial paper: “Data sharing: We will be happy to make our dataset available to researchers, once we have finished reporting our findings. Please contact the corresponding author.”)

Yes, I saw that FINE was also not mentioned in GETSET’s review of the literature:

“Research in context

Evidence before this study

We searched PubMed, PsychINFO, and the Cochrane Library

from database inception until Aug 1, 2016, without language

restrictions, for full reports of randomised controlled trials,

systematic reviews, and meta-analyses using the search terms

“chronic fatigue syndrome”, “myalgic encephal*”, “self-help”,

“self-management”, “self-care”, and “self-instruction”.

We excluded trials of adolescents, education, and group

interventions…[mentions Chalder study]. After excluding studies in which participants had

unexplained fatigue but were not diagnosed with chronic

fatigue syndrome…[mentions Knoop study and Tummers study]. We found no trials of self-help management for chronic fatigue syndrome based on guided exercise

therapy principles."

The FINE trial’s title was “Nurse led, home based self help treatment for patients in primary care with chronic fatigue syndrome: randomised controlled trial”. The reason for exclusion from the GETSET lit review is not clear to me. Patients did have 5 face to face sessions and 5 phonecalls - maybe this was why? Another team might have made the reason for excluding such a large relevant trial explicit.

Valentijn said:
The result was that it did lower the bar a bit for recovery, though I'm not sure how much practical effect that had by itself. I can take a look at my copy of the data set later if there's no definitive answer posted yet (it might have been addressed in one of the publications).

@Valentijn I found some discussion of it in relation to the 2013 PACE recovery paper in @Carolyn Wilshire @Tom Kindlon et al’s 2016 paper “Can patients with chronic fatigue syndrome really recover…”:

“Our analyses show that in PACE,

changing this threshold [Evergreen: from no more than 3 on bimodal scoring of Chalder Fatigue Questionnaire to 18 or below on Likert] doubled the number of patients who qualified as ‘recovered’

on this criterion (the total recovered rose from 15% to 29%). 16 of the new qualifying

cases reported continuing fatigue on seven out of the 11 CFQ items, and one case even

reported fatigue on 8 of the 11 items. These scores indicate considerably greater levels

of fatigue than the maximum score of 3 specified on the original protocol. Finally, and

perhaps most worryingly, seven of the PACE participants themselves fulfilled this new

recovery criterion upon trial entry”

I’d be interested to know how it affected improvement rates in the main 2011 paper – it seems like it would have inflated them, but if anyone knows where this is discussed please direct me to it!

Esther12 · Feb 15, 2018

Evergreen said:
I can’t see any mention of a post-hoc analysis using Likert in the FINE trial paper itself – am I missing something (I do skip over things thanks to brain fog) or was this mentioned somewhere else? Or do we know this because someone has the FINE data and has done the post-hoc analysis? (I did note this line in the FINE trial paper: “Data sharing: We will be happy to make our dataset available to researchers, once we have finished reporting our findings. Please contact the corresponding author.”)

It was actually in this linked rapid response: http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0

The FINE authors have since gone on to use these results in presentations on their results, although they were never peer-reviewed and their figures are contradicted by the Cochrane review on exercise therapy which (after being challenged by Robert Courtney on their use of results that did not seem to have been reported elsewhere, despite claiming to have only used results from published paper) reported having access to FINE data, but that their analysis showed that Likert results were not significant.

dave30th · Feb 15, 2018

I haven't understood that point of the Cochrane reviewers--why are they saying the Likert findings are not significant when the post-hoc analysis from the FINE team said they were? I haven't looked closely at the numbers, but where is that contradiction coming from?

Esther12 · Feb 15, 2018

dave30th said:
I haven't understood that point of the Cochrane reviewers--why are they saying the Likert findings are not significant when the post-hoc analysis from the FINE team said they were? I haven't looked closely at the numbers, but where is that contradiction coming from?

It could be that someone has just made an error, as they're claiming to have different results from the same data, or it could be that in the FINE RR adjustments were made that allowed them to reach statistical significance (Larun's response mentions "our unadjusted analysis"). It's hard to say (for me anyway - maybe others are more able to make an informed judgement), and those involved do not seem keen on explaining. [edit: Kindlon later pointed out that Larun's response to Courtney describes the figures from the FINE RR as an "effect estimate", and contrasts that with "our unadjusted analysis" for which there was not a statistically signifiant effect]

Regardless, as Courtney points out, their use of this data clearly contradicts the claim in their review that "For this updated review, we have not collected unpublished data for our outcomes...".

Copy of the relevant bits from the Cochrane review for those interested.

From Cochrane review:

Larun response to Kindlon:

"Bimodal versus Likert scoring in Wearden et al. 2010
To enable pooling of as many studies as possible in a mean difference meta-analyses, we used the 33-scale results reported by Wearden. You suggest that the decision to use the 33-point fatigue scores in our analysis may bias the results because there is no statistically significant difference at the 11-point data at 70 weeks. This statement suggests that there is a statistically significant difference when using the 33-point data, but if you look into analysis 1.2 that is not the case. At 70 week we report MD -2.12 (95% CI -4.49 to 0.25) for the FINE trial, i.e. not statistically significant."

Courtney comment:

"Query re use of post-hoc unpublished outcome data: Scoring system for the Chalder fatigue scale, Wearden 2010.

I would like to highlight what appears to be a discrepancy within the Cochrane review [1] with respect to the analysis of data from Wearden 2010 [2,3].

Throughout the Cochrane review (please see details below), the impression is given that only protocol-defined and published data or outcomes were used for the Cochrane analysis of the Wearden 2010 study.

However, this does not appear to be the case and, to the best of my knowledge, instead of using protocol-defined or published data, the Cochrane analyses of fatigue for the Wearden 2010 study, appears to have used an alternative unpublished set of data.

The relevant analyses of fatigue in the Cochrane review are: Analyses: 1.1, 1.2, 2.1 and 2.3. Each of these analyses states that the “0,1,2,3” scoring system was used for the Chalder fatigue questionnaire. This scoring system is known as the Likert scoring system and uses a fatigue scale of 0-33 points.

However, to the best of my knowledge, data or analyses using this scoring system were not proposed in the Wearden 2010 trial protocol [3], and were not included in Wearden 2010 [2], and have not previously been formally (i.e. via peer review) published by Wearden et al. A post-hoc informal analysis using this data has been informally released by Wearden et al. as a BMJ Rapid Response comment [4].

In the Cochrane review, the analyses using the 0,1,2,3 scoring system contradict text within the section “Characteristics Of Studies”, in relation to Wearden 2010: Under “Outcomes”, it is stated that Chalder fatigue was measured using the 0,0,1,1 scoring system using a scale from 0-11 points: “Fatigue (Fatigue Scale, FS; 11 items; each item was scored dichotomously on a 4-point scale (0, 0, 1 or 1)”.

Wearden 2010 pre-specified Chalder fatigue questionnaire scores as a primary outcome at 70 weeks, and as a secondary outcome immediately after treatment at 20 weeks. The scoring, in both cases, used the 0,0,1,1 system, with a scale of 0-11. This scoring system was described both in the trial protocol [3] and the main results paper published in 2010 [2].

The Likert (0,1,2,3) scoring system was neither proposed in the trial protocol, nor formally published, and so the Likert scores should be considered post-hoc. Even if it is argued that the Chalder fatigue questionnaire (irrespective of the scoring system) was pre-defined as a primary outcome measure, data using the Likert scoring system was neither proposed nor published and so the data itself must surely be considered to be post-hoc. The outcome analyses using the Likert data must be considered post-hoc.

Simply changing a scoring system may, at first glance, appear not to be a significant or major adjustment, however, we do not know what difference it made because a sensitivity analysis has not been published.

I cannot find any explanation within the Cochrane review that explains why the Cochrane review has replaced pre-defined published data with an unpublished and post-hoc set of data.

Is it normal practice for a Cochrane meta-analysis to selectively ignore the pre-defined primary outcome data for a trial, and to selectively include and analyse post-hoc data? I wonder if some clarity could be shed on this situation?

I suggest that the post-hoc data are replaced with the original published data. Otherwise, the post-hoc data should be clearly labelled as such and the risk of bias analysis amended accordingly; and an explanation should be included in the review explaining why an apparently adequate pre-defined set of data has been replaced with an apparent novel set of post-hoc data.

Also, I suggest that any discrepancies that I will outline below, should be corrected where necessary; Either the analyses (1.1, 1.2, 2.1 and 2.3) should be amended or the description of the data should be amended so it is not incorrectly labelled as protocol-defined and published data with a “low risk” of bias.

Discrepancies within the text of the Cochrane Analysis.

Please note that all page numbers used below are pertinent to the current version (version 4) of the Cochrane review in PDF format.

1. On page 28 of the Cochrane review [1], in section “Potential biases in the review process”, under the heading “Potential bias in the review process”, in relation to the review in general, it is stated that: "For this updated review, we have not collected unpublished data for our outcomes..." However, as explained above, this is not the case for the Wearden 2010 fatigue data for which unpublished data has been used in the Cochrane analysis.

2. On page 45 of the review, in section “Characteristics Of Studies”, specifically in relation to Wearden 2010 [2,3], it is stated that only protocol-defined outcomes were used: "all relevant outcomes are reported in accordance with the protocol". "Selective reporting (reporting bias)" is rated as "low risk". However, as explained above, this is not the case, because the Wearden 2010 fatigue data (used in the Cochrane analysis) was not proposed in the protocol. If the data is post-hoc, then the “low-risk” category will need to be revised.

3. On page 44 of the review, in section “Characteristics Of Studies”, in relation to Wearden 2010 [2,3], under “Outcomes”, it is stated that Chalder fatigue was measured using the 0,0,1,1 scoring system using a scale from 0-11 points: “Fatigue (Fatigue Scale, FS; 11 items; each item was scored dichotomously on a 4-point scale (0, 0, 1 or 1)”. Wearden 2010 did indeed use the 0,0,1,1 scoring system for the Chalder fatigue scale: This scoring system was proposed in the trial protocol and published with the main outcome data in Wearden 2010. However, as explained above, this scoring system has not been used in the Cochrane analysis.

4. If figures 2 and 3 also contain discrepancies, after any amendments to the review, then they should be amended accordingly.

There may be other related discrepancies and inaccuracies in the text that I haven’t noticed.

I thank the Cochrane team in advance for giving this submission careful consideration, and for making amendments to the analysis, and providing explanations, where appropriate. I hope you will agree that clarity, transparency and accuracy in relation to the analysis is paramount.

References:

1. Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. Exercise therapy for chronic fatigue syndrome. Cochrane Database Syst Rev. 2016; CD003200.

2. Wearden AJ, Dowrick C, Chew-Graham C, et al. Nurse led, home based self help treatment for patients in primary care with chronic fatigue syndrome: randomised controlled trial. BMJ. 2010; 340:c1777.

3. Wearden AJ, Riste L, Dowrick C, et al. Fatigue Intervention by Nurses Evaluation – The FINE Trial. A randomised controlled trial of nurse led self-help treatment for patients in primary care with chronic fatigue syndrome: study protocol. BMC Med. 2006; 4:9.

4. Wearden AJ, Dowrick C, Chew-Graham C, et al. Fatigue scale. BMJ Rapid Response. 2010. http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0 (accessed April 16, 2016).

Larun reply to Courtney:

Dear Robert Courtney

Thank you for your detailed comments on the Cochrane review ’Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work.

We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.

The Chalder Fatigue Scale was used to measure fatigue. The results from the Wearden 2010 trial show a statistically significant difference in favour of pragmatic rehabilitation at 20 weeks, regardless whether the results were scored bi-modally or on a scale from 0-3. The effect estimate for the 70 week comparison with the scale scored bi-modally was -1.00 (CI-2.10 to +0.11; p =.076) and -2.55 (-4.99 to -0.11; p=.040) for 0123 scoring. The FINE data measured on the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors to be able to use the estimates in our meta-analysis. In our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks. The decision to use the 0123 scoring did does not affect the conclusion of the review.
Regards,

Lillebeth Larun

Rethinking the treatment of chronic fatigue syndrome—A reanalysis and evaluation of findings from a recent major trial of graded exercise and CBT

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Attachments

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Guest

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Guest

Guest

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)