Trials we cannot trust: investigating their impact on systematic reviews and clinical guidelines in spinal pain, 2023, O'connell et al

EndME

Senior Member (Voting Rights)
Highlights
  • A group of trials with trust concerns had major impacts on the results of systematic reviews and clinical guidelines.
  • They substantially impacted effect sizes and influenced the conclusions and recommendations drawn.
  • There is a need for a greater focus on the trustworthiness of studies in evidence appraisal.
Abstract

We previously conducted an exploration of the trustworthiness of a group of clinical trials of cognitive behavioural therapy (CBT) and exercise in spinal pain. We identified multiple concerns in eight trials, judging them untrustworthy. In this study, we systematically explored the impact of these trials (“index trials”) on results, conclusions and recommendations of systematic reviews and clinical practice guidelines (CPGs).

We conducted forward citation tracking using Google Scholar and the citationchaser tool, searched the Guidelines International Network (GIN) library and National Institute of Health and Care Excellence (NICE) archive to June 2022 to identify systematic reviews and CPGs. We explored how index trials impacted their findings. Where reviews presented meta-analyses, we extracted or conducted sensitivity analyses for the outcomes pain and disability, to explore how exclusion of index trials affected effect estimates.

We developed and applied an ’Impact Index’ to categorise the extent to which index studies impacted their results. We included 32 unique reviews and 10 CPGs. None directly raised concerns regarding the veracity of the trials. Across meta-analyses (55 comparisons), removal of index trials reduced effect sizes by a median 58% (IQR 40 to 74). 85% of comparisons were classified as highly, 3% as moderately, and 11% as minimally impacted. Nine out 10 reviews conducting narrative synthesis drew positive conclusions regarding the intervention tested. Nine out of 10 CPGs made positive recommendations for the intervention(s) evaluated. This cohort of trials, with concerns regarding trustworthiness, has substantially impacted the results of systematic reviews and guideline recommendations.

Perspective
We found that a group of trials of CBT for spinal pain with concerns relating to their trustworthiness have had substantial impacts on the analyses and conclusions of systematic reviews and clinical practice guidelines. This highlights the need for a greater focus on the trustworthiness of studies in evidence appraisal.

https://www.jpain.org/article/S1526-5900(23)00467-4/fulltext
 
Last edited by a moderator:
You need to be able to give useless studies zero weight in reviews, instead of a very small weight.

This is what GRADE fails on. If a study is too open to bias to be interpretable it should score zero. There is no logical justification for just 'downgrading' a pip or two.

Or even negative weight.

And this as well. The PACE study provides us with very robust evidence for CBT and GET not producing a cost effective benefit. If there is any benefit it is too small to be worth it. Since it was a big trial whose authors did their darnedest to get a positive result from it should carry very significant negative weight.
 
You need to be able to give useless studies zero weight in reviews, instead of a very small weight.
Somehow doesn't seem to have the same impact, when comparing the response to NICE vs. to IQWIG, which simply discarded all but 3 trials for being too biased to use. Or maybe since the conclusions remain mostly the same, they just don't care about their work being rated as too poor to even consider.

I'm really puzzled by the different response considering that grading them as not even worth rating, carrying no weight at all, is obviously worse than being rated as very low and carrying little weight.
Or even negative weight.
Which is another puzzling part of this. IQWIG rated identical trials, from mostly the same people, with mostly the same process and methodology, and obviously the exact same "treatments", as worthless, which should have negatively rated similar conclusions from the 3 trials of low quality that they found acceptable. Same food, from the same kitchen, cooked by the same staff using the same ingredients and equipment. This is all weird, absolute zero silence on a similar situation.
 
This is what GRADE fails on. If a study is too open to bias to be interpretable it should score zero. There is no logical justification for just 'downgrading' a pip or two.

I don't think GRADE has to fail on this - it depends entirely on how it is used. Quoting my comment from elsewhere:
While @ME/CFS Skeptic... I disagree that this is an accurate representation of how GRADE works. If assessors decide that an evaluation of an outcome in a trial really is fatally flawed, then the trial is not should not be included in the assessment of the evidence for the outcome. But, an unblinded trial with subjective outcomes is not necessarily fatally flawed. It can tell us something. It might tell us that there is no effect, even with all the biases stacked in favour of the intervention. It might tell us that an effect was reported, but it probably falls within the range of a placebo effect. Or it might tell us that there was an amazing result that exceeds the likely placebo response, and that this intervention shows real promise and should be investigated further, that it is worthwhile doing more expensive and difficult trials that are blinded and/or have objective outcomes.

So, if a trial is completely unreliable, for example the researcher sat at a table and just made up the data, it can and should be excluded from an analysis.

But, a trial with subjective outcomes and no blinding is not completely unreliable. In some cases, a subjective outcome is actually something we want to know and report on.

As I say above, a subjective trial with no blinding might tell us that the intervention has no effect, even though all the biases are stacked in its favour. Or that trial and all the others with the same design might show "benefits" that can be identified as likely to be within the benefit range expected from a hyped placebo. Then we know that the benefit isn't big enough to be real. There should be a step in the evaluation process that comes to a sensible conclusion on what magnitude of benefit is real and relevant and lasts long enough.

Alternatively, what if an intervention has a fantastically useful outcome? A trial with subjective outcomes (do you think the treatment solved the problem?) and no blinding might result in nearly 100% positive outcomes, and almost all of the participants continued to use it at followup. In that case, the result is so far in excess of what you would expect with a placebo treatment that you can conclude that the treatment is probably useful.

The thing is, any evaluation system has to be used by people who are thinking properly about what they are doing. I think it is entirely possible to use the GRADE approach to make a useful analysis. As I've noted elsewhere, the main benefit of GRADE and the other tools it is often used with is that they provide a transparent structure for assessing trials, for reporting, and to some extent standardising, how the assessor thought about things.
 
You need to be able to give useless studies zero weight in reviews, instead of a very small weight.


Indeed, the point made in the abstract about 9 out of ten of these reviews (of which 85% were highly impacted) being positive and effectively voting positively for the treatment makes it really obvious this isn't just an issue of bias and selectivity in those that end up being reported. As far as I can see there is nothing stopping there being ten studies done in a row until the right answer finally gets fudged 'just enough' from the poor methods, and the previous 9 don't need to be reported.

NOTE: therefore none of those ever have to also be included in 'the review' and so its nonsense that these would be somehow useful given they've been selected out from a pile of others people wouldn't see.

Even if you somehow couldn't design a single trial without any kind of objective measures to triangulate [which I think is just excuses for not wanting to 'do the work'] that is just playing the probability alongside having bad methods that you choose to not improve, but instead focus on the outcome that works for your conflict.

To say that being the habit in an area makes it 'not science' is an understatement surely?

There is nothing stopping good quality qualitative research from taking place to add meat to bones and so on so these arguments of justification seem beyond weak and almost like trying it on. Add in the huge risk of perceived coercion, particularly in the area of ME/CFS, and particularly when done by those who are BPSM (as their beliefs are in behavioural treatments like 'remove support', make life hard, 'secondary gains' nonsense, and their sales patter itself they write about people suggesting they are psychosomatic tends to ruin access to medical and other support), and I think that very much needs to be accounted for - I still don't understand why they aren't required in methods to account for how they both made people safe and made them feel they were definitely safe 'to give whatever truth vs 'their right answer''.
 
Last edited:
By far the biggest flaw and biasing element in this methodology: change the people and you change the outcome. That's voting with a few extra steps. A cornerstone of science is, well, exactly the opposite of that.
I guess, but all science has similar problems. As this paper says"
RCTs are a human product and so are influenced by biases in human behaviour.
And meta-analyses can compound the bias.

I think it is possible to do a review with rigour, and for your assumptions to be laid out so clearly that others can challenge them, and conduct sensitivity analyses to see if assuming something else makes a difference.


Evidence-based medicine (EBM) has numerous tools and methods to assess and manage both quality and bias in research concerned with the conduct and reporting of trials, but there are few methods addressing the important question of trustworthiness of data. Trustworthiness incorporates research integrity and governance, including transparent pre-registration of protocols, appropriate ethical approval and transparent data stewardship, and potential research misconduct.1 The latter might include fabrication or falsification of research results, or plagiarism.2
If untrustworthy trials are not identified and removed during the development process of reviews and CPGs, then the conclusions and recommendations of those reviews and guidelines are at risk of being incorrect, with potentially major impact on patient care. This issue is compounded by an academic and publishing system that is generally slow, inefficient and inconsistent in dealing with scientific error, issues of misconduct and research integrity,3 and where mistakes are often uncorrected, raising the likelihood of negative impact.4

The problems with the studies they identified as flawed are summarised:
Key concerns included issues of research governance (lack of study pre-registration, no documentary confirmation of relevant ethical approvals, a lack of sharing of data upon request, distributions of baseline variables that appeared unlikely in the context of random allocation, data anomalies (in particular, duplicate or highly similar results data across unique studies), low to no attrition of participants in some studies and implausible results (extremely large effect sizes diverging from the wider literature).
The authors don't seem to have identified the problem of subjective outcomes in unblinded trials, which surely must have applied to most of these studies.

Here's the chart of the changes in effect size for each of the meta-analyses timepoints, with and without the trials. The right hand scale is an estimated Number Needed to Treat.
Screen Shot 2023-07-20 at 6.11.11 pm.png

Twelve out of 40 statistically significant effects (at the p<0.05 threshold) became non-significant after the removal of studies of interest.
Look how the standard mean difference tends to zero. And that's before taking into account the bias from subjective outcomes in unblinded trials.

A number of the reviews noticed that certain trials were pulling the results in a favourable direction. Some explained this away with speculation that the interventions were somehow superior, rather than checking for the flaws that these authors found.
In 15 of these reviews, authors commented on the fact that studies of interest were outliers in their sample, had very large effects, and/or introduced heterogeneity to the analyses. Of these, the authors of five reviews speculated that the dose, intensity and/or aspects of the content of the intervention in those trials might explain the observed divergence, while the other reviews either did not offer an explanation or stated that the heterogeneity was unexplained.

Same story with the clinical practice guidelines:
Nine CPGs presented narrative syntheses for comparisons that included index trials and one53 conducted a de novo meta-analysis. The interventions of interest were described as multimodal, multidisciplinary or biopsychosocial,49, 51, 55, 56, 57CBT combined with exercise,54 behavioural treatment,48cognitive therapy50or general exercise.52 Table 6 summarises the CPG analyses that included the index trials. All but one made positive recommendations for interventions for which index trials had informed the synthesis. No guideline raised any concerns regarding the veracity of the index trials.
 
Last edited:
In many cases, the exclusion of index trials changed the pooled effects of meta-analyses from moderate-to-large to small or very small effect sizes. These new effect estimates are of questionable clinical significance and, in some cases, excluding index trials shifted effects from statistically significant to non-significant.

We identified a number of CPGs from a range of countries and organisations that included at least one of the index trials and used them to formulate their recommendations. All CPGs made positive recommendations for the interventions for which index trials informed the syntheses. Due to the varied approaches to reporting in CPGs and the dominance of narrative approaches to syntheses, it is often not possible to ascertain the specific contribution of index trials to their conclusions and recommendations. In most included CPGs, it is reasonable to infer that the positive reported findings of the index trials contributed to recommendations that favoured psychological or multimodal therapies. In specific examples, it is clear that the index trials were crucial to such clinical recommendations. The NICE 2016 guideline55clearly shows that two of the index trials7, 9were included in the three trials whose evidence was used for a de novo economic analysis that drove a recommendation for multidisciplinary biopsychosocial rehabilitation for low back pain. It is not unreasonable to speculate that without the index trials such a recommendation would not have been considered appropriate. That the evidence in the NICE guideline5 was directly used in the formulation of the Belgian (KCE)57 guideline further extends that impact.
 
Ah, this is interesting:
Neil O'Connor, the first author
Neil is a Reader in the Physiotherapy Division of the Department of Health Sciences. He divides his time between teaching and research and previously worked as a musculoskeletal physiotherapist. Neil's research interests focus on the evidence based management of persistent pain and he has published extensively in this area. He also leads and teaches modules on clinical research methods and evidence based practice for pre- and post-graduate clinicians.

Neil was the Co-ordinating editor for the Cochrane Pain, Palliative and Supportive Care (PaPaS) group from 2020-23 and is a member of Cochrane's central editorial board. He was a member of the Guideline Development Group for the UK's National Institute of Health and Care Excellence (NICE) 2016 clinical guideline on the management of low back pain and sciatica and was a specialist committee member for the NICE Quality Standard on that topic. Neil is the current Chair of the International Association for the Study of Pain (IASP) Methods, Evidence Synthesis and Implementation Special Interest Group (MESIGIG).
 
On Amanda C de C Williams (the final author)
an interview with transcript
https://integrativepainscienceinsti...-does-it-help-with-dr-amanda-c-de-c-williams/
In this episode, we’re discussing the different types of psychological therapies available for the treatment of chronic pain. Do they help? Are they safe? How much confidence can we place in them and what we should further investigate regarding this topic as we move forward? My expert guest this episode is Dr. Amanda Williams.
We discussed the findings from her paper called Psychological Therapies for the Management of Chronic Pain in Adults, which can be found in the August 2020 Cochrane Library of Systematic Reviews. The paper updates the literature regarding the effectiveness of different kinds of psychological therapy, including traditional Cognitive Behavioral Therapy, Acceptance and Commitment Therapy, and Behavioral Therapy.

Given that, most of our trials were CBT. They tend to be the larger trials. They more than made it to our gate. They showed in general small but definite improvements in disability and distress, both straight after treatment and that 6 to 12 months follow up. Some of the results were not quite clear at follow-up and it matters that the effects last. This was against treatment as usual. If you compare with another treatment, you get much less effective because you’ve got two strong things going against each other, but most people’s choice is not, “Would you like this psychological treatment or that one?” It’s, “Would you like this or nothing?” That’s perhaps more likely aligned.

CBT looked good, but the changes are small. These are average changes. They’re made up of people who changed a lot. Some people who changed not at all and a few who get worse. A few trials of Behavior Therapy made it through. They are mostly earlier trials. An awful lot of small trials disappeared. They didn’t look good at all. Nothing much changed. Our confidence in the findings, which has to do with things about the quality of evidence was much lower than it was for CBT. The surprise to a lot of people, including us was only five trials of ACT made it through the gate. There are an awful lot of small trials back, which is a bit surprising given that the more trials tend to be bigger. We weren’t seeing any improvement on where we were. The quality of evidence was low that we couldn’t show that’s published when it’s added. It won’t change the results in a negative direction. That was a bit surprising given the strength of taking up of ACT by practitioners all over the place.
 
In addition to Neil O' Connor's there's more authors' involvement with Cochrane and NICE:
Thanks MSEspe! Interesting to read those links. There's a huge Cochrane involvement, and, from Ecclestone, some extraordinary expressions of prejudice about CFS:
Chronic fatigue patients appear to demonstrate a high level of self-criticism and negative perfectionism (Luyten et al., 2011). People are highly critical and demanding in a self-defeating pattern that fuels ruminating thoughts of failure, and so depression. This pattern can be self-perpetuating as people fail to reach unrealistically high targets, which reinforces a belief of personal inadequacy. Specific beliefs are that the fatigue is uncontrollable, likely to lead to catastrophic outcomes, and that further activity will lead to physical damage. All appear to play a part in maintaining a lack of engagement with activity (Lukkahatai and Saligan, 2013).

Other strong beliefs, perhaps fuelled by the social context of CFS as a contested illness, are beliefs about the causes of the disease and what needs to be done. Simply put: if one believes strongly that exercising is not only going to be fatiguing but that it will cause damage, then one is unlikely to exercise. Further, one is likely to consider anyone who is suggesting exercise to be at least unwise and perhaps unkind. Ironically, the person with CFS can appear to observers to be passive and resting. The opposite is normally true. People with fatigue tend to be actively ruminating about possible solutions, and often desperate for change, but may be engaging in self-defeating attempts to achieve unachievable solutions.

Edit to add the thread on the source:
Embodied: The psychology of physical sensation (2015) by C. Eccleston
 
Last edited:
I had a quick look at the Cochrane review of psychological therapies for pain. The outcome measures were all subjective, ie questionnaires.
It only found evidence of small improvements with CBT and none with BT or ACT. I would interpret that as CBT being more focused on changing unhelful beliefs than the others so more likely to have patients reporting improvement even when no real change. But of course they don't interpret it that way.
 
People with fatigue tend to be actively ruminating about possible solutions, and often desperate for change,

Real mystery why patients might be desperately seeking solutions to their serious life-trashing health problems, especially when the professional experts are botching it badly.

Can't imagine why patients would do that.
 
Last edited:
I guess, but all science has similar problems. As this paper says"
Definitely, but this problem is massively amplified in a context of open label trials with biased subjective outcomes that mostly amount to "we're experts, we say so", multiplied again by the fact that what is being evaluated is a black box made up of a bunch of different things and involve direct 1-on-1 attempts at manipulating the outcomes. No one knows or checks the substance of the trials, the actual "intervention". What they do is rate books by their cover. It mostly ends up being all bunched up together under the "everything but the kitchen sink" label of non-pharmaceutical interventions.

What this means is that even if the results say something, they absolutely cannot be applied in the tyrannical fashion that the biopsychosocial is forced onto us, taking what is, at the very best, some minor subjective benefit in 1/7 of participants being turned into "this is a complete cure and despite being a pragmatic trial, where causality cannot be inferred, we believe this proves that the participants are only suffering from psychological issues".

There are so many more layers in the biopsychosocial ideology compared to other disciplines. They may be the same issues, but they are amplified 100x at every single point, from the design to the evaluation of services delivered without any actual reliable evidence, since they were designed from the start with the intent of implementing straight from the start, assuming they could not possibly fail.

If it was merely a suggestion that patients could ignore without any issues, then OK. But without fail patients keep being told to "just exercise" no matter how many times they report that it makes them severely ill, with the insistence, without any evidence, that they are anxious, afraid, or other schoolyard taunt-level nonsense.
 
Last edited:
Back
Top Bottom