Who Agrees That GRADE is (a) unjustified in theory and (b) wrong in practice?

Jonathan Edwards · Mar 4, 2021

The idea of GRADE to provide a recipe for making decisions for people who are not themselves capable of making such decisions on their own is a flawed and dangerously counterproductive idea in a medical context.

The pseudo-arithmetic structure of allocating evidence to 'grades' has no purpose other than to sound standardised. Standardisation in decision-making by definition makes it less precise.

The proper process is for people with enough experience and skill in logic to view the evidence available and decide what its implications for recommended management are in one integrated decision step. Any intermediate steps of forcing information into grading levels and using arbitrary rules for moving up and down grading levels is logically invalid and bound to interfere with, rather than assist, a decision.

It should be possible for a randomised controlled trial that has fatal flaws that make it uninterpretable to be downgraded to uninterpretable (no need for very low or grade 1 or anything) on the basis of any one flaw that is enough to reach that judgment. GRADE does not allow this and so is highly likely to produce false conclusions.

It is interesting to see that both Cochrane and NICE use GRADE but NICE does not trust the Cochrane use of GRADE so re-does it. At NICE I can see the practical reason for using GRADE. Technical staff use GRADE to prepare a provisional analysis which is then reviewed by a committee. The technical staff have no experience of trials so will need something like GRADE. I do not see why the committee needs to make use of GRADE. I think it would be fair to ask technical staff to search for studies and document a list of features but I do not think there is any merit in asking them to grade, since I don't think grading comes in to this.

For Cochrane the worry is that nobody oversees the use of GRADE by the review team. There does not seem to be any place for anything like GRADE here. Admittedly Cochrane reviews go out to peer review but we have seen how problematic that is.

It would be easy to think that because GRADE has been arrived at by a consensus of 'experts' that is must be as good an approach as any. However, by definition those who choose to see themselves as experts suited to the construction of such a set of rules will be those who do not see that the exercise is pointless and invalid in decision-making theory terms. Those who can see that the exercise is doomed will not volunteer to be on the committee. It may be worth remembering that at least in the UK you get a pay rise for sitting on committees but not for just doing your job well, despite the fact that if you are sitting on a committee you cannot be doing the job you are paid to do.

FMMM1 · Mar 4, 2021

Jonathan Edwards said:
I will fill out a first post shortly. In brief:

The idea of GRADE to provide a recipe for making decisions for people who are not themselves capable of making such decisions on their own is a flawed and dangerously counterproductive idea in a medical context.

The pseudo-arithmetic structure of allocating evidence to 'grades' has no purpose other than to sound standardised. Standardisation in decision-making by definition makes it less precise.

The proper process is for people with enough experience and skill in logic to view the evidence available and decide what its implications for recommended management are in one integrated decision step. Any intermediate steps of forcing information into grading levels and using arbitrary rules for moving up and down grading levels is logically invalid and bound to interfere with, rather than assist a decision.

It should be possible for a randomised controlled trial that has fatal flaws that make it uninterpretable to be downgraded to uninterpretable (no need for very low or grade 1 or anything) on the basis of any one flaw that is enough to reach that judgment. GRADE does not allow this and so is highly likely to produce false conclusions.

Yea surely a medical Doctor who is unsure of a diagnosis can set out their views and ask a colleague(s) for their views?

Black boxes (like GRADE) are rightly concerning - the fact that this one requires you to give a +ve value to data that should be discarded, means that it isn't fit for purpose. Bit disappointing that the great, and the good, have been touting it and are still trying to defend it.

Excuse me for not keeping up i.e. you folks have already done this. I Googled "cochrane insurance medicine + GRADE" and yes:
"The evaluation for quality of evidence of cost or economic outcomes through GRADE: A survey of Cochrane reviews"
[https://abstracts.cochrane.org/2017...-medicine-evaluation-cochrane-reviews-and-new]

Old joke "if you can't be part of the solution then make money out of the problem" - tasteless in this context.

I wouldn't blame anyone using strong language re this.

Jonathan Edwards · Mar 4, 2021

The point about giving some sort of positive value to evidence, however flawed, is also a salient one, yes. Any system like GRADE should recognise evidence that makes it highly likely that there was no effect as for PACE. GRADE deals with this by noting the consistency of any positive finding but that misses the opportunity to make use of strong evidence for no effect from individual studies.

arewenearlythereyet · Mar 4, 2021

I can’t say I am well read enough on the detail of GRADE but I would say that whatever system is used to put a ‘value’ on research should determine first what the purpose of the ‘grade’ is.

Pace is a bad trial in terms of flawed methodology so you could discount it completely or can you use it to provide proof that even flawed it shows that CBT and GET doesn’t work (based on the principle that a negative result is as useful as a positive one)? So I guess it all hangs on what your objective is ....do a thorough search of all known research and use it to establish what facts exist?

In this case it’s a bit moot since all the evidence we have says that we don’t know very much and there isn’t a lot of anything other than what little we have tried so far doesn’t work.

One thing I used to do when doing a literature search ahead of pitching for a research grant (food not medical) was to initially group past research in terms of quality/strength just so I could weigh things up. This was good because you could quickly filter out the wheat from the chaff and spot ‘career publishing’ by the same authors and genuine replication etc. but also negative results that showed what ideas had been disproved.

I can see that grouping evidence might be useful initially to establish a base and even to demonstrate at a high level what you are dealing with, but that’s probably where it ends.

The next bit (insight) should be based on skill, common sense and consensus. I.e free thought not some second rate algorithm that assumes that people are incapable of learning a skill

Hoopoe · Mar 4, 2021

The lowest GRADE certainty rating is "very low", described as "The true effect is probably markedly different from the estimated effect".

Does that really accurately describe the worst possible scenario? It's as if only positive results are possible with this system, a bit like a questionnaire where the only allowed options are varying degrees of improved health, and there is no option to indicate a lack of improvement or deterioration.

If we come up with a scale like this, it should start with something like "high certainty of no effect", followed by "uninterpretable".

And looking at something like PACE, it's somewhere between uninterpretable and disproving any claims of meaningful treatment effects.

Trish · Mar 4, 2021

What is the best source of a good summary of how GRADE works?

I found this one on a website called Cochrane Training:
https://training.cochrane.org/grade-approach

This one looks like it's the GRADE group website:
https://www.gradeworkinggroup.org/

And this is an article called 'What is Grade' under BMJ best practice:
https://bestpractice.bmj.com/info/toolkit/learn-ebm/what-is-grade/

Any suggestions of which one I should start with to try to understand GRADE?

Jonathan Edwards · Mar 4, 2021

strategist said:
The lowest GRADE certainty rating is "very low", described as "The true effect is probably markedly different from the estimated effect".

And this is very odd wording. It seems to indicate either a lack of understanding of probability or a hidden assumption that the result was 'fiddled'. If the accuracy of the estimated effect is just plain uncertain the true effect is most probably something like it but might be quite different. If there is a conclusion that the true effect is probably markedly different' I think there has to be an assumption that the estimate is biased and in practice that is always biased towards a positive effect, unless you are dealing with someone trying to disprove homeopathy perhaps.

Jonathan Edwards · Mar 4, 2021

Trish said:
Any suggestions of which one I should start with to try to understand GRADE?

I think the BMJ What is GRADE is a reasonable place to get an overview. The GRADE manual is long, although it is reasonably easy to locate various aspects.

Jonathan Edwards · Mar 4, 2021

Looking at the BMJ 'What is GRADE' and the opening para of 'How does it work?' the following sentence is interesting:

An overall GRADE quality rating can be applied to a body of evidence across outcomes, usually by taking the lowest quality of evidence from all of the outcomes that are critical to decision making. (my bolding)

To me, the confusions involved in what GRADE is trying to do are apparent straight away. It is not clear whether the idea is to decide whether or not there is an effect or to decide what size it is, apparently assuming that there is one. Certainty and quality are also seen as interchangeable. The whole thing looks like a fail on a probability exam paper.

ME/CFS Science Blog · Mar 4, 2021

Well said.

Jonathan Edwards said:
It should be possible for a randomised controlled trial that has fatal flaws that make it uninterpretable to be downgraded to uninterpretable (no need for very low or grade 1 or anything) on the basis of any one flaw that is enough to reach that judgment. GRADE does not allow this and so is highly likely to produce false conclusions.

This seems to be one of the main issues: that GRADE does not believe in a fatal flaw that makes 'evidence' totally unreliable. The only way to rate something as very low quality is if a trial suffers from several different flaws. I haven't seen any arguments why this would be the case in the real world.

Suppose for example that there is an interpretation problem on a questionnaire: patients indicate they got better when in fact they meant something else. The GRADE approach, if I understand correctly, only offers the possibility the downgrade the quality of evidence a little bit even though the data is totally useless.

The problem is that GRADE is now so often used that this is seen as the correct and neutral way to rate quality of evidence. If you deviate from it, by arguing there is a fatal flaw that makes evidence fully unreliable, then you're arbitrary, not neutral or biased etc.

I think the Hanbook said something like: we're not trying to tell you how to rate evidence, merely how to make your decision transparent. The fact that you can only downgrade evidence two times for risk of bias, from strong to low quality, shows that this isn't really the case.

It would be interesting to have a comparison of how evidence is rated with and without GRADE. I suspect that the approach with GRADE will result in evidence being rated higher quality than the approach without GRADE.

cassava7 · Mar 4, 2021

It seems that previous discussions on GRADE have picked up your points @Jonathan Edwards.

Jonathan Edwards said:
The pseudo-arithmetic structure of allocating evidence to 'grades' has no purpose other than to sound standardised. Standardisation in decision-making by definition makes it less precise.

The proper process is for people with enough experience and skill in logic to view the evidence available and decide what its implications for recommended management are in one integrated decision step. Any intermediate steps of forcing information into grading levels and using arbitrary rules for moving up and down grading levels is logically invalid and bound to interfere with, rather than assist, a decision.

From a 2014 editorial by Malmivaara [1] (bolding mine):

Another concern is related to the conceptual heterogeneity of the eight items in calculating the decisive level for confidence in the point estimate. When conceptually different entities are summarized together, defining the precise nature of that confidence in the evidence becomes impossible. In addition, presentations of these different concepts obtain similar weights (− 1 or − 2, zero, + 1, + 2), which are then summarized to reach the final estimate for confidence in the results. I think that the conceptual and empirical basis for this weighting, the suggested threshold values for obtaining each weight, as well as the way of simply summarizing together values representing different concepts have not been justified in a satisfactory manner in the articles describing the GRADE method. In a paper on the overall ratings of confidence in effect estimates, the potential for arriving at a non-plausible grading of evidence in summing up the points of each of the validity criteria has been addressed (25). In these cases it is suggested that the gestalt of the confidence in estimates of effect is considered before arriving at the final decision. I think that the need for this recent additional guidance illustrates the problems related to the conceptual heterogeneity of the eight items and to the method for weighting these items and finally calculating the level of evidence.
It should be possible for a randomised controlled trial that has fatal flaws that make it uninterpretable to be downgraded to uninterpretable (no need for very low or grade 1 or anything) on the basis of any one flaw that is enough to reach that judgment. GRADE does not allow this and so is highly likely to produce false conclusions.

Jonathan Edwards said:
It should be possible for a randomised controlled trial that has fatal flaws that make it uninterpretable to be downgraded to uninterpretable (no need for very low or grade 1 or anything) on the basis of any one flaw that is enough to reach that judgment. GRADE does not allow this and so is highly likely to produce false conclusions.

Malmivaara does not suggest creating an uninterpretable "grade" but arguably hinges towards it; similarly to the thread on Busse et al.'s response where you mentioned that GRADE only allows rating down up to 2 grades at a time (not 3), he suggests that should be changed.

Finally the points are summarized together in order to reach the final rating of the quality of evidence. Randomized controlled trials start from a high level and observational studies start from a low level.

My concern in the calculation of the final confidence in the evidence is, as described above, firstly related to the potential of all the eight items to reflect accurately the trustworthiness of the causal relationships. In my opinion, only three of the criteria— risk of bias, inconsistency of the findings, and publication bias— are valid for an assessment of the grade of evidence in systematic reviews. Decreasing the grades of evidence based on these three criteria is thus justified—although there is a need to consider the weighting of these criteria. If the risk of bias in all randomized trials is very high indeed, decreasing the grade of evidence by only two grades (e.g. evidence quality changes from high level to low level) may not be enough, but rather the most appropriate decision could be to decrease the grade of evidence by three grades (e.g. evidence quality move from high level to very low level). An obvious example of this is a situation where all the 12 relevant quality items for a randomized trial (24) are considered, but no trial meets any of these or all trials include fatal flaws (e.g. more than 50% loss to follow-up).

His editorial is well worth a read, he analyzes the possible issues with each criterion in GRADE separately.

The aim of this article is, firstly, to describe the conceptual meaning of each of the eight GRADE criteria, and to consider their ability to increase or decrease confidence in estimates of outcome of a systematic review. A second aim is to consider the conceptual homogeneity of the GRADE criteria, the rationale for weighting the GRADE criteria, and the rationale for summarizing the values decided for each criterion in order to reach the overall rating of confidence in the effect estimate.

[1] Malmivaara A. Methodological considerations of the GRADE method. Ann Med. 2015;47(1):1-5. doi:10.3109/07853890.2014.969766

Jonathan Edwards · Mar 4, 2021

Michiel Tack said:
I think the Handbook said something like: we're not trying to tell you how to rate evidence, merely how to make your decision transparent.

Interesting: That would seem to make a mockery of Busse et al.'s claim that GRADE had been disastrously misapplied. Clearly the GRADE people think they are telling others how decisions should come out. Moreover, if GRADE is being used by technical staff who have no experience of the psychology of trials in real life then there must be a tacit assumption that they are expecting GRADE to guide them to the right conclusion.

Jonathan Edwards · Mar 4, 2021

I was hoping someone would disagree with me to point out where I am arguing badly. But still time for that!

FMMM1 · Mar 4, 2021

Jonathan Edwards said:
The point about giving some sort of positive value to evidence, however flawed, is also a salient one, yes. Any system like GRADE should recognise evidence that makes it highly likely that there was no effect as for PACE. GRADE deals with this by noting the consistency of any positive finding but that misses the opportunity to make use of strong evidence for no effect from individual studies.

Yes the "beauty" of what they do is note the remarkable consistency of studies which do not have objective outcome measures, and are unblinded, -- they then ignore the question is this the Hawthorne effect [https://en.wikipedia.org/wiki/Hawthorne_effect]. If you show an interest in people they respond positively Hawthorne effect.

If these guys were doing card tricks on a street corner we would have a grudging respect --- they are involved in health care and they are coming up with dodgy black boxes [GRADE] and supporting research which is fundamentally flawed.

cassava7 · Mar 4, 2021

Much like @Jonathan Edwards' criticisms, Irving et al. (2017) wrote a critical review that is not specific to GRADE but that focuses on it in 5 points: "(1) lack of information on validity and reliability, (2) poor concurrent validity, (3) may not account for external validity, (4) may not be inherently logical, (5) susceptibility to subjectivity" [1].

Norris and Bero (2016) highlight some of the same problems [2]:

Although 1 of the key strengths of GRADE is that it makes judgments transparent, implementation of the approach can be challenging. Various groups disagree on how to assess a body of evidence and the meaning and suitability of GRADE terminology (2). This assessment comes down to a single determination: high-, moderate-, low-, or very-low-quality evidence. Guideline panels can view this assessment in an oversimplified or overly precise manner, or they can ignore it altogether and base recommendations on other considerations.

The challenges of implementing GRADE affect its reliability. Although persons with extensive experience using GRADE have “substantial” interrater reliability when assessing the quality of a body of evidence (3), experienced evaluators (using a modified GRADE approach) have low interrater reliability when assessing complex bodies of evidence consisting of different study designs (4). An expert panel achieved only fair interrater agreement on the strength of recommendations based on a body of evidence (5).

The predictive validity of GRADE assessments of the certainty in effect estimates (the degree to which the assessment of high-, moderate-, low- or very-low-quality evidence predicts whether the effect estimate will change with further research) is reported as limited (6). When the quality of evidence was low, the data predicted that future studies had a smaller effect on the estimate of effect than anticipated by GRADE experts, whereas high-quality evidence was influenced more by new data than anticipated.

But the response to their concerns from the US GRADE Network seems to be that, to improve inter-rater reliability, raters should receive training on GRADE and use the GRADEpro software [2, Comments]:

First, we question the applicability of the study by Gartlehner et al. on the reproducibility of the GRADE framework.(2) This study focused on the unique approach used by Evidence-Based Practice centers, was based on systematic review authors with unclear training in GRADE methodology and did not employ GRADEpro software (https://gradepro.org). Other studies have shown that GRADE is reproducible when users, even those without expertise, received appropriate training.(3)

2. Gartlehner G, Dobrescu A, Evans TS, Bann C, Robinson KA, Reston J, et al. The predictive validity of quality of evidence grades for the stability of effect estimates was low: a meta-epidemiological study. J Clin Epidemiol. 2016;70:52-60.
3. Kumar A, Miladinovic B, Guyatt GH, Schunemann HJ, Djulbegovic B. GRADE guidelines system is reproducible when instructions are clearly operationalized even among the guidelines panel members with limited experience with GRADE. J Clin Epidemiol. 2016;75:115-8.

Kavanagh (2009), who has a similar position as @Jonathan Edwards on GRADE, comes to the same conclusion after repeating the issues above (external and internal consistency, not inherently logical, lack of validation although this may have evolved since, potential for bias) [3]:

What Should Replace GRADE?

A key question that arises when a system is questioned is: what is the alternative? There is a very good alternative to using the GRADE system to rate clinical guidelines: clinicians and organizations should use published guidelines while considering the clinical context, the credentials, and any conflicts of interest among the authors, as well as the expertise, experience, and education of the practitioner. If in the future a guideline grading system is shown to improve outcome and is without harm, it could usefully be incorporated into clinical practice.

[1] Irving M, Eramudugolla R, Cherbuin N, Anstey KJ. A Critical Review of Grading Systems: Implications for Public Health Policy. Eval Health Prof. 2017;40(2):244-262. doi:10.1177/0163278716645161 (free access: Sci-hub link)

[2] Norris SL, Bero L. GRADE Methods for Guideline Development: Time to Evolve?. Ann Intern Med. 2016;165(11):810-811. doi:10.7326/M16-1254

[3] Kavanagh BP. The GRADE system for rating clinical guidelines. PLoS Med. 2009;6(9):e1000094. doi:10.1371/journal.pmed.1000094

Jonathan Edwards · Mar 4, 2021

I see now what 'transparent' is supposed to mean - to have the reasoning explicit. But GRADE does not do this. It just requires that you say you downgraded one pip for bias and one pip for indirectness or whatever. Does it require you to say what your reasons are? I think it would be better simply to have a rule at NICE and Cochrane that reasons for evaluations must be given in full.

Jonathan Edwards · Mar 4, 2021

Thanks for the literature @cassava7. So I am re-inventing the wheel, but maybe if the wheel has been forgotten by the people that matter that is not so bad! Also sense a degree of polite restraint in some of the critique that I would want to blow away.

I don't understand quite what Kavanagh means here:
There is a very good alternative to using the GRADE system to rate clinical guidelines: clinicians and organizations should use published guidelines while considering the clinical context, the credentials, and any conflicts of interest among the authors, as well as the expertise, experience, and education of the practitioner.

What published guidelines should clinicians use? It seems not GRADE, but what then? The final conclusion seems to be not to have grading rules until they are of proven benefit and safety.

Kitty · Mar 4, 2021

Jonathan Edwards said:
The final conclusion seems to be not to have grading rules until they are of proven benefit and safety.

Clinical trial, then?

Jonathan Edwards · Mar 4, 2021

Having read Kavanagh I think I understand the sentence about guidelines. He is suggesting that clinicians should look at a recommendation in a guideline and then themselves judge the evidence on the basis of reading all the papers.

What is slightly puzzling about Kavanagh's account is that it is about grading recommendations rather than evidence. With organisations like NICE recommendations are heavily coloured by cost-effectiveness and resource considerations. And the clinician does not have the opportunity to make up their own mind and act on it. On a broader front for GPs Kavanagh's suggestion is no use because they do not have the time. What would make more sense would be to say that the clinicians issuing the guidelines should make up their own minds on the basis of reading the papers.

Jonathan Edwards · Mar 4, 2021

The other useful thing about Kavanagh is that he/she makes it clear that GRADE is not itself evidence-based. I would like to get more detail on that. Presumably some sort of testing process has been done but it sounds as if when it has that things have turned out inconsistent.

Who Agrees That GRADE is (a) unjustified in theory and (b) wrong in practice?

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)