Basic questions on terms and methodology used in clinical trials

MSEsperanza

Senior Member (Voting Rights)
As the title indicates -- a thread for lay people like me who have no or only superficial knowledge of trial methodology and statistics to ask some basic questions and hopefully get answers from more knowledgeable forum members.
 
Last edited:
From the wording on [edit] some of the questionnaires used in the PACE trial it appears that these weren't filled in by the participants themselves but by study staff in a face-to-face situation with the participant.

So study staff read the questions to the patients, and the staff fills in their answers.

I wonder how common that is/ was in studies in general and who fills in the questionnaires in such settings?

Does 'blinding of assessors' mean only those who actually analyze the data are blinded to the trial arms, or also those who sit together with the patients to fill in the answers?

Edit: So my question mainly is: Is it possible that therapists or other staff who see the patients also on other occasions actually fill in the questionnaires together with the patient? And could the investigators in this case still say that the assessors were blinded?


Edit 2 : for clarification see edit in the first line and this post on the PACE trial discussion -- and also:

Thanks @Pustekuchen

So Research Nurses were not blinded to the trial arms and the investigators state that "all our primary and secondary outcomes are therefore either self-rated or objective in order to minimise observer bias."

But I think that is all very 'relative' because the Research Nurses seem to also have the role of observers, e.g. with the 6-minutes walking test and other assessments they filled in [*] and they also could easily influence even how the patients filled in the self-rated questionnaires:

"Those participants who cannot attend clinic will be offered home assessments (or failing this assessment by telephone or by post). Before second and consequent RN assessments, self-rated measures will be posted to the participant prior to the visit and checked for completion at assessment by the RN."

"When the participant does not attend a research interview, the RN will send the self-rated questionnaires to the participant's home address, with a stamped addressed envelope. If questionnaires are not received back within a week/ the RN will arrange to visit the participant at home and oversee completion of the questionnaires."

Figure 9: 10.2.2 Table of research assessments by time points, Pace Protocol, p. 52-53
 
Last edited:
Does 'blinding of assessors' mean only those who actually analyze the data are blinded to the trial arms, or also those who sit together with the patients to fill in the answers?

With these trials I don't think anything can be taken for granted as meaning what it seems to mean. Assessors are normally those who collect and record outcomes face to face with patients. In a situation I which the patient knows what sort of treatment they had one can more or less assume the assessor will get to know or suspect pretty well.

Blinding is always difficult, and quite problematic even for drug trials with apparently identical placebos. All you need is for a clue to be available and the whole thing is busted.
 
I wonder how common that is/ was in studies in general and who fills in the questionnaires in such settings?
Reportedly this is common in IAPT. I think that wherever particular outcomes are expected, by cheating if necessary, this is likely common practice. It's not as if it makes much of a difference anyway, the very act of limiting outcomes to those biased questionnaires is already half-way to being filled by the people whose job performance depends on getting expected answers.

Everything BPS has this in bulk. Hell, it includes entirely redefining the patients' problems, filling in questions barely rates as suspicious at this point, it's pretty much the whole thing, by proxy of selecting the questions and answers to narrow issues of no relevance to the patient. If the questions relevant to patients aren't asked and the available answers don't include the ones patients would respond if the option was there, it's pretty much as if the questionnaire was already filled.

This approach is very common in politics. They take the form of: "Do you support the end of all good things from horrible opposition candidate, or maybe this particular policy which vaguely sounds good but is only a talking point for electoral purposes?"

Same with answering questions that weren't asked. Or pointing out irrelevant issues as distraction. The overlap with politics is just absurd.
 
Last edited:
IAPT is a UK NHS system for providing cheap easy to access psychological therapy using under qualified therapists. It costs billions, claims a high success rate, but has a high drop out rate and people who have looked into it say it's not fit for purpose.
 
This is a bit off-topic but still...

I wonder why quite a couple of the more recent trials investigating CBT for illnesses and conditions often labelled as MUS and also for co-morbid fatigue, depression and anxiety (alleged or real) in established biomedical illness failed to produce equally -- allegedly -- 'good' results as the previous ones,

I'm aware the investigators in their papers still mostly try to twist the -- even by their own standards -- null results and even continue to twist them when replying to justified critique, e.g. by stating that they reported the null results in other parts of the paper, just not in the abstract -- or something along these lines.

So my question is:

Do the more recent trials have better trial methodology? Do they use different outcomes? Are they larger than the previous ones?

Also, is there a similar trend with trials investigating exercise?

@dave30th @ME/CFS Skeptic
 
Last edited:
I wonder why quite a couple of the more recent trials investigating CBT for illnesses and conditions often labelled as MUS and also for co-morbid fatigue, depression and anxiety (alleged or real) in established biomedical illness failed to produce equally -- allegedly -- 'good' results as the previous ones,
Do you have concrete examples?

Remember that, despite their flaws and biases, PACE, FINE, GETSET, FITNET etc. also did not produce good results, especially at long-term assessments.
 
Do you have concrete examples?

Sorry that was just from memory but some examples should be here:
Some recent examples, all covered by David Tuller and various co-authors, are summarized here:

"In the academic realm, the Journal of Health Psychology published a rebuttal I co-authored with a colleague of a high-profile paper by Professor Trudie Chalder, one of the lead PACE authors, in the Journal of the Royal Society of Medicine. Another journal, Psychological Medicine, published correspondence I co-wrote with a very smart patient to yet another example of misleading research from Professor Trudie Chalder [on transdiagnostic CBT for persistent physical symptoms].

"*The Journal of Psychosomatic Research published a correction of a study from Professor Peter White, another lead PACE author, based on my complaint. The study, a follow-up to a trial of a self-help graded exercise therapy course, found no benefits for the intervention, but the highlights section failed to mention this inconvenient fact. Now it does."
. "

And more of the same sort -- about:

  • Non-epileptic Seizures:
https://www.virology.ws/2020/06/26/...-commentary-promotes-eminence-based-medicine/

  • Irritable Bowel Syndrome:
https://www.virology.ws/2020/01/28/trial-by-error-more-on-the-mahana-therapeutics-deal/


If I remember correctly there were more examples like these, including a study led by Sharpe on cancer patients.


Remember that, despite their flaws and biases, PACE, FINE, GETSET, FITNET etc. also did not produce good results, especially at long-term assessments.
Yes I realize that - but if I understood correctly some of the more recent results are still much weaker even by these type of studies' standards and the investigators' claims even bolder in relation to the actual data?
 
Perhaps the outcome "dissociative seizure frequency" is less subjective and likely to be affected by response bias than a fatigue questionnaire.

In the IBS trial patients only received web- or telephone CBT which may have weaker response bias than face-to-face CBT.

On the other hand, I noticed a standardised mean difference of 0.65 for telephone CBT for IBS-SSS at 12 months, which is bigger than what was found in the PACE and FINE trials if I recall correctly.

investigators' claims even bolder in relation to the actual data?
Not sure about this given how the PACE trial authors claimed that GET and CBT could help patients recover from ME/CFS.
 
the investigators' claims even bolder in relation to the actual data?

I do sometimes wonder if part of the problem is that not enough people actually interrogate the data.

If enough studies publish puffed up results that are more narrative than data-based, it could start to appear to purse-string holders (who may not have any expertise in the condition being researched) as if there's something in what they're saying.
 
Perhaps the outcome "dissociative seizure frequency" is less subjective and likely to be affected by response bias than a fatigue questionnaire.

In the IBS trial patients only received web- or telephone CBT which may have weaker response bias than face-to-face CBT.

On the other hand, I noticed a standardised mean difference of 0.65 for telephone CBT for IBS-SSS at 12 months, which is bigger than what was found in the PACE and FINE trials if I recall correctly.

Not sure about this given how the PACE trial authors claimed that GET and CBT could help patients recover from ME/CFS.

Thanks.

So most likely not a trend towards better trial methodology in psychosomatic research on so-called MUS and functional illness, but probably just due to the increasing amount of studies on diverse conditions -- so also an an increased probability for incoherent data plus an increased probability that they will use more objective measures for specific illnesses where an absence of objective measures would be a too apparent omission?
 
Last edited:
I'm trying to understand the following paragraph in a methods handbook on the evaluation of statistical significance:

"A range of aspects should be considered when interpreting p-values. It must be absolutely clear which research question and data situation the significance level refers to, and how the statistical hypothesis is formulated.

"In particular, it should be evident whether a one- or two-sided hypothesis applies [61] and whether the hypothesis tested is to be regarded as part of a multiple hypothesis testing problem [713].

Anyone feels up to give examples of one-sided and two-sided hypotheses and a multiple hypothesis testing problem?


Source and context:
Institute for Quality and Efficiency in Health Care (IQWiG) / General Methods -
https://www.iqwig.de/methoden/general-methods_version-6-1.pdf



[Institute for Quality and Efficiency in Health Care (IQWiG) / General Methods – p. 170 - ]


9.3 Specific statistical aspects

9.3.1 Description of effects and risks


The description of intervention or exposure effects needs to be clearly linked to an explicit outcome variable. Consideration of an alternative outcome variable also alters the description and size of a possible effect. The choice of an appropriate effect measure depends in principle on the measurement scale of the outcome variable in question. For continuous variables, effects can usually be described using mean values and differences in mean values (if appropriate, after appropriate weighting). For categorical outcome variables, the usual effect and risk measures of 2x2 tables apply [45]. Chapter 10 of the Cochrane Handbook for Systematic Reviews of Interventions [161] provides a well-structured summary of the advantages and disadvantages of typical effect measures in systematic reviews. Agresti [10,11] describes the specific aspects to be considered for ordinal data.


It is essential to describe the degree of statistical uncertainty for every effect estimate. For this purpose, the calculation of the standard error and the presentation of a confidence interval are methods frequently applied. Whenever possible, the Institute will state appropriate confidence intervals for effect estimates, including information on whether one- or two-sided confidence limits apply, and on the confidence level chosen. In medical research, the two-sided 95% confidence level is typically applied; in some situations, 90% or 99% levels are used. Altman et al. [19] give an overview of the most common calculation methods for confidence intervals. In order to comply with the confidence level, the application of exact methods for the interval estimation of effects and risks should be considered, depending on the particular data situation (e.g. very small samples) and the research question posed. Agresti [12] provides an up-to-date discussion on exact methods.


9.3.2 Evaluation of statistical significance

With the help of statistical significance tests it is possible to test hypotheses formulated a priori with control for type 1 error probability. The convention of speaking of a “statistically significant result” when the p-value is lower than the significance level of 0.05 (p < 0.05) may often be meaningful. Depending on the research question posed and hypothesis formulated, a lower significance level may be required. Conversely, there are situations where a higher significance level is acceptable. The Institute will always explicitly justify such exceptions. A range of aspects should be considered when interpreting p-values. It must be absolutely clear which research question and data situation the significance level refers to, and how the statistical hypothesis is formulated. In particular, it should be evident whether a one- or two-sided hypothesis applies [61] and whether the hypothesis tested is to be regarded as part of a multiple hypothesis testing problem [713]. Both aspects, whether a one- or two-sided hypothesis is to be formulated, and whether adjustments for multiple testing need to be made, are a matter of repeated controversy in scientific literature [240,430].


[Institute for Quality and Efficiency in Health Care (IQWiG) / General Methods - 171 - ]

Regarding the hypothesis formulation, a two-sided test problem is traditionally assumed. Exceptions include non-inferiority studies. The formulation of a one-sided hypothesis problem is in principle always possible, but requires precise justification. In the case of a one-sided hypothesis formulation, the application of one-sided significance tests and the calculation of one-sided confidence limits are appropriate. For better comparability with two-sided statistical methods, some guidelines for clinical trials require that the typical significance level should be halved from 5% to 2.5% [371]. The Institute generally follows this approach. The Institute furthermore follows the central principle that the hypothesis formulation (one- or two-sided) and the significance level must be specified clearly a priori. In addition, the Institute will justify deviations from the usual specifications (one-sided instead of two-sided hypothesis formulation; significance level unequal to 5%, etc.) or consider the relevant explanations in the primary literature.

If the hypothesis investigated clearly forms part of a multiple hypothesis problem, appropriate adjustment for multiple testing is required if the type I error is to be controlled for the whole multiple hypothesis problem [53]. The problem of multiplicity cannot be solved completely in systematic reviews, but should at least be considered in the interpretation of results [48]. If meaningful and possible, the Institute will apply methods to adjust for multiple testing. In its benefit assessments (see Section 3.1). The Institute attempts to control type I errors separately for the conclusions on every single benefit outcome. A summarizing evaluation is not usually conducted in a quantitative manner, so that formal methods for adjustment for multiple testing cannot be applied here either.

The Institute does not evaluate a statistically non-significant finding as evidence of the absence of an effect (absence or equivalence) [17]. For the demonstration of equivalence, the Institute will apply appropriate methods for equivalence hypotheses. In principle, Bayesian methods may be regarded as an alternative to statistical significance tests [670,671]. Depending on the research question posed, the Institute will, where necessary, also apply Bayesian methods (e.g. for indirect comparisons, see Section 9.3.8).
 
Anyone feels up to give examples of one-sided and two-sided hypotheses and a multiple hypothesis testing problem?
It notes that "Regarding the hypothesis formulation, a two-sided test problem is traditionally assumed. Exceptions include non-inferiority studies". So almost everything is a two-sided hypothesis. I think it's two sided in the manner that it's determining whether a treatment is superior or inferior to the other, so you look to see if it's statistically significant in two directions - whether it's statistically significantly better, or statistically significantly worse. I think a non-inferiority study is one sided, because then you're only interested in whether a treatment is inferior or not to the other. A non-inferiority study has a different design where you look at whether the effect of the treatment you're interested in is close enough to be "non-inferior" to the other, and so you're just looking to see if something is statistically significantly inferior or not. There's also an explanation of these ideas here (minus the part about non-inferiority studies)

If you're testing for an effect in two directions rather than just one, then that means you're twice as likely to find an effect by chance. So that needs to be adjusted for.

A multiple hypothesis testing problem is for instance just when you have multiple primary outcomes (and therefore you're testing multiple hypotheses) in a trial. So you're more likely to find an apparent effect by chance, so you need to adjust for that, by having a stricter standard of statistical significance (to avoid type 1 error which just means having a false positive)
 
Just for further info about that question:

From the wording on [edit] some of the questionnaires used in the PACE trial it appears that these weren't filled in by the participants themselves but by study staff in a face-to-face situation with the participant.

So study staff read the questions to the patients, and the staff fills in their answers.

I wonder how common that is/ was in studies in general and who fills in the questionnaires in such settings?

Does 'blinding of assessors' mean only those who actually analyze the data are blinded to the trial arms, or also those who sit together with the patients to fill in the answers?

Edit: So my question mainly is: Is it possible that therapists or other staff who see the patients also on other occasions actually fill in the questionnaires together with the patient? And could the investigators in this case still say that the assessors were blinded?

Croos-posting form the PACE trial discussion thread:

Thanks @Pustekuchen

So Research Nurses were not blinded to the trial arms and the investigators state that "all our primary and secondary outcomes are therefore either self-rated or objective in order to minimise observer bias."

But I think that is all very 'relative' because the Research Nurses seem to also have the role of observers, e.g. with the 6-minutes walking test and other assessments they filled in [*] and they also could easily influence even how the patients filled in the self-rated questionnaires:

"Those participants who cannot attend clinic will be offered home assessments (or failing this assessment by telephone or by post). Before second and consequent RN assessments, self-rated measures will be posted to the participant prior to the visit and checked for completion at assessment by the RN."

"When the participant does not attend a research interview, the RN will send the self-rated questionnaires to the participant's home address, with a stamped addressed envelope. If questionnaires are not received back within a week/ the RN will arrange to visit the participant at home and oversee completion of the questionnaires."

Figure 9: 10.2.2 Table of research assessments by time points, Pace Protocol, p. 52-53


(Apologies @rvallee for any confusion also for giving you a false alert a couple of days ago when I couldn't find related information but shortly after having posted a question quoting a post of yours I found the info myself and posted here. :ill: :sleeping:)
 
I'm trying to understand the following paragraph in a methods handbook on the evaluation of statistical significance:

"A range of aspects should be considered when interpreting p-values. It must be absolutely clear which research question and data situation the significance level refers to, and how the statistical hypothesis is formulated.

"In particular, it should be evident whether a one- or two-sided hypothesis applies [61] and whether the hypothesis tested is to be regarded as part of a multiple hypothesis testing problem [713].

Anyone feels up to give examples of one-sided and two-sided hypotheses and a multiple hypothesis testing problem?


Source and context:

The following page is quite good in explaining: https://stats.oarc.ucla.edu/other/m...nces-between-one-tailed-and-two-tailed-tests/

scroll down to section on when to use 1-tailed test and para below (when not to use), it's about half way down - I didn't want to copy too many paras (not on ball enough right now about what's allowed) but think at least these 3 are relevant to your question

When is a one-tailed test appropriate?
Because the one-tailed test provides more power to detect an effect, you may be tempted to use a one-tailed test whenever you have a hypothesis about the direction of an effect. Before doing so, consider the consequences of missing an effect in the other direction. Imagine you have developed a new drug that you believe is an improvement over an existing drug. You wish to maximize your ability to detect the improvement, so you opt for a one-tailed test. In doing so, you fail to test for the possibility that the new drug is less effective than the existing drug. The consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.
 
Thanks.

So most likely not a trend towards better trial methodology in psychosomatic research on so-called MUS and functional illness, but probably just due to the increasing amount of studies on diverse conditions -- so also an an increased probability for incoherent data plus an increased probability that they will use more objective measures for specific illnesses where an absence of objective measures would be a too apparent omission?

Sorry for muddled thinking and wording -- post now too old to delete.

Still I think it would be worthwhile to do a review on trial design, outcomes (subjective/ objective), and reporting on outcomes in the field of research on treatments for ME/CFS, MUS, and maybe also comorbid depression in 'established' medical disease.

Perhaps it would even make sense to include both behavioral and drug interventions?

Difficult to think of a proper hypothesis -- as I probably have too many questions and assumptions both about trial methodology and about certain proponents at once. (see the discussion with @ME/CFS Skeptic above and also the questions I posted on a more recent members-only thread.)

But maybe 'just' an extended and contextualized version of the Tack et al paper on bias due to a lack of blinding in ME/CFS treatment trials and Jonathan Edwards' NICE expert testimony?

(Context = how is the bias due to reliance on subjective outcomes in unblindable trials discussed and addressed by other researchers in various fields, including psychology, (neuro-)psychiatry and physical therapy?).
 
Last edited:
About the terms ‘prospective' and ‘retrospective' in observational studies in epidemiology.

Not up to word my question so just leave this quote from the STROBE Initiative here.

Strengthening the Reporting of Observational Studies in Epidemiology (STROBE)
:

"We recommend that authors refrain from simply calling a study ‘prospective' or ‘retrospective' because these terms are ill defined [29].

"One usage sees cohort and prospective as synonymous and reserves the word retrospective for case-control studies [30].

"A second usage distinguishes prospective and retrospective cohort studies according to the timing of data collection relative to when the idea for the study was developed [31].

" A third usage distinguishes prospective and retrospective case-control studies depending on whether the data about the exposure of interest existed when cases were selected [32].

"Some advise against using these terms [33], or adopting the alternatives ‘concurrent' and ‘historical' for describing cohort studies [34].

" In STROBE, we do not use the words prospective and retrospective, nor alternatives such as concurrent and historical. We recommend that, whenever authors use these words, they define what they mean. Most importantly, we recommend that authors describe exactly how and when data collection took place."


Source:
Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M; STROBE Initiative. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. PLoS Med. 2007 Oct 16;4(10):e297. doi: 10.1371/journal.pmed.0040297.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2020495/
 
Related to the discussion on developing assessment tools for monitoring disease activity/ impact/ disability --

Why is the term 'psychometric' used for scales that measure also physical symptoms / symptom burden / disability?

Is a simple visual analog pain rating scale or a symptom diary also a psychometric tool?


See discussion on research for a new clinical assessment toolkit in NHS ME/CFS specialist services here.

And a paper mentioned there on developing a patient-reported-outcome-measure scale here.
 
Back
Top Bottom