Assessing Functioning in adolescents with Chronic Fatigue Syndrome: Psychometric properties and Factor Structure of SSAS & SF36 PF, 2020, Loades

Lucibee · Feb 23, 2020

Michiel Tack said:
Could you give an example of this?

I mean *lack of* linearity is going to be a problem. The authors themselves noted that the SF36 PF subscale split into 2 distinct factors. As the items within each factor group are fairly well correlated, they will have to assume that each score on each type of factor will correlate/match at least on severity. You don't want someone scoring 20 on one set of factors being equivalent to a score of 40 on the other set. But I don't know how you would account for that without there being a set of standard objective measures you can test that against.

And then there's linearity of scale. Is the difference between a score of 10 and 20 the same as the difference between 80 and 90?

If these scales are simply used as a rough idea of how disabled someone is, or as a summary measure of a population, then there's not so much a problem. However, if you are using them to make direct comparisons between people, then there might be, because one person's score may not be directly equivalent to another's.

ME/CFS Science Blog · Feb 23, 2020

Lucibee said:
And then there's linearity of scale. Is the difference between a score of 10 and 20 the same as the difference between 80 and 90?

Lucibee said:
However, if you are using them to make direct comparisons between people, then there might be, because one person's score may not be directly equivalent to another's

It seems that these are problems of all questionnaires, not only the ones used here (and probably of other outcome measures as well). I suspect it's difficult to get around this problem. You can't measure physical function directly so one will have to use questions and a scoring system that approximates it as good as possible.

Trish · Feb 23, 2020

There is also the potentially dramatic effect of persuasion. All it needs is for the therapist to persuade people to interpret differently the level of difficulty they have with each item on the SF-36 scale.

Take someone with mild to moderate ME who says they have some difficulty with half the items (5x5), and a lot of difficulty with the other half(5x0). Their score is 25.

Give them a course of therapy that persuades them that what they are experiencing is normal aches and pains and tiredness, perhaps comparing themselves with people who are bedbound, and getting them to focus on how much more they can do than that.

Persuade them that their 'some difficulty' is normal and should be classed as no difficulty (5x10) and their idea of a lot of difficulty is exaggeration, and is really just some difficulty (5x5). Their score is now 75.

Miracle cure!

As Lucibee says, as a general indicator of population and individual levels of disability, SF-36 is useful. It is also useful as an indicator of disability levels between different patient populations, and the full SF-36 with all the other things like social and emotional functioning is useful in homing in on where and individual or patient group's main area of difficulty lies.

But using individual patients' changes in SF-36 PF over time, such as in a clinical trial, is fraught with traps, which as far as I'm concerned means all the fancy statistical analyses in the world won't make it a reliable or valid measure of how ill or how disabled someone with ME is and whether the treatment has been effective. It's far too subjective.

ME/CFS Science Blog · Feb 23, 2020

Trish said:
But using individual patients' changes in SF-36 PF over time, such as in a clinical trial, is fraught with traps, which as far as I'm concerned means all the fancy statistical analyses in the world won't make it a reliable or valid measure of how ill or how disabled someone with ME is and whether the treatment has been effective. It's far too subjective.

I don't think I agree with this. The problem you sketched is mostly a problem of clinical trial design where you have to control for other factors that might influence how participants fill in the questionnaire, though blinding and an adequate control condition etc. But in a standard blinded RCT with a drug and placebo group, this shouldn't be an issue.

I think I prefer something like the SF-36 PF subscale because it asks patients directly what (specific) activity they can or can't do, which I think will result in less subjective and more reliable answers than if you ask them to rate the severity of a symptom or impairment on a larger scoring scale. The SF-36 PF subscale is also not too long and therefore easily interpretable. There is this recent trial on intranasal mechanical stimulation, where the authors reported improvements on a ME symptom rating scale. But that's a scale that asks about multiple symptoms and where a 5-degree scale from 0-4 (none, light, moderate, severe, very severe) is used. That makes the result more difficult to interpret and the issues of non-linearity you sketched above, probably even more problematic.

In short, I don't see the problem with using the SF-36 as an outcome measure in treatment trials for ME/CFS or to measure disability compared to other patient groups.

Snow Leopard · Feb 23, 2020

Lucibee said:
I mean *lack of* linearity is going to be a problem. The authors themselves noted that the SF36 PF subscale split into 2 distinct factors. As the items within each factor group are fairly well correlated, they will have to assume that each score on each type of factor will correlate/match at least on severity. You don't want someone scoring 20 on one set of factors being equivalent to a score of 40 on the other set. But I don't know how you would account for that without there being a set of standard objective measures you can test that against.

Yes, the key issue of lack of linearity is if you were to say, worsen slightly overall, despite improving on a particular question, you may end up getting the same or an improved score despite the worsening. This means that overall, the set of questions do not fulfil the requirements of being a scale.

ME/CFS Science Blog · Feb 23, 2020

Snow Leopard said:
the key issue of lack of linearity is if you were to say, worsen slightly overall, despite improving on a particular question, you may end up getting the same or an improved score despite the worsening. This means that overall, the set of questions do not fulfil the requirements of being a scale.

Perhaps things would be clearer if you could give an example of an outcome measure that that doesn't have this problem because I still don't seem to get it.

If you ask patients to rate their physical functioning on a scale from 1-10, patients could be making the same consideration as you sketched above. They could reflect on how most aspects of their physical functioning (for example walking, lifting things etc.) got worse but that one aspect got a lot better (for example getting up from bed) and so give a score that is the same or an improvement, despite worsening on most aspects of physical functioning.

Snow Leopard · Feb 23, 2020

Michiel Tack said:
Perhaps things would be clearer if you could give an example of an outcome measure that that doesn't have this problem because I still don't seem to get it.

Some are much less likely to have this problem, but regardless, the point is that PROMS used on their own lead to problems of interpretation in prospective studies (including clinical trials).

ME/CFS Science Blog · Feb 23, 2020

Snow Leopard said:
Some are much less likely to have this problem

I suppose that short questionnaires that focus on a particular issue (not overall impairment) and don't have large scoring ranges would be better in this regard.

Snow Leopard said:
the point is that PROMS used on their own lead to problems of interpretation in prospective studies (including clinical trials).

I agree it's often better to have a combination of objective measurements and questionnaires of what you want to measure, but that makes trials also more costly and difficult to do.

Suppose a researcher has some reason to think drug X will provide symptom relief for brain fog in ME/CFS and he wants to do a small blinded RCT to test it. In such cases, I think a relatively short questionnaire that asks patients about specific cognitive issues would be the preferred primary outcome measure. If the questionnaire hasn't got any notable issues (like questions that don't make sense or ceiling effects) I don't think it would cause many problems of interpretation, to be honest. Given the many things that can go wrong with such studies, the imperfect linearity of the scale is probably going to be low on the list of things to worry about.

I write this as someone who has no particular expertise in this subject and who would like to know more about it, so apologies for my frankness.

Trish · Feb 23, 2020

Michiel Tack said:
In short, I don't see the problem with using the SF-36 as an outcome measure in treatment trials for ME/CFS

The example you gave where this scale might be a useful measure of change with treatment was a double blinded trial of a medication. I agree subjective measures like this can be useful in that context because the blinding and lack of psychological persuasion mean the scores are more likely to be consistent. (And it's certainly more useful as a measure of ME severity than the ridiculous CFQ).

But that's a world away from the situation where it is usually used for ME - unblinded psychological trials. The fact that Chalder and co, some of the worst offenders in this regard, are using fancy stats here to pretend they have proved the measure is reliable and valid without giving the contexts in which this might be true is troubling.

I contend that any claim to reliability and validity in the context of the trials that group carry out is just plain wrong. The reliability goes out the window when persuasion is involved.

Michiel Tack said:
Perhaps things would be clearer if you could give an example of an outcome measure that that doesn't have this problem because I still don't seem to get it.

Objective measures like employment/school attendance, 2 day CPET, actometers, fitness and cognitive tests, tests of stamina. etc etc.

Sly Saint · Feb 23, 2020

I don't see why they don't use standard physical fitness tests. Particularly as they are so certain that exercise can't cause any harm.

eg I found this study
Reliability of health-related physical fitness tests in
European adolescents. The HELENA Study
https://s3.amazonaws.com/academia.e...5a5d21f377334f6f0fbee8f3d5960f353980f3c531292

In conclusion, our study provides reference values for
reliability of a wide set of physical fitness tests in European
adolescents. Neither a learning nor a fatigue effect was found
for any of the physical fitness tests when repeated. The
results also suggest that reliability did not differ between
male and female adolescents.

ME/CFS Science Blog · Feb 23, 2020

Trish said:
I contend that any claim to reliabilility and validity in the context of the trials that group carry out is just plain wrong. The reliability goes out the window when persuasion is involved.

I think we fully agree on this Trish, it's just that I would label these problems as issues with trial design (bias) rather than the reliability or validity of questionnaires.

I think that the term reliability has a specific meaning in this context, namely that if the same patient fills in the same questionnaire, he gets more or less consistent results, otherwise the questionnaire is not considered reliable. It's also useful to know how the singular questions of the questionnaire correlate with each other and with other, more objective outcomes measures. Those are basic properties of questionnaires that need to be checked and could be useful for future researchers who want to test drug therapies in ME/CFS. In short, I didn't interpret this study as an attempt to claim that the results found by this group in randomized trials are reliable - cause that's another question, one about bias and trial design.

Trish said:
Objective measures like employment/school attendance, 2 day CPET, actometers, fitness and cognitive tests, tests of stamina. etc etc.

I think these have the same issues we were discussing.

The hours of employment might decrease without the patient doing significantly worse (they could, for example, be doing more unpaid work).

Cognitive tests generally correlate poorly with the cognitive problems that patients report (I tend to believe patients more than the tests).

Many ME/CFS patients have relatively normal CPET (a stamina test) results despite being severely ill. The 2-day CPET studies only show a consistent decline for workload at the ventilatory threshold and the studies are too small to say this is robust (or what it actually means) while most other measurements (like VO2 or maximal workload) look relatively normal.

Actimeters have a lot of problems too: are patients wearing them consistently, are these influenced by simple wrist movements etc. etc.

Snow Leopard · Feb 23, 2020

Michiel Tack said:
I think we fully agree on this Trish, it's just that I would label these problems as issues with trial design (bias) rather than the reliability or validity of questionnaires.

One of the points Trish and I are trying to make is that the reliability of these questionnaires tested outside the context of prospective studies does not indicate their reliability within the context of a prospective study.

This is not merely an issue of trial design, but a problem with patient rated outcome measures themselves.

Michiel Tack said:
Many ME/CFS patients have relatively normal CPET (a stamina test) results despite being severely ill. The 2-day CPET studies only show a consistent decline for workload at the ventilatory threshold and the studies are too small to say this is robust (or what it actually means) while most other measurements (like VO2 or maximal workload) look relatively normal.

CPETs are not stamina tests, having said that, many participants (in my opinion) don't reach a true VO2Max on the tests, indicated by relatively low maximum heart rates for their age. VO2Max itself is a simply measure of how much oxygen can be delivered to the muscles and therefore a measure of cardiovascular fitness. The maximal workload on CPET tests is below 50% of the amount of power that participants can put out maximally for 8 seconds (maximal voluntary contraction), nor even 30 seconds on a bike (Wingate test) on a bicycle. Which is to say VO2Max occurs well below maximal muscle drive. Given this knowledge, it should not be surprising that VO2Max itself is not a useful measure for the illness.

http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0100-879X2015000300261

ME/CFS Science Blog · Feb 23, 2020

Apologies in advance for being difficult, but I think this helps in understanding things.

Snow Leopard said:
One of the points Trish and I are trying to make is that the reliability of these questionnaires tested outside the context of prospective studies does not indicate their reliability within the context of a prospective study.

But how would you test the reliability of an outcome measure over a long period of time: if you get a different score how can you know if the difference is due to an unreliable questionnaire or an actual change in the patient's condition? Are there other outcome measures than PROMS that have been tested and shown to be reliable in this way?

Snow Leopard said:
This is not merely an issue of trial design, but a problem with patient rated outcome measures themselves.

I suspect that the trial design typical of GET/CBT studies also distorts observer-reported outcomes or things like the 6-minute walking test. It's not solely an issue of PROMS. And when bias is properly controlled for in a blinded RCT I see little reason to think that the prospective reliability of PROMS is an issue compared to other outcome measures.

Trish · Feb 23, 2020

Michiel Tack said:
But how would you test the reliability of an outcome measure over a long period of time: if you get a different score how can you know if the difference is due to an unreliable questionnaire or an actual change in the patient's condition?

Isn't that why the PACE researchers scrapped the end of trial actigraphy - they had found out from other studies that the patients reporting being able to be more active on SF-36 were actually not any more active.

ME/CFS Science Blog · Feb 23, 2020

Trish said:
Isn't that why the PACE researchers scrapped the end of trial actigraphy - they had found out from other studies that the patients reporting being able to be more active on SF-36 were actually not any more active.

But in that case, patients were actively encouraged to interpret and their symptoms differently, so they were like primed to fill in the questionnaire differently.

I think if you were to do a simple prospective study with both actimeters and the SF-36 SF subscale, it would be rather difficult to determine reliability one based on the other. If there was a significant divergence I would doubt which of the two is the most reliable measure.

It's like the cognitive tests: these sometimes correlate poorly with patient reports, but in my view, that doesn't mean that the patient reports are unreliable.

Trish · Feb 23, 2020

Michiel Tack said:
I think if you were to do a simple prospective study with both actimeters and the SF-36 SF subscale, it would be rather difficult to determine reliability one based on the other. If there was a significant divergence I would doubt which of the two is the most reliable measure.

When this sort of argument is made, I always think of the example of the asthma study:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351653/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351653/figure/F2/
It's a great pity we don't have an easily administered accurate objective test as in asthma, but until we do, I don't think we can take for granted the suggestion that any questionnaire is reliable.

Anyway, I'll bow out of this now. Thanks for an interesting discussion. I think we basically agree that ideally objective tesing is best, but will have to agree to disagree about how reliable SF-36 is likely to be for ME studies, blinded or not.

Lucibee · Feb 25, 2020

Michiel Tack said:
It seems that these are problems of all questionnaires, not only the ones used here (and probably of other outcome measures as well). I suspect it's difficult to get around this problem. You can't measure physical function directly so one will have to use questions and a scoring system that approximates it as good as possible.

The first step is to be aware that there is a problem. But most studies won't even acknowledge that.

Saying, "we don't have anything better, so they'll just have to do" is not good enough.

I'm not against them being used at all, but I do think they should come with much, much stronger warnings about how their use under certain circumstances may affect the interpretability of a study.

Warnings such as, "this measure may correlate well with more objective measures at baseline, but not as an outcome measure", should be setting off alarm bells as to why that may be, and that maybe the interventions used are affecting the measurement tool (the patient themselves) in ways that are unanticipated and haven't been controlled for.

If all the sphygmomanometers in one arm of a trial on a blood pressure treatment were being recalibrated, while the other arm was being left alone, you wouldn't hesitate to declare that trial to be flawed. Yet we accept that happening routinely in psychological trials because "that's the way it's always been done."

Granted, we don't know how much an intervention that aims to change a patient's perception of their symptoms by endorsing a positive spin will affect how they will modify their SF36 answers at endpoint, but even if it is only a small amount, that's important, particularly if it means they are no longer reliably reporting how their symptoms affect them. But standard validity and reliability testing is not going to tell you that.

Assessing Functioning in adolescents with Chronic Fatigue Syndrome: Psychometric properties and Factor Structure of SSAS & SF36 PF, 2020, Loades

Lucibee

Senior Member (Voting Rights)

ME/CFS Science Blog

Senior Member (Voting Rights)

Trish

Moderator

ME/CFS Science Blog

Senior Member (Voting Rights)

Snow Leopard

Senior Member (Voting Rights)

ME/CFS Science Blog

Senior Member (Voting Rights)

Snow Leopard

Senior Member (Voting Rights)

ME/CFS Science Blog

Senior Member (Voting Rights)

Trish

Moderator

Sly Saint

Senior Member (Voting Rights)

ME/CFS Science Blog

Senior Member (Voting Rights)

Snow Leopard

Senior Member (Voting Rights)

ME/CFS Science Blog

Senior Member (Voting Rights)

Trish

Moderator

ME/CFS Science Blog

Senior Member (Voting Rights)

Trish

Moderator

Lucibee

Senior Member (Voting Rights)