The MSIS-29 and SF-36 as outcomes in secondary progressive MS trials, 2022, Strijbis et al

Hutan · Feb 21, 2024

Abstract
Background:
Patient-reported outcome measures (PROMs) are often used in clinical research, but little is known about their performance as longitudinal outcomes.

Methods:
We used data from ASCEND, a large SPMS trial (n = 889), to investigate changes on the Short Form Health Survey 36 (SF-36 v2) and the Multiple Sclerosis Impact Scale (MSIS-29) over 2 years of follow-up.

Results:
PROM scores changed little over the 2 years of follow-up. In contrast to physical disability measures, there was no consistent trend in PROM change: significant worsening occurred about as often as improvement. Using a 6-month confirmation reduced the number of both worsening and improvement events without altering their relative balance. There was no clear difference in worsening events in groups based on population characteristics, nor was there a noticeable effect using different thresholds for clinically significant change.

Conclusion:
We found little consistent change in MSIS-29 and SF-36 over 2 years of follow-up in people with SPMS. Our findings show a disconnect between disability worsening and PROM change in this population. Our findings raise caution about the use of these PROMs as primary outcome measures in SPMS trials and call for a critical reappraisal of the longitudinal use of these measures in SPMS trials.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9315187/

Hutan · Feb 21, 2024

I note the last two sentences of the abstract:

Our findings show a disconnect between disability worsening and PROM change in this population.

Our findings raise caution about the use of these PROMs as primary outcome measures in SPMS trials and call for a critical reappraisal of the longitudinal use of these measures in SPMS trials.

Hutan · Feb 21, 2024

The paper and related issues are discussed in a brief communication, perhaps an editorial:
The value of patient-reported outcome measures for multiple sclerosis
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9326853/

Some excerpts:

Clinical trials and regulatory agencies historically have focused on clinical outcomes including clinician rated disability measures, biomarkers, and imaging. These outcomes often incompletely measure the experience of patients with their disease. To address these gaps, patient-reported outcome measures (PROMs) were developed to quantify the patient perspective. Although the unique benefit of PROMs has been appreciated there remains uncertainty regarding interpretation and clinical impact of these measures.1

The primary finding of this study was minimal and inconsistent change was observed over 2 years (MSIS-29 physical 50.8 vs 50.5; MSIS psychological 39.1 vs 36.7; SF-36 physical 33.3 vs 33.5; SF-36 mental 47.0 vs 47.7 for all baseline and week 96 scores), compared to consistent worsening on EDSS and T25FW. Given the inconsistent change the authors cautioned the use of PROMs as a primary outcome in SPMS trials and suggest the need for a re-evaluation of how these measures are utilized longitudinally.

This study highlights the limitations in our current use of PROMs in all forms of MS.

Response shift occurs when an individual’s internal standard changes over time; therefore the magnitude of change detected can be obscured due to this adaptive process. This phenomenon has been well described in diseases with unpredictable disease courses and proposed as an explanation for the discrepancy between disability level and patient-reported HRQOL4. A study in MS reported 20% of patients demonstrated response shift, with a higher percentage in progressive types, leading to misleading interpretations for SF-36 change scores.5 Lack of group-level change could also be attributed to how patients internalize and respond to PROMs. Individuals’ ratings reflect their frame of reference, whether comparing their health to people with the same diagnosis or others without their diagnosis, or whether their expectations are based on their past health or their ideal health. A study of physical and mental HRQOL in MS demonstrated these appraisal processes explained a large amount of variation in individual scores, and concluded HRQOL should not be assessed without measuring appraisal in MS patients.6 These observations suggest a universal threshold for meaningful change in PROMs is unlikely to be effective.

Minimal important differences (MIDs) have also been shown to vary depending on baseline severity and domain. For patients with good HRQOL at baseline, there will be less ability to reach MID thresholds.

However, the authors of the editorial suggest that perhaps different aspects of a patient's experience are being measured. Unfortunately they suggest

PROMs provide measurement of outcomes not well assessed by current objective neurological outcomes such as EDSS, and provide complementary yet distinct information. The role of PROMs in clinical trials will likely be to focus on features such as mental health, fatigue, and pain that are difficult to quantify with a neurological exam.9

So, despite knowing very well about the problems with subjective outcomes, the authors of the editorial still think that subjective outcomes are okay for mental health, fatigue and pain. These three sorts of outcomes are surely among the most vulnerable to response shift, especially when rehabilitation professionals are precisely encouraging that change in the frame of reference. And we have seen that the tools used for measuring these outcomes are often extremely vague on what the baseline comparator is - is it when you were last well? what is typical for a person your age? two months ago? when the trial started? Respondents' assumed baseline comparators can even change within one study, creating a complete mess in the data.

I assume that the acceptance of subjective outcomes stems partly from a belief that there are no good objective outcomes for mental health, fatigue and pain. Of course, it does not make sense to accept a misleading outcome just because you can't think of a good one. Surely, these internally experienced problems are reflected in outcomes that, with care, can be objectively measured - e.g. percentage of days when the participant left the house, steps walked, quantity of pain relief consumed?

bobbler · Feb 21, 2024

Hutan said:
I note the last two sentences of the abstract:

Our findings show a disconnect between disability worsening and PROM change in this population.

Our findings raise caution about the use of these PROMs as primary outcome measures in SPMS trials and call for a critical reappraisal of the longitudinal use of these measures in SPMS trials.

Agreed. I don't know whether I encountered this paper specifically but over a year ago I did a bit of slow browsing to look into the SF-36 and discovered that it was never invented to be used the way we often see it utilised. More on this below.

It is very interesting thought to look at it being used in the context of measuring changes in individuals in a condition where there can be better objective and subjective triangulation used to compare it.

On the other hand I was intrigued by I think it was @rvallee noting that despite this, it turns out to be potentially one of the better ones for showing the level of disability experienced in a certain context (apologies @rvallee it seemed very valid at the time, but I can't remember what the specific context was that it was being cited).

So I'm quite fascinated by it. And its origins (including the intricacy of thought into the design), how it has been used and what that resulted in etc.

Apologies as the rest of this is top of my head from memory, so is a bit of an approximation with no-doubt gaps:

I believe it was intended to be for population-level public health type surveillance: comparing large groups, not 'measuring' individuals. AND that it has two arms that the design suggested needed to be totalled separately for two separate scores, a mental health one and a physical health one.

In fact you then have questions about appropriate/assumed 'weightings' if someone does just then go ahead and roll one into the other, if it hasn't been designed specifically to be used that way. And I would assume that from an information perspective for most individual conditions then having the scores separate would be just as useful - even if the two areas might interact. Particularly if you were (by the sounds of this it is not recommended) were using it in a longitudinal time-series sense for populations, because you could see which time-wise came first or 'knock-on effects'.

Anyway, the first point, about it being for populations carries with it lots of possible indicators of where there may be blindspots or pitfalls in it being used for individual's assessments. ie if someone is using it in clinical practice on has to wonder whether rather than monitoring an individual's progress/deterioration/response to treatment it is simply monitoring that population of patients in the general.

I struggle to think of a more inappropriate illness to do this for [than ME/CFS] due to the different severities, situations, lack of treatments and ergo drop-out rates/discharge 'because we have nothing for you/you are to ill to participate'. And that is before you turn a clinic into on of 'fatigue' or 'persistent symptoms'.

But, then, when you look at the results in context for some of the ME/CFS things, it isn't the worst of all the scales in fact when you start picking away at validity etc- interestingly. I say this not to suggest copying it as a starting point, because I think given the 'PEM/PESE' breakthrough and how counterintuitive that makes ME/CFS to many illness paradigms and certainly to a lot of old paradigms it has been slotted under that relate more to mental healthy type ways of doing things. But because it's sometimes interesting to see where things might have 'got it right', and am fascinated to see how a scale computes out stuff and its accuracy/usefulness relates to what is 'put in'.

bobbler · Feb 21, 2024

@Hutan

Response shift occurs when an individual’s internal standard changes over time; therefore the magnitude of change detected can be obscured due to this adaptive process. This phenomenon has been well described in diseases with unpredictable disease courses and proposed as an explanation for the discrepancy between disability level and patient-reported HRQOL4. A study in MS reported 20% of patients demonstrated response shift, with a higher percentage in progressive types, leading to misleading interpretations for SF-36 change scores.5 Lack of group-level change could also be attributed to how patients internalize and respond to PROMs. Individuals’ ratings reflect their frame of reference, whether comparing their health to people with the same diagnosis or others without their diagnosis, or whether their expectations are based on their past health or their ideal health. A study of physical and mental HRQOL in MS demonstrated these appraisal processes explained a large amount of variation in individual scores, and concluded HRQOL should not be assessed without measuring appraisal in MS patients.6 These observations suggest a universal threshold for meaningful change in PROMs is unlikely to be effective.

This is, I think, the same issue that is referred to by Lisbeth Utens in her letter to Kuut, Knoop et al re: their CBT for long covid when she mentions the potential for using partners and parents as measurers ("a multi-informant response") - which struck me as interesting and I pondered the pros, cons and why it hasn't been used before.

Particularly given I imagine in the arena of mental health 'multi-informant' is often used due to ideas about 'lack of insight' and the like.

Which is about 2/3 of the way down the following thread:
Efficacy of cognitive behavioral therapy targeting severe fatigue following COVID-19: results of a randomized controlled trial 2023, Kuut, Knoop et al | Page 16 | Science for ME (s4me.info)

"Furthermore, Kuut et al. (2023) report that they use validated lists, but the complex fatigue and stimulus processing problems with post-COVID can actually be measured sensitively and validly with five and eight items respectively on a generic scale? Finally: why is no proxy measure included?

Partners and parents can provide very useful information, because it can be very difficult for patients to register small changes themselves during a long disease process with stimulus processing and concentration problems. A multi-informant approach is often recommended (Achenbach et al., 2005). Measuring with more specific post-COVID items would have been desirable and possible. In the trial design article also mentioned the use of several self-developed items.

Knoop's comment in his response: “symptoms can only be assessed by asking the patient” is, in my opinion, a simplification of reality. Proxy reporting is also essential for post-COVID children, for example."

bobbler · Feb 21, 2024

When thinking about PROMS for ME/CFS I think in particular the following part of the above in the 'Response Shift' para is interesting to bear in mind, when looking 'big picture' about the format/approach:

Lack of group-level change could also be attributed to how patients internalize and respond to PROMs.

Individuals’ ratings reflect their frame of reference, whether comparing their health to people with the same diagnosis or others without their diagnosis, or whether their expectations are based on their past health or their ideal health.

A study of physical and mental HRQOL in MS demonstrated these appraisal processes explained a large amount of variation in individual scores, and concluded HRQOL should not be assessed without measuring appraisal in MS patients

By which I mean that instead of thinking 'questionnaire', with familiar questions and large 'load' that are then weighted and calculated

something that forces or checks/helps an 'overall judgement' that is based on 'thinking afresh' and providing new 'points of reference' might be more useful/accurate. It might be that people are quite good at measuring their own ‘ew threshold of energy’ or pain or exhaustion as an 'overall' better than a questionnaire attempting to use a battery to weight component parts (which might vary across individuals in weighting), but just need a method that provide 'checks and balances' to that 'overall' judgement that accounts for these 'response shifts/adapting'.

There is also perhaps an argument to say that because of the nature and situational context for ME/CFS, combined with there perhaps being a few 'objective' signs/hints/pieces of info that might keep those with SPMS more 'in reality check' - meaning those with ME/CFS might be even worse at judging because they don't have those check-ins and measures to 'calibrate'.

I'm not keen on the 'multi-informant' idea [suggested by Lisbeth Utens in her letter in the comment above] for ME/CFS for a multitude of reasons including dynamic issues, and our vulnerability in workplace, family etc and that power-dynamic it could involve or affect (unless 'volunteered'/chosen by patient). But maybe there are lateral versions that could be inspired from this, that could utilise some of these suggestions to aid accuracy?

I don't know whether this is a useful piece to muse on, particularly thinking for instance about the situation where someone things they are 'doing really well spinning the plates/riding the fine line on the threshold of doability'. And whether when I was doing it there were things slowly dropping off that the right questions might have pointed out, or if it really was all great until the adrenaline stopped and the cumulative effect of just a bit too much each day hit.

EDIT: the other scenario to accommodate is a crash of various sorts that we might not yet (for various reasons including being a novice to the illness in experience of deteriorations) have considered could not just be a bad few weeks of ‘still getting over that cold’. But that/those dips could perhaps be encapsulated a bit as well in detailed scales that maybe have an a,b,c (almost functioning like a half measure to indicate I think I’m still moderate but…). Anyway I don’t know why the same questions/text/tests that you might ask someone to do to ‘help’ then think afresh each time what their level is on an overall scale wouldn’t also help here.

Now I know this next bit is more related to PROMs ideas for ME/CFS and might be more another thread but taking these lessons re: frames of references etc above..

Is there a way that you can avoid a 'computed score' that doesn't go without a 'sense-check' that it feels about right, whilst also avoiding the 'expectations based on past or ideal health' type issues above? A lot of the things that feel relevant to ME/CFS admittedly might be things that apps nowadays might inform people to help them calibrate anyway. But I don't know how good they are at getting people to realise they are 'at a new level' rather than 'in a crash'. Or - without research into that - how relevant that distinction is anyway/we just kid ourselves.

By which I'm thinking that the 'output' would be to ask a pwme to perhaps just select one very specific increment on what could be quite a large scale that has a lot of specifics . I'm also trying to think how introducing these could actually work or translate. And what would be valid or relevant. Vs questions about all the component symptoms or areas of ME and then a total perhaps being computed from that. And the frames of reference being spelled out/made clear snd having tools to help that seem to be one thing above hints at trying to work on.

This would indeed make selecting exactly the right box a lot of work. However, people would narrow down their approx zone quite quickly (we normally don't change that much) by defined limitations. And maybe could feel confident in their precise selection more easily by finding more exact comparators to their own situation - ie fresh frames of reference (rather than negotiating with what 'mostly homebound' does and doesn't encompass, it could almost have like market research 'profiles' that people can drop-down to check as they narrow-in on where they think they fit of e.g. Jenny who does still does her full-time job, but from home, and cannot schedule more than 2hrs of conversation in a day/10hrs in a week but can keep set hours wash, dress, watch films/socialise low key at weekend vs 'Jilly: works full-time with 3 days in the office with the day a mixture of meetings and quiet work, and can only do this and showers only in preparation for those days', and these should be worked up from/with actual patients in their terms that mean something, not guesses/cliches from HCPs).

Or you could add other home-based simple tasks to triangulate an assessment - knowing some would know they can't do that at all without needing to attempt it, and some could just be a case of self-monitoring what the categorisation means they already probably do. Maybe steps or reaction tests, but the difficulty there is good day/bad day/what you've done in the days leading up I guess so would need instructions.

But if the issue is comparing to past or ideal health then could something work directly from a list of things people might include (I guess 'meaningful items'). Did you struggle to cook that meal/watch that film in the past month. Have a conversation with someone close to you on a good day about a news item when sat upright and time it asking them to flag when your 'performance' starts to decline.

For those less severe when looking at PEM and threshold, seeing whether if they spend a weekend resting then their first day at work or doing an activity they might already have to do but which tends to cause PEM or fatiguability (agreed as appropriate to level), which day and how much their symptoms then hit. I imagine that if it was a clinic PROM then most people could actually negotiate with their employer if needed to ensure they could complete this. These things might then help to inform someone's own self-judgement/calibrate

Anyway, I just think it is an interesting point about frames of reference, and we are all aware that we assume our own norms and adapt naturally as things get worse without noticing we've been adapting for a year and now have to admit we actually can't really do a food shop in the supermarket anymore (hence eg direct questions: 'have you actually been able to complete a grocery shop in the last 6months'), or you now can only shower for x amount of time, are better wake-up calls and could be used as some sort of 'self-calculator'. If we are thinking about energy-threshold.

But, importantly, which could be of a formats so could provide an output that is a visual representation showing eg 'time able to shower', 'time able to talk seriously', 'time able to small talk', and so on. So that by that point by glance someone could have helped them, but they can see/feel if it seems about right. OR, if they are doing it themselves they can look at the 'overall' which might be represented visually as activity/energy in a week/fortnight and think 'I'm missing something' or 'no, that's more than I can do'.

Again a way of aiding people to work out which one, but then be able to check the 'overall' adds up themselves which hopefully works towards removing those response shifts due to adapting to deterioration.

And to keep things fresh or accessible (by providing choices in tasks) researchers could have for such 'self-calculator' tasks, groups of patients (of different severities, in case some of these task do vary by severity) grouping tasks of similar 'strenuousness/energy level/payback' so that the tasks rotates rather than being a familiar battery tempting those filling it in to think they know the answer.

Of course... it's all very hard and I haven't perfected anything at all here, these are just incomplete 'bits' of thoughts, and we've lots of different measures to try and look at eg. PEM is different to 'threshold'/severity measures (have you actually got worse) and so. It is just sparking my imagination that maybe if some of these sources of error have been highlighted there could be method-type ways to help control them/build in some triangulation. And of course it could be that it is useful to compare these things with eg the objective patterns and if people use apps etc test if those are getting there.

rvallee · Feb 22, 2024

Hutan said:
I note the last two sentences of the abstract:

Our findings show a disconnect between disability worsening and PROM change in this population.

Our findings raise caution about the use of these PROMs as primary outcome measures in SPMS trials and call for a critical reappraisal of the longitudinal use of these measures in SPMS trials.

Generally speaking, ask bad questions and you will get bad answers.

Most PROMs use very weird questions that are highly ambiguous. If not all of them, at least some of them, and those are generally the ones that skew everything. They're basically ink blot questions where whoever fills them has to wonder 8 ways what the question even means or how it applies to them.

And that's before how difficult it is to assign numbers to qualitative things. And that's before you consider that a 1-5 scale has a 20% fuzziness baked into it since it's usually impossible to really say whether any answer of 3 could rather be a 2 or a 4. And then there's the fact that using a 1-10 or a 1-100 scale doesn't really add any precision since, again, they're qualitative properties. And that's before you consider the slipperiness of how answers change based on adjustments to disability, where a 4 today would be a 2 to a former healthy self. Even in this paper they understand response shift but can't really give it its proper importance as a fundamental problem.

Unless the questions can be put into things that can be compared, my 4/5 is definitely not a healthy person's 4/5, or even my formerly healthy self's 4/5, I don't think that PROMs carry much value besides asking a single 1-10 scale of general well-being. They have to ask questions that relate to objective things, questions that are unambiguous that will give unambiguous answers. And drop the damn biopsychosocial nonsense about feeling anxious and all that crap. That's just esoteric woo with even less scientific value than Meyers-Briggs or a Dungeons and Dragons character attributes.

At some point the industry has to realize that they spent decades trying this and are no further than they were several decades before they started. But they don't seem to have the psychological flexibility for it. Oh, the irony.

rvallee · Feb 22, 2024

To address these gaps, patient-reported outcome measures (PROMs) were developed to quantify the patient perspective.

And that's really the thing. They don't quantify those. They ask to assign numbers to qualitative properties. That's a completely different thing from quantifying something. It's as different as quantum mechanics is to classical mechanics, and psychology is basically unable to make the leap towards the incompatibility of both paradigms.

Assigning numbers to qualitative properties does not make them quantitative. It doesn't work like that. Science is first about properly categorizing things, which medicine is massively failing where it comes to psychosomatic ideology, then comes measurement. And they're not measuring anything. Measuring something has a specific meaning in science, and qualitative ratings are not measurements. They are the equivalent of what cold-warm-hot is to temperature where the centigrade scale is a true measurement.

If you're not measuring, you're not doing science. Period. It sucks but that's how it is, and pretending otherwise has effectively set back medical science for decades in the areas where they pretend that clowning around with fuzzy numbers assigned to qualitative properties is equal to measurement.

Kitty · Feb 23, 2024

Patients will only be able to give useful answers if the questions are related to their real lives. They can't compare themselves with someone else, or accurately rate symptoms on meaningless scales—they can only say how the illness impacts their ability to do the things they need and want to do, and to enjoy things (or not).

The MSIS-29 and SF-36 as outcomes in secondary progressive MS trials, 2022, Strijbis et al

Hutan

Moderator

Hutan

Moderator

Hutan

Moderator

bobbler

Senior Member (Voting Rights)

bobbler

Senior Member (Voting Rights)

bobbler

Senior Member (Voting Rights)

rvallee

Senior Member (Voting Rights)

rvallee

Senior Member (Voting Rights)

Kitty

Senior Member (Voting Rights)