Developing a blood cell-based diagnostic test for ME/CFS using peripheral blood mononuclear cells, 2023, Xu, Morten et al

First ever diagnostic test for chronic fatigue syndrome sparks hope - Advanced Science News

Formally known as myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS), this condition is generally characterized by persistent and unexplained fatigue though it also presents a myriad of symptoms that can vary between patients and fluctuates over time.

“When people say its tiredness, it’s not tiredness like, ‘Oh, I’m ready for bed’,” continued Polgreen, an active member of the Oxfordshire ME Group for Action (OMEGA). “It’s exhaustion, it’s a lack of energy, it’s absolutely heavy, and you’re just not able to do anything.”

When we spoke with other members of the OMEGA organization, they shared similar ordeals. “It’s incredibly debilitating and frustrating,” said James Charleson, a patient with ME/CFS who got in touch with us through OMEGA via email. “I used to be very physically and mentally active and now I have to be very prudent with how I expend physical and mental energy.

“In a typical day, I have to spend at least 50% of my waking hours lying down, resting. The grogginess and brain fog mean that days often pass me by without me really being aware of it.”

The experiences of individuals like Polgreen and Charleston highlight the deep impact that chronic fatigue syndrome has on the lives of those affected by this debilitating disorder. One of the greatest challenges is the fact that no definitive diagnostic tests or treatment options exist.

“Having something that nobody can put a finger on, something nebulous, something people don’t understand is distressing,” said Polgreen. “I’d just rather know than live with uncertainty.”

But now, there is hope that this could one day change with news of a new diagnostic test that can, for the first time, accurately identify hallmarks of chronic fatigue syndrome in blood cells.

The study was published by researchers at the University of Oxford led by Karl Morten and Wei Huang who have published their findings in Advanced Science. The test has an accuracy rate of 91%, and could be a much-needed beacon of hope for many.

https://www.advancedsciencenews.com...est-for-chronic-fatigue-syndrome-sparks-hope/
 
I can see how you feel but I have been carefully communicating with detailed debate on these issues on this forum for six years now. I am not herding anybody into anything, just maybe pointing out that the reality is that most publications on disease mechanisms and diagnostic tests these days are crap - 95%. Back in the 1970s that was not the case but now almost everything is 'me too'. It is not difficult to produce a good quality abstract if you have something valid to say. If you read a journal volume from 1969 you will find most of the articles are saying something that is still considered valid fifty years later and you can see that in the abstract.

The point I was making was a basic theoretical one that scientific progress in these matters is not about labelling or 'legitimacy'. It is about having an understanding of what is going on. Patients are entitled to want something that gives legitimacy but in reality true legitimacy only comes from something that increases our understanding. The rheumatoid factor test is neither particularly specific nor sensitive but it led us to an understanding of rheumatoid disease that allowed us to control it. That was because it gave us an understanding that we could work on and finally see what it was we needed to do.

If I am criticising from a position of expertise I'm afraid I cannot help that. I spent fifty years working in this area and came to see all the pitfalls. Devising effective treatments for RA had nothing to do with having a 'diagnostic test' - there wasn't one - it had everything to do with tests that told us what was going on.

I appreciate the reply. Please understand that I do not criticise your expertise nor your intentions. I apologise if it sounded like I was accusing you of herding people into things; I instead meant to say that comments in isolation (as they will appear to a casual reader) will likely be considered in isolation.

I understand your long-running and thoughtful contributions to the forum and I appreciate them. The point was mostly just that not everybody who stumbles upon this will have the benefit of having read all of those contributions, nor in order. So each time a point is raised from a position of expertise, it needs to be communicated with some detail to avoid people misconstruing it.

In any case I'm not trying to lecture or rant away here, just clarify my previous comment's intention. I hope it comes across well. As always I appreciate and value our interactions and learn something from your wealth of experience each time.

I'm particularly glad that this has sparked some great discussion around not only the value of a diagnostic market but the utility of its mechanistic relevance. Needed to be addressed I think.

I think another thing to mention here is that clinical judgement *is* lacking in so many cases where clinicians are not properly informed about the disease, which is much of where my contention is coming from. So that's why I'm not sold on clinical judgement as a counterargument. With better education and policy, sure.

As a general comment I will also defend the apparent lack of impact of much modern science when compared with work done 50 years ago by saying that I think all of the low hanging fruit has long been exhausted. I do not mean this in a snarky way. It's just harder to find transformationally-new things when so much foundational work has already been done and done so well.

I should also say that my entire research philosophy and efforts are centred around mechanism, so I get it. And a biomarker associated with mechanism is undeniably so much more useful than one without. I'm just thinking in terms of a potentially more rapidly reachable stopgap to reduce the skepticism that people face even if only partial. Knowing what people go through with dismissal or disbelief absolutely crushes my heart. I've experienced it myself, so I understand it completely. We need to start ending it asap. So I remain open minded to all efforts to do so even if 90+% of my own work is probably more concerned with mechanism than diagnostics.

This turned into a disjointed stream of consciousness typed from the lab bench but I hope it makes the contention clear.

Best, dan
 
Last edited:
Slides from a recent presentation by Prof Morten, the second half is mostly about this work.

Near the end there is a slide about their next step on this project:

Next step towards developing a diagnostic test: a validation study (4-5 months) 2024
3 groups: ME/CFS (Mild/Moderate patients n= 40), Multiple Sclerosis (n=40) & Healthy controls (n=40)
Samples: UK ME/CFS biobank
3 Centres: Oxford, Limerick & Canada (Alain Moreau) using the same samples
Study design: For each group 30 training set, 5 test set, 5 never tested will test the model

This seems like a much more sensible approach than what they did in the paper. Good to see them trying to replicate in multiple centres, though I'm not sure how valuable this is compared to just testing more samples instead (or 40 different samples in each centre). I suspect the results from this paper won't replicate, but maybe they'll find something useful all the same.
 
Why didn't Morten look for a single measure to differentiate PWME from healthy people? Is this a step towards that? Could Decode ME participants be recruited?
Good point e.g. those who participated in the Nath NIH intramural study (lets forget Walitt!) could be tested to see whether they were considered positive (for ME/CFS) on the basis of this (Raman spectroscopy) "test" - ditto Decode ME participants.
 
Last edited:
The usefulness of their classifier can only be demonstrated if they then apply it to a completely independent set of test data and see if they get good separation. If they don't do this step then it's basically useless.

It seems like they did have a test set - 20% of the samples - and this is what the accuracy figures describe:
For better clinical relevance on the predictive perspective, we also looked at the performance of the ensemble learner for the three-class classification tasks, that is, classifying each single-cell spectrum as either MS, ME/CFS, or HC (Figure 5B). The model achieved high performance on the independent test set with a sensitivity of 91% and specificity of 93% for the ME/CFS group; a 90% sensitivity and 92% specificity for the MS group. The overall accuracy on the test set was 91% (87–93% at a 95% confidence interval) (Figure 5B).

advs6355-fig-0005-m.png

The figures above are called confusion matrices (good chance @chillier you already know what this is if you are proficient in R, describing for others).

The squares on the diagonal lines going from bottom left to top right are correct guesses by the model on never before seen data. All other squares are incorrect guesses. The darker a square, the more samples the model guessed that way for.

The diagonal is much darker (and the numbers for correct guesses higher) in both tasks (separating into HC, MS, and ME, as well as separating HC, MS, and three severities of ME).

So the accuracy seems very promising here on unseen samples.

Though it's also possible to "train to the test set" as well, if they routinely tested the model on this set, and tweaked the model until the results on the test set look good, potentially by chance. Ideally, the test set should only be used once at the very end to verify the model. I don't know if they did this.

Edit: Actually, reading through more comments, I'm not as excited about the accuracy after learning the test set isn't completely separate people from the training set.
 
Last edited:
Edit: Actually, reading through more comments, I'm not as excited about the accuracy after learning the test set isn't completely separate people from the training set.

I asked the Morten Group about this on Twitter:

https://twitter.com/user/status/1796166890789667137


Though I'm not too hopeful they'll respond as they didn't respond to a similar tweet about their test set methodology from over a year ago:

https://twitter.com/user/status/1644410334097309696
 
I asked the Morten Group about this on Twitter:

https://twitter.com/user/status/1796166890789667137


Though I'm not too hopeful they'll respond as they didn't respond to a similar tweet about their test set methodology from over a year ago:

https://twitter.com/user/status/1644410334097309696


They posted this:

https://twitter.com/user/status/1803670724780904601


(I accidentally deleted the question they are responding to. Basically asking if they split their original study by individuals or simply by cells.)

As I say in response, I think I might have misinterpreted what they meant by "samples" in the paper when they said "the train and test sets contained a balanced number of samples from five groups of MS, Severe ME, Moderate ME, Mild ME, and HC"; I had thought it meant individual cells.

But this seems to say samples refer to people, which I missed:
After quality control to remove spectra with low signal‐to‐noise ratios, measurements of 98 samples [98 is the number of people] yielded a total number of 14 600 Raman spectra from 2155 single cells.

They also said:

https://twitter.com/user/status/1803671399120195685


So we'll get a better idea of the power of the original model since they'll test on a totally new group.
 
Last edited:
One of the authors, Jiabao Xu, presented about this and other Raman spectroscopy-related research at the recent PRIME project webinar (from 38:54 to 1:07:32, link to thread).

She says they went on to try a similar method in lupus. They split up lupus patients by organs affected (cardiopulmonary, renal, etc), and trained a model to classify these subgroups and healthy controls using Raman spectroscopy data from PBMCs. Slide from 52:56:
1769895029892.png

I'm not exactly sure what the difference is between the "single cell level" and "patient level" results, but for the latter, it classified all five groups perfectly.

The host referring to a comment from a viewer of the webinar:
These results give them some pause. Usually when there's 100% accuracy, this is either due to an effect of overfitting, so perhaps unseen data still used in the training.

Another comment:
I would be totally convinced of its accuracy if your model had never seen a truly held out dataset that was used for testing. Is there any intention to do this soon? If not, what would you need, because PRIME might know someone to contact.
 
Last edited:
I'm not exactly sure what the difference is between the "single cell level" and "patient level" results, but for the latter, it classified all five groups perfectly.

I think it might be the number of cells? So they first identify a cell in one patient's sample that they think is a rotten grape, then look for other rotten grapes that sample contains. Those all go forward to the analysis stage, but they're all from the same patient—and the AI model didn't see that person's cells during training (I think that's what the presenter meant by an independent 'hot dog/not hot dog' test).

I'm not certain I've got it right (don't understand the subject very well), but that's what it appeared to be saying.

One of the things that interested me was the apparent clear separation in PsA between pain symptoms and fatigue symptoms. If that's real, it may be important for all kinds of disease.
 
I'm not exactly sure what the difference is between the "single cell level" and "patient level" results, but for the latter, it classified all five groups perfectly.
I assume it’s the difference between individually classifying the spectra of single cells (multiple cases per participant treated separately) vs. pooling multiple cells per participant. I would guess the latter is probably being confounded by a difference in cell frequencies, same as other metabolic studies. But I could be wrong
 
I assume it’s the difference between individually classifying the spectra of single cells (multiple cases per participant treated separately) vs. pooling multiple cells per participant.
I was thinking it was maybe that, but it's the same number of predictions in both result figures. It seems like it'd be weird to still count multiple cells from one prediction as different predictions when describing the results.
 
I was thinking it was maybe that, but it's the same number of predictions in both result figures. It seems like it'd be weird to still count multiple cells from one prediction as different predictions when describing the results.
I went back to the 2023 ME/CFS Raman spectroscopy paper to see if that might clarify and it just made me more confused—it looks like they sorted individual cells and then averaged them per patient. I am not sure if that would be analogous to the “single cell” or “patient level” measurement here. They also didn’t seem to assess which cell types they were measuring, and didn’t use any [edit: independent] test cohort validation either.

[Edit: they report an “independent” test set in the paper but it is a split from the original cohort, not a new cohort]

I would be shocked if this is the first time that test cohort validation has been suggested to them. If they knew enough to use ensemble ML, they should be well informed about how to assess for overfitting. I don’t think they even used cross-validation in their model training. How does this even fly?
 
Last edited:
I don’t think they even used cross-validation in their model training.
She talks about cross-validation in the video and I see this in the paper:
A cross-validation approach was used to enable all samples to enter the independent test set at least once, therefore, making the best use of the sample pool. The final performance measurements were reported based on averages on all cross-validation results.
Though my knowledge of ML training/testing practices is pretty limited.
 
Wait a minute I have to check the numbers on the original paper. They say it’s a 20% test split from 98 samples, but in their confusion matrix they have instances of 1% of the cohort being misclassified in a particular way, which would be 0.2 of a person. Also their workflow diagram seems to suggest that the test set also underwent model training, rather than being used to predict crew cases after already being trained (though maybe that’s just a bad design choice that doesn’t reflect what was actually done?). I’ll do some more digging and put my concerns in the paper’s thread to avoid derailing this one.

She talks about cross-validation in the video and I see this in the paper:

Though my knowledge of ML training/testing practices is pretty limited.
whoa hold on cross validation should not include the test set??? That definitionally means the test set is not independent and was used in model training. I’m really hoping the presenter just mixed up some words. I haven’t watched the presentation yet but I will make time for it because something is seriously not adding up.
 
I’ll do some more digging and put my concerns in the paper’s thread to avoid derailing this one.
Are we talking about a different paper? Isn't this the thread?

They say it’s a 20% test split from 98 samples, but in their confusion matrix they have instances of 1% of the cohort being misclassified in a particular way
whoa hold on cross validation should not include the test set??? That definitionally means the test set is not independent and was used in model training. I’m really hoping the presenter just mixed up some words. I haven’t watched the presentation yet but I will make time for it because something is seriously not adding up.
My very limited understanding of cross-validation was that the setup was this:

Create some number of folds from the dataset. Train with some folds, test with the last fold. You have test results for one fold. Repeat five times with a different fold used as test set. Combine the prediction metrics in some way.

So I think the results for each test set would technically be "independent" in this case? The model is trained without seeing the individuals in the test set.

And each of the 100 participants got to be a part of a test set once.
 
Are we talking about a different paper? Isn't this the thread?
Ah my bad, [edit: I was jumping between tabs and] I thought this was the thread for the presentation conference.


My very limited understanding of cross-validation was that the setup was this:

Create some number of folds from the dataset. Train with some folds, test with the last fold. You have test results for one fold. Repeat five times with a different fold used as test set. Combine the prediction metrics in some way.

So I think the results for each test set would technically be "independent" in this case? The model is trained without seeing the individuals in the test set.

And each of the 100 participants got to be a part of a test set once.
That’s not a true independent test set, theyre only reporting the effects of training the model essentially 5 times, not how an internally cross validated trained model performs on truly unseen data. A model using such fine grained data as this can still be expected to massively overfit in k fold cross validation esp if you don’t have a cumulative final model at the end of it, so you actually need an additional test cohort that isn’t touched in training at all.

I will come back to explain more—sorry, short on time today (and this is an area where terminology gets very confusing because the same word will be used for different meanings, apologies!).
 
Last edited:
Back
Top Bottom