Circulating cell-free RNA signatures for the characterization and diagnosis of myalgic encephalomyelitis/chronic fatigue syndrome, 2025, Gardella+

I don't really understand what is being done here, from the abstract and the figure shown. They seem to be trying to guess which cells are spilling their RNA most in patients and controls? I wonder if it hoild all be explained by activity levels, via circulation times, cell senescence, lack of normal diapedesis into moved tissues etc.
I agree it was a confusing step of the analysis. I always thoughts that the interesting part of cfRNA was to be able to easily capture transcripts from non-PBMCs (potentially even from tissues) without having to go into the tissues. If you’re deconvolving transcripts to a PBMC data set, this is basically giving you no new information than you could’ve gotten from a basic PBMC RNA-seq, which they already did several years ago. And you’re just disregarding what transcripts don’t originate from PBMCs.

It’s not like sequencing cfRNA is any more “diagnostically accessible” than sequencing PBMCs, even if there was a strong signature in their findings.

Is there a reason why a higher platelet-poor spin was not used? Notable that the fraction itself strongly influenced classification success - and, also, HBB (haemoglobin beta) was amongst the top LASSO coefficients.
Since the cfRNA sequencing is done on plasma, platelets would already have been filtered out. But the most likely explanation for the findings is platelet rupture during sample processing, which would have caused intracellular mRNA from platelets to enter the plasma fraction at much higher abundance. They said that there was no difference by sample site, though it could have been a difference in how many samples from each group were handled by someone who didn’t notice signs of rupturing.

The other option is that there is some biological difference that makes platelets more likely to rupture in ME/CFS, which would have been interesting to know but it seems they didn't consider. I’m surprised they didn’t mention this in the text—it’s the first question I or any of my colleagues would jump to.
 
I'm no machine learning expert and also no statistician and don't have access to this exact paper, but we're seeing this approach a lot in ME/CFS research and in other papers, which I had access to, there were often large concerns about this approach (not just related to the biology behind things, but the basic merits of it). In other papers it typically had the flavour of designing "the best lockpicking lock on the basis of optimizing it for one lock" and reporting "we can classify ME/CFS in a test set with high accuracy".

The approach:
Split your data into 2 sets, training and test set, chose the model that does the best on your test set.

My concerns:
Overfitting to the test set (primarily because the set of models to chose from has become large and easily accessible to everyone).

The standard fix:
Split into training set, validation set and test set. Use the training set to train the model. Use the validation set to tune the model, select features and chose a model. Then once everything is finalised run the data over the training set. (There's also more fancy approaches, something like nested cross-validation).

This seems fairly basic to me, so I'm suprised to see it happen and wonder if I misinterpreted something. Happy to hear anybodies thoughts, especially on what was specfically done here.
 
I think platelet-derived cfRNA was decreased in ME/CFS.

EDIT: so that would mean that the platelets are less likely to rupture in ME/CFS?
Ah sorry--yes I suppose? I have no idea what biological difference would result in less rupturing in ME/CFS though. So I think it is more likely to be a confounder in sample processing.
 
I'm no machine learning expert and also no statistician and don't have access to this exact paper, but we're seeing this approach a lot in ME/CFS research and in other papers, which I had access to, there were often large concerns about this approach (not just related to the biology behind things, but the basic merits of it). In other papers it typically had the flavour of designing "the best lockpicking lock on the basis of optimizing it for one lock" and reporting "we can classify ME/CFS in a test set with high accuracy".

The approach:
Split your data into 2 sets, training and test set, chose the model that does the best on your test set.

My concerns:
Overfitting to the test set (primarily because the set of models to chose from has become large and easily accessible to everyone).

The standard fix:
Split into training set, validation set and test set. Use the training set to train the model. Use the validation set to tune the model, select features and chose a model. Then once everything is finalised run the data over the training set. (There's also more fancy approaches, something like nested cross-validation).

This seems fairly basic to me, so I'm suprised to see it happen and wonder if I misinterpreted something. Happy to hear anybodies thoughts, especially on what was specfically done here.
This is what their methods state:
The sample metadata and count matrices were first split into a training set (70%) and a test set (30%), which were evenly partitioned across phenotype, sex,
and test site. We repeated the split 100 times, for 100 different iterations of the train and test groups. Within each of the 100 training sets, feature selection was
performed using differential abundance analysis which selected genes with a base mean of greater than 50 and a P-value less than 0.001. These genes were
used to train 15 machine learning algorithms using fivefold cross-validation and grid search hyperparameter tuning. For each iteration of each model, accuracy, sensitivity, and AUC-ROC were used to measure test performance.

And the relevant results text:
Unsupervised clustering based on the differentially abundant features demonstrated separation between cases and controls ( Fig. 2B). To build machine learning classifiers, we used Monte Carlo cross-validation by repeatedly partitioning the dataset into training (70%) and test (30%) sets, while balancing for phenotype, sex, and test site. This process was repeated 100 times, generating 100 unique train–test splits (Materials and Methods). From each training set, we selected features using differential abundance criteria (P -value < 0.001 and base mean > 50) and trained 15 machine learning models (Fig. 2C). This approach yielded approximately 1,500 models. We evaluated performance based on test set accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC), assessing both the average metrics and the best-performing seed for each model ( Fig. 2D).
As expected, tree-based algorithms such as ExtraTrees, Random Forest (RF), and C5 exhibited strong training performance, reflecting their capacity to fit complex patterns. However, these models tended to overfit the training data as evidenced by poor performance on the test set. Models with robust performance demonstrated high accuracy and AUC-ROC values for both the training and test sets. Across all models, each sample was included in the test set approximately 450 times (~30% of all iterations).
We observed variability in individual classification rates, with some samples classified correctly >90% of the time, while others classified correctly as low as 11% of the time ( Fig. 2E). This result suggests that certain samples possessed unique features which drove consistent correct or incorrect classification.

So yes, it does seem like the first scenario you describe.
 
I think your concerns are definitely warranted @EndME . I've gotten to learn under several machine learning experts over the past few years and the thing they drill into me is that the test set cannot be used for picking the best model. If you do that, it's basically just using it as another cross validation rather than an actual test set. Any time I had to choose between models, it was always done based on cross validation in the training cohort--the test set doesn't even get touched until we know which model we're moving forward with for downstream analysis. [Edit: and, frankly, that's why my current PI prefers unsupervised models over anything else if there's a chance they'll turn up a relevant signal]

I also am generally quite skeptical about using big data models for ME/CFS diagnosis. The primary and most useful purpose of big data fishing expeditions is to pick out patterns with potential biological relevance that we didn't know to look for. That's why all my projects are collaborations with clinicians and specialized biologists--I find the patterns, I theorize about what it could mean, they tell me whether that makes sense or whether there's something I missed, and then it's up to them to validate the phenomenon directly.

I truly don't think we're ever going to get a good diagnostic marker this way. Sure, it can discriminate between ME/CFS and healthy controls, but it's probably going to get very hairy once you start throwing in any other diagnosis encountered in a clinical setting. If you're searching for a biomarker, it needs to be a biomarker of the illness, not a set of hundreds of genes that are different in ME/CFS compared to a selected group of controls driven by an unknown biological process.
 
Back
Top Bottom