Pipeline-optimized machine learning for [CFS] diagnosis: A lightweight, interpretable model using blood biochemical and metabolomic data, 2026, Li+

forestglip

Moderator
Staff member
Pipeline-optimized machine learning for chronic fatigue syndrome diagnosis: A lightweight, interpretable model using blood biochemical and metabolomic data

Li, Junrong; Cao, Hanyu; Zhu, Zirun; Zhai, Xiaobing; Xing, Abao; Zeng, Shuowen; Luo, Gang; Sha, Yuyang; Li, Peng; Li, Kefeng

[Line breaks added]

Introduction and Background
Chronic fatigue syndrome (CFS) is a debilitating multisystem disorder with persistent fatigue and functional impairment, yet remains underdiagnosed due to symptom heterogeneity and the lack of objective biomarkers.

Developing a lightweight, interpretable diagnostic model requires systematic optimization of the entire analytical pipeline—from control group selection to biomarker identification and model construction.

Method
We developed a comprehensive pipeline optimization framework using UK Biobank metabolomic and blood biochemical data (1137 CFS cases; 66,838 controls). Unlike previous studies, our control group included both healthy individuals and patients with CFS-overlapping conditions.

We employed stratified bootstrap sampling (1000 iterations) instead of traditional random sampling to ensure balanced covariate distributions between cases and controls. Our systematic approach compared 7 missing value imputation methods, 9 feature selection techniques, and 11 machine learning/deep learning models.

Feature selection incorporated collinearity exclusion and sequential forward selection to identify the 10 most influential biomarkers. Model evaluation extended beyond standard metrics (ROC-AUC, accuracy, sensitivity, specificity, F1-score, NPV, and PPV) to include Matthews Correlation Coefficient (MCC) for comprehensive performance assessment.

We enhanced model interpretability through both Mendelian randomization (MR) for causal inference and SHAP (SHapley Additive exPlanations) analysis for feature contribution quantification. Clinical utility was evaluated using decision curve analysis (DCA), with additional validation through Spearman's correlation and restricted cubic spline (RCS) analyses examining biomarker relationships with core CFS symptoms.

Results
The optimized pipeline yielded a lightweight model combining Bayesian Principal Component Analysis (BPCA) imputation, NearMiss undersampling, and random forest classification using only 10 biomarkers plus three covariates (BMI, age, and gender).

This model achieved exceptional diagnostic performance (accuracy = 0.939, ROC-AUC = 0.979, MCC = 0.878, Balanced Performance Score = 0.859 across 11 metrics), effectively discriminating CFS from both healthy controls and overlapping conditions.

DCA demonstrated substantial net clinical benefit across a wide threshold range (0.01–0.98), confirming strong clinical applicability. MR analysis established causal relationships for six biomarkers (urea, total protein, glucose, total bilirubin, leucine, vitamin D; P < 0.05).

SHAP-based interpretability analysis, corroborated by Spearman's correlation and RCS analyses, revealed that elevated glucose and leucine levels exacerbated CFS symptoms, providing mechanistic insights aligned with personalized risk directionality.

Conclusion
Through systematic pipeline optimization—from stratified control selection to comprehensive model comparison and multi-faceted interpretability analysis—we developed a lightweight, highly interpretable CFS diagnostic model using exclusively objective biomarkers.

To ensure reproducibility, this methodology was implemented via the ClinMetML framework. Our model effectively discriminates CFS from both healthy individuals and overlapping conditions, providing a robust, cost-effective foundation for early diagnosis and personalized therapeutic management

Web | DOI | Computational Biology and Chemistry | Paywall
 
Authors are from the Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macau, and Guangdong

Feature selection incorporated collinearity exclusion and sequential forward selection to identify the 10 most influential biomarkers. Model evaluation extended beyond standard metrics (ROC-AUC, accuracy, sensitivity, specificity, F1-score, NPV, and PPV) to include Matthews Correlation Coefficient (MCC) for comprehensive performance assessment.

We enhanced model interpretability through both Mendelian randomization (MR) for causal inference and SHAP (SHapley Additive exPlanations) analysis for feature contribution quantification. Clinical utility was evaluated using decision curve analysis (DCA), with additional validation through Spearman's correlation and restricted cubic spline (RCS) analyses examining biomarker relationships with core CFS symptoms.
I don't know what that means. But I'm not seeing anything about having a training sample and a validation sample there. And, I think the UK Biobank's CFS labelling is pretty unreliable.
MR analysis established causal relationships for six biomarkers (urea, total protein, glucose, total bilirubin, leucine, vitamin D; P < 0.05).
Hmm.

Our model effectively discriminates CFS from both healthy individuals and overlapping conditions, providing a robust, cost-effective foundation for early diagnosis and personalized therapeutic management
I think they have got a bit ahead of themselves. Perhaps their model is very good on this particular sample, But, surely this needs to be validated on another sample, preferably a really well characterised one, before claims are made about its success at identifying CFS?


I'm trying hard to keep an open mind, but I'm pretty sure that we have seen others have a go at this sort of thing with the same data, and not really come up with much. If someone has access to this paper, perhaps they can have a look at what is said about the biomarkers - are they from blood or urine? are they low or high?
 
If someone has access to this paper, perhaps they can have a look at what is said about the biomarkers - are they from blood or urine? are they low or high?

From blood, and most of the relationships are U-shaped if I'm understanding correctly. (In the following, a higher SHAP value predicts higher chance of CFS diagnosis in the dataset.)

Our analysis reveals that total protein (TP), vitamin D (VD), total bilirubin (TBIL), urea, and alkaline phosphatase (ALP) are the five biomarkers with the highest contribution to the prediction of CFS risk (Fig. 5a, b). Higher TP, urea, alanine aminotransferase (ALT), and body mass index (BMI) were associated with increased CFS risk, while VD and creatinine levels act as protective factors for CFS.
The SHAP dependence plot uncovered complex nonlinear relationships between biomarkers and CFS risk. Most features exhibit a U-shaped curve in their impact on CFS risk (Fig. 5d). For instance, Within the range of 70.51–73.51 g/dL, TP exerted a protective effect (negative SHAP values). However, outside this range (<70.51 g/dL or >73.51 g/dL), CFS risk increased significantly. Furthermore, CFS risk decreased when creatinine levels exceeded 0.07 mg/dL, indicating its protective role.

1000098849.jpg
1000098851.jpg
 
Back
Top Bottom