Pipeline-optimized machine learning for chronic fatigue syndrome diagnosis: A lightweight, interpretable model using blood biochemical and metabolomic data
[Line breaks added]
Introduction and Background
Chronic fatigue syndrome (CFS) is a debilitating multisystem disorder with persistent fatigue and functional impairment, yet remains underdiagnosed due to symptom heterogeneity and the lack of objective biomarkers.
Developing a lightweight, interpretable diagnostic model requires systematic optimization of the entire analytical pipeline—from control group selection to biomarker identification and model construction.
Method
We developed a comprehensive pipeline optimization framework using UK Biobank metabolomic and blood biochemical data (1137 CFS cases; 66,838 controls). Unlike previous studies, our control group included both healthy individuals and patients with CFS-overlapping conditions.
We employed stratified bootstrap sampling (1000 iterations) instead of traditional random sampling to ensure balanced covariate distributions between cases and controls. Our systematic approach compared 7 missing value imputation methods, 9 feature selection techniques, and 11 machine learning/deep learning models.
Feature selection incorporated collinearity exclusion and sequential forward selection to identify the 10 most influential biomarkers. Model evaluation extended beyond standard metrics (ROC-AUC, accuracy, sensitivity, specificity, F1-score, NPV, and PPV) to include Matthews Correlation Coefficient (MCC) for comprehensive performance assessment.
We enhanced model interpretability through both Mendelian randomization (MR) for causal inference and SHAP (SHapley Additive exPlanations) analysis for feature contribution quantification. Clinical utility was evaluated using decision curve analysis (DCA), with additional validation through Spearman's correlation and restricted cubic spline (RCS) analyses examining biomarker relationships with core CFS symptoms.
Results
The optimized pipeline yielded a lightweight model combining Bayesian Principal Component Analysis (BPCA) imputation, NearMiss undersampling, and random forest classification using only 10 biomarkers plus three covariates (BMI, age, and gender).
This model achieved exceptional diagnostic performance (accuracy = 0.939, ROC-AUC = 0.979, MCC = 0.878, Balanced Performance Score = 0.859 across 11 metrics), effectively discriminating CFS from both healthy controls and overlapping conditions.
DCA demonstrated substantial net clinical benefit across a wide threshold range (0.01–0.98), confirming strong clinical applicability. MR analysis established causal relationships for six biomarkers (urea, total protein, glucose, total bilirubin, leucine, vitamin D; P < 0.05).
SHAP-based interpretability analysis, corroborated by Spearman's correlation and RCS analyses, revealed that elevated glucose and leucine levels exacerbated CFS symptoms, providing mechanistic insights aligned with personalized risk directionality.
Conclusion
Through systematic pipeline optimization—from stratified control selection to comprehensive model comparison and multi-faceted interpretability analysis—we developed a lightweight, highly interpretable CFS diagnostic model using exclusively objective biomarkers.
To ensure reproducibility, this methodology was implemented via the ClinMetML framework. Our model effectively discriminates CFS from both healthy individuals and overlapping conditions, providing a robust, cost-effective foundation for early diagnosis and personalized therapeutic management
Web | DOI | Computational Biology and Chemistry | Paywall
Li, Junrong; Cao, Hanyu; Zhu, Zirun; Zhai, Xiaobing; Xing, Abao; Zeng, Shuowen; Luo, Gang; Sha, Yuyang; Li, Peng; Li, Kefeng
[Line breaks added]
Introduction and Background
Chronic fatigue syndrome (CFS) is a debilitating multisystem disorder with persistent fatigue and functional impairment, yet remains underdiagnosed due to symptom heterogeneity and the lack of objective biomarkers.
Developing a lightweight, interpretable diagnostic model requires systematic optimization of the entire analytical pipeline—from control group selection to biomarker identification and model construction.
Method
We developed a comprehensive pipeline optimization framework using UK Biobank metabolomic and blood biochemical data (1137 CFS cases; 66,838 controls). Unlike previous studies, our control group included both healthy individuals and patients with CFS-overlapping conditions.
We employed stratified bootstrap sampling (1000 iterations) instead of traditional random sampling to ensure balanced covariate distributions between cases and controls. Our systematic approach compared 7 missing value imputation methods, 9 feature selection techniques, and 11 machine learning/deep learning models.
Feature selection incorporated collinearity exclusion and sequential forward selection to identify the 10 most influential biomarkers. Model evaluation extended beyond standard metrics (ROC-AUC, accuracy, sensitivity, specificity, F1-score, NPV, and PPV) to include Matthews Correlation Coefficient (MCC) for comprehensive performance assessment.
We enhanced model interpretability through both Mendelian randomization (MR) for causal inference and SHAP (SHapley Additive exPlanations) analysis for feature contribution quantification. Clinical utility was evaluated using decision curve analysis (DCA), with additional validation through Spearman's correlation and restricted cubic spline (RCS) analyses examining biomarker relationships with core CFS symptoms.
Results
The optimized pipeline yielded a lightweight model combining Bayesian Principal Component Analysis (BPCA) imputation, NearMiss undersampling, and random forest classification using only 10 biomarkers plus three covariates (BMI, age, and gender).
This model achieved exceptional diagnostic performance (accuracy = 0.939, ROC-AUC = 0.979, MCC = 0.878, Balanced Performance Score = 0.859 across 11 metrics), effectively discriminating CFS from both healthy controls and overlapping conditions.
DCA demonstrated substantial net clinical benefit across a wide threshold range (0.01–0.98), confirming strong clinical applicability. MR analysis established causal relationships for six biomarkers (urea, total protein, glucose, total bilirubin, leucine, vitamin D; P < 0.05).
SHAP-based interpretability analysis, corroborated by Spearman's correlation and RCS analyses, revealed that elevated glucose and leucine levels exacerbated CFS symptoms, providing mechanistic insights aligned with personalized risk directionality.
Conclusion
Through systematic pipeline optimization—from stratified control selection to comprehensive model comparison and multi-faceted interpretability analysis—we developed a lightweight, highly interpretable CFS diagnostic model using exclusively objective biomarkers.
To ensure reproducibility, this methodology was implemented via the ClinMetML framework. Our model effectively discriminates CFS from both healthy individuals and overlapping conditions, providing a robust, cost-effective foundation for early diagnosis and personalized therapeutic management
Web | DOI | Computational Biology and Chemistry | Paywall

