Benchmarking large language models for cell-free RNA diagnostic biomarker discovery
Abstract
Large language models can synthesize biomedical knowledge, parse vast amounts of data, and generate code, positioning them as promising tools for biomarker discovery from high-throughput omics data.
Here, we benchmark six models from OpenAI, Anthropic, and Google on plasma cell-free RNA datasets spanning three clinical cohorts: Kawasaki disease versus multisystem inflammatory syndrome in children, active tuberculosis versus symptomatic respiratory controls, and myalgic encephalomyelitis/chronic fatigue syndrome versus sedentary controls. We evaluate literature-guided nomination of diagnostic gene panels for downstream machine learning and autonomous construction of end-to-end classifiers from raw count matrices to held-out test predictions.
Despite prompt adherence issues, model-nominated panels recapitulate canonical immune pathways and outperform random panels across cohorts, even matching differential gene expression baselines in the tuberculosis cohort. End-to-end automation proves feasible but is model- and task-dependent.
One model approaches conventional performance for Kawasaki disease versus multisystem inflammatory syndrome in children, whereas performance decreases for tuberculosis and myalgic encephalomyelitis/chronic fatigue syndrome cohorts.
These findings delineate current capabilities and limitations of large language models in diagnostics and open a path for their future use in biomarker discovery.
Web | DOI | PDF | Nature Communications | Open Access
Gaudio, Hunter A.; Bliss, Andrew; Loy, Conor J.; Eweis-LaBolle, Daniel; Gardella, Anne E.; De Vlaminck, Iwijn
Abstract
Large language models can synthesize biomedical knowledge, parse vast amounts of data, and generate code, positioning them as promising tools for biomarker discovery from high-throughput omics data.
Here, we benchmark six models from OpenAI, Anthropic, and Google on plasma cell-free RNA datasets spanning three clinical cohorts: Kawasaki disease versus multisystem inflammatory syndrome in children, active tuberculosis versus symptomatic respiratory controls, and myalgic encephalomyelitis/chronic fatigue syndrome versus sedentary controls. We evaluate literature-guided nomination of diagnostic gene panels for downstream machine learning and autonomous construction of end-to-end classifiers from raw count matrices to held-out test predictions.
Despite prompt adherence issues, model-nominated panels recapitulate canonical immune pathways and outperform random panels across cohorts, even matching differential gene expression baselines in the tuberculosis cohort. End-to-end automation proves feasible but is model- and task-dependent.
One model approaches conventional performance for Kawasaki disease versus multisystem inflammatory syndrome in children, whereas performance decreases for tuberculosis and myalgic encephalomyelitis/chronic fatigue syndrome cohorts.
These findings delineate current capabilities and limitations of large language models in diagnostics and open a path for their future use in biomarker discovery.
Web | DOI | PDF | Nature Communications | Open Access