A generalist medical language model for disease diagnosis assistance, 2025, Liu et al.

SNT Gatchaman

Senior Member (Voting Rights)
Staff member
A generalist medical language model for disease diagnosis assistance
Liu, Xiaohong; Liu, Hao; Yang, Guoxing; Jiang, Zeyu; Cui, Shuguang; Zhang, Zhaoze; Wang, Huan; Tao, Liyuan; Sun, Yongchang; Song, Zhu; Hong, Tianpei; Yang, Jin; Gao, Tianrun; Zhang, Jiangjiang; Li, Xiaohu; Zhang, Jing; Sang, Ye; Yang, Zhao; Xue, Kanmin; Wu, Song; Zhang, Ping; Yang, Jian; Song, Chunli; Wang, Guangyu

The delivery of accurate diagnoses is crucial in healthcare and represents the gateway to appropriate and timely treatment. Although recent large language models (LLMs) have demonstrated impressive capabilities in few-shot or zero-shot learning, their effectiveness in clinical diagnosis remains unproven.

Here we present MedFound, a generalist medical language model with 176 billion parameters, pre-trained on a large-scale corpus derived from diverse medical text and real-world clinical records. We further fine-tuned MedFound to learn physicians’ inferential diagnosis with a self-bootstrapping strategy-based chain-of-thought approach and introduced a unified preference alignment framework to align it with standard clinical practice. Extensive experiments demonstrate that our medical LLM outperforms other baseline LLMs and specialized models in in-distribution (common diseases), out-of-distribution (external validation) and long-tailed distribution (rare diseases) scenarios across eight specialties. Further ablation studies indicate the effectiveness of key components in our medical LLM training approach.

We conducted a comprehensive evaluation of the clinical applicability of LLMs for diagnosis involving artificial intelligence (AI) versus physician comparison, AI-assistance study and human evaluation framework. Our proposed framework incorporates eight clinical evaluation metrics, covering capabilities such as medical record summarization, diagnostic reasoning and risk management.

Our findings demonstrate the model’s feasibility in assisting physicians with disease diagnosis as part of the clinical workflow.

Link | PDF (Nature Medicine)
 
We expanded our experiments to examine the performance of the LLMs in diagnosing rare diseases characterized by long-tailed distributions24. Previous models have shown effectiveness in identifying common diseases25, but their performance tends to decline in classifying rarer diseases in few-shot or zero-shot scenarios. As illustrated in Fig. 3a, the distribution of diseases reveals a long-tailed distribution, with common diseases covering 99% of the population and the remaining 1% comprising a wide variety of less common diseases. To evaluate the adaptability of the LLMs in diagnosing a broad spectrum of conditions, we used a zero-shot learning setting on the MedDX-Rare dataset, which includes 2,105 rare diseases derived from long-tailed distribution across eight specialties (Fig. 3b and Extended Data Fig. 6a). Bar plots in Fig. 3c illustrate the Top-3 accuracy of MedFound-DX-PA for each fine-grained rare disease within each specialty, and radar plots show the overall performance of each specialty across diseases (as detailed in Methods). MedFound-DX-PA excelled across all specialties, ranging from 77.4% (95% CI: 76.8%, 78.0%) to 84.4% (95% CI: 83.9%, 84.9%), with an average of 80.7% (95% CI: 80.1%, 81.2%) (Fig. 3c). GPT-4o achieved the second-best performance, ranging from 57.2% (95% CI: 56.5%, 57.9%) to 63.1% (95% CI: 62.4%, 63.8%), with an average of 59.1% (95% CI: 58.4%, 59.8%). This trend was also observed in the Top-1 macro accuracy (Extended Data Fig. 6b).

The codes are available for scientific research and non-commercial use on GitHub at https://github.com/medfound/medfound.

The pre-trained models are publicly available (https://huggingface.co/medicalai/MedFound-7B, https://huggingface.co/medicalai/MedFound-176B).
 
But NB —

Medical large language models are vulnerable to data-poisoning attacks (2025)
Alber, Daniel Alexander; Yang, Zihao; Alyakin, Anton; Yang, Eunice; Rai, Sumedha; Valliani, Aly A.; Zhang, Jeff; Rosenbaum, Gabriel R.; Amend-Thomas, Ashley K.; Kurland, David B.; Kremer, Caroline M.; Eremiev, Alexander; Negash, Bruck; Wiggan, Daniel D.; Nakatsuka, Michelle A.; Sangwon, Karl L.; Neifert, Sean N.; Khan, Hammad A.; Save, Akshay Vinod; Palla, Adhith; Grin, Eric A.; Hedman, Monika; Nasir-Moin, Mustafa; Liu, Xujin Chris; Jiang, Lavender Yao; Mankowski, Michal A.; Segev, Dorry L.; Aphinyanaphongs, Yindalon; Riina, Howard A.; Golfinos, John G.; Orringer, Daniel A.; Kondziolka, Douglas; Oermann, Eric Karl

The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation.

Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%).

Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

Link | PDF (Nature Medicine) [Open Access]


(Not sure what this all means when the "deliberately planted misinformation" is widely understood to be the gold standard of evidence-based medicine.)
 
Back
Top Bottom