Multi-scale data improves performance of machine learning model for long COVID identification
BACKGROUND
Long COVID affects a substantial proportion of the over 778 million individuals infected with SARS-CoV-2, yet predictive models remain limited in scope. While existing efforts, such as the National COVID Cohort Collaborative (N3C), have leveraged electronic health record (EHR) data for risk prediction and identification, accumulating evidence points to additional contributions from social, behavioral, and genetic factors.
METHODS
Using a diverse cohort of SARS-CoV-2-infected individuals (n > 17,200) from the NIH All of Us Research Program, we investigated whether integrating EHR data with survey-based and genomic information improves model performance.
RESULTS
Our multi-scale approach outperforms EHR-only model’s area under the receiver operating curve 0.736 (95% CI: 0.730, 0.741), achieving an area of 0.748 (0.741,0.755). Among the top predictors, active-duty service status, and self-reported fatigue are the most informative survey features.
CONCLUSIONS
These findings highlight the importance of incorporating multi-scale data to improve risk stratification and inform personalized interventions for long COVID. However the relative increase in accuracy is modest, and the cost of collecting genetic and survey data should be considered before implementation.
PLAIN LANGUAGE SUMMARY
Long COVID is a chronic condition triggered by infection with SARS-CoV-2. Often it is difficult to diagnose long COVID, as many of the condition’s symptoms overlap with other diseases. Most computer programs developed to identify long COVID use only a patient’s electronic health record history, even though research has shown associations of long COVID to health survey and genetic data. Here, we aimed to create a program using all three of these data sources and assess the program’s ability to identify long COVID versus a version that only uses patient electronic health records. We find that the program using all three sources shows a more accurate identification of long COVID than the version using only electronic health records. This suggests that using this data can help institutions better identify who has long COVID. However, the increase in accuracy is small, and collecting additional data could be costly.
Web | DOI | PDF | Nature Communications Medicine | Open Access
Guardo, Christopher; Xinmeng, Zhang; Gangireddy, Srushti; Chao, Yan; Kerchberger, V Eric; Dickson, Alyson L; Pfaff, Emily R; Master, Hiral; Yi, Xin; Basford, Melissa; Chute, Christopher G; Tran, Nguyen K; Mancuso, Salvatore; Syed, Toufeeq Ahmed; Zhongming, Zhao; QiPing, Feng; Haendel, Melissa; Lunt, Christopher; Harris, Paul A; Lang, Li; Ginsburg, Geoffrey S; Denny, Joshua C; Roden, Dan M; Wei-Qi, Wei
BACKGROUND
Long COVID affects a substantial proportion of the over 778 million individuals infected with SARS-CoV-2, yet predictive models remain limited in scope. While existing efforts, such as the National COVID Cohort Collaborative (N3C), have leveraged electronic health record (EHR) data for risk prediction and identification, accumulating evidence points to additional contributions from social, behavioral, and genetic factors.
METHODS
Using a diverse cohort of SARS-CoV-2-infected individuals (n > 17,200) from the NIH All of Us Research Program, we investigated whether integrating EHR data with survey-based and genomic information improves model performance.
RESULTS
Our multi-scale approach outperforms EHR-only model’s area under the receiver operating curve 0.736 (95% CI: 0.730, 0.741), achieving an area of 0.748 (0.741,0.755). Among the top predictors, active-duty service status, and self-reported fatigue are the most informative survey features.
CONCLUSIONS
These findings highlight the importance of incorporating multi-scale data to improve risk stratification and inform personalized interventions for long COVID. However the relative increase in accuracy is modest, and the cost of collecting genetic and survey data should be considered before implementation.
PLAIN LANGUAGE SUMMARY
Long COVID is a chronic condition triggered by infection with SARS-CoV-2. Often it is difficult to diagnose long COVID, as many of the condition’s symptoms overlap with other diseases. Most computer programs developed to identify long COVID use only a patient’s electronic health record history, even though research has shown associations of long COVID to health survey and genetic data. Here, we aimed to create a program using all three of these data sources and assess the program’s ability to identify long COVID versus a version that only uses patient electronic health records. We find that the program using all three sources shows a more accurate identification of long COVID than the version using only electronic health records. This suggests that using this data can help institutions better identify who has long COVID. However, the increase in accuracy is small, and collecting additional data could be costly.
Web | DOI | PDF | Nature Communications Medicine | Open Access