Multi-scale data improves performance of machine learning model for long COVID identification, 2026, Guardo et al.

SNT Gatchaman · Wednesday at 11:24 AM

Multi-scale data improves performance of machine learning model for long COVID identification

Guardo, Christopher; Xinmeng, Zhang; Gangireddy, Srushti; Chao, Yan; Kerchberger, V Eric; Dickson, Alyson L; Pfaff, Emily R; Master, Hiral; Yi, Xin; Basford, Melissa; Chute, Christopher G; Tran, Nguyen K; Mancuso, Salvatore; Syed, Toufeeq Ahmed; Zhongming, Zhao; QiPing, Feng; Haendel, Melissa; Lunt, Christopher; Harris, Paul A; Lang, Li; Ginsburg, Geoffrey S; Denny, Joshua C; Roden, Dan M; Wei-Qi, Wei

BACKGROUND
Long COVID affects a substantial proportion of the over 778 million individuals infected with SARS-CoV-2, yet predictive models remain limited in scope. While existing efforts, such as the National COVID Cohort Collaborative (N3C), have leveraged electronic health record (EHR) data for risk prediction and identification, accumulating evidence points to additional contributions from social, behavioral, and genetic factors.

METHODS
Using a diverse cohort of SARS-CoV-2-infected individuals (n > 17,200) from the NIH All of Us Research Program, we investigated whether integrating EHR data with survey-based and genomic information improves model performance.

RESULTS
Our multi-scale approach outperforms EHR-only model’s area under the receiver operating curve 0.736 (95% CI: 0.730, 0.741), achieving an area of 0.748 (0.741,0.755). Among the top predictors, active-duty service status, and self-reported fatigue are the most informative survey features.

CONCLUSIONS
These findings highlight the importance of incorporating multi-scale data to improve risk stratification and inform personalized interventions for long COVID. However the relative increase in accuracy is modest, and the cost of collecting genetic and survey data should be considered before implementation.

PLAIN LANGUAGE SUMMARY
Long COVID is a chronic condition triggered by infection with SARS-CoV-2. Often it is difficult to diagnose long COVID, as many of the condition’s symptoms overlap with other diseases. Most computer programs developed to identify long COVID use only a patient’s electronic health record history, even though research has shown associations of long COVID to health survey and genetic data. Here, we aimed to create a program using all three of these data sources and assess the program’s ability to identify long COVID versus a version that only uses patient electronic health records. We find that the program using all three sources shows a more accurate identification of long COVID than the version using only electronic health records. This suggests that using this data can help institutions better identify who has long COVID. However, the increase in accuracy is small, and collecting additional data could be costly.

Web | DOI | PDF | Nature Communications Medicine | Open Access

SNT Gatchaman · Wednesday at 11:32 AM

I don't understand what they're trying to do here. It reads as if they're trying to predict the diagnosis of Long COVID and finding that it tends to happen with patients that report fatigue.

The five highest contributing features are the proportion of outpatient visits per month a participant had after their COVID19 diagnosis, a diagnosis for dyspnea or shortness of breath, military active-duty service status reported in a survey, self-reported fatigue from survey data, and a participant’s age.

Maybe it's useful to record that LC is common in fit healthy active duty military members. Doubt that'll have any bearing on the thinking of the fear-avoidance people.

In addition, social and behavioral factors—often captured through surveys—have been shown to influence both acute and long-term COVID-19 outcomes. Findings from AoU, for example, revealed significant reductions in physical activity (e.g., step count) following the pandemic, with disproportionate impacts on socioeconomically disadvantaged populations.

Social and behavioural factors carries a bit of an implication there. This all seems to be framed as: "pandemic happened, people freaked out what with the years-long lockdowns and the fact that everybody died (except it was just the flu), so I guess that's why no-one exercise any more."

SNT Gatchaman · Wednesday at 11:50 AM

Figure 2 highlights

chr19:4719431:G:A_A
and
chr10:79946568:AG_G

It's late, so only quickly Googling, I think the first one (on ch 19) is DPP9, which came up in Integrative Genome-Wide Association Studies of COVID-19 Susceptibility and Hospitalization Reveal Risk Loci for Long COVID (2025)

The second (on ch 10) might be surfactant D if the Google search listing is meaningful. That might be a predisposing factor for more severe acute respiratory disease, if so.

rvallee · Wednesday at 5:30 PM

I don't really understand the value of this. It all depends on patient-submitted information involving health care, and so cannot be predictive in the sense that it still needs a consult where someone accurately records the case presentation, and then they could make a diagnosis anyway. Obviously the active duty stuff is worthless outside of research purposes. Good grief.

In addition, social and behavioral factors—often captured through surveys—have been shown to influence both acute and long-term COVID-19 outcomes. Findings from AoU, for example, revealed significant reductions in physical activity (e.g., step count) following the pandemic, with disproportionate impacts on socioeconomically disadvantaged populations.

If a reduction in functioning captured through reduced activity is considered behavioral, then why bother assigning meaning to words? Here they clearly don't mean it the same way as "sickness behavior" but in an actual lifestyle/wellness way. Is a similar reduction in activity caused by cancer also behavioral? Although is the reduction in activity social or behavioral? Who knows?! And why physical? It's been 6 years, it's not credible to continue to argue this is about not being able to jog at the same pace as before, people have to reduce all activities, including work and socializing. Which still does not make it a social factor!

What about people walking less because they're paraplegic? Is this behavioral? When does it become medical? This all looks close to completely arbitrary to me. We need words to have coherent meanings or there is literally no point having a discussion. This feels like playing a game with a 4 year-old who changes the rules every damn second, what used to be the safe base is now lava and actually they win by being tagged more, not less.

This is about as useful as "someone made a medical appointment" as a predictor of health problems. I mean, sure, yeah, correct, but, useful?

Multi-scale data improves performance of machine learning model for long COVID identification, 2026, Guardo et al.

SNT Gatchaman

Senior Member (Voting Rights)

SNT Gatchaman

Senior Member (Voting Rights)

SNT Gatchaman

Senior Member (Voting Rights)

rvallee

Administrator