NATURAL: End-To-End Causal Effect Estimation from Unstructured Natural Language Data

Discussion in 'Other health news and research' started by forestglip, Jul 24, 2024 at 4:49 PM.

  1. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    492
    NATURAL: End-To-End Causal Effect Estimation from Unstructured Natural Language Data

    "Does semaglutide protect kidney health? Do later school times promote the well-being of children? These types of question about cause and effect drive decisions across medicine, policy, and business. Randomized controlled trials (RCTs) are the most trusted mechanism to answer causal questions by estimating treatment effects. Unfortunately, clinical trials take several years and millions of dollars to maybe approve a drug, and RCTs are often infeasible for many critical policy questions. Yet, a sudden outbreak or pandemic gives us mere days and a handful of potential facts to make vital decisions. Our aim is to bolster the sources of causal information available to us and accelerate the extraction of useful insights from them.

    Observational studies offer pre-trial insights but demand structured data. Meanwhile, a wealth of information lies untapped in online forums. For instance, thousands with diabetes, migraines, or Long Covid share their treatment experiences on dedicated subreddits. This data is rich, diverse, and accessible – but unstructured. In this work, we introduce a pipeline that turns unstructured text data like this into causal insights.

    NATURAL is a large-language-model based pipeline that turns unstructured text data into meaningful treatment effects. We used social media data to test its performance against real-world RCTs comparing several diabetes and migraine drugs. For clinical trials, NATURAL predicted ATES that fell within three percentage points of their ground truth counterparts! This suggests that unstructured text data is indeed a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.

    Screenshot_20240724-114334.png

    We are excited to expand its data sources as well as applications:
    1. Using online forum conversations to better understand policy interventions.
    2. Estimating individualized treatment effects on-demand from electronic health records.
    3. Prioritizing clinical trial investment for neglected diseases, based on real lived experiences.
    4. Repurposing drugs and uncovering hidden potential in existing medications.
    5. Detecting rare adverse effects of drugs via safety monitoring in large, diverse populations."
     
    Sean, Mij and Hutan like this.
  2. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    492
    End-To-End Causal Effect Estimation from Unstructured Natural Language Data

    Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G. Krishnan, Chris J. Maddison

    Abstract
    Knowing the effect of an intervention is critical for human decision-making, but current approaches for causal effect estimation rely on manual data collection and structuring, regardless of the causal assumptions. This increases both the cost and time-to-completion for studies. We show how large, diverse observational text data can be mined with large language models (LLMs) to produce inexpensive causal effect estimates under appropriate causal assumptions. We introduce NATURAL, a novel family of causal effect estimators built with LLMs that operate over datasets of unstructured text. Our estimators use LLM conditional distributions (over variables of interest, given the text data) to assist in the computation of classical estimators of causal effect. We overcome a number of technical challenges to realize this idea, such as automating data curation and using LLMs to impute missing information. We prepare six (two synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.

    Link | PDF (ArXiv)
     
    Peter Trewhitt and Hutan like this.
  3. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    492
    Peter Trewhitt and Hutan like this.
  4. Hutan

    Hutan Moderator Staff Member

    Messages:
    28,035
    Location:
    Aotearoa New Zealand
    I see the benefits. Of course there are potential harms too. Social media records can easily be the subject of campaigns. For example, all the supposed testimony of recovery as a result of the Lightning Process, with testimony of harm from the therapy on social media possibly limited by people feeling ashamed that they undertook the therapy and/or failed to improve.
     
  5. Creekside

    Creekside Senior Member (Voting Rights)

    Messages:
    1,074
    This might be useful only for the first few uses. Then when people figure out how to benefit from manipulating social media, it becomes useless.
     
    LJord, forestglip and Peter Trewhitt like this.
  6. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    492
    Maybe for drawing conclusions directly from the model output, but I think it could stay very valuable as a hypothesis generating tool. Even if 9 out of 10 outputs are incorrect, it still has the potential to unearth and focus in on treatments people are talking about in the corners of the internet that no researcher will ever see. Then they can go and do RCTs and prove these one way or the other.
     
    Peter Trewhitt likes this.
  7. rvallee

    rvallee Senior Member (Voting Rights)

    Messages:
    13,026
    Location:
    Canada
    I think there is a lot of potential value to this, but I always saw it more in the context of patient cohorts. Social media is still a good starting point, medical research needs to be a lot more flexible or outside of biomedical research it will simply stall completely, if that hasn't already happened.
     
    Trish, forestglip and Peter Trewhitt like this.

Share This Page