NATURAL: End-To-End Causal Effect Estimation from Unstructured Natural Language Data

forestglip · Jul 24, 2024

NATURAL: End-To-End Causal Effect Estimation from Unstructured Natural Language Data

"Does semaglutide protect kidney health? Do later school times promote the well-being of children? These types of question about cause and effect drive decisions across medicine, policy, and business. Randomized controlled trials (RCTs) are the most trusted mechanism to answer causal questions by estimating treatment effects. Unfortunately, clinical trials take several years and millions of dollars to maybe approve a drug, and RCTs are often infeasible for many critical policy questions. Yet, a sudden outbreak or pandemic gives us mere days and a handful of potential facts to make vital decisions. Our aim is to bolster the sources of causal information available to us and accelerate the extraction of useful insights from them.

Observational studies offer pre-trial insights but demand structured data. Meanwhile, a wealth of information lies untapped in online forums. For instance, thousands with diabetes, migraines, or Long Covid share their treatment experiences on dedicated subreddits. This data is rich, diverse, and accessible – but unstructured. In this work, we introduce a pipeline that turns unstructured text data like this into causal insights.

NATURAL is a large-language-model based pipeline that turns unstructured text data into meaningful treatment effects. We used social media data to test its performance against real-world RCTs comparing several diabetes and migraine drugs. For clinical trials, NATURAL predicted ATES that fell within three percentage points of their ground truth counterparts! This suggests that unstructured text data is indeed a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.

We are excited to expand its data sources as well as applications:

Using online forum conversations to better understand policy interventions.
Estimating individualized treatment effects on-demand from electronic health records.
Prioritizing clinical trial investment for neglected diseases, based on real lived experiences.
Repurposing drugs and uncovering hidden potential in existing medications.
Detecting rare adverse effects of drugs via safety monitoring in large, diverse populations."

forestglip · Jul 24, 2024

End-To-End Causal Effect Estimation from Unstructured Natural Language Data

Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G. Krishnan, Chris J. Maddison

Abstract
Knowing the effect of an intervention is critical for human decision-making, but current approaches for causal effect estimation rely on manual data collection and structuring, regardless of the causal assumptions. This increases both the cost and time-to-completion for studies. We show how large, diverse observational text data can be mined with large language models (LLMs) to produce inexpensive causal effect estimates under appropriate causal assumptions. We introduce NATURAL, a novel family of causal effect estimators built with LLMs that operate over datasets of unstructured text. Our estimators use LLM conditional distributions (over variables of interest, given the text data) to assist in the computation of classical estimators of causal effect. We overcome a number of technical challenges to realize this idea, such as automating data curation and using LLMs to impute missing information. We prepare six (two synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.

Link | PDF (ArXiv)

forestglip · Jul 24, 2024

Twitter thread from author

https://twitter.com/user/status/1816115607290429786

One tweet: "NATURAL was also inspired by my journey with #LongCovid. We need more investment, more clinical trials, more attention. Can we accelerate this by using the experiences that patients are sharing on forums?"

Hutan · Jul 24, 2024

I see the benefits. Of course there are potential harms too. Social media records can easily be the subject of campaigns. For example, all the supposed testimony of recovery as a result of the Lightning Process, with testimony of harm from the therapy on social media possibly limited by people feeling ashamed that they undertook the therapy and/or failed to improve.

Creekside · Jul 25, 2024

This might be useful only for the first few uses. Then when people figure out how to benefit from manipulating social media, it becomes useless.

forestglip · Jul 25, 2024

Creekside said:
This might be useful only for the first few uses. Then when people figure out how to benefit from manipulating social media, it becomes useless.

Maybe for drawing conclusions directly from the model output, but I think it could stay very valuable as a hypothesis generating tool. Even if 9 out of 10 outputs are incorrect, it still has the potential to unearth and focus in on treatments people are talking about in the corners of the internet that no researcher will ever see. Then they can go and do RCTs and prove these one way or the other.

rvallee · Jul 25, 2024

I think there is a lot of potential value to this, but I always saw it more in the context of patient cohorts. Social media is still a good starting point, medical research needs to be a lot more flexible or outside of biomedical research it will simply stall completely, if that hasn't already happened.

NATURAL: End-To-End Causal Effect Estimation from Unstructured Natural Language Data

forestglip

Moderator

forestglip

Moderator

forestglip

Moderator

Hutan

Moderator

Creekside

Senior Member (Voting Rights)

forestglip

Moderator

rvallee

Senior Member (Voting Rights)