Idea: Web app to compile all ME/CFS study test results

Discussion in 'General ME/CFS discussion' started by forestglip, Sep 6, 2024.

  1. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    789
    I downloaded every abstract from Pubmed from the search term "chronic fatigue syndrome" (in quotes). I wrote a script to send every abstract, one by one, to the Claude API, to respond with whether it is original research on ME/CFS. Here's the prompt:

    I sent 121 abstracts so far as a test. It's a bit expensive. There are about 8,400 abstracts. It'll cost about $40-50 to get responses for all of them. There's another Claude model that is 10 times cheaper, but the answers I was getting weren't making much sense. This model seems pretty good at making decisions. A lot of the cost is the length of the prompt above, so I might have to figure out a way to shorten it without it losing accuracy.

    Anyway, here are the results of the first few. I attached a text file with all the abstract responses I've gotten so far.

     

    Attached Files:

    hotblack likes this.
  2. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    271
    Location:
    UK
    Interesting, are you using sonnet or opus? If you don’t mind rate limits maybe it’s worth seeing what Gemini comes up with, I’ve been doing some simple experiments (unrelated to this) using it. While Gemini 1.5 Flash may not be powerful enough the limits are low, 1.5 Pro would take too long, but 1.0 Pro may fit the bill if it produces good enough results and you can schedule the work over a week.
    https://ai.google.dev/pricing

    And perhaps an obvious or silly question, but have you tried any of the local models to see if they’re up to the task? Depending upon what hardware you’ve got access to. I’ve only experimented with small (2B) models for basic tasks which wouldn’t but maybe if you can run the larger models?
     
    Last edited: Sep 7, 2024
  3. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    789
    This was with Sonnet. Opus would cost about 5 times more.

    Interesting, looks like for the free tier of 1.0 Pro, I can do all 8,400 in about 10 hours. I'm not too hopeful it'll be much better than Claude Haiku, which was pretty bad, but I'll try with the same 121 abstracts.
     
    hotblack likes this.
  4. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    789
    I don't have GPU on my computer. I once tried the biggest that could comfortably run on my laptop, and it was both pretty bad at understanding compared to the biggest ones and incredibly slow.
     
    hotblack likes this.
  5. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    789
    Oh I wasn't looking carefully at the rate limits. There's also a daily limit of 1500 requests, so it would be closer to a week. But it's also a lot cheaper if I just want to pay and do it quickly. Somewhere around $5-10 for all of them, which isn't that bad.

    Anyway, I ran the same 121 abstracts through Gemini Pro 1.0. It had a different opinion on 28 of them. 9 of which were a flip from a YES to a NO or vice versa. For the rest, one of the models said MAYBE.

    Gemini said this for one: "YES: The study is a meta-analysis of six randomized controlled trials, pooling data to investigate whether cognitive behavioral therapy (CBT) effectiveness is moderated by depressive symptoms in patients with ME/CFS. This analysis is original research and focuses specifically on ME/CFS."

    Even though the prompt explicitly says: "NO: If it describes a review, meta-analysis, other non-original research, or does not specifically focus on CFS/ME/ME-CFS."

    I think it might have to be a better model, or there will be a lot of mistakes like this.

    Here are the first five that didn't match, along with each model's explanation. There's a character limit in these posts, so all 28 are in an attached text file.

     

    Attached Files:

    Last edited: Sep 7, 2024
    hotblack likes this.
  6. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    789
    Oh, I can filter that list of studies down on PubMed to just clinical trials. Does that include everything that would have tested something? Not sure. But it comes down from 8,426 to 516. Much better.

    My plan is to try the embedding approach. After having Claude tell me which of these matches the criteria, as above, then for those I'll try to get the full text of each somehow and send that to Claude and ask it to list every test and result in detail. Depending on how long papers are and how many there are, that step might still turn out very expensive, we'll see.
     
    hotblack likes this.
  7. forestglip

    forestglip Senior Member (Voting Rights)

    Messages:
    789
    Oops, I had used Gemini 1.5 Flash previously, not 1.0 Pro. I just ran it again with 1.0 Pro, and it's worse. 37 mismatches out of 120. Here's just a few in case anyone is interested, but I'm going to just use Claude Sonnet.

    Edit: Oh right, I need to filter for more than clinical trials to get tests like serotonin levels. But no filter seems to include the deep phenotyping study. I'll probably have to filter backwards by downloading them all and eliminating the ones that are tagged as reviews, commentaries, etc.

    Edit 2: No, this doesn't seem as straightforward as I had hoped. A couple main issues:

    1. I don't think this will work for interventions. At least getting a nice binary result for the heatmap of interventions to show if it improved or got worse, since there can be multiple outcomes per intervention. Maybe just for observational studies this would be okay.

    2. It's expensive to give it full studies, and it's not very good at following instructions perfectly if the text is really long. It was about 5 cents for just the methods and results sections of a random study. If I'm doing 1000 studies, that's $50, more if there are significantly longer ones.

    I may have burned out my brain for a while too. So I'll leave this alone for now, I think. I still think there might be something cool if I could get a bunch of data like this:

    <test>SF-36 Physical Function score</test><result>increased</result>
    <test>xanthine metabolism compounds in urine samples</test><result>increased</result>

    And make a map where it groups similar items together (e.g. serotonin would be closer to dopamine than to symptom questionnaire) and makes items that are increased one color, like blue dots, and items that are decreased red dots. If you see a lot of blue dots clumped together in one spot, or red dots clumped together in one spot, you can zoom in and see that many somewhat similar tests have gotten the same result.

    I'm not even sure this would work as well as I hope. Anyway, maybe a project for the future or for someone else.
     
    Last edited: Sep 7, 2024
    Nightsong, hotblack and alktipping like this.
  8. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    271
    Location:
    UK
    @forestglip Thanks for sharing your progress, comparisons and results.I’m not surprised your brain is a little fried, mine has been just from following! There’s some really interesting ideas here.
     
    alktipping and forestglip like this.
  9. hotblack

    hotblack Senior Member (Voting Rights)

    Messages:
    271
    Location:
    UK
    alktipping and forestglip like this.
  10. Nightsong

    Nightsong Senior Member (Voting Rights)

    Messages:
    587
    On cost: you don't actually have to use the provided APIs. When ChatGPT first came out I wrote a quick Python script to interact with it using browser instrumentation (Selenium/ChromeDriver with a few modifications) - much cheaper!

    Also occurs to me that the hallucination risk might be reduced by using ensemble (e.g. consensus of multiple LLMs) or cross-verification (where one LLM evaluates the output of another LLM for correctness) methods.

    Lots of interesting ideas on this thread. I've no energy to take on a project like this but hope someone picks it up and runs with it.
     
    alktipping, hotblack and forestglip like this.
  11. kasi-leko

    kasi-leko New Member

    Messages:
    1
    It seems that the Pubmet format has a field PT ("publication type") that indicates if the article is a review: https://pubmed.ncbi.nlm.nih.gov/help/#pt (the list: https://pubmed.ncbi.nlm.nih.gov/help/#publication-types) so you don't have to use a LLM for that. These would also give the answer to your question 1.

    If you really want to make something that works, you will have to label some data manually. This is absolutely necessary if you at least want to know how well your extraction system work. Incidently you could use the labelled data to train a classifier model using one of the transformer models specifically trained on medical data (for example Med-BERT, Clinical BERT, etc). The advantages of classifiers are that they don't rely on text generations, which are prone to hallucinations no matter what, their performance is quantifiable (as opposed to using a LLM with no manually labelled data), and finally, they're cheaper than querying a LLM.

    As for measurements you probably want to train a dedicated NER model, for the same reasons as above. Of course you could always try a LLM and ask it to extract the relevant information into JSON format, as long as you have some manually labelled data in hand to evaluate the LLM outputs.
     
    hotblack and forestglip like this.

Share This Page