Can Large Language Models (LLMs) like ChatGPT be used to produce useful information?

V.R.T.

Senior Member (Voting Rights)
Some of these posts have been moved from the DecodeME preprint thread.


The LLM isn't actually detecting any real hype, it's just generating some text about 'hype' because similar text strings are often found in the sources that it's been trained to mimic.
Yeah LLMs are very good at confidently spouting bollocks.
 
Last edited by a moderator:
6894e49ca3a1be9d17beb575_reliability.png
My impression of ChatGPT is that it's good at being optimistic, seeing connections, and overinterpreting things. Likely because it's imitating hypothesis papers that are in this style. It's a bit like a Cort Johnson with a PhD in biology.
Interpretation does need to be taken with a slight pinch of salt, but it’s also important to understand that “AI” isn’t a static entity. The model out this week has very low error rates on health bench marks, on the graph it’s essentially the difference between last week (white) on ChatGPT vs this week on ChatGPT (pinks). The middle graph is about health information accuracy/reliability.

AI on its own has lower diagnostic errors than human doctors now (interestingly lower than human doctor plus AI which should really be the best combination).
https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff30d08a4-c69c-43d4-a614-acd2c35ab12c_6008x2966.heic




Context is important though and in my prompt I had asked it what was hyperbole, so that it wasn’t unrealistically optimistic. There can definitely be a tendency to guess what the user wants and be sycophantic and here it might have over corrected.
 
Last edited:
It does need to be taken with a slight pinch of salt, but it’s also important to understand that “AI” isn’t a static entity. The model out this week has very low error rates on health bench marks, and AI on its own has lower diagnostic errors than human doctors now (interestingly lower than human doctor plus AI which should really be the best combination). Let me know if you’d like be to find the references,

Context is important though and in my prompt I had asked it what was hyperbole, so that it wasn’t unrealistically optimistic. There can definitely be a tendency to guess what the user wants and be sycophantic and here it might have over corrected.
Of course. It all depends what we count as “low error rates on health”. I can see that making sense for finding objective findings on a scans or whatever. But in Me/cfs where we barely know anything, I dunno if it means much.
 
The model out this week has very low error rates on health bench marks, and AI on its own has lower diagnostic errors than human doctors now (interestingly lower than human doctor plus AI which should really be the best combination). Let me know if you’d like be to find the references,
I like that it's good at summarising information. I'm not sure how good it is at creating hypotheses. It's useful for sure, but maybe not for all tasks.

I can believe that it might be better than doctors at diagnosing illness, but this is presumably in controlled setting for which it has been optimized.
 
Of course. It all depends what we count as “low error rates on health”. I can see that making sense for finding objective findings on a scans or whatever. But in Me/cfs where we barely know anything, I dunno if it means much.

I’m basing it on hearing about these.
6894e49ca3a1be9d17beb575_reliability.png

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff30d08a4-c69c-43d4-a614-acd2c35ab12c_6008x2966.heic


We dismiss the potential of AI for ME at our peril I think. It contains enormous volumes of medical information about the human body, more than human specialists can, and can potentially spot connections we are missing. This will include false positives, but worth investigating suggestions if there’s a true positive within the bundle of suggestions.
 
I’m basing it on hearing about these.
6894e49ca3a1be9d17beb575_reliability.png

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff30d08a4-c69c-43d4-a614-acd2c35ab12c_6008x2966.heic


We dismiss the potential of AI for ME at our peril I think. It contains enormous volumes of medical information about the human body, more than human specialists can, and can potentially spot connections we are missing. This will include false positives, but worth investigating suggestions if there’s a true positive within the bundle of suggestions.
But here it makes sense for objective findings. But we need to be quite careful. Who decides whats good “management reasoning”. Depending on who is testing the AI they could count BPS things as the “right answer”.

So it’s always worth thinking more critically about these “AI tests”.

I’m not one to think LLM’s are worth dismissing. (As others have seen given my other opinions in the forum). But i think we should be very careful with these tests of LLM’s. Because who decides what is truth, what answer is better than another. Can become very murky with things like ME.

And of course the AI absorbs the langauge we use. Talking about “CFS” or “CFS/ME” and it will give you BPS views. But “ME” it will give biomed.
 
After using ChatGPT for a while, I've been impressed with how useful it is for some things (like math questions, or summarising information). Of course, I also asked it about hypotheses based on the genes highlighted by DecodeME and that's where I started having the feeling that it was probably capable of generating superficially credible hypotheses out of almost anything. I decided to test this.

I used https://molbiotools.com/randomgenesetgenerator.php to generate a short list of random genes, then told ChatGPT these genes were from a study of ME/CFS and that I wanted to generate an hypothesis based on these genes, and that I was interested in lipid metabolism and PEM (the first two things that came to my mind). This was the answer (I left out the last part about suggested experiments and sources):
Nice — good gene list. Below I’ll translate those genes into concrete, testable hypotheses that link lipid metabolism to post-exertional malaise (PEM) in ME/CFS, then give specific experiments/analyses you can run next. I’ll call out the handful of high-impact (load-bearing) facts with literature pointers.




Short summary of spotlighted gene functions (one line each)​


  • PEX16 — peroxisome biogenesis protein; required for peroxisome membrane assembly (directly relevant to peroxisomal β-oxidation of very-long-chain fatty acids). UniProtgenecards.org
  • IFNA1 — type-I interferon (IFN-α); chronic IFN signaling reprograms cellular metabolism and can suppress de novo lipid synthesis while altering lipid uptake/handling. PMC+1
  • PKIA — potent inhibitor of PKA (cAMP-dependent protein kinase); PKA activity controls lipolysis, mitochondrial phosphorylation cascades and many metabolic responses. UniProtCenabiotech
  • CCPG1 — ER-phagy cargo receptor linking ER stress to selective autophagy; ER/ER-phagy influences lipid handling, lipoprotein processing and organelle quality control. PMCPubMed
  • CYTH4 (cytohesin-4) — ARF-GEF family member involved in vesicle trafficking; localizes to lipid droplets and membranes — plausible role in lipid droplet dynamics and membrane trafficking of lipids. proteinatlas.orggenecards.org
  • HLA-G — immunomodulatory non-classical MHC I; can shape chronic immune tone and tolerance (could modulate inflammatory drivers of metabolic change). PMCFrontiers
  • PIWIL4, CASC2, LINC01132, TTTY3 — (PIWIL4 piwi family; CASC2 & LINC01132 lncRNAs; TTTY3 Y-linked lncRNA). These are regulators (epigenetic, post-transcriptional, microRNA sponge roles) that could affect expression of metabolic genes or immune regulators. UniProtPMCBioMed CentralCenabiotech



Mechanistic hypotheses (ranked & testable)​


Below each hypothesis I list why it’s plausible and how it links to lipid metabolism → PEM.


1)​


  • Why plausible: PEX16 is required for peroxisome biogenesis; defective peroxisomes reduce peroxisomal β-oxidation of very-long-chain fatty acids and plasmalogen synthesis (membrane lipids). Loss of peroxisomal capacity can cause accumulation of unusual lipids and reduce substrate supply to mitochondria under high-demand (exercise) states. UniProtgenecards.org
  • PEM link: After exertion, tissues rely more on fatty acid oxidation. If peroxisomal upstream processing is impaired, fuel flux to mitochondria is disrupted, causing delayed recovery, metabolite accumulation (lipotoxic intermediates) and immune activation → PEM.

2)​


  • Why plausible: Type-I IFNs downregulate de-novo lipid/cholesterol synthesis and alter lipid uptake profiles; chronic IFN exposure can change immune cell and tissue metabolism. PMC+1
  • PEM link: IFN-driven metabolic shifts may blunt flexible lipid utilization during exertion and provoke pro-inflammatory lipid mediators (eicosanoids), worsening post-exertional symptoms.

3)​


  • Why plausible: PKIA inhibits PKA. PKA activation (via cAMP) promotes lipolysis (hormone-sensitive lipase), mitochondrial responses and adrenergic metabolic adaptation. If PKIA is increased, the normal cAMP/PKA response to exertion may be blunted. UniProtCenabiotech
  • PEM link: During/after exertion the inability to mobilize fatty acids quickly could cause early energy deficit, increased reliance on anaerobic glycolysis, and later malaise.

4)​


  • Why plausible: CCPG1 mediates ER-selective autophagy and helps maintain ER proteostasis; ER dysfunction impacts lipid biosynthesis and VLDL assembly and can secondarily impair mitochondrial function. PMCPubMed
  • PEM link: With exertional stress, increased protein/lipid folding demands or impaired ER turnover could tip cellular energetics toward failure and trigger systemic symptoms.

5)​


  • Why plausible: CYTH4 is an ARF-GEF family member; proteomic/localization data show association with lipid droplets and vesicle compartments — it could regulate lipid droplet dynamics and trafficking of lipid transporters. proteinatlas.orggenecards.org
  • PEM link: Poor mobilization or aberrant sequestration of lipids could limit fuel availability during exercise and promote inflammatory lipid signaling afterwards.

6)​


  • Why plausible: HLA-G can dampen or reshape immune responses; lncRNAs (CASC2, LINC01132) are known regulators of signaling pathways and may modulate metabolic gene expression or microRNA networks that control lipid enzymes. TTTY3 (Y-linked lncRNA) is less characterized but could have sex-specific regulatory effects. PMC+1BioMed CentralCenabiotech
  • PEM link: Combinations of immune tolerance/activation and regulatory lncRNA changes could determine whether metabolic hits translate into symptomatic PEM.



Integrated (multi-hit) hypothesis — short​


A multi-hit model where peroxisomal insufficiency (PEX16) + altered signaling (PKIA, CYTH4, CCPG1) + chronic immune reprogramming (IFNA1, HLA-G, lncRNAs) converge to produce impaired lipid mobilization/oxidation and accumulation of lipid intermediates, which causes an energetic crisis and immune activation after exertion → PEM.

You decide whether this is superficially credible.

In the human body, everything is connected to a lot of other things, so it's easy to find connections between things.
 
Last edited:
Absolutely this. These tools can create plausible sounding results from anything.

They’re great when used responsibly, for anything which is verifiable, as they can be trained and trained to ensure they behave. That’s why they often make good information retrieval, summarising, code completion/assistant and mathematics solution tools.

But for anything speculative? Absolute garbage. And often dangerously so.

One great writer, particularly if you have a technical leaning, which I think you do, is https://simonwillison.net/

Edit: this also reminds me of one great use, creating dummy data for testing systems, if you need to populate a database or something, give it the structure and get an LLM to do it!
 
They’re great when used responsibly, for anything which is verifiable, as they can be trained and trained to ensure they behave. That’s why they often make good information retrieval, summarising, code completion/assistant and mathematics solution tools.
Yes, they're very good at these things.

It's just like Wikipedia was always good when it came to uncontroversial topics that had definitive answers, while being unreliable and inevitably biased when it came to other things.
 
It’s strength tends to be knowledge mass, and linking knowledge together.

Not reasoning or critically assessing knowledge.

If you’re stuck on something and are good with prompts, it can help you see different ways of looking at things and make new connections. Potentially valuable for genetic data. But this needs
(1) Strong verification
(2) The expectation that 29 outputs out of 30 will be useless or illogical.
 
Yes, they're very good at these things.

It's just like Wikipedia was always good when it came to uncontroversial topics that had definitive answers, while being unreliable and inevitably biased when it came to other things.
Yeah, and in turn LLMs are good at reading wikipedia. Meta-pedia? I guess that’s another name for Llama!
 
I was always suprised at how bad it was at finding sources even if you told it the exact name and address of the source it still managed to halicunate an entirely different source that was in some way "sensible". I'm sure that's the easiest thing to fix and probably should sensibly be even fixed for public users quite soon. I'm guessing the top verification methods are already far ahead of what people see when they use ChatGPT.
 
The more I look at wikipedia the more I think anything that isn’t a “big” article reads as if some biased guy wrote his opinion and added a bunch of references to make it look smart.

That might just be a reflection of the type of wikipedia articles I read though.
 
I was always suprised at how bad it was at finding sources even if you told it the exact name and address of the source it still managed to halicunate an entirely different source that was in some way "sensible". I'm sure that's the easiest thing to fix and probably should sensibly be even fixed for public users quite soon. I'm guessing the top verification methods are already far ahead of what people see when they use ChatGPT.

I think it’s got a lot to do with copyright and which websites block bots trying to access
 
I think it’s got a lot to do with copyright and which websites block bots trying to access
I'm not too sure about that. The sources can come from the same public domain (say a public preprint repository) and sometimes it does it, some other times it'll just repetivitely hallucinate "sensibly". I'd place larger bets on it being an LLM property when run with insufficient verification.
 
I'm not too sure about that. The sources can come from the same public domain (say a public preprint repository) and sometimes it does it, some other times it'll just repetivitely hallucinate "sensibly". I'd place larger bets on it being an LLM property when run with insufficient verification.
It could also be rate limits. Like that website allows 50 requests per IP address every 5 minutes. And chatgpt servers run out until it resets. (I’m pretty sure this is common defense again DDOS attacks).
 
Back
Top Bottom