Artificial intelligence in medicine and science

rvallee · Nov 29, 2023

Although large language models (LLMs) have been the main innovation in AI in recent years, the last year has been largely spent on optimizing prompting strategies, basically how to get LLMs to reason with their data set. How you ask a question is a critical requirement to get a valid answer, LLMs need to be directed on how to think.

Microsoft published research today showing a model that achieves 90% on all dimensions of the standard medical board certification testing, largely through prompting optimization in the form of a model called MedPrompt.

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
https://arxiv.org/abs/2311.16452

Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

Also an article from Microsoft (who built MedPrompt) on their approach to prompting optimization: https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/, which is also capable of reaching passing grades on several other professional certifications:

rvallee · Dec 4, 2023

Towards Accurate Differential Diagnosis with Large Language Models
https://arxiv.org/abs/2312.00164

An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

The bottom tier isn't very useful, all clinicians will make use of resources to assist in diagnosis, but the upper one is quite significant: LLMs alone do better than LLM-assisted clinicians.

Andy · Dec 7, 2023

Is AI leading to a reproducibility crisis in science?

"During the COVID-19 pandemic in late 2020, testing kits for the viral infection were scant in some countries. So the idea of diagnosing infection with a medical technique that was already widespread — chest X-rays — sounded appealing. Although the human eye can’t reliably discern differences between infected and non-infected individuals, a team in India reported that artificial intelligence (AI) could do it, using machine learning to analyse a set of X-ray images1.

The paper — one of dozens of studies on the idea — has been cited more than 900 times. But the following September, computer scientists Sanchari Dhar and Lior Shamir at Kansas State University in Manhattan took a closer look2. They trained a machine-learning algorithm on the same images, but used only blank background sections that showed no body parts at all. Yet their AI could still pick out COVID-19 cases at well above chance level.

The problem seemed to be that there were consistent differences in the backgrounds of the medical images in the data set. An AI system could pick up on those artefacts to succeed in the diagnostic task, without learning any clinically relevant features — making it medically useless."

https://www.nature.com/articles/d41586-023-03817-6

SNT Gatchaman · Dec 7, 2023

Eric Topol at TED: Can AI catch what doctors miss? (14 mins)

"And in the medical community the thing that we don't talk much about are diagnostic medical errors."

rvallee · Dec 12, 2023

Performance of Large Language Models on a Neurology Board–Style Examination
JAMA Neurology: https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812620

Key Points

Question What is the performance of large language models on neurology board–style examinations?

Findings In this cross-sectional study, a newer version of the large language model significantly outperformed the mean human score when given questions from a question bank approved by the American Board of Psychiatry and Neurology, answering 85.0% of questions correctly compared with the mean human score of 73.8%, while the older model scored below the human average (66.8%). Both models used confident or very confident language, even when incorrect.

Meaning These findings suggest that with further refinements, large language models could have significant applications in clinical neurology.

Abstract

Importance Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored.

Objective To assess the performance of LLMs on neurology board–style examinations.

Design, Setting, and Participants This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers.

Main Outcomes and Measures Overall percentage scores of 2 LLMs.

Results LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2’s performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological–related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers.

Conclusions and Relevance Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2’s results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

rvallee · Dec 12, 2023

ChatGPT outperforming human doctors in behavioral, cognitive, and psychological–related questions is both hilarious and not the least bit surprising. This is by far the weakest area in all medicine, and the gap will only grow wider.

AIs don't read between the lines or see thinly veiled language with its alternative meaning. Instead they consider the bulk of what's out there, and the bulk of it disagrees with the "special menu" that we get served from with a wink and a smirk.

tmrw · Dec 15, 2023

„Google is rolling out new AI models for health care. Here’s how doctors are using them“

„KEY POINTS

Google on Wednesday announced a suite of new health-care AI models called MedLM.

The move marks Google’s latest attempt to monetize health-care industry AI tools, as competition for market share remains fierce between competitors like Amazon and Microsoft.

Google said it plans to introduce health-care-specific versions of Gemini, the company’s newest and “most capable” AI model, to MedLM in the future.“

https://www.cnbc.com/2023/12/13/how-doctors-are-using-googles-new-ai-models-for-health-care.html

One interesting point in this article:

When Google announced Med-PaLM 2 in March, the company initially said it could be used to answer questions like “What are the first warning signs of pneumonia?” and “Can incontinence be cured?” But as the company has tested the technology with customers, the use cases have changed, according to Greg Corrado, head of Google’s health AI.

Corrado said clinicians don’t often need help with “accessible” questions about the nature of a disease, so Google hasn’t seen much demand for those capabilities from customers. Instead, health organizations often want AI to help solve more back-office or logistical problems, like managing paperwork.

My bolding. So the one area where AI probably could actually make a difference for patients, like pwME, and the doctors are not interested in it. Who would have thought?

rvallee · Dec 21, 2023

Discovery of a structural class of antibiotics with explainable deep learning
https://www.nature.com/articles/s41586-023-06887-8

(Paragraph breaks added for legibility, seriously what's up with academia and illegible walls of text?)

The discovery of novel structural classes of antibiotics is urgently needed to address the ongoing antibiotic resistance crisis1,2,3,4,5,6,7,8,9. Deep learning approaches have aided in exploring chemical spaces1,10,11,12,13,14,15; these typically use black box models and do not provide chemical insights. Here we reasoned that the chemical substructures associated with antibiotic activity learned by neural network models can be identified and used to predict structural classes of antibiotics.

We tested this hypothesis by developing an explainable, substructure-based approach for the efficient, deep learning-guided exploration of chemical spaces. We determined the antibiotic activities and human cell cytotoxicity profiles of 39,312 compounds and applied ensembles of graph neural networks to predict antibiotic activity and cytotoxicity for 12,076,365 compounds.

Using explainable graph algorithms, we identified substructure-based rationales for compounds with high predicted antibiotic activity and low predicted cytotoxicity. We empirically tested 283 compounds and found that compounds exhibiting antibiotic activity against Staphylococcus aureus were enriched in putative structural classes arising from rationales.

Of these structural classes of compounds, one is selective against methicillin-resistant S. aureus (MRSA) and vancomycin-resistant enterococci, evades substantial resistance, and reduces bacterial titres in mouse models of MRSA skin and systemic thigh infection. Our approach enables the deep learning-guided discovery of structural classes of antibiotics and demonstrates that machine learning models in drug discovery can be explainable, providing insights into the chemical substructures that underlie selective antibiotic activity.

rvallee · Jan 12, 2024

Towards Conversational Diagnostic AI
https://arxiv.org/abs/2401.05654

At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts.

We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE).

The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.

rvallee · Jan 12, 2024

Blog post from Google AI about the above paper:

AMIE: A research AI system for diagnostic medical reasoning and conversations
https://blog.research.google/2024/01/amie-research-ai-system-for-diagnostic.html

Inspired by this challenge, we developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on a LLM and optimized for diagnostic reasoning and conversations. We trained and evaluated AMIE along many dimensions that reflect quality in real-world clinical consultations from the perspective of both clinicians and patients. To scale AMIE across a multitude of disease conditions, specialties and scenarios, we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process. We also introduced an inference time chain-of-reasoning strategy to improve AMIE’s diagnostic accuracy and conversation quality. Finally, we tested AMIE prospectively in real examples of multi-turn dialogue by simulating consultations with trained actors.

Comparison is somewhat limited in that to account for the limitations of the AI system, participating PCPs only interacted through a chat system, rather than the in-person conversation that is typical in health care. However the benefits of remote asynchronous conversation not being limited by physician time and clinical space are still massively significant, and it's only a matter of time before real-time conversation is available, more likely this year than not.

rvallee · Jan 19, 2024

US FDA clears DermaSensor's AI-powered skin cancer detecting device
https://www.reuters.com/business/he...ered-skin-cancer-detecting-device-2024-01-17/

The FDA clearance is based on a study which showed that the device had a 96% sensitivity in detecting skin cancers. A negative result through the device had a 97% chance of being benign, according to the company.

When brought in contact with skin, the device emits light and captures the wavelengths of light reflecting off cellular structures beneath the skin's surface.

It subsequently utilizes an algorithm to analyze the reflected light and detect the presence of skin cancer.
...
Company CEO Cody Simmons said the device will be priced through a subscription model at $199 a month for five patients or $399 a month for unlimited use.

DermaSensor is currently commercially available in Europe and Australia.

rvallee · May 7, 2024

Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2818387

Key Points

Question Can a large language model (LLM) accurately assess clinical acuity in the emergency department (ED)?

Findings This cross-sectional study of 251 401 adult ED visits investigated the potential for an LLM to classify acuity levels of patients in the ED based on the Emergency Severity Index across 10 000 patient pairs. The LLM demonstrated accuracy of 89% and was comparable with human physician classification in a 500-pair subsample.

Meaning These findings suggest that LLMs could accurately identify higher-acuity patient presentation when given pairs of presenting histories extracted from patients’ first ED documentation.

Results From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]).

Conclusions and Relevance In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients’ first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.

rvallee · May 9, 2024

This week, DeepMind released AlphaFold v3, which has improved its accuracy in predicting protein function, but also enabled simulation of interactions between proteins and ligands (small molecules), ions (single atoms), as well as with DNA and RNA. Many (most?) drugs are ligands, so the impact on drug development will be huge. Especially as this keeps improving, whatever v3 can do, v4 will do better, and v5 even better, and so on.

rvallee · May 10, 2024

rvallee said:
This week, DeepMind released AlphaFold v3, which has improved its accuracy in predicting protein function, but also enabled simulation of interactions between proteins and ligands (small molecules), ions (single atoms), as well as with DNA and RNA. Many (most?) drugs are ligands, so the impact on drug development will be huge. Especially as this keeps improving, whatever v3 can do, v4 will do better, and v5 even better, and so on.

Dr Unumatz agrees with this, it could be incredibly revolutional.

Generative AI combined with AlphaFold 3 will be so transformative in drug discovery that we are likely to see a massive acceleration in speed at a fraction of the cost of conventional approaches. A revolution in medicine has now begun!

We are working on very big metabolomics and immunology papers for #MECFS and #LongCovid - we believe it will contain many actionable targets, this could be a great case to screen for drugs using AI. Maybe I should write a grant to do that experiment or hope others will

My understanding, and I may be wrong here, is that this model is able to tell if and where a molecule will bind to a protein, one of the hardest steps in drug development. It acts as a lock and key with a humungous number of possible combinations, most aren't a fit. Knowing the site where a molecule binds tells what function it activates on the target protein, which is also part of AlphaFold v3, at least in part, by comparing to a database of known proteins and inferring from how that part of that protein reacts. That database will only improve over time, as obviously it depends on other parts of the protein based on how it's folded.

This has the potential to speed up drug development by thousands of times or more. You can only get improvements like this with information technologies. Nothing beats this kind of rapid progress. This isn't 2x, or even 20x, it can be millionsx.

Amw66 · May 19, 2024

This is mind blowing : I took photos of old and recent blood tests (in Greek) and sent them to #chatGPT 4-o. It extracted correctly all tests and associated results, identified which tests had abnormal values and finally commented on associations between abnormal findings.

Amw66 · May 19, 2024

rvallee said:
Dr Unumatz agrees with this, it could be incredibly revolutional.

My understanding, and I may be wrong here, is that this model is able to tell if and where a molecule will bind to a protein, one of the hardest steps in drug development. It acts as a lock and key with a humungous number of possible combinations, most aren't a fit. Knowing the site where a molecule binds tells what function it activates on the target protein, which is also part of AlphaFold v3, at least in part, by comparing to a database of known proteins and inferring from how that part of that protein reacts. That database will only improve over time, as obviously it depends on other parts of the protein based on how it's folded.

This has the potential to speed up drug development by thousands of times or more. You can only get improvements like this with information technologies. Nothing beats this kind of rapid progress. This isn't 2x, or even 20x, it can be millionsx.

It's the equivalent of electricity

glennthefrog · May 20, 2024

what can I say, this picture represents what I think about the issue of doctors and AI:

glennthefrog · May 21, 2024

I have hashimoto's thyroiditis and hypothyroidism, while being treated with 200mg of synthetic hormone. I uploaded my thyroid function test to GTP4o. My T4 and T3 levels are normal, but the TSH is near zero, at a value of 0,07uU/mL. It correctly identified that I was now hyperthyroidism due to consuming an excess amount of synthetic hormone. My GP, seeing me in person and with the same test, considered the test was normal and that nothing needed to be modified. A week later I went on my own to a private endocrinologist and she diagnosed me with hyperthyroidism and lowered my dose, which, according to my new tests, was the right choice, and it needs to be lowered even further. What I mean is that, in 5 minutes and completely free, GTP4o surpassed my GP, and this isn't even a language model that is optimized for medical tasks, imagine what future medical models will be able to do!

Jaybee00 · Jun 17, 2024

https://www.nytimes.com/2024/06/17/...ytcore-ios-share&referringSource=articleShare

rvallee · Jun 21, 2024

Video about AI use in cancer detection from youtuber Coldfusion. Great channel for technology enthusiasts, and always pleasant to listen.

Artificial intelligence in medicine and science

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Established Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Established Member (Voting Rights)

Established Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)