LLMs Fail at Doctor-Patient Conversations – Study
Introduction
Large Language Models (LLMs) such as GPT-4 and Med-PaLM 2 have shown promising capabilities in medical diagnostics, patient communication, and clinical decision-making. However, their integration into healthcare settings demands rigorous evaluation to ensure reliability, accuracy, and ethical considerations. A recent study published in Nature Medicine proposes the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) to systematically assess LLMs in clinical scenarios.
The CRAFT-MD Framework
CRAFT-MD evaluates LLMs through four key components:
- Case Vignettes – Simulated patient cases are presented to LLMs to test their diagnostic accuracy.
- Multi-Turn Conversations – LLMs interact with virtual patients in extended dialogues, mimicking real clinical encounters.
- Single-Turn Conversations – The model provides diagnostic and treatment recommendations based on isolated queries.
- Summarized Conversations – LLMs synthesize and summarize patient interactions to assess clarity and coherence.
Key Findings and Analysis
1. Diagnostic Accuracy and Performance Disparities
The study tested several LLMs, including GPT-4, GPT-3.5, Mistral-v2-7b, and LLaMA-2-7b, across different evaluation settings. Results showed a notable decline in accuracy when shifting from multiple-choice questions (MCQs) to free-response questions (FRQs). For instance, GPT-4 exhibited a 0.334 accuracy in vignettes but 0.399 accuracy in summarized conversations, suggesting variability in its reasoning capabilities.
However, when placed in dynamic, patient-interactive scenarios, diagnostic accuracy dropped even further. GPT-4, which previously scored 82% in structured MCQs, plummeted to 26% accuracy in real-time simulated patient conversations, highlighting the challenges of open-ended reasoning and adaptive questioning.
2. Deficiencies in Patient Interaction and Information Gathering
The ability to extract relevant patient history is a cornerstone of effective medical practice. The study found that even the most advanced LLMs frequently failed to gather complete medical histories in simulated patient conversations. GPT-4, the best-performing model, successfully retrieved necessary history in only 71% of cases, indicating that AI models are still poor at adaptive questioning.
This highlights a critical issue: While AI models excel at synthesizing structured data, they struggle in real-world clinical dialogues where patients may not volunteer key symptoms unless prompted with specific follow-up questions.
3. Specialty-Based Performance Variations
LLMs demonstrated higher accuracy in dermatology cases due to well-structured textual descriptions, whereas they struggled significantly with complex internal medicine cases, where nuanced questioning and multi-factorial reasoning were required.
Furthermore, when multi-modal inputs (such as images for dermatological cases) were introduced, performance declined even further. This suggests that current LLMs lack robust multi-modal reasoning capabilities, which are essential for real-world clinical decision-making.
Challenges and Considerations for Clinical AI
- Lack of Adaptive Reasoning: Unlike human physicians who refine their questions based on patient responses, AI models rely on pre-trained heuristics and often fail to ask crucial follow-up questions.
- Bias and Ethical Concerns: AI models trained on imbalanced datasets risk propagating diagnostic biases, particularly for underrepresented populations.
- Regulatory and Deployment Hurdles: The clinical application of LLMs requires strict validation and oversight, including compliance with HIPAA and FDA regulations.
- Medical-Legal Liability: If AI-driven diagnostics lead to incorrect or delayed diagnoses, determining accountability becomes a significant legal and ethical challenge.
The Future of Clinical LLMs
To improve their clinical viability, future iterations of LLMs must integrate real-time physician oversight, explainable AI mechanisms, and continuous validation against medical gold standards. Moreover, enhancements in multi-modal learning (text, images, lab data) are crucial to improving diagnostic robustness.
Ultimately, while AI shows promise as a clinical assistant, it remains far from replacing human physicians, especially in complex diagnostic tasks requiring contextual reasoning and interpersonal communication.
—
Reference:
Johri S, Jeong J, Tran BA, Schlessinger DI, Wongvibulsin S, Barnes LA, Zhou HY, Cai ZR, Van Allen EM, Kim D, Daneshjou R, Rajpurkar P. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat Med. 2025 Jan;31(1):77-86. doi: 10.1038/s41591-024-03328-5. Epub 2025 Jan 2. PMID: 39747685.