AI models struggle in real-world medical conversations
Artificial intelligence tools like ChatGPT are touted for their promise to reduce the workload of clinicians by triaging patients, collecting medical histories and even making preliminary diagnoses. These tools, known as large-language models, are already being used by patients to understand their symptoms and medical test results. But while these AI models perform impressively on standardized medical tests, how well do they perform in situations that more closely mimic the real world? Not so great, according to the results of a new study led by researchers at Harvard Medical School and Stanford University. For their analysis, published January 2nd...
AI models struggle in real-world medical conversations
Artificial intelligence tools like ChatGPT are touted for their promise to reduce the workload of clinicians by triaging patients, collecting medical histories and even making preliminary diagnoses.
These tools, known as large-language models, are already being used by patients to understand their symptoms and medical test results.
But while these AI models perform impressively on standardized medical tests, how well do they perform in situations that more closely mimic the real world?
Not so great, according to the results of a new study led by researchers at Harvard Medical School and Stanford University.
For their analysis, published January 2nd inNatural medicinethe researchers designed an evaluation framework -; or a test -; called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large-language models to see how well they worked in environments that closely mimic actual interactions with patients.
All four large-language models performed well on medical exam-style questions, but their performance deteriorated when they were involved in conversations that more closely mimicked real-world interactions.
This gap, the researchers said, underscores a two-fold need: first, to create more realistic assessments that better assess the suitability of clinical AI models for use in the real world, and second, to improve the ability of these tools to diagnose based on more realistic interactions before they are used in the clinic.
Assessment tools like CRAFT-MD, the research team says, can not only more accurately assess AI models for their real-world fitness, but could also help optimize their performance in the clinic.
Our work reveals a striking paradox: While these AI models excel at medical exams, they struggle with the basic ins and outs of a doctor's visit. The dynamics of medical conversations—the need to ask the right questions at the right time, piece together scattered information, and reason based on symptoms—present unique challenges that go well beyond answering multiple-choice questions. As we move from standardized testing to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.”
Pranav Rajpurkar, senior author of the study, assistant professor of biomedical informatics at Harvard Medical School
A better test to check AI performance in practice
Currently, developers test the performance of AI models by asking them to answer multiple-choice medical questions, typically derived from the national exam for graduating medical students or from tests that residents take as part of their certification.
“This approach assumes that all relevant information is presented clearly and succinctly, often using medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far more messy,” said Shreya Johri, co-first author of the study and a doctoral candidate in the Rajpurkar Lab at Harvard Medical School. “We need a testing framework that better reflects reality and therefore can better predict how well a model would work.”
CRAFT-MD was developed as such a more realistic measuring device.
To simulate real-world interactions, CRAFT-MD evaluates how well large-language models can gather information about symptoms, medications, and family history and then make a diagnosis. An AI agent poses as a patient and answers questions in a conversational, natural style. Another AI agent evaluates the accuracy of the final diagnosis provided by the large-language model. Human experts then evaluate the results of each encounter in terms of ability to collect relevant patient information, diagnostic accuracy in presenting scattered information, and adherence to instructions.
The researchers used CRAFT-MD to test four AI models –; both proprietary or commercial and open source versions –; for performance in 2,000 clinical vignettes covering common conditions in primary care and 12 medical specialties.
All AI models showed limitations, particularly in their ability to conduct clinical conversations and reason based on information provided by patients. This in turn affected their ability to take medical histories and make an appropriate diagnosis. For example, the models often had difficulty asking the right questions to gather a relevant patient history, missed important information during history taking, and had difficulty synthesizing scattered information. The accuracy of these models decreased when they were presented with open-ended information instead of multiple-choice answers. These models also performed worse when in back-and-forth exchange -; as is the case with most conversations in the real world –; rather than engaging in summarized conversations.
Recommendations for optimizing the performance of AI in practice
Based on these findings, the team offers a series of recommendations for both AI developers designing AI models and regulators tasked with evaluating and approving these tools.
This includes:
- Verwendung von Konversationsfragen mit offenem Ende, die unstrukturierte Arzt-Patient-Interaktionen genauer widerspiegeln, bei der Entwicklung, Schulung und Prüfung von KI-Tools
- Bewerten Sie Modelle hinsichtlich ihrer Fähigkeit, die richtigen Fragen zu stellen und die wichtigsten Informationen zu extrahieren
- Entwerfen von Modellen, die in der Lage sind, mehrere Gespräche zu verfolgen und Informationen daraus zu integrieren
- Entwerfen von KI-Modellen, die in der Lage sind, Textdaten (Notizen aus Gesprächen) mit und Nichttextdaten (Bilder, EKGs) zu integrieren.
- Entwicklung ausgefeilterer KI-Agenten, die nonverbale Hinweise wie Gesichtsausdrücke, Tonfall und Körpersprache interpretieren können
In addition, both AI agents and human experts should be included in the evaluation, the researchers recommend, as relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD was faster than human raters, processing 10,000 interviews in 48 to 72 hours, plus 15 to 16 hours of expert assessment. In contrast, human-based approaches would require extensive recruitment and an estimated 500 hours for patient simulations (nearly 3 minutes per conversation) and approximately 650 hours for expert assessments (nearly 4 minutes per conversation). Using AI assessors as a first choice has the added benefit of eliminating the risk of exposing real patients to unverified AI tools.
The researchers expect that CRAFT-MD itself will also be regularly updated and optimized to incorporate improved patient AI models.
“As a physician and scientist, I am interested in AI models that can effectively and ethically improve clinical practice,” said study co-senior author Roxana Daneshjou, assistant professor of biomedical data science and dermatology at Stanford University. “CRAFT-MD creates a framework that better reflects real-world interactions, helping to advance the field when it comes to testing the performance of AI models in healthcare.”
Sources:
Johri, S.,et al. (2025) An evaluation framework for clinical use of large language models in patient interaction tasks. Nature Medicine. doi.org/10.1038/s41591-024-03328-5.