Possibilities and limitations of using a large language model to respond to patient messages
Discover the impact of large language models on patient messaging and learn how Mass General Brigham is improving patient education. Results in Lancet Digital Health.

Possibilities and limitations of using a large language model to respond to patient messages
A new study from researchers at Mass General Brigham shows that large language models (LLMs), a type of generative AI, can help reduce physician workload and improve patient education when used to compose responses to patient messages. The study also found limitations to LLMs that may impact patient safety, suggesting that careful monitoring of communications generated by LLMs is essential for safe use. Results published inLancet Digital Healthemphasize the need for a measured approach to LLM implementation.
Increasing administrative and documentation requirements have led to an increase in physician burnout. To streamline and automate physician workflows, electronic health record (EHR) vendors have adopted generative AI algorithms to help physicians compose messages to patients. However, the efficacy, safety and clinical impact of their use were unknown.
Generative AI has the potential to offer the best of both worlds, reducing the burden on the clinician while better educating the patient. However, based on our team's experience working with LLMs, we have concerns about the potential risks associated with integrating LLMs into messaging systems. As LLM integration into EHRs becomes more common, our goal in this study was to identify relevant benefits and shortcomings.”
Danielle Bitterman, MD,Corresponding author,Faculty member in the Artificial Intelligence in Medicine (AIM) program at Mass General Brigham and a physician in the Department of Radiation Oncology at Brigham and Women’s Hospital
For the study, researchers used OpenAI's GPT-4, a basic LLM, to generate 100 scenarios about cancer patients and an accompanying patient question. The study did not use questions from actual patients. Six radiation oncologists answered the questions manually; GPT-4 then generated answers to the questions. Finally, the LLM-generated responses were provided to the same radiation oncologists for review and editing. The radiation oncologists did not know whether GPT-4 or a human had written the answers and in 31% of cases assumed that an LLM-generated answer had been written by a human.
On average, physician-authored responses were shorter than LLM-authored responses. GPT-4 tended to include more education for patients but was less directive in its instructions. Physicians reported that LLM support improved their perceived efficiency and considered LLM-generated responses safe 82.1 percent of the time and acceptable to send to a patient without further processing 58.3 percent of the time. The researchers also noted some deficiencies: If left unaddressed, 7.1 percent of LLM-generated responses could pose a risk to the patient and 0.6 percent of responses could pose a risk of death, mostly because the GPT-4 response failed to urgently inform the patient to seek immediate medical attention.
Of note, the LLM-generated/physician-edited responses were more similar in length and content to the LLM-generated responses than to the manual responses. In many cases, physicians retained LLM-created educational content, suggesting that they found it valuable. While this could promote patient education, the researchers emphasize that over-reliance on LLMs may also pose risks due to their proven shortcomings.
The emergence of AI tools in healthcare has the potential to positively transform the continuum of care, and it is imperative to balance their potential for innovation with a commitment to safety and quality. Mass General Brigham is a leader in the responsible use of AI and conducts in-depth research on new and emerging technologies to support the incorporation of AI into healthcare delivery, workforce support, and administrative processes. Mass General Brigham is currently leading a pilot project to integrate generative AI into the electronic health record to author responses to patient portal messages and is testing the technology in a number of outpatient practices across the health system.
Going forward, the study authors will examine how patients perceive LLM-based communication and how patients' racial and demographic characteristics influence LLM-generated responses based on known algorithmic biases in LLMs.
“Keeping a human informed is an essential safety step when it comes to using AI in medicine, but it is not a one-size-fits-all solution,” Bitterman said. "As providers rely more and more on LLMs, we may miss errors that could result in patient harm. This study shows the need for systems to monitor the quality of LLMs, training for clinicians to appropriately monitor LLM outcomes, and more AI skills for patients and clinicians." and, at a fundamental level, a better understanding of how to deal with the mistakes that LLMs make.”
Sources:
Chen, S., et al. (2024) The effect of using a large language model to respond to patient messages.The Lancet Digital Health. doi.org/10.1016/S2589-7500(24)00060-8.