AI in Medicine: Revolutionary Tools, Uncertain Results
Can AI really revolutionize healthcare? A systematic review uncovers the hidden gaps in patient benefit and barriers to meaningful clinical integration. In a recent study published in The Lancet Regional Health - Europe, a group of researchers assessed the benefits and harms of artificial intelligence (AI)-based algorithmic decision-making systems (ADM) used by healthcare professionals compared to standard care, focusing on patient-relevant outcomes. Background Advances in AI have enabled systems to outperform medical experts in tasks such as diagnosis, personalized medicine, patient monitoring and drug development. Despite these advances, it remains unclear whether improved...
AI in Medicine: Revolutionary Tools, Uncertain Results
Can AI really revolutionize healthcare? A systematic review uncovers the hidden gaps in patient benefit and barriers to meaningful clinical integration.
In a recent study published inThe Lancet Regional Health – Europe, a group of researchers evaluated the benefits and harms of artificial intelligence (AI)-based algorithmic decision-making systems (ADM) used by healthcare professionals compared to standard care, focusing on patient-relevant outcomes.
background
Advances in AI have enabled systems to outperform medical experts in tasks such as diagnosis, personalized medicine, patient monitoring and drug development. Despite these advances, it remains unclear whether improved diagnostic accuracy and performance metrics translate into tangible patient benefits, such as reduced mortality or morbidity.
Current research often prioritizes analytical performance over clinical outcomes, and many AI-based medical devices are approved without supporting evidence from randomized controlled trials (RCTs).
Furthermore, the lack of transparency and standardized assessments of the harms associated with these technologies raise ethical and practical concerns. This highlights a critical gap in AI research and development that requires further assessments focused on patient-relevant outcomes to ensure meaningful and safe integration into healthcare.
About the study
Limited external validation: Most AI systems evaluated were developed based on internal data, with few studies reporting external validation, raising concerns about their generalizability to different patient populations.
This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to ensure methodological rigor. The searches were conducted in the Medical Literature Analysis and Retrieval System Online (MEDLINE), in the Excerpta Medica Database (EMBASE), in the public/publisher MEDLINE (PubMed) and in the Institute of Electrical and Electronics Engineers (IEEE) Xplore and covered a period of 10 years up to March 27, 2024, when AI-related ADM systems became relevant in health studies. The search included terms related to AI, machine learning (ML), decision algorithms, healthcare professionals, and patient outcomes.
Eligible studies included intervention or observational designs with AI decision support systems developed with or leveraging ML. Studies had to report patient-relevant outcomes such as mortality, morbidity, length of hospital stay, readmission or health-related quality of life. Exclusion criteria included studies without preregistration, without a standard of care control, or with a focus on robotics or other systems unrelated to AI-based decision making. The protocol for this review was preregistered in the International Prospective Register of Systematic Reviews (PROSPERO) and all changes were documented.
The reviewers checked titles, abstracts and full texts based on predefined criteria. Data extraction and quality assessment were carried out independently using standardized forms. Risk of bias was assessed using the Cochrane Risk of Bias 2 (RoB 2) tool and the Risk of Bias in Non-Randomized Studies of Interventions (ROBINS-I) tool to account for potential confounding factors, while reporting transparency was assessed using the Consolidated Standards Expansion of Reporting Trials - Artificial Intelligence (CONSORT-AI) and the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis - artificial intelligence (TRIPOD-AI) framework.
Data extracted included study settings, design, intervention and comparison details, patient and professional demographics, algorithm characteristics, and outcome measures. Studies were also classified by AI system type, clinical area, prediction objectives, and regulatory and funding information. The analysis also examined whether the unique contributions of AI systems to the results were isolated and validated.
Study results
Underrepresented specialties: While psychiatry and oncology studies were well represented, other specialties such as critical care and pulmonology remain underrepresented, potentially distorting the broader applicability of the results.
The systematic review included 19 studies, including 18 RCTs and one prospective cohort study, selected after reviewing 3,000 records. These studies were conducted in different regions, including nine in the United States, four in Europe, three in China, and others distributed worldwide. Settings included 14 studies in hospital, three in outpatient clinics, one in a nursing home and one in a mixed environment.
The studies covered a range of medical specialties, including oncology (4 studies), psychiatry (3 studies), internal hospital medicine, neurology and anesthesiology (2 studies each), as well as individual studies in diabetology, pulmonology, intensive care and other specialties.
The mean number of participants across all studies was 243, with a mean age of 59.3 years. The proportion of women averaged 50.5%, and 10 studies reported racial or ethnic composition, with a median of 71.4% white participants. Twelve studies described the intended healthcare professionals, such as: E.g., nurses or primary care providers, and nine detailed training protocols ranging from short introductions to the platform to multi-day supervised sessions.
AI systems differ in type and function. Seven studies used monitoring systems for real-time monitoring and predictive alerts, six used treatment personalization systems, and four studies integrated multiple functions. Examples included algorithms for glycemic control in diabetes, personalized psychiatric care, and venous thromboembolism monitoring. Development data sources ranged from large internal datasets to pooled multi-institutional data, applying various ML models such as gradient boosting, neural networks, Bayesian classifiers and regression-based models. Despite these developments, external validation of algorithms was limited in most studies, raising concerns about their generalizability to broader patient populations.
The risk of bias was assessed as low in four RCTs, moderate in seven and high in a further seven, while the cohort study had a serious risk of bias. Adherence to the CONSORT-AI and TRIPOD-AI guidelines varied, with three studies achieving full compliance while others had high to low compliance. Most studies conducted before the introduction of these guidelines showed moderate adherence, although explicit references to the guidelines were rare.
The results showed a mix of benefits and harms. Twelve studies reported patient-relevant benefits, including reductions in mortality, improved depression and pain management, and improved quality of life. However, only eight studies included standardized harm assessments and most of them failed to comprehensively document adverse events. Although six AI systems received regulatory approvals, the relationships between regulatory status, study quality, and patient outcomes remained unclear.
Conclusions
This systematic review highlights the lack of high-quality studies assessing patient-relevant outcomes of AI-related ADM systems in healthcare. While benefits were consistently shown in psychiatry, other areas reported mixed results with limited evidence of improvements in mortality, anxiety, and hospitalizations. Most studies lacked balanced harm-benefit assessments and failed to isolate the unique contributions of AI.
The findings highlight the urgent need for transparent reporting, robust validation practices, and standardized frameworks to guide the safe and effective integration of AI into clinical environments.
Sources: