The AI system matches diagnostic accuracy while reducing medical costs
In a new study, Microsoft's AI-powered diagnostic system outperformed experienced doctors in solving the most challenging medical cases faster, cheaper and more accurately. Study: Sequential diagnosis with language models. Image credit: MetamorWorks/Shutterstock.com *Important Disclosure: Arxiv publishes preliminary scientific reports that are not peer-reviewed and therefore are not considered conclusive, guide clinical practice/health-related behaviors, or treated as established information. A recent study on the Arxiv Preprint Servers compared the diagnostic accuracy and resource expenditures of AI systems with those of clinicians on complex cases. The Microsoft AI team demonstrated the efficient use of artificial intelligence (AI) in...
The AI system matches diagnostic accuracy while reducing medical costs
In a new study, Microsoft's AI-powered diagnostic system outperformed experienced doctors in solving the most challenging medical cases faster, cheaper and more accurately.
Study: Sequential diagnosis with language models. Image credit: MetamorWorks/Shutterstock.com
*Important Notice: ArxivPublish preliminary scientific reports that are not peer-reviewed and therefore not considered conclusive, guide clinical practice/health-related behaviors, or treated as established information.
A recent study on theArxivPreprint Server compared the diagnostic accuracy and resource expenditures of AI systems with those of clinicians on complex cases. The Microsoft AI team demonstrated the efficient use of artificial intelligence (AI) in medicine to address diagnostic challenges that doctors need to decipher.
Sequential diagnosis and language models
Doctors often diagnose patients for an illness through a clinical reasoning process that involves step-by-step, iterative questioning and testing. Even with limited initial information, clinicians narrow the possible diagnosis by questioning the patient and confirming it through biochemical testing, imaging, biopsy, and other diagnostic procedures.
Resolving a complex case requires a comprehensive set of skills, including identifying the most critical questions or tests to follow, paying attention to testing costs to prevent increasing patient burden, and recognizing evidence to make a confident diagnosis.
Several studies have demonstrated the improved efficiency of language models (LMS) in conducting medical licensing exams and highly structured diagnostic vignettes. However, the performance of most LMs has been evaluated under artificial conditions that are drastically different from real-world clinical environments.
Most LMS models for diagnostic assessments are based on a multiple-choice quiz, and the diagnosis is made from a predefined answer set. A reduced sequential diagnostic cycle increases the risk of overestimating the model competence of the static benchmarks. In addition, these diagnostic models pose the risk of indiscriminate test ordering and premature diagnostic closure. Therefore, there is an urgent need for an AI system based on a sequential diagnostic cycle to improve diagnostic accuracy and reduce testing costs.
About the study
To overcome the above-mentioned disadvantages of LMS models for clinical diagnosis, scientists have developed the Sequential Diagnostic Benchmark (SDBench) as an interactive framework for evaluating diagnostic agents (human or AI) through realistic sequential clinical encounters.
To assess diagnostic accuracy, the current study used weekly cases published in the New England Journal of Medicine (NEJM), the world's leading medical journal. This journal typically publishes case notes of Massachusetts General Hospital patients in a detailed, narrative format. These cases are among the most diagnostically challenging and intellectually demanding in clinical medicine and often require multiple specialists and diagnostic tests to confirm a diagnosis.
Sdbench by 304 cases from the NEJM Clinicopathologic Conference (2017-2025) into stepwise diagnostic encounters. Medical data included clinical presentations in definitive diagnoses ranging from common diseases (e.g., pneumonia) to rare disorders (e.g., neonatal hypoglycemia). Using the interactive platform, diagnostic agents decide what questions to ask, what tests to order, and when to confirm a diagnosis.
Information Gatekeeper is a language model that reveals clinical details from a comprehensive case file only when explicitly queried from a comprehensive case file. It may also provide additional case-consistent information for testing not described in the original CPC narrative. After the final diagnosis was made based on the information received from the gatekeeper, the accuracy of the clinical assessment was tested against the actual diagnosis. In addition, the cumulative cost of all requested diagnostic tests performed in real diagnosis was estimated. By assessing diagnostic accuracy and diagnostic cost, Sdbench indicates how close we are to providing high quality care at a sustainable cost.
Study results
The current study analyzed the performance of all diagnostic agents on the SDBEN. AI agents were evaluated in all 304 NEJM cases, while physicians were evaluated in a retained subset of 56 test sets. This study found that AI agents performed better than doctors in this subgroup.
Physicians practicing in the US and UK with a median of 12 years of clinical experience achieved 20% diagnostic accuracy at an average cost of $2,963 per case on SDBench, highlighting the inherent difficulty of the benchmark. Physicians spent an average of 11.8 minutes per case and requested 6.6 questions and 7.2 tests. GPT -4o outperformed physicians in both diagnostic accuracy and cost. Commercially available off-the-shelf models have offered varying diagnostic accuracy and cost.
The current study also introduced the MAI Diagnostic Orchestrator (MAI-DXO), a platform partnered with doctors that demonstrated higher diagnostic efficiency than human doctors and commercial language models. Compared to commercial LMs, Mai-DXO demonstrated higher diagnostic accuracy and a significant reduction in medical costs of more than half. For example, the off-the-shelf O3 model achieved 78.6% diagnostic accuracy at $7,850, while May-DXO achieved 79.9% accuracy at just $2,397 or 85.5% at $7,184.
MAI-DXO achieved this by simulating a virtual panel of “doctor agents” with different roles in hypothesis generation, test selection, cost awareness, and error checking. Unlike the base AI prompt, this structured orchestration allowed the system to be iterative and efficient.
Mai-Dxo is a model-agnostic approach that has demonstrated accuracy gains in various language models, not just the O3 Foundation model.
Conclusions and future prospects
The results of the current study show the higher diagnostic accuracy and cost-effectiveness of AI systems when they process iteratively and carefully. Sdbench and Mai-Dxo provided an empirically based foundation for advancing AI-assisted diagnostics under realistic constraints.
In the future, Mai-DXO needs to be validated in clinical settings where disease prevalence and presentation occur as frequently as on a daily basis rather than as a rare occasion. Furthermore, large-scale interactive medical benchmarks with more than 304 cases are required. Incorporating visual and other sensory modalities such as imaging could also improve diagnostic accuracy without compromising cost-effectiveness.
However, the authors note important limitations. NEJM -CPC cases are selected for their difficulty and do not reflect everyday clinical presentations. The study did not include healthy patients or measure false positive rates. Additionally, diagnostic cost estimates are based on US prices and may vary worldwide.
The models were also tested on a retained test set of recent cases (2024-2025) to assess generalization and avoid overfitting, as many of these cases were released after the training cutoff for most models.
The paper also raises a broader question: Should we compare AI systems to individual doctors or full medical teams? Because Mai-Dxo mimics multi-specialist collaboration, the comparison may reflect somewhat closer to team-based care than individual practice.
However, the research suggests that structured AI systems like Mai-DXO may one day support or augment clinicians, particularly in settings where access to specialists is limited or expensive.
Download your PDF copy now!
*Important Notice: ArxivPublish preliminary scientific reports that are not peer-reviewed and therefore not considered conclusive, guide clinical practice/health-related behaviors, or treated as established information.
Sources:
- Preliminary scientific report.
Nori, H. et al. (2025) Sequential Diagnosis with Language Models. ArXiv. https://arxiv.org/abs/2506.22405 https://arxiv.org/abs/2506.22405