Google Gemini and Bard pass the ophthalmology exam
See how Google Gemini and Bard perform on the Ophthalmology exam. Study shows their performance in different countries and disciplines. Important insights!

Google Gemini and Bard pass the ophthalmology exam
In a study recently published in the journalEye,Researchers from Canada evaluated the performance of two artificial intelligence (AI) chatbots, Google Gemini and Bard, on the Ophthalmology Board exam.
They found that both tools achieved acceptable response accuracy and performed well in the field of ophthalmology, although there were some differences between countries.
background
AI chatbots such as ChatGPT (short for Chat-Generative Pre-Trained Transformer), Bard and Gemini are increasingly being used in the medical field. Your performance continually evolves across exams and disciplines.
While ChatGPT-3.5's accuracy was up to 64% in steps one and two of the AMBOSS and NBME (short for National Board Medical Examination) exams, newer versions such as ChatGPT-4 showed improved performance.
Google's Bard and Gemini provide answers based on diverse cultural and linguistic training and may tailor information to specific countries. However, responses vary by region and require further research to ensure consistency, particularly in medical applications where accuracy is critical to patient safety.
In the present study, researchers sought to evaluate the performance of Google Gemini and Bard using a series of practice questions designed for the Ophthalmology Board certification exam.
About the study
The performance of Google Gemini and Bard was evaluated using 150 text-based multiple-choice questions from EyeQuiz, an educational platform for medical professionals specializing in ophthalmology.
The portal provides practice questions for various exams, including the Ophthalmic Knowledge Assessment Program (OKAP), National Board exams such as the American Board of Ophthalmology (ABO) exam, and certain postgraduate exams.
Questions were manually categorized and data were collected using Bard and Gemini versions available on November 30 and December 28, 2023, respectively. Accuracy, explanation provision, response time, and question length were evaluated for both tools.
Secondary analyzes included assessing performance in countries other than the United States (US), including Vietnam, Brazil, and the Netherlands, using virtual private networks (VPNs).
Statistical tests, including chi-square and Mann-Whitney U tests, were conducted to compare the performance of different countries and chatbot models. Multivariable logistic regression was used to examine factors that influence correct responses.
Results and discussion
Bard and Gemini responded promptly and consistently to all 150 questions without experiencing high demand. In the primary analysis with the US versions, Bard took 7.1 ± 2.7 seconds to respond, while Gemini took 7.1 ± 2.8 seconds, which had a longer average response duration.
In the primary analysis using the US form of chatbots, both Bard and Gemini achieved 71% accuracy, answering 106 out of 150 questions correctly. Bard provided explanations for 86% of its answers, while Gemini provided explanations for all answers.
It was found that Bard performed best in orbital and plastic surgery, while Gemini showed superior performance in general ophthalmology, orbital and plastic surgery, glaucoma and uveitis. However, both tools struggled in the cataract and lens and refractive surgery categories.
In the secondary analysis with Bard from Vietnam, the chatbot answered 67% of the questions correctly, similar to the US version. However, using Bard from Vietnam resulted in different answer choices for 21% of questions compared to the US version.
For twins from Vietnam, 74% of the questions were answered correctly similar to the US version, although there were differences in answer selection for 15% of the questions compared to the US version. In both cases, some questions answered incorrectly by the US versions were answered correctly by the Vietnam versions and vice versa.
The Vietnam versions of Bard and Gemini explained 86% and 100% of their answers, respectively. Bard performed best in retinal and vitreous surgery and orbital and plastic surgery (80% accuracy), while Gemini performed better in corneal and external diseases, general ophthalmology and glaucoma (87% accuracy each).
Bard struggled most with cataracts and lenses (40% accuracy), while Gemini struggled with pediatric ophthalmology challenges and strabismus (60% accuracy). Gemini's performance in Brazil and the Netherlands was relatively worse than that of the US and Vietnam versions.
Despite the promising results, limitations of the study include the small sample size of questions, reliance on a publicly available question bank, unexplored effects of user prompts, internet speed, website traffic on response times, and occasional incorrect explanations from chatbots.
Future studies could examine chatbots' relatively unexplored ability to interpret eye images. Further research is needed to address the limitations and explore additional applications in this area.
Diploma
In summary, while both the U.S. and Vietnam versions of Bard and Gemini highlight possible response variability related to user location, the study demonstrated satisfactory performance on ophthalmology practice questions.
Future evaluations tracking the improvement of AI chatbots and comparisons between ophthalmology residents and AI chatbots could provide valuable insights into their effectiveness and reliability.
Sources:
-
Mihalache, A. et al., (2024) Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.Eye.doi: https://doi.org/10.1038/s41433-024-03067-4. https://www.nature.com/articles/s41433-024-03067-4