Original Article
What matters when generative artificial intelligence enters the clinic: a clinician-centered evaluation of ChatGPT, Gemini and Copilot in radiation oncology
Abstract
Background: Generative artificial intelligence (GenAI), particularly large language models (LLMs), promises to revolutionize healthcare, yet their practical utility in clinical settings remains uncertain. To date, LLMs have largely been evaluated using algorithm-centered benchmarks that are generally tailored for engineering, not clinical practice. This study pioneers LLM utility in radiation oncology (RO) by introducing a provider-centered approach using ecologically situated use cases and develops a novel evaluation framework focused on clinical safety and usability.
Methods: Four ecologically situated use cases [Letter of Medical Necessity (LOMN) generation, radiology report summarization, consult note summarization, medical decision-making (MDM)] were identified through provider needs assessment. Three publicly available commercial LLMs (ChatGPT, Google Gemini, Microsoft Copilot) were evaluated using de-identified clinical data. A multidisciplinary team of AI experts and RO physicians (ROPs) developed customized, multi-dimensional evaluation measures focusing on safety and usability. ROPs assessed LLM outputs against these measures using standardized prompts. The evaluation was repeated after 6 months to validate measure robustness.
Results: No single LLM excelled across all tasks. ChatGPT performed best overall in readability and minimizing hallucinations, scoring in the highest tier across most use cases. Gemini showed strengths in lay-friendliness but exhibited greater variability across dimensions. Copilot led in citation accuracy yet consistently scored lower in readability. LLMs performed adequately for summarization tasks (averaging ≥4/5 on accuracy) but outputs for complex tasks like MDM scored notably lower, often requiring significant verification.
Conclusions: These commercially available LLMs (ChatGPT, Gemini, Copilot) exhibit variable performance and require substantial human oversight for safe application in RO. They are not currently suitable for autonomous clinical use, primarily due to concerns regarding reliability and the burden of verification. This study offers ecologically situated use cases and pioneers a validated, provider-centered framework to guide LLM evaluation, and guiding realistic assessment for their use in clinical workflows.

