What matters when generative artificial intelligence enters the clinic: a clinician-centered evaluation of ChatGPT, Gemini and Copilot in radiation oncology
Highlight box
Key findings
• To assess the performance of commercially available large language models (LLMs) when tasked with practical applications in radiation oncology (RO), we developed multi-dimensional evaluation measures focusing on safety and usability in this context.
• Using this framework, LLMs demonstrated variable performance across 4 practical use cases tested, excelling in summarization tasks but requiring substantial verification for complex tasks like medical decision-making.
What is known and what is new?
• Previous methods to benchmark LLM performance have focused on sanitized cases or synthetic datasets that do not mirror the complexities of real-world clinical scenarios.
• This is the first evaluation of ecologically situated use cases for LLMs in RO, paired with practically grounded evaluation criteria.
What is the implication, and what should change now?
• Our results argue against autonomous clinical use of LLMs in this context and suggest that significant human oversight may be required to verify reliability and validity of output in most use cases. Future work should include evaluation of a broader range of use cases and LLMs, using both general and medically focused algorithms.
Introduction
As generative artificial intelligence (GenAI) shapes healthcare, it brings both transformation and challenges. Healthcare’s global GenAI market topped $1.8 billion in 2023 (1) with 85% of leading healthcare companies exploring or already having adopted GenAI capabilities in 2024. In the UK, 22% of health professionals use GenAI (2). Proliferation and hype around large language models (LLMs) have created splashy headlines, with claims of LLMs outperforming doctors in diagnosis (3), passing the United States Medical Licensing Examination (USMLE) with over 60% accuracy (4), and even surpassing human physicians in patient-preferred responses (5). Yet, providers remain skeptical, arguing these demonstrations often rely on unrealistic or synthetic tasks, questioning their real-world reliability (6-9). Polling from Pew Research suggests the majority of Americans are growing more uncomfortable with unregulated AI use in healthcare (10).
To harness technology effectively, it is imperative to identify practical use cases and evaluate them in an actionable manner (11,12). Otherwise, the industry risks succumbing to the perennial trap of “when you have a hammer, everything looks like a nail”. AI missteps in healthcare abound—from IBM Watson Health’s $5 Billion failure to coronavirus disease 2019 (COVID-19) AI models’ inability to deliver during the pandemic because of flawed use cases, datasets, and assumptions (13,14). These failures highlight the peril of mismatched use cases and improper AI evaluation.
Typically, benchmarking of LLMs in healthcare has concentrated on algorithmic performance using sanitized, toy use cases or synthetic datasets that do not capture the complexities of real clinical environments. The benchmark metrics are often tailored for engineers, not providers (15). Benchmarks often lean on simplistic or hypothetical use cases, like simulated patient encounters that lack the messiness of clinical reality (15,16). Only 5% of reviewed studies used real patient data, with most focusing on generic healthcare applications or internal medicine (17). A study using real patient cases from the MIMIC-IV database revealed that state-of-the-art LLMs perform significantly worse than physicians in diagnosis, fail to follow neither diagnostic nor treatment guidelines, and struggle with integrating into clinical workflows (18). These findings underscore the need for more realistic, comprehensive evaluations of LLMs in healthcare settings before clinical deployment.
Despite shedding light on the computational affordances, algorithmic benchmarks often fail to capture how useful the technology is to its users; they fail to address the sociotechnical gap of LLMs in Health. The sociotechnical gap is an inherent property of all technologies—it is the divide between (I) what the technology allows; and (II) what people want. Failure to address or minimize the sociotechnical gap is a leading cause of technological failures (19,20).
To address and minimize this gap, there is a need to take a human-centered AI approach, one that is provider centered and practically grounded. In this paper, going beyond the algorithm-centered LLM evaluation in healthcare, we take a human (provider)-centered approach. The paper critically examines the utility of LLMs in radiation oncology (RO) tasks that are both workflow-sensitive and crucial to the quality of clinical work. Specifically, the paper addresses the following questions: what use cases do they deem desirable for LLM assistance? How can we evaluate these models to ensure medical safety and actionable utility from a provider’s perspective? How do LLMs perform on ecologically-situated tasks reflecting real clinical demands?
We evaluated the performance of 3 publicly available commercial, free-access LLMs (ChatGPT, Google Gemini, Microsoft Copilot) across four ecologically situated use cases with de-identified clinical data. Note that “publicly available” does not mean open-source or open-weights. These are closed-source, proprietary models offered to the public through free and paid plans. Given their ubiquity, the initial focus centers on these models. Each use case identified varying degrees of appropriateness for integration into RO workflows. These use cases, reflecting a pragmatic focus on provider-perceived safety and utility, include:
- Generating a Letter of Medical Necessity (LOMN);
- Summarizing a radiology report;
- Creating a consult note summary (although lay-oriented summaries chiefly aid patients, ROs must still draft or vet these sections for the electronic health record and patient portals. Providers in our needs-assessment emphasized that automating such routine summarization would reclaim clinician time, reduce documentation fatigue, and thereby help mitigate burnout, especially because our evaluation shows LLMs perform these tasks with acceptable accuracy);
- Formulating a medical decision-making (MDM) argument.
Due to the absence of practically grounded and actionable LLM evaluation frameworks of such use cases in RO, we developed customized evaluation measures for each use case through close collaboration between clinicians and AI experts. As such, the framework balances clinical relevance with technical feasibility. While specialized fine-tuned LLMs are used in studies (21-23), our work focuses on publicly available commercial models due to their accessibility and widespread use. To ensure the robustness of our evaluation, we conducted assessments at two time points over 6 months, capturing the impact of model updates and validating the stability of our evaluation measures.
To our knowledge, this study is the first to evaluate LLMs in RO using ecologically grounded use cases using a human (provider)-centered practical assessment criteria. This pragmatic inquiry aims to address and minimize the sociotechnical gap of LLMs in RO, advancing technology design to serve those who deliver care. This paper makes the following contributions:
- Ecologically situated use cases: we present four real-world use cases, each reflecting varied appropriateness levels for LLM integration in RO.
- Evaluation measures: we develop a set of robust criteria to assess LLMs in RO and validate them by repeating the study over 6 months, ensuring robustness of measures to model updates. These criteria comprise a framework that is specifically tailored for formatively evaluating LLMs within the context of RO, although it may provide guidance for evaluating use cases elsewhere in medicine.
- Performance comparison: formative results illustrate the relative strengths and limitations of three widely accessible, publicly available commercial LLMs for each use case.
Methods
Scoping appropriate use cases and prompting LLMs
To evaluate LLMs in RO, a non-trivial challenge is to find appropriate use cases. Currently, there are no established guidelines on relevant use cases for LLMs in RO. Instead of following the trend of studying AI vs. humans (doctors) (3,5), we adopted an ecologically-situated and pragmatic approach where humans use AI to achieve practical tasks. A needs assessment was conducted with 8 expert radiation oncologist physicians (ROP) to general a list of uses that (I) will make a positive difference to a provider if AI-assisted; and (II) is at the right level of complexity that AI-assisted automation could save time. Using scenario-based design (SBD) (24-26) as a methodology, a multidisciplinary team {2 ROPs, AI expert [a human-centered AI expert (computer science professor) with pioneering work in explainable AI and language technologies along with a background in human-computer interaction], and medical physicist} identified four ecologically-situated tasks (as listed in the Introduction) that would be helpful to an RO and could reasonably be completed using free-access, commercial LLMs: LOMN, radiology note summary, consult note summary, MDM. The prompts were developed by the two ROs on our expert panel, with the goal of reflecting the manner in which providers would be expected conduct such queries in real-world practice. Detailed prompts and outputs for each use case are provided in the available online: https://cdn.amegroups.cn/static/public/jmai-2025-1-251-1.pdf.
The index cases for the MDM, LOMN, medical summaries, and radiology reports were derived from real clinical encounters to ensure ecological validity. To meet Health Insurance Portability and Accountability Act (HIPAA)-compliant standards for deidentification as per the US Health and Human Services Privacy Rule, all potentially identifying attributes (including age, tumor origin, geographic location, radiologic and laboratory details, and specific dates) were substantially modified. As a result of this rigorous deidentification, the project was not classified as Human Subjects Research and was exempt from Institutional Review Board (IRB) review.
How evaluation measures were developed
Evaluating the curated use cases required new measures, as no existing frameworks captured the specific safety and contextual demands of RO. General-purpose LLM benchmarks focus on algorithmic performance, overlooking critical factors like safety and usability essential to RO. Through 18 collaborative iterations lasting over 30 hours, a team of Human-centered AI experts, ROPs, and physicists developed an evaluation framework tailored to RO needs, operationalizing safety and usability per use case. Categories for LLM evaluation comprising the framework were identified by employing Braun and Clark’s 6-step thematic analysis strategy, completing the checklist proposed by Ahmed et al., to enhance precision and reliability of the process (27). The Likert scale for each category was developed through this same iterative approach and was customized for that use case to maximize the actionability with the guiding question, “How might this measure be actionably useful in practice to a provider?” For example, in the LOMN use case, dimensions like accuracy [hallucination (examples of hallucinations are provided in the available online: https://cdn.amegroups.cn/static/public/jmai-2025-1-251-1.pdf)] ensured safety, while readability and understandability supported usability. Some categories and scales cross-over to all use cases (e.g., accuracy, comprehensiveness) while some are tailored for a specific one (e.g., persuasiveness for LOMN). Likert scale components are listed in Tables 1-4.
Table 1
| Dimension | Likert scale | Justification |
|---|---|---|
| Comprehensiveness: how comprehensively did the LLM answer the question as directed? | Includes letter format with medical necessity statement, patient history, treatment rationale, citations. 1 point for each. 1= no criteria met; 2= one criteria met; 3= two criteria met; 4= three criteria met; 5= four criteria met | The prompt directly requests or implies inclusion of specific elements key to a standard LOMN; missing elements could lead to rejection of the LOMN case |
| Readability: how cohesively readable was the response? | Totally incohesive information | Output format and subsequent cohesiveness of its content affect readability of the LOMN |
| Moderate incohesive information | ||
| Somewhat incohesive information (e.g., bullets) | ||
| Sentences but still not fully cohesive | ||
| Sentences that are cohesive | ||
| Persuasiveness: how persuasive was the output? | Uncertain: avoids conclusory statement, does not connect new content to case detail | The extent to which the output was neutrally or persuasively presented impacts the anticipated efficacy of the LOMN |
| Tentative: weakly links to details, suggests ideas without commitment | ||
| Permissive: offers multiple options as valid | ||
| Assertive: presents a conclusion but notes other possibilities | ||
| Certain: forceful argument without ambiguity | ||
| Output accuracy: how accurate and relevant was the output (e.g., citations)? | No citations | Accuracy and relevancy of citations affects the extent to which user oversight and corrections would be required |
| All hallucinations and/or no relevance | ||
| Mixed relevant real citations and/or some hallucinations | ||
| All real citations, less than 50% relevant | ||
| All real citations, more than 50% relevant |
A strong LOMN convincingly justifies necessity using patient-specific facts and supporting medical literature. This task used a real case description of a de-identified patient. Case history relevant to supporting a higher complexity treatment delivery modality, SBRT, was provided but not explicitly noted to be significant to supporting this treatment type. This task tested the ability to generate accurate and reasonable argumentation for a non-standard treatment regimen. It also tested the ability to provide its own evidence for its reasoning. LLM, large language model; LOMN, Letter of Medical Necessity; SBRT, stereotactic body radiation therapy.
Table 2
| Dimension | Likert scale | Justification |
|---|---|---|
| Comprehensiveness: how comprehensively did the LLM answer the question as directed? | Includes summary of key elements of the report: comments on findings for prostate and nodes; comments on bones; comments on incidental findings; accurately reports on changes, 1 point each. 1= no criteria met; 2= one criteria met; 3= two criteria met; 4= three criteria met; 5= four criteria met | To effectively summarize the case radiology report, inclusion of key elements of the source report are required |
| Readability: how cohesively readable was the response? | Totally incohesive information | Output format and subsequent cohesiveness of its content affect readability of the radiology report summary |
| Moderate incohesive information | ||
| Somewhat incohesive information (e.g., bullets) | ||
| Sentences but still not fully cohesive | ||
| Sentences that are cohesive | ||
| Lay understandability: how lay-friendly understandable was the output? | Jargon-heavy, clinical tone—uses complex terminology and a formal, technical style that is difficult for most people to understand | The extent to which jargon and technical writing style are used impacted its lay interpretability |
| Mostly jargon—predominantly uses specialized language with some effort to simplify | ||
| Mixed—mixes technical terms with lay-friendly explanations, aiming for clarity without oversimplification | ||
| Conversational tone—uses everyday language and a casual style that is easy for most people to understand, minimal use of jargon | ||
| Highly accessible—completely avoids jargon, using simple, clear language and an engaging tone suitable for a general audience | ||
| Output accuracy: how accurate and relevant was the output (e.g., citations)? | All hallucinations and/or no relevance | Presence of errors or hallucinations affects the extent to which user oversight and corrections would be required |
| A lot of hallucinations | ||
| Some hallucinations | ||
| Very little hallucinations | ||
| No hallucinations/did not make anything up |
This task used a real radiology report of a de-identified patient. Reports assist in contextualizing disease progression and treatment effects, but are limited by indeterminate findings, and simultaneous findings of disease progression, regression, and stability. This task tested the ability to generate a summary based on an actual radiology report that demonstrated areas of disease progression and improvement, with non-relevant findings, and to interpret these to demonstrate that disease progression is conclusively a poor finding. LLM, large language model.
Table 3
| Dimension | Likert scale | Justification |
|---|---|---|
| Comprehensiveness: how comprehensively did the LLM answer the question as directed? | Includes workup including primary response, nodal response, treatment options, side effects, next steps. 1 point for each. 1= no criteria met; 2= one criteria met; 3= two criteria met; 4= three criteria met; 5= four criteria met | To effectively summarize the case consult note, inclusion of key elements of the source consult note are required |
| Readability: how cohesively readable was the response? | Totally incohesive information | Output format and subsequent cohesiveness of its content affect readability of the consult note summary |
| Moderate incohesive information | ||
| Somewhat incohesive information (e.g., bullets/sentence fragments) | ||
| Sentences but still not fully cohesive | ||
| Sentences that are cohesive | ||
| Lay understandability: how lay-friendly and understandable was the output? | Jargon-heavy, clinical tone—uses complex terminology and a formal, technical style that is difficult for most people to understand | The extent to which jargon and technical writing style are used impacts its lay interpretability |
| Mostly jargon—predominantly uses specialized language with some effort to simplify | ||
| Mixed—mixes technical terms with lay-friendly explanations, aiming for clarity without oversimplification | ||
| Conversational tone—uses everyday language and a casual style that is easy for most people to understand, minimal use of jargon | ||
| Highly accessible—completely avoids jargon, using simple, clear language and an engaging tone suitable for a general audience | ||
| Output accuracy: how accurate and relevant was the output (e.g., citations)? | All hallucinations and/or no relevance | Presence of errors or hallucinations affects the extent to which user oversight and corrections would be required |
| A lot of hallucinations | ||
| Some hallucinations | ||
| Very little hallucinations | ||
| No hallucinations/did not make anything up |
This task used a real patient consult note of a de-identified patient. This task tested the ability to summarize and translate a medical report to be understood with actionable conclusions by a lay person with no medical knowledge. LLM, large language model.
Table 4
| Dimension | Likert scale | Justification |
|---|---|---|
| Comprehensiveness: how comprehensively did the LLM answer the question as directed during the first query? | Includes summary of key elements of the report: breadth of options (conventional, IMRT, SBRT), intention of treatment (reference to oligometastatic, aggressiveness of treatment, performance status, prognosis), impact of spine location, dose selection details (i.e., spinal cord safety), 1 point each; 1= no criteria met; 2= one criteria met; 3= two criteria met; 4= three criteria met; 5= four criteria met | The prompt directly requests or implies inclusion of specific elements key to MDM; missing elements could hinder optimal decision-making |
| Readability: how cohesively readable was the response? | Totally incohesive information | Output format and subsequent cohesiveness of its content affect readability of the MDM output |
| Moderate incohesive information | ||
| Somewhat incohesive information (e.g., bullets) | ||
| Sentences but still not fully cohesive | ||
| Sentences that are cohesive | ||
| Coherent resolution: how necessary were follow-up questions and how logically consistent were their answers? | Multiple follow-up questions required, responses are inconsistent between queries | If additional queries are required to fully address elements of the prompt, the number of additional queries needed and the consistency of output recommendations between answers impact overall interpretation and trust in the output |
| Multiple follow-up questions required, responses are consistent between queries | ||
| Some follow-up questions required, responses are inconsistent between queries | ||
| Some follow-up questions required, responses are consistent between queries | ||
| No follow-up questions required | ||
| Output accuracy: how accurate and relevant was the output (e.g., citations)? | No citations | Accuracy and relevancy of citations affects the extent to which user oversight and corrections would be required |
| All hallucinations and/or no relevance | ||
| Mixed relevant real citations and/or some hallucinations | ||
| All real citations, less than 50% relevant | ||
| All real citations, more than 50% relevant |
This task used the actual case details of a de-identified patient. This task tested the ability to create RO medical decisions including treatment technique, fractionation schedule, dose constraints. Case history relevant to supporting a higher complexity treatment delivery modality, SBRT, was provided but not explicitly noted to be significant to supporting this treatment type. This task tested the ability to generate accurate and reasonable argumentation for a non-standard treatment regimen. It also tested the ability to provide its own evidence for its reasoning. In the initial iteration of the test, all models performed poorly, offering broad or generalized recommendations. Standardized follow-up prompts were created to constrain answers after the initial response, with specific emphasis on accurate and safe recommendations for delivering and SBRT treatment for a metastatic lesion. All follow-up prompts were templated and queried in sequence for each LLM. IMRT, intensity-modulated radiation therapy; LLM, large language model; MDM, medical decision-making; RO, radiation oncology; SBRT, stereotactic body radiation therapy.
This study adopts a formative, exploratory approach and does not test pre-specified hypotheses. Rather, it seeks to develop and validate practically grounded evaluation criteria through provider-centered inquiry. The evaluation dimensions were derived from the provider needs assessment and the iterative thematic analysis described above; for instance, hallucination detection was prioritized given its direct patient safety implications, while lay-friendliness was included to reflect real-world communication demands in oncology. The complete evaluation instruments and detailed rationale for each dimension are provided in the available online: https://cdn.amegroups.cn/static/public/jmai-2025-1-251-1.pdf.
Using a structured repeated-measures evaluation protocol with standardized prompts, two ROPs prompted three commercial LLMs (Gemini, Copilot, ChatGPT) for all use cases. The prompts contained de-identified text from patient notes, case details, and radiology reports. Prior to evaluation, ROPs reviewed scoring criteria together to align on each dimension and its Likert scales. All LLMs were prompted on the same date to minimize model update confounds. Each query was performed as a new session in the given LLM, and the model’s memory was cleared after each iteration. ROPs first scored independently, then discussed discrepancies until consensus—initial agreement exceeded 80%, no scores varied by more than 2 points, and consensus was always achieved. Each scoring session took approximately 2 hours including discussion. To validate measure robustness, the identical process was repeated approximately 6 months later following model updates to all three LLMs (8/24/2024 and 3/10/2025. Evaluators scored outputs directly from each LLM’s native interface in real time to replicate authentic clinical use and allow follow-up queries, precluding blinding to the source model. While this design choice may introduce bias from rater preferences, it was done to prioritize ecological validity consistent with our pragmatic study design. Tables 1-4 below illustrate the dimensions for each use case along with its justification for inclusion into the evaluation framework. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Ethics approval for this study was deemed exempt as it does not involve human subjects.
Statistical analysis
Agreement between ROP scores noted above was calculated as (exact agreement/total cases) ×100.
Results
Across all use cases, there was no clear winner—no LLM performed best across all clinical use cases and evaluation dimensions (Figures 1-4 and Table 5). Performance varied by task complexity, with summarization tasks yielding the strongest and most consistent results, while MDM proved the most challenging. Our repeat assessment at 6 months revealed similar overall performance trends, supporting the robustness of the evaluation measures despite model updates to all three LLMs. Below, we report results organized by use case.
Table 5
| Task | Performance |
|---|---|
| LOMN | ChatGPT and Gemini excelled in comprehensiveness, readability, and persuasion, while Copilot outperformed them in citational accuracy/hallucinations (Figure 1) |
| Radiology report summarization | In most categories, Copilot trailed behind ChatGPT and Gemini, which performed similarly (Figure 2) |
| Consult note summarization | All models avoided hallucinations and consistently generated reasonably comprehensive reports. The greatest performance variance occurred in lay understandability, where Copilot ranked the lowest (Figure 3) |
| Medical decision-making | This was the only use case with two sequential prompts—the second asking for a de novo treatment recommendation without external suggestions. ChatGPT and Gemini performed similarly across most categories. Copilot excelled in citation accuracy but lagged in comprehensiveness and response consistency, where Gemini also struggled (Figure 4) |
LLM, large language model; LOMN, Letter of Medical Necessity.
LOMN
Figure 1 demonstrates LLM performance across 4 scoring dimensions for the LOMN task at both time points. At T1, Gemini and ChatGPT tied on comprehensiveness (4/5) while Copilot lagged (3/5). Gemini led in readability (4/5) over ChatGPT (3/5) and Copilot (2/5). Persuasiveness was moderate for Gemini and ChatGPT (both 3/5), with Copilot lower (2/5). However, Copilot substantially outperformed on output accuracy (citation quality: 5/5 vs. ChatGPT at 2/5 and Gemini at 1/5). By T2, all three LLMs improved markedly: comprehensiveness converged at 5/5 across all models, and readability rose to 5/5 for both Gemini and ChatGPT (Copilot: 4/5). Persuasiveness converged at 3/5 for all three models. Notably, Copilot’s citation advantage disappeared: output accuracy dropped to 2/5 for all three LLMs. This pattern suggests that while model updates improved surface-level output quality, the ability to generate persuasive argumentation and accurate citations, arguably the most clinically important dimensions for an LOMN remained limited.
Radiology report summarization
Figure 2 shows LLM performance across 4 scoring dimensions for the radiology report summarization task at both time points. This was the strongest-performing use case overall. Output accuracy was high across all three LLMs at both time points (5/5 for all at T1; 5/5 for all at T2), indicating minimal hallucination risk for this task. At T1, ChatGPT led comprehensiveness (5/5) over Gemini (4/5), while Copilot lagged considerably (2/5). Gemini led in both readability (5/5) and lay understandability (5/5), followed by ChatGPT (4/5 on both); Copilot scored 1/5 on both dimensions. By T2, comprehensiveness improved to 5/5 across all three LLMs. Readability also improved, with Gemini and ChatGPT reaching 5/5 and Copilot rising to 4/5. Lay understandability showed a notable shift: Gemini dropped from 5/5 to 2/5, while ChatGPT improved to 5/5 and Copilot improved substantially from 1/5 to 4/5. The consistently high output accuracy across both time points suggests that summarization of structured radiology reports is a relatively safe application with minimal hallucination risk.
Consult note summarization
Figure 3 demonstrates LLM performance across 4 scoring dimensions for the consult note summarization task at both time points. Results were notably stable across time points. Comprehensiveness was identical at both T1 and T2 (4/5 for all three LLMs). Output accuracy was also consistently high (5/5 for all three at T1; Gemini and ChatGPT at 5/5 with Copilot at 4/5 at T2), confirming that summarization tasks carry lower hallucination risk. Readability was stable: ChatGPT scored 4/5 at both time points, while Gemini and Copilot each scored 3/5 at both. Lay understandability remained the most variable dimension: ChatGPT consistently scored highest (4/5 at both time points), while Gemini (2/5 at both) and Copilot (1/5 at both) remained weak, indicating that their outputs would require substantial revision to be accessible to patients.
MDM
Figure 4 shows LLM performance across 4 scoring dimensions for the MDM task at both time points. This was the weakest and most variable use case. At T1, Copilot led comprehensiveness (4/5) over Gemini and ChatGPT (both 3/5). Readability was strongest for ChatGPT and Copilot (both 4/5), with Gemini lower (3/5). Copilot dominated output accuracy at T1 (5/5 vs. 3/5 for both Gemini and ChatGPT). However, coherent resolution (measuring the need for follow-up queries and logical consistency) was the critical weak point: Gemini and Copilot both scored 1/5, indicating multiple follow-up questions were required with inconsistent responses, while ChatGPT was moderate (3/5). By T2, results shifted substantially. ChatGPT improved in comprehensiveness (5/5) and coherent resolution (4/5), emerging as the strongest overall performer. Gemini improved moderately across dimensions (comprehensiveness 4/5, coherent resolution 2/5). Copilot’s performance diverged: while readability improved to 5/5, comprehensiveness collapsed to 1/5 and coherent resolution remained at 1/5, suggesting model updates may have degraded complex clinical reasoning capabilities. This use case most clearly demonstrated that LLM outputs for complex tasks added more verification burden than value for providers: time spent checking for hallucinations, inconsistencies, and following up with additional queries offset any potential efficiency gains.
Cross-cutting findings
Several patterns emerged across use cases. First, summarization tasks (radiology and consult note) yielded consistently high output accuracy (4–5/5), while generative tasks (LOMN and MDM) showed more variable and often lower citation accuracy. Second, ChatGPT was the most balanced performer overall, maintaining strong readability (3–5/5) and lay understandability (4–5/5) across tasks while avoiding major performance collapses. Third, Copilot showed the widest performance swings: leading in citation accuracy for LOMN at T1 (5/5) and MDM at T1 (5/5), but scoring as low as 1/5 on readability, lay understandability, and comprehensiveness in other contexts. Fourth, Gemini demonstrated strengths in readability and comprehensiveness for summarization but was the least consistent, particularly on lay understandability (ranging from 5/5 to 2/5 across tasks and time points). Fifth, persuasiveness for LOMN and coherent resolution for MDM, dimensions that capture higher-order clinical reasoning, remained consistently weak across all LLMs, underscoring that current models struggle most with tasks requiring argumentation and sustained logical consistency. This performance gap between summarization and complex reasoning tasks highlights that the appropriateness of LLM integration varies substantially by task type and complexity.
Discussion
To our knowledge, this study marked the first evaluation of ecologically situated use cases for LLMs in RO, paired with practically grounded evaluation criteria. This framework emerged from close collaboration between clinicians and AI experts, balancing clinical relevance with technical feasibility. The use cases stemmed from a pragmatic approach, prioritizing perceived safety and usefulness from the provider’s perspective. Assessments were conducted at two distinct points over a 6-month period, which included model updates, confirming the resilience of the evaluation measures against changes in model performance. The study also demonstrated that there is no one LLM to rule them all—no publicly available LLM that outperforms all others.
Effective use of LLMs in RO
Generative AI, particularly LLM-use, is inevitable in RO (28). Effective use of LLMs requires an understanding of the performance in a practical and applicable way (29,30). The 4 use cases serve as a springboard of how the publicly available LLMs may be used in a safe and useful way. While fine-tuned advanced LLMs can provide better performance, they will still have weaknesses. The contributions of this paper—the use cases and the evaluation framework—offered can be used to highlight those blind spots to ensure their safety and usefulness.
This study offers a practical framework for assessing LLM performance in terms of both safety and usability, shedding light on their strengths and blind spots—critical for accountable, high-stakes decision-making in RO. Given the inherent opacity of LLMs, fully “opening” the black box is not feasible. Instead, users need evaluation methods that reveal limitations and effectively communicate them to future users to mitigate risks. The evaluation presented here provides a formative path toward that goal.
The study highlights a tension between automation’s efficiency and the burden of verification. Since LLMs inevitably hallucinate (31,32), users must always double-check accuracy. This mirrors the challenge of working with an assistant prone to mistakes—sometimes, it is easier to do the task yourself than to verify and correct errors. Thus, the trade-off between automation’s benefits and oversight costs must be carefully weighed.
Recommendations & takeaways
For medical professionals: this work offers a practical framework for assessing LLM performance in terms of both safety and usability. The evaluation measures are model agnostic; thus, it can be used as an analytic tool to achieve a baseline evaluation of the LLM of your choice. It also establishes a formative benchmark for three popular, publicly available commercial LLMs, helping users make informed decisions about which model best suits each use case. For example, the fact that all 3 have relatively little to no issues with hallucinations around summarization tasks vs. the MDM task, this insight can be used to scope which uses are relatively safer. Beyond the measures, we present curated use cases practically grounded in an ROP’s workflow (avoiding sanitized or unrealistic scenarios), which can guide others on what to use LLMs for. Even if your use case differs, you can find the closest match among the four use cases (e.g., summarization) and assess LLMs using the provided evaluation measures.
For LLM developers and technologists: going beyond algorithm-centered benchmarking, this work makes LLM evaluation more human-centered. Algorithmic benchmarks don’t capture how useful the technology is because it does not adequately address the sociotechnical gap (20). A human-centered evaluation framework grounded in practical use cases and measures can serve as a guiding tool to develop more practically grounded evaluation of LLMs that are likely to make a positive difference to a medical practitioner’s workflow.
Limitations and future work
Due to stochasticity and non-deterministic nature of LLMs, an inherent limitation is the inability to ascertain whether you get the same output or answer if you ask the same question (prompt) twice. Thus, it makes direct comparison challenging. To mitigate the overall variance of the outputs, we prompted all LLMs at the same days and ensured the outputs were similar (even if not exactly the same). It is also noted that our scoring framework purposefully did not directly include a dimension for accuracy of recommendations for the MDM use case. This is because we selected an MDM clinical scenario in which there was no single established, data-driven best practice. As such, a range of clinical interventions could be potentially acceptable. In these contexts, safety may also be poorly defined, as it is often weighed against factors such as patient preference that are not easily captured using ground truth metrics. We elected to employ such a use case to reflect the common real-world scenario in which MDM involves conflicting data or patient-specific features that strain the limits of easily searchable guidelines. This is in contrast to prior studies that have tested LLMs in less-taxing synthetic tasks such as passing the USMLE board exams (4), where a correct answer can be more concretely established. An additional limitation arises from the use of different Likert scales for each use case. While this approach was selected to enhance specificity and meaning, it also introduces inconsistency and may hinder comparability between domains and use cases.
It is noted that ChatGPT and Copilot are both built on the same underlying technology (OpenAI GPT-4/GPT-5) and thus may be expected to provide similar results. However, these LLMs do not use the same application programming interface, and our results suggest different performance despite potentially similar data and training sets.
Also of importance, given the inherent stochasticity of the design material of these generative models, there is inherent randomness that is inevitable. Unlike classical machine learning where one could fix a “seed” to get the same results repeatedly, there exists no avenue to do so for such models from the user side. Given this inherent technical limitation, evaluations will carry an element of uncertainty that one cannot control for. Drawing from pragmatic design principles, the goal was to create ecologically situated scenarios in the framework that represents real-world events, which carries this randomness. Even with the randomness, there is robustness in the data given the repeated measures taken within and between each time point.
Although out of the scope of this article, we would be remiss to highlight the host of ethical concerns that are associated with LLMs. These include issues related to data safety and patient privacy that arise when sharing real patient data in LLM queries as well as biases that can be perpetuated by LLMs due to limitations of their training sets and algorithms. Such biases may lead to output that exacerbates existing discriminatory practices or is inaccurate (33). While not expressly built to evaluate such biases, these themes may be captured in our framework under “Comprehensiveness” and “Output Accuracy” categories. Additionally, only a single task was tested per use case scenario, which may amplify the influence of model stochasticity on individual results. The repeated evaluation at 6 months, which yielded consistent performance trends across all three LLMs, provides some mitigation against this concern. Nevertheless, future studies should employ multiple tasks per scenario to more robustly characterize stable model performance across use cases.
Lastly, we did not seek to address a potentially invaluable use of LLMs in medicine: in extracting information from electronic health records. While this may be of key use to researchers, we instead focused on use cases that would be of primary utility in clinical practice. However, similar analyses have sought to evaluate the performance of LLMs in data extraction from unstructured and semi-structured electronic health records (34).
In the future, beyond safety and usefulness, the field will require explainable LLMs to ensure accountability and mitigate liability concerns. This transparency is needed for both the algorithm as well as through studies like ours that evaluate the safety and usability from outside the algorithm with use case evaluation. This is an open area of research where the field of Human-centered Explainable AI (HCXAI) (35) can be informative and design methods such as Seamful AI could promote proactive mitigation of pitfalls (26,36). Future work should also apply the use case and evaluation to more use cases in RO across more public LLMs (e.g., Perplexity, Claude) and fine-tuned ones (e.g., MedPalM). Future summative evaluations could incorporate independent raters to strengthen external validity and further validate the evaluation framework. In addition, future studies could benchmark LLM‑generated LOMNs against payer‑specific policies (e.g., EviCore) once stable mapping resources are in place.
Conclusions
In this provider-centered evaluation aimed at assessing the usefulness of LLMs in RO, concerns around hallucinations and lack of explainability significantly limit the utility of LLMs in standard clinical practice. However, given their widespread adoption, clinicians will likely employ them for various purposes and to differing degrees. The proposed framework offers a means for uncovering the relative strengths and weaknesses of LLMs in this context, helping to define the boundaries of appropriate LLM use in RO and facilitating safe application of such systems in this clinical context. Future efforts such as incorporating explainable AI may further enhance safety, usefulness, and accountability of medical applications of LLMs.
Acknowledgments
None.
Footnote
Data Sharing Statement: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-1-251/dss
Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-1-251/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-1-251/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Ethics approval for this study was deemed exempt as it does not involve human subjects.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Global Market Insights Inc. Generative AI in Healthcare Market Size & Share 2024-2032. 2024. [Accessed April 16, 2025]. Available online: https://www.gminsights.com/industry-analysis/generative-ai-in-healthcare-market
- Blease CR, Locher C, Gaab J, et al. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inform 2024;31:e101102. [Crossref] [PubMed]
- Naughton J. If AI can provide a better diagnosis than a doctor, what’s the prognosis for medics? 2024. [Accessed April 16, 2025]. Available online: https://www.theguardian.com/commentisfree/2024/nov/30/if-ai-can-provide-a-better-diagnosis-than-a-doctor-whats-the-prognosis-for-medics
- Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. [Crossref] [PubMed]
- Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589-96. [Crossref] [PubMed]
- Craig Spencer MD MPH [@Craig_A_Spencer]. LLM Skepticism from Doctors--that talk of AI ‘replacing’ doctors is wildly off the mark. Twitter. 2023. [Accessed April 16, 2025]. Available online: https://x.com/Craig_A_Spencer/status/1671374003087241216
- Laura Heacock, MD [@heacockmd]. LLM Skepticism Doctors: I saw a slew of Twitter articles talking about how doctors can all be replaced by ChatGPT and how we should all become prompt engineers. Twitter. 2023. [Accessed April 16, 2025]. Available online: https://x.com/heacockmd/status/1639252174449397762
- Steven Phillips, MD [@StevePhillipsMD]. LLM Skepticism Doctors: AI creeps ever closer to replacing doctors. This is so dangerous. Twitter. 2023. [Accessed April 16, 2025]. Available online: https://x.com/StevePhillipsMD/status/1645455654923165696
- Spencer C. AI can’t replicate this key part of practicing medicine. STAT. 2023. [Accessed April 16, 2025]. Available online: https://www.statnews.com/2023/07/18/medical-ai-doctors-chatgpt/
- Funk AT Giancarlo Pasquini, Alison Spencer and Cary. 60% of Americans Would Be Uncomfortable With Provider Relying on AI in Their Own Health Care. Pew Research Center. 2023. [Accessed April 17, 2025]. Available online: https://www.pewresearch.org/science/2023/02/22/60-of-americans-would-be-uncomfortable-with-provider-relying-on-ai-in-their-own-health-care/
- Chow JCL, Li K. Large Language Models in Medical Chatbots: Opportunities, Challenges, and the Need to Address AI Risks. Information 2025;16:549.
- Chow JCL, Li K. Developing Effective Frameworks for Large Language Model-Based Medical Chatbots: Insights From Radiotherapy Education With ChatGPT. JMIR Cancer 2025;11:e66633. [Crossref] [PubMed]
- Chakravorti B. Why AI Failed to Live Up to Its Potential During the Pandemic. HARVARD BUSINESS REVIEW. [Accessed April 17, 2025]. Available online: https://hbr.org/2022/03/why-ai-failed-to-live-up-to-its-potential-during-the-pandemic
- O’Leary L. How IBM’s Watson Went From the Future of Health Care to Sold Off for Parts. Slate. 2022. [Accessed April 17, 2025]. Available online: https://slate.com/technology/2022/01/ibm-watson-health-failure-artificial-intelligence.html
- Chen RJ, Lu MY, Chen TY, et al. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021;5:493-7. [Crossref] [PubMed]
- Goncalves A, Ray P, Soper B, et al. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020;20:108. [Crossref] [PubMed]
- Bedi S, Liu Y, Orr-Ewing L, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025;333:319-28. [Crossref] [PubMed]
- Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024;30:2613-22. [Crossref] [PubMed]
- Ackerman MS. The intellectual challenge of CSCW: The gap between social requirements and technical feasibility. Human–Computer Interact 2000;15:179-203.
- Ehsan U, Saha K, De Choudhury M, et al. Charting the Sociotechnical Gap in Explainable AI: A Framework to Address the Gap in XAI. Proc ACM Hum-Comput Interact 2023;7:1-32.
- Liu Z, Wang P, Li Y, et al. RadOnc-GPT: A Large Language Model for Radiation Oncology. arXiv:2309.10160v3 [Preprint]. 2023. Available online: https://doi.org/
10.48550 /arXiv.2309.10160 - Dennstädt F, Hastings J, Putora PM, et al. Exploring Capabilities of Large Language Models such as ChatGPT in Radiation Oncology. Adv Radiat Oncol 2024;9:101400. [Crossref] [PubMed]
- Wang P, Liu Z, Li Y, et al. Fine-tuning open-source large language models to improve their performance on radiation oncology tasks: A feasibility study to investigate their potential clinical applications in radiation oncology. Med Phys 2025;52:e17985. [Crossref] [PubMed]
- Rosson MB, Carroll JM. Scenario based design. In: Sears A, Jacko JA, editors. Hum-Comput Interact. Boca Raton (FL): CRC Press; 2009:145-162.
- Ehsan U, Liao QV, Muller M, et al. Expanding Explainability: Towards Social Transparency in AI systems. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). New York (NY): Association for Computing Machinery; 2021:82.
- Ehsan U, Liao QV, Passi S, et al. Seamful XAI: Operationalizing Seamful Design in Explainable AI. Proc ACM Hum-Comput Interact 2024;8:1-29.
- Ahmed SK, Mohammed RA, Nashwan AJ, et al. Using thematic analysis in qualitative research. J Med Surg Public Health 2025;6:100198.
- Piras A, Mastroleo F, Colciago RR, et al. How Italian radiation oncologists use ChatGPT: a survey by the young group of the Italian association of radiotherapy and clinical oncology (yAIRO). Radiol Med 2025;130:453-62. [Crossref] [PubMed]
- Carriero S, Cannella R, Cicchetti F, et al. AI Revolution in Radiology, Radiation Oncology and Nuclear Medicine: Transforming and Innovating the Radiological Sciences. J Med Imaging Radiat Oncol 2025;69:649-59. [Crossref] [PubMed]
- Piras A, Morelli I, Colciago RR, et al. The continuous improvement of digital assistance in the radiation oncologist's work: from web-based nomograms to the adoption of large-language models (LLMs). A systematic review by the young group of the Italian association of radiotherapy and clinical oncology (AIRO). Radiol Med 2024;129:1720-35.
- Xu Z, Jain S, Kankanhalli M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817v2 [Preprint]. 2025. Available online: https://doi.org/
10.48550 /arXiv.2401.11817 - Hicks MT, Humphries J, Slater J. ChatGPT is bullshit. Ethics Inf Technol 2024;26:38.
- Chow JCL, Li K. Ethical Considerations in Human-Centered AI: Advancing Oncology Chatbots Through Large Language Models. JMIR Bioinform Biotechnol 2024;5:e64406. [Crossref] [PubMed]
- Ntinopoulos V, Rodriguez Cetina Biefer H, Tudorache I, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health Care Inform 2025;32:e101139. [Crossref] [PubMed]
- Ehsan U, Riedl MO. Human-Centered Explainable AI: Towards a Reflective Sociotechnical Approach. In: Stephanidis C, Kurosu M, Degen H, et al., editors. HCI International 2020 - Late Breaking Papers: Multimodality and Intelligence. HCII 2020. Lecture Notes in Computer Science. Cham: Springer; 2020:449-66.
- Ehsan U, Riedl MO. Explainability pitfalls: Beyond dark patterns in explainable AI. Patterns (N Y) 2024;5:100971. [Crossref] [PubMed]
Cite this article as: Ehsan U, Jang BS, McNutt TR, Alcorn SR. What matters when generative artificial intelligence enters the clinic: a clinician-centered evaluation of ChatGPT, Gemini and Copilot in radiation oncology. J Med Artif Intell 2026;9:53.





