Comparative evaluation of ChatGPT versions in anesthesia patient education for total hip arthroplasty
Brief Report

Comparative evaluation of ChatGPT versions in anesthesia patient education for total hip arthroplasty

Andrew D. Fisher1 ORCID logo, Gabrielle Fisher1 ORCID logo, Renuka George1 ORCID logo, Ellen Hay1 ORCID logo, Christopher D. Wolla1 ORCID logo, Bethany J. Wolf2 ORCID logo

1Department of Anesthesia and Perioperative Medicine, Medical University of South Carolina, Charleston, SC, USA; 2Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA

Correspondence to: Andrew D. Fisher, MD. Department of Anesthesia and Perioperative Medicine, Medical University of South Carolina, 167 Ashley Avenue, Suite 301, MSC 912, Charleston, SC 29464, USA. Email: Fisheran@musc.edu.

Abstract: The integration of artificial intelligence (AI) into patient education provides new opportunities for enhancing physician-patient communication and patient health outcomes. The conversational abilities of large language models (LLMs) powered by natural language processing (NLP) are uniquely suited to interact with users in an approachable and accessible way. When these capabilities are applied to the medical realm, LLMs have the potential to reduce gaps in patients’ health literacy or medical knowledge and improve health outcomes. Here, we examine the potential of GPT, an LLM developed by OpenAI, to serve as a tool for patient education in the subspecialty of anesthesia. This study focuses on evaluating the accuracy, completeness, and readability of responses from three iterations of GPT—versions 3.5, 4, and a prompt-engineered GPT4 (GPT-4P) —in addressing common anesthesia questions related to total hip arthroplasty. The accuracy and completeness of responses were assessed by three regional anesthesia experts using 6-point Likert scales. Readability and word counts were analyzed using online tools. When comparing the three models, GPT-3.5 showed the highest overall accuracy and completeness, while GPT-4P had the best readability. On pairwise comparison, no single model consistently outperformed the others across all metrics. All models provided responses that were more accurate than inaccurate and more complete than incomplete. Beyond this examination, the results are heterogeneous and necessitate further studies. Though GPT-3.5 performed best in completeness and accuracy on a Type III F-test, its pairwise comparison performance against GPT-4P was only superior in completeness. Studying prompt language and expanding evaluator pools would be ideal next steps in further examination of this rapidly expanding aspect of patient care.

Keywords: Artificial intelligence (AI); GPT; anesthesiology; medical education


Received: 05 January 2024; Accepted: 21 March 2024; Published online: 31 May 2024.

doi: 10.21037/jmai-24-3


Introduction

Strong physician-patient communication is key to building relationships and delivering effective healthcare. Technological aids currently play an increasing role in health education, ranging from formal institutional resources (1) to informal patient-initiated Internet research. However, these approaches can leave gaps in patient knowledge due to reliance on incorrect or incomplete resources. With the popularization of publicly available large language models (LLMs) with massive knowledge bases, a solution to this knowledge gap may be on the horizon.

The recent surge in artificial intelligence (AI)-driven natural language processing (NLP) owes much of its momentum to the launch of OpenAI’s third iteration of the GPT-3 in June 2020. This LLM is built on transformer architecture specifically designed for natural language understanding and generation (2) and its ascent is evident in its surge to 100 million monthly users within two months of launch (3). Many are using ChatGPT, a chatbot powered by GPT, for daily tasks including email composition, language translation, and as a data reference source.

Current models have passed challenging academic examinations including the United States Medical Licensing Examination (USMLE), Law School Admission Test (LSAT), Scholastic Assessment Test (SAT), and Graduate Record Examination (GRE) proving their vast and versatile knowledge base (4,5). Clinicians have also taken note of GPT’s potential, leading to research into its applicability as a healthcare resource. Areas previously explored include LLMs’ ability to answer questions pertaining to cardiology (6,7), ophthalmology (8), hematology (9), rheumatology (10), and general medical inquiries (11). Despite positive results, the knowledge base of an LLM should not be presumed to be equal amongst different subspecialties. Notably, ChatGPT was found to have unequal performance for subspecialties in two separate studies (12,13).

Assessing the capabilities of ChatGPT through thoughtful analysis is critical for determining its applicability in specialized medical fields. Numerous studies have investigated the comparative performance of these models in non-patient-facing tasks (11,14-19). A smaller subset of work has focused on patient-facing interactions (20-22). None have explored how the different available versions of ChatGPT perform in patient-facing communication in the realm of anesthesiology.

Here, the authors examined how iterations of ChatGPT performed when asked anesthesia-specific questions for total hip arthroplasty (THA). Appropriateness, completeness, readability, and word count of GPT-3.5, GPT-4, and a prompt-engineered GPT4 (GPT-4P), were evaluated in their responses to common patient questions regarding anesthesia for THA.


Methods

The investigators A.D.F, G.F., R.G., C.D.W. and E.H. collaborated to construct 30 questions (Table 1) that encompass preoperative, intraoperative, and postoperative anesthesia-related questions frequently expressed by patients undergoing THA (Figure 1), based on clinical experience caring for patients undergoing these procedures. Some topics covered by the questions include fasting guidelines, risks of neuraxial procedures, implications of blood thinning medication for neuraxial placement, postoperative nausea and vomiting, and postoperative cognitive decline. On November 5, 2023, the questions were input into GPT-3.5, GPT-4, and GPT-4P, resulting in 90 distinct submissions across the three versions. For GPT-4P, the prompt and question were submitted simultaneously with the prompt preceding the question. To ensure unbiased responses and prevent the models from learning from prior inputs, each question was presented in a new chat session. Questions were posed a single time to each model, as the authors felt repeated queries would increase the time burden of rating the responses without adding significantly to the evaluation of models.

Table 1

Expert curated patient questions

Preoperative Intraoperative Postoperative
… when is the last time I can eat solid food? … what are the options for anesthesia? … how long does it take for general anesthesia to wear off?
… which medications can I take the morning of surgery? … how is it decided if I receive spinal, epidural, or general anesthesia? … will I receive pain and nausea medication in the recovery room?
… can I receive medication for anxiety before the surgery? … what are the risks to receiving a spinal or epidural? … are there options if oral opioid pain medication causes extreme nausea?
… does my weight influence my risk of anesthesia? … can you explain the process of how a spinal or epidural is done? … what are the risks of post procedure delayed cognitive recovery?
… does having asthma influence my anesthesia plan? … can you explain what a peripheral nerve block is and why it may be needed? … can anything be done to avoid delirium or confusion?
… if I have a cough or recent respiratory illness, does this influence my anesthesia? … if I get a spinal or epidural, will I be awake for the surgery? … will the anesthesia allow participation in physical therapy?
… can I still have my daily cup of coffee that morning? … does my blood thinning medication influence my anesthesia plan? … how long would the spinal or epidural make my legs feel numb?
… does my chronic pain medication influence my ability to safely receive anesthesia? … I have had back surgery. Does this influence my anesthesia plan? … with a history of opioid addiction in the past, would it be possible to avoid opioids entirely?
… does a family history of allergy to anesthesia influence my ability to have a safe experience? … how will I be monitored when I am under anesthesia? … if I use a CPAP at night for sleep apnea, should I bring it with me?
… I have sleep apnea. Does this affect my anesthesia plan? … how common is awareness under anesthesia and how is it prevented? … if I use a CPAP at night for sleep apnea, should I bring it with me?

A complete list of the 30 patient questions. Note that these questions begin mid-sentence. Each question was preceded by the phrase “For my hip replacement surgery”. CPAP, continuous positive airway pressure.

Figure 1 Flow chart of study design. The sequence from question creation, input into ChatGPT versions, data collection and data analysis.

The prompt used in GPT-4P was constructed by authors AF and GF in accordance with OpenAI’s prompt engineering guide, including defining a role, the tone of communication, and instruction specific to the task (23). The resulting prompt was as follows: “You are an anesthesiologist with years of experience. You are here to answer patient questions from the text below. Be kind, compassionate, and precise. Do not provide direct medical advice. Acknowledge when you do not have an answer to a question.

Guides outlining prompting strategies in the medical space are available, yet no gold standard for prompt engineering in the medical space exists (24,25). For this reason, the authors favored the concise single prompt outlined above over a complex verbose prompt. GPT-4’s architecture includes enhancements that allow for more nuanced interactions with prompt engineering and increased performance with prompt engineering (26). For this reason, the authors chose to add the prompt engineered language into only the GPT-4 model.

Each response (the supplementary material is available at https://cdn.amegroups.cn/static/public/jmai-24-3-1.pdf) was provided to three regional anesthesia fellowship-trained attending anesthesiologists, who were given instructions to rate the responses’ accuracy and completeness, along with explanations of the Likert scales used (Figure 2). Likert scales were chosen for accuracy and completeness based on previous literature (11,22,28). Raters were blinded to the model providing the response. Responses were manually randomized to mitigate the potential for bias, and all raters were given the responses in an identical order. The survey was constructed and resulting data were collected and managed using REDCap electronic data capture tools hosted at the Medical University of South Carolina (29). Responses were entered into Readable.com (27), an online tool which calculates Flesch-Kincaid grade level using sentence and word length. Scores are equivalent to the U.S. grade level of education required to understand the text (Figure 2), with a higher score indicating lower readability.

Figure 2 Expert reviewer instructions, explanation of Likert scores for accuracy and completeness, and readability calculation details. (A) Instructions were given to expert reviewers at the time of response evaluation, including Likert scale explanations for accuracy and completeness scoring. (B) F-K grade level calculation, used to evaluate readability, and its interpretation (27). F-K, Flesch-Kincaid.

Differences in accuracy and completeness by model were evaluated using a linear mixed model approach. Models included fixed effects for model, and random effects for reviewer and question to account for correlation between responses to the same question and evaluation of the same question by different reviewers. Differences between GPT models were estimated from a series of linear contrasts. P values for pairwise comparisons between GPT models within each outcome were Bonferroni adjusted to address multiple testing by multiplying the P values by 3 to account for the three pairwise comparisons. The covariance from the model was also used to estimate the interclass correlation between raters’ scores of the same responses to examine consistency of scores between raters. Model assumptions were checked graphically and were found to be reasonable. Differences in readability and word count were evaluated similarly but excluded a random reviewer effect as there was only one assessment for each response. All analyses were conducted in SAS v 9.4 (SAS Institute, Cary, NC, USA).


Results

Table 2 provides the mean [95% confidence interval (CI)] for accuracy, completeness, and readability by model. The P values are provided for the Type III F-tests examining the hypotheses that at least one model differs from another and for the Bonferroni adjusted pairwise comparisons between models. Figure 3 shows the distributions of accuracy, completeness, readability, and word count scores by model.

Table 2

Accuracy, completeness, readability, and word count by version of ChatGPT

Outcome ChatGPT 3.5, mean (95% CI) ChatGPT 4, mean (95% CI) ChatGPT 4.0+prompt, mean (95% CI) P value (global Type III F-test) P value
3.5 vs. 4.0 3.5 vs. 4.0+ 4.0 vs. 4.0+
Accuracy 4.98 (4.74, 5.22) 4.59 (4.35, 4.83) 4.76 (4.52, 4.99) 0.02 0.01 0.30 0.65
Completeness 4.66 (4.35, 4.96) 3.80 (3.50, 4.10) 3.66 (3.35, 3.96) <0.001 <0.001 <0.001 >0.99
Readability 14.3 (13.5, 15.1) 15.3 (14.5, 16.1) 13.7 (12.9, 14.5) 0.02 0.22 0.80 0.01
Word count 357.6 (328.5, 386.7) 182.8 (153.7, 211.8) 173.1 (144.0, 202.2) <0.001 <0.001 <0.001 0.57

Mean (95% CI) for accuracy, completeness, readability, and word count by version of ChatGPT estimated from a series of linear mixed models including fixed effects for GPT model and random effects for reviewers and questions. The P values are for the global Type III F-tests examining the hypotheses that at least one version of GPT differs from another and for Bonferroni adjusted pairwise comparisons between versions of ChatGPT. CI, confidence interval.

Figure 3 Violin plots illustrating the distribution of accuracy, completeness, readability, and word count scores for 30 questions on anesthesia care for a total hip arthroplasty. Plots are categorized by different versions of ChatGPT used. (A) Violin plots of the distribution of accuracy scores for 30 questions on anesthesia and perioperative care by version of ChatGPT used. For each question, the mean of the accuracy scores given by the three raters was calculated. These thirty averages are displayed as individual points on the violin plot for each GPT model assessed. The width of the plot at different score levels indicates the distribution of these average ratings, showing common versus rare average scores. (B) Violin plots of the distribution of completeness scores for 30 questions on anesthesia and perioperative care by version of ChatGPT used. For each question, the mean of the completeness scores given by the three raters was calculated. These thirty averages are displayed as individual points on the violin plot for each GPT model assessed. The width of the plot at different score levels indicates the distribution of these average ratings, showing common versus rare average scores. (C) Violin plots of the distribution of readability for 30 questions on anesthesia and perioperative care by version of ChatGPT. The width of the plot at different score levels indicates the distribution of these ratings, showing common versus rare average scores. (D) Violin plots of the distribution of word count for responses to 30 questions on anesthesia and perioperative care by version of ChatGPT. The width of the plot at different score levels indicates the distribution of these ratings, showing common versus rare average scores.

There was a significant difference in accuracy between the models (Type III F-test P=0.02). GPT-3.5 had the highest overall accuracy with a mean accuracy of 4.98 (95% CI: 4.74, 5.22) out of 6, followed by GPT-4P with a mean of 4.76 (95% CI: 4.52, 4.99), and lastly GPT-4 with a mean of 4.59 (95% CI: 4.35, 4.83). In pairwise comparisons, GPT-4 had a significantly lower mean accuracy compared to GPT-3.5 (P=0.01), but there was not a significant difference between GPT-3.5 and GPT-4P or GPT-4 and GPT-4P (P=0.30 and 0.65 respectively) (Figure 3A). We also examined agreement between raters within question and version. The interclass correlation between raters for accuracy was 0.34 which suggests modest agreement.

There was a significant difference in completeness between the models (Type III F-test P<0.001). GPT-3.5 had the highest overall mean score for completeness with a mean score of 4.66 (95% CI: 4.35, 4.96) out of 6. GPT-4 and GPT-4P had similar mean scores of 3.80 (95% CI: 3.50, 4.10) and 3.66 (95% CI: 3.35, 3.96), which were both significantly lower than GPT-3.5 (P<0.001 for both pairwise comparisons) (Figure 3B). The interclass correlation between raters for completeness was 0.44 which though slightly higher than for accuracy still suggests modest agreement.

There was a significant difference in readability between the models (Type III F-test P=0.02). GPT-4 had the poorest overall readability with a mean of 15.3 (95% CI: 14.5, 16.1), followed by GPT-3.5 with a mean of 14.3 (95% CI: 13.5, 15.1), and lastly GPT-4P with a mean of 13.7 (95% CI: 12.9, 14.5). In pairwise comparisons, GPT-4 had a significantly worse readability on average compared to GPT-4P (P=0.01), but there was not a significant difference between GPT-3.5 and GPT-4 or GPT-3.5 and GPT-4P (P=0.22 and 0.80 respectively) (Figure 3C). GPT-4 also exhibited more variability in readability compared to the other two versions.

There was a significant difference in the number of words in the responses between the models (Type III F-test P<0.001). GPT-3.5 had the highest overall mean word count with an average of 357.6 (95% CI: 328.5, 386.7) words per response. GPT-4 and GPT-4P had similar mean word counts of 182.8 (95% CI: 153.7, 211.8) and 173.1 (95% CI: 144.0, 202.2), which were both significantly lower than GPT-3.5 (P<0.001 for both pairwise comparisons) (Figure 3D).


Discussion

On collective assessment via Type III F-tests, GPT-3.5, GPT-4, and GPT-4P show high capabilities in terms of accuracy and completeness. Each model proved quite accurate, with scores consistently surpassing 4.5 out of 6. For completeness, the models’ performances varied more than accuracy with a full point spread between best and worst performance. Even with this variability, the scores were found to be more complete than incomplete. The readability scores were consistently high, denoting a reading level above high school graduation. This is partially explained by the subject matter discussed: 3- to 4-syllable words like anesthesia, medications, and operation are used frequently in the responses which greatly increases the Flesch-Kincaid score. Including a maximum reading grade level in the prompt was considered, but test responses using alternative questions were felt to be overly simplistic and would likely have unblinded the raters.

Comparative analysis of the three GPT models studied yielded some intriguing outcomes. GPT4 never outperformed GPT-3.5 or GPT-4P on pairwise analysis. When pairwise comparing GPT-3.5 with GPT-4P, we found their performances to be similar. GPT-3.5 provided more complete information (4.66 vs. 3.66, P<0.001) and more verbose responses (P<0.001), but there was no significant difference in accuracy or readability. The superiority in completeness by GPT-3.5 over GPT-4 and GPT-4P could be attributed to its statistically longer responses, which may provide a more comprehensive answer to questions. These findings were somewhat unexpected for the authors. First, it was predicted that newer versions of GPT would mark an advancement in language processing, leading to improved performance across all parameters. Second, it was anticipated that the evolution of GPT would uniformly enhance all aspects being evaluated, so the absence of a single model uniformly outperforming the others was interesting. This suggests that advancements in model versions do not linearly correlate with enhancements in all dimensions of performance, highlighting the complexity of language model optimization.

Considering these findings, the best model for patients appears to be GPT-3.5. While GPT-3.5 and GPT-4P demonstrate similar performance in this study, imposing the additional step of entering prompts creates a hurdle. If this hurdle does not yield significantly superior performance, it cannot be justified. Furthermore, with GPT-3.5’s higher performance in completeness, it seems counterintuitive to recommend a less complete resource. Though the verbosity and readability of GPT-3.5 may cause trepidation in some practitioners, it must be noted that the primary strength of this technology lies in its interactive capability. If a response is too expansive, complicated, or confusing, a patient can simply ask for clarification or simplification.

When analyzing this study’s results among existing research, this study is the first to compare GPT-3.5 and GPT-4 in the anesthesia context and revealed that enhancements in GPT versions do not guarantee superior performance in highly specialized care settings like anesthesiology. Consistent with existing literature (7,8,11,28,30-32), our findings suggest GPT models generally provide accurate information when formally analyzed and provide high potential as a health knowledge resource for patients. With the need for highly accurate information in the patient-facing medical setting, the need for further refinement and validation before their application in patient education can be fully realized.

It is important to note areas where ChatGPT has underperformed to assure that integration into practice is seamless and positive. These models perform suboptimally with negation, where the denial or contradiction of a statement can lead to misunderstandings, as well as with contextual and cultural nuances (33). In addition, biases in the information provided may perpetuate preexisting biases (34). Hallucination, when an LLM provides information that is not factual and is not grounded in its input data, poses significant risks in the medical context as accuracy of responses is critical. Notably, there were no instances of hallucination in any of the models tested here. Lastly, incorporating these technologies into medical care underscores the critical importance of privacy and confidentiality. Ensuring that this integration does not compromise a patient’s trust or privacy is paramount, highlighting the need for careful and secure management of information.

Future areas for research include expanding the number of evaluators and involving a broader spectrum of evaluators or integrating patient perspectives on LLM outputs. Given the relatively similar performances of GPT-3.5 and GPT-4P, formal analysis into prompt engineering in a medical context could also be explored. Future studies may also include responses generated by humans alongside LLMs to compare LLM outputs more accurately with the current standard.

As the field of AI continues to integrate into medicine, thoughtful implementation of this technology is crucial to capture its potential and avoid missteps. This study hopes to shed light on the feasibility and effectiveness of integrating LLMs into medical practices for patient education, offering a foundation for informed decision-making and successful adoption.


Acknowledgments

The authors used REDCap to collect data for this study.

Funding: This work was partially supported by NIH/NCATS grant (UL1 TR001450).


Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-3/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-3/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This article does not contain any studies with human participants performed by any of the authors. Institutional IRB review deemed the project IRB exempt. Informed consent was waived as no patients were involved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Medicine JH. Anesthesia. Johns Hopkins Medicine. Accessed November 10, 2023. Available online: https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/types-of-anesthesia-and-your-anesthesiologist
  2. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems 2020;33:1877-901.
  3. Hu K. ChatGPT sets record for fastest-growing user base - analyst note. Reuters. 2023. Accessed October 11, 2023. Available online: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
  4. OpenAI. GPT-4 Technical Report. arXiv:2303.08774.
  5. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. [Crossref] [PubMed]
  6. Sarraju A, Bruemmer D, Van Iterson E, et al. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA 2023;329:842-4. [Crossref] [PubMed]
  7. Kassab J, El Dahdah J, Chedid El Helou M, et al. Assessing the Accuracy of an Online Chat-Based Artificial Intelligence Model in Providing Recommendations on Hypertension Management in Accordance With the 2017 American College of Cardiology/American Heart Association and 2018 European Society of Cardiology/European Society of Hypertension Guidelines. Hypertension 2023;80:e125-7. [Crossref] [PubMed]
  8. Momenaei B, Wakabayashi T, Shahlaee A, et al. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol Retina 2023;7:862-8. [Crossref] [PubMed]
  9. Kumari A, Kumari A, Singh A, et al. Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing. Cureus 2023;15:e43861. [Crossref] [PubMed]
  10. Krusche M, Callhoff J, Knitza J, et al. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int 2024;44:303-6. [Crossref] [PubMed]
  11. Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589-96. [Crossref] [PubMed]
  12. Grigorian A, Shipley J, Nahmias J, et al. Implications of Using Chatbots for Future Surgical Education. JAMA Surg 2023;158:1220-2. [Crossref] [PubMed]
  13. Thirunavukarasu AJ, Hassan R, Mahmood S, et al. Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care. JMIR Med Educ 2023;9:e46599. [Crossref] [PubMed]
  14. Luk DWA, Ip WCT, Shea YF. Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties. J Chin Med Assoc 2024;87:259-60. [Crossref] [PubMed]
  15. Luke WANV, Seow Chong L, Ban KH, et al. Is ChatGPT 'ready' to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry. Med Teach 2024; Epub ahead of print. [Crossref] [PubMed]
  16. Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education. J Orthop 2024;50:70-5. [Crossref] [PubMed]
  17. Takagi S, Watari T, Erabi A, et al. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ 2023;9:e48002. [Crossref] [PubMed]
  18. Rosoł M, Gąsior JS, Łaba J, et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep 2023;13:20512. [Crossref] [PubMed]
  19. Roos J, Kasapovic A, Jansen T, et al. Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany. JMIR Med Educ 2023;9:e46482. [Crossref] [PubMed]
  20. Aiumtrakul N, Thongprayoon C, Arayangkool C, et al. Personalized Medicine in Urolithiasis: AI Chatbot-Assisted Dietary Management of Oxalate for Kidney Stone Prevention. J Pers Med 2024;14:107. [Crossref] [PubMed]
  21. Nov O, Singh N, Mann D. Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study. JMIR Med Educ 2023;9:e46939. [Crossref] [PubMed]
  22. Currie G, Robbie S, Tually P. ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4. J Nucl Med Technol 2023;51:307-13. [Crossref] [PubMed]
  23. OpenAI. Prompt Engineering. Accessed 1/31/24. Available online: https://platform.openai.com/docs/guides/prompt-engineering/strategy-test-changes-systematically
  24. WangJShiEYuSPrompt engineering for healthcare: Methodologies and applications. arXiv:2304.14670.
  25. WhiteJFuQHaysSA Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382.
  26. Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) 2022;35:27730-44.
  27. Readable. Measure the Readability of Text - Text Analysis Tools. Accessed November 10, 2023. Available online: https://readable.com/text/
  28. Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open 2023;6:e2336483. [Crossref] [PubMed]
  29. Harris PA, Taylor R, Thielke R, et al. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 2009;42:377-81. [Crossref] [PubMed]
  30. Patnaik SS, Hoffmann U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth 2024;132:169-71. [Crossref] [PubMed]
  31. Segal S, Saha AK, Khanna AK. Appropriateness of Answers to Common Preanesthesia Patient Questions Composed by the Large Language Model GPT-4 Compared to Human Authors. Anesthesiology 2024;140:333-5. [Crossref] [PubMed]
  32. Mootz AA, Carvalho B, Sultan P, et al. The Accuracy of ChatGPT-Generated Responses in Answering Commonly Asked Patient Questions About Labor Epidurals: A Survey-Based Study. Anesth Analg 2024;138:1142-4. [Crossref] [PubMed]
  33. Jang M, Lukasiewicz T, editors. Consistency Analysis of ChatGPT. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023.
  34. Ray PP. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems 2023;3:121-54. [Crossref]
doi: 10.21037/jmai-24-3
Cite this article as: Fisher AD, Fisher G, George R, Hay E, Wolla CD, Wolf BJ. Comparative evaluation of ChatGPT versions in anesthesia patient education for total hip arthroplasty. J Med Artif Intell 2024;7:19.

Download Citation