Quality assessment of ChatGPT-3.5—generated patient information leaflets in vascular surgery

Mark Twyford; Brian Fahey; Colum Keohane; Daniel Westby; J. Donagh Healy; Stewart W. Walsh

doi:10.21037/jmai-2025-136

Original Article

Quality assessment of ChatGPT-3.5—generated patient information leaflets in vascular surgery

Mark Twyford¹, Brian Fahey², Colum Keohane³, Daniel Westby³, J. Donagh Healy⁴, Stewart W. Walsh³

¹Department of Vascular Surgery, St James’s Hospital, Dublin, Ireland; ²Royal College of Surgeons in Ireland, Dublin, Ireland; ³Department of Vascular Surgery, University Hospital Galway, Galway, Ireland; ⁴Department of Vascular Surgery, Tallaght University Hospital, Dublin, Ireland

Contributions: (I) Conception and design: M Twyford, B Fahey, JD Healy, SW Walsh; (II) Administrative support: M Twyford, B Fahey; (III) Provision of study materials or patients: M Twyford, B Fahey, JD Healy, SW Walsh; (IV) Collection and assembly of data: M Twyford, B Fahey, JD Healy, SW Walsh; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Mark Twyford, FRCS, Mch. Department of Vascular Surgery, St James’s Hospital, James’s Street, Dublin 8, D08 NHY1, Ireland. Email: marktwyford@rcsi.ie.

Background: Generative Pre-trained Transformer (ChatGPT) is a generative artificial intelligence (AI) model developed by Open AI (San Francisco, CA, USA), which generates responses based on input received and can solve problems and complete tasks by using reinforcement techniques and machine learning from sources online. Despite the growing use of AI in healthcare, its application in vascular surgery remains limited. This study aimed to evaluate ChatGPT’s ability to generate patient information related to digital subtraction angiography (DSA). Fifteen commonly asked patient questions regarding DSA were identified, and ChatGPT-3.5 was used to produce responses. Additionally, the model was tasked with generating a complete patient information leaflet for DSA. The outputs were assessed for readability, informational quality, and appropriateness.

Methods: Fifteen questions were entered into ChatGPT-3.5, which also generated a complete patient information leaflet for DSA. The readability of the outputs was evaluated using two standardized scoring systems: the Flesch-Kincaid Reading Ease Score (FRES) and the Gunning Fog Index (GFI). The quality of the responses was assessed using the DISCERN tool, and their appropriateness was rated on a Likert scale.

Results: The readability analysis using the FRES yielded an average score of 31.21 (range, 16.29–53.57), corresponding to a college-level reading ability. The mean Flesch-Kincaid Grade Level (FKGL) was 13.54 (range, 9.37–15.30). The patient information leaflet generated by ChatGPT-3.5 scored 41.30 on the Reading Ease scale, also indicating a college-level reading age. Using the GFI, the average score for the responses was 15.91 (range, 12.24–20.84), equivalent to the reading level of a college junior or senior, while the patient information leaflet scored 14.43. These findings suggest that the content is written at a significantly higher level than the recommended 6th–8th grade reading level for patient education materials. The quality of the responses, assessed using the DISCERN tool, averaged 41, which is considered “fair”, whereas the patient information leaflet scored 34, classified as “poor”. Despite this, the majority of the content was rated as factually accurate and generally appropriate for the clinical context.

Conclusions: Using ChatGPT-3.5 to generate responses to patient questions and create a patient information leaflet resulted in content written at a significantly higher reading level than recommended for patient education. While individual responses demonstrated generally good appropriateness and achieved fair DISCERN scores, the overall quality and readability decreased when generating a complete leaflet. These findings suggest that ChatGPT-3.5 performs better with discrete questions than with comprehensive tasks. Care and caution should be exercised when considering the use of such tools for patient-facing materials.

Keywords: Angiography; artificial intelligence (AI); Generative Pre-trained Transformer (ChatGPT); patient information leaflet

Received: 23 May 2025; Accepted: 12 September 2025; Published online: 10 June 2026.

doi: 10.21037/jmai-2025-136

Highlight box

Key findings

• Using ChatGPT-3.5 to elicit responses from direct questioning and also to produce a patient information leaflet, produces a significantly higher than recommended reading level.

What is known and what is new?

• ChatGPT has been used to address common patient questions in other specialties, but little has been done in vascular surgery.

• Attempt to generate an acceptable patient information leaflet using ChatGPT-3.5.

What is the implication, and what should change now?

• Care and caution should therefore be used using these tools.

• Comparing the responses of ChatGPT-3.5 vs. newer iterations.

• Repeat the inputs after a reasonable period of time to see how much it has “learned” and compare to the original.

Introduction

In recent years the use of artificial intelligence (AI) has grown exponentially. Its applications include addressing workforce shortages, improving productivity, fostering innovation, supporting the development of new products and services, and mitigating supply chain disruptions (1). In medicine, AI has been used to support both prognosis and diagnosis, as well as to answer prescription-related queries, including those concerning drugs costs and efficacy (2).

Second generation AI refers to the current iteration which thinks, plans and solves problems and tasks by itself. In medicine, AI has the potential to enhance decision-making, reduce costs and generate predictions based on prior trends. Negative aspects include ethical concerns, limited understanding, as well as issues related to privacy and data security (3).

Generative Pre-trained Transformer (ChatGPT) is a generative AI model developed by Open AI (San Francisco, CA, USA), that generates responses based on the input received. It can solve problems and complete tasks by using reinforcement techniques and machine learning from sources online. It can then improve its own efficiency and respond to the user’s needs. It also has the ability to understand and mimic natural language (4,5). ChatGPT has already been used in the writing of medical articles and academic manuscripts (6).

Hurley et al. recently explored the utility of ChatGPT in addressing common patient queries regarding shoulder stabilization surgery (7). Fifteen commonly asked questions were selected and ChatGPT was used to generate responses. These responses were evaluated for accuracy, quality and readability using the JAMA benchmark criteria, DISCERN score, Flesch-Kincaid Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). They concluded that the responses given by the programme were generally high quality, however the source of the information provided was often unclear. It also concluded that a high reading level was needed to comprehend the information provided.

There has been little work regarding the role of ChatGPT in vascular surgery. Current efforts are primarily focused on developing tools that enhance patient care, streamline administrative and clinical workflows, support medical education and research, and enable the use of virtual assistants (8). Athavale et al. employed ChatGPT to address questions with regard to chronic venous disease patient management while Haidar et al. utilized the tool to generate patient information leaflets for common vascular procedures (9,10).

Digital subtraction angiography (DSA) is a well-established diagnostic and therapeutic modality in patients with leg ulceration (11). The aim of this study was to select 15 commonly asked questions by patients undergoing DSA and generate responses using ChatGPT. We then asked GPT-3.5 to generate a patient information leaflet itself for a patient undergoing a DSA. The generated responses were then assessed for readability, quality of information and appropriateness.

Methods

Question generation

Two consultant vascular surgeons (J.D.H. and S.W.W.) from two different vascular centres in Ireland (University Hospital Galway and Tallaght University Hospital) and two vascular surgery trainees (M.T. and B.F.) were asked to devise 15 common questions that patients undergoing DSA might typically ask. These were agreed upon and the list was compiled. These commonly asked questions are represented below as questions 1 to 15 throughout this paper (Table 1).

Table 1

Compiled list of questions asked of ChatGPT

No.	Questions
1	What is a digital subtraction angiogram?
2	Do I need a digital subtraction angiogram?
3	Does a digital subtraction angiogram always work?
4	Should I get a digital subtraction angiogram?
5	Can my ulcer be managed without a digital subtraction angiogram?
6	What will happen after my digital subtraction angiogram?
7	Will my symptoms improve after my digital subtraction angiogram?
8	What are the risks of a digital subtraction angiogram?
9	Will I lose my leg after digital subtraction angiogram?
10	Does a digital subtraction angiogram hurt?
11	Can I drive after a digital subtraction angiogram?
12	When can I return to work after a digital subtraction angiogram?
13	When can I fly after digital subtraction angiogram?
14	Do I have to stay in hospital overnight after a digital subtraction angiogram?
15	Do I need to stop any medications before digital subtraction angiogram?
16	Can you make a patient information leaflet for a patient undergoing a digital subtraction angiogram?

ChatGPT, Generative Pre-trained Transformer.

ChatGPT was then asked to create a patient information leaflet for a patient undergoing a DSA. This question is represented below as question 16 and throughout this paper.

ChatGPT input

Each of the 15 prompts was entered once into ChatGPT-3.5 by two independent users (M.T. and B.F.) on the same date (3^rd May 2024). This resulted in two separate outputs per question. No attempts were made to regenerate responses to control for variation. The responses were then analyzed for readability and quality as outlined below. The outputs were found to be the same. It was also asked to generate a patient information leaflet itself. The full responses of the programme are demonstrated in Appendix 1.

Readability

The readability of each of the generated responses was validated by two separate tools: the Flesch-Kincaid score and the Gunning Fog Index (GFI) (Figure 1) (12,13). These are readability tests designed to indicate how difficult a passage of language is to understand. The Flesch-Kincaid score ranges from 0 to 100, with higher scores indicating greater ease of reading; for example, a score of 100 suggests the text is easily understood by an average 11-year-old student, while a score closer to 0 indicates content suitable for university-level readers. It uses average sentence length and word complexity to produce a numerical score. The Gunning fog index ranges from 6 which is readable by someone in the 6^th grade to 17 which would be readable by a college graduate. It is calculated using a formula involving total words, total sentences, complex words and total words.

Figure 1 Flesch-Kincaid and Gunning Fog Index.

The Flesch-Kincaid score and reading level was calculated using the “charactercalculator” website (14). Each passage of response was inputted verbatim into the calculator and the reading score, reading level, Grade Level score and reading note were calculated.

The GFI was calculated in a similar fashion using the calculator on the “gunning-fog-index” website (15).

Quality analysis

The quality of the responses was evaluated using the DISCERN criteria (16). This allows a standardised method to assess the quality of health-care related information. It comprises of three sections: reliability, quality, and overall evaluation. Sixteen questions are used as discriminators. A score from 1–5 is given for each domain and totalled to give a final score. Scores range from excellent (63 to 75), good (51 to 62), fair (39 to 50), poor (27 to 38) and very poor (16 to 26). Two individuals MT and BF independently assessed each question. Discrepancies were discussed and consensus agreed upon.

Appropriateness

Each response was graded as appropriate or not appropriate depending on the information located in the passage. The Likert scale was used here. This is a scale which evaluates ranging from 1 (completely disagree) to 5 (completely agree) with the appropriateness of the response (17).

Statistical analysis

Data was recorded and analysed using Microsoft Excel (Microsoft Corp[oration, Redmond, WA, USA). Descriptive statistics were used to summarise the data. Continuous variables are presented as mean or average as appropriate.

Ethical considerations

This study did not involve the use of real patient data or interaction with human participants. All questions were devised by clinical professionals, and ChatGPT was used to simulate responses to these generic, hypothetical queries. No identifiable or sensitive information was collected or analyzed. As such, ethical approval was not required for this study.

Results

Readability

Flesch-Kincaid Reading Score

The full results are displayed in Table 2. The average reading score was 31.21 (16.29–53.57) which would be aimed at a college level population. Eight of the responses were deemed college graduate level and very difficult to read, followed by 6 who were at a college level difficult and 1 aimed at 10^th–12^th grade fairly difficult. The average reading level was 13.54 (9.37–15.30). For the patient information leaflet the reading score was 41.30 again aimed at college level reading age and difficult to read.

Table 2

Combined Flesch-Kincaid and Gunning Fog Index

Commonly asked patient questions and ChatGPT-generated patient information leaflet	Flesch-Kincaid reading score	Flesch-Kincaid reading level	Flesch-Kincaid grade level score	Gunning Fog Index
What is a digital subtraction angiography?	19.17	14.19	College graduate very difficult	16.80
Do I need a digital subtraction angiogram?	23.03	14.23	College graduate very difficult	16.79
Does a digital subtraction angiogram always work?	16.29	15.30	College graduate very difficult	16.52
Should I get a digital subtraction angiogram?	25.53	14.75	College graduate very difficult	17.35
Can my ulcer be managed without a digital subtraction angiogram?	25.55	13.70	College graduate very difficult	15.71
What will happen after my digital subtraction angiogram?	39.68	12.39	College difficult	14.42
Will my symptoms improve after my digital subtraction angiogram?	27.02	14.50	College graduate very difficult	15.80
What are the risks of a digital subtraction angiogram?	33.86	12.38	College difficult	14.54
Will I lose my leg after digital subtraction angiogram?	21.20	16.60	College graduate very difficult	20.84
Does a digital subtraction angiogram hurt?	53.57	9.37	10^th to 12^th grade fairly difficult	12.24
Can I drive after a digital subtraction angiogram?	42.45	11.52	College difficult to read	15.23
When can I return to work after a digital subtraction angiogram?	44.39	12.82	College difficult to read	13.99
When can I fly after digital subtraction angiogram?	39.11	12.85	College difficult to read	14.70
Do I have to stay in hospital overnight after a digital subtraction angiogram?	30.10	14.54	College difficult to read	17.82
Do I need to stop any medications before digital subtraction angiogram?	27.30	13.96	College graduate very difficult	15.97
Can you make a patient information leaflet for a patient undergoing a digital subtraction angiogram?	41.30	12.90	College difficult to read	14.43

ChatGPT, Generative Pre-trained Transformer.

GFI

The full results are also displayed in Table 2. The average score was 15.91 (12.24–20.84). This would be the equivalent to a college junior or senior. The highest score was 20.84 (a college graduate) and the lowest was 12.24 (high school senior). The patient information leaflets score was 14.43.

Quality analysis

The DISCERN tool was used to analyse the quality of the responses as a whole. All the responses to the questions were assessed as a single entity and an overall score was given. An overall score of 3 suggested the patient information leaflet and the generated questions contained potentially important information but not serious shortcomings. Table 3 demonstrates the results fully. The total score given was 41 with the average score of 2.73 per question. This is considered fair using the DISCERN scoring system. The information leaflet scored lower at 34 and would be considered poor as a result. As can be seen for Table 3, no aims were mentioned, no sources listed from where or when the information came from and information regarding quality of life impact were lacking in both. Additionally the patient information leaflet gave no information on alternate treatments or what would happen if the treatment was not used.

Table 3

DISCERN scores

Question	Rating generated questions	Ratings generated patient information leaflet
1. Are the aims clear?	1	1
2. Does it achieve its aims?	n/a	n/a
3. Is it relevant?	5	5
4. Is it clear what sources of information were used to compile the publication (other than the author or producer)?	1	1
5. Is it clear when the information used or reported in the publication was produced?	1	1
6. Is it balanced and unbiased?	2	3
7. Does it provide details of additional sources of support and information?	1	1
8. Does it refer to areas of uncertainty?	3	5
9. Does it describe how each treatment works?	5	5
10. Does it describe the benefits of each treatment?	4	2
11. Does it describe the risks of each treatment?	4	5
12. Does it describe what would happen if no treatment is used?	3	1
13. Does it describe how the treatment choices affect overall quality of life?	2	1
14. Is it clear that there may be more than one possible treatment choice?	3	1
15. Does it provide support for shared decision-making?	3	4
16. Based on the answers to all of the above questions, rate the overall quality of the publication as a source of information about treatment choices	3	3
Total	41	34

Appropriateness

The majority of the information was regarded as appropriate and factual. The responses gave detail and even potential alternatives. All responses received a score of 5 except for the question on flying as it suggested possibly flying the day after, but caveated by saying discuss with healthcare provider, which was given a score of 2.

Many of the responses advocated for discussion between the patient and their healthcare provider. Also it was noted that if pain was experienced post operatively, medical assistance should be sought as it could be a sign of a potential complication. The full responses are listed in Table 4.

Table 4

Appropriateness of questions

No.	Question	Score
1	What is a digital subtraction angiography?	5
2	Do I need a digital subtraction angiogram?	5
3	Does a digital subtraction angiogram always work?	5
4	Should I get a digital subtraction angiogram?	5
5	Can my ulcer be managed without a digital subtraction angiogram?	5
6	What will happen after my digital subtraction angiogram?	5
7	Will my symptoms improve after my digital subtraction angiogram?	5
8	What are the risks of a digital subtraction angiogram?	5
9	Will I lose my leg after digital subtraction angiogram?	5
10	Does a digital subtraction angiogram hurt?	5
11	Can I drive after a digital subtraction angiogram?	5
12	When can I return to work after a digital subtraction angiogram?	5
13	When can I fly after digital subtraction angiogram?	2
14	Do I have to stay in hospital overnight after a digital subtraction angiogram?	5
15	Do I need to stop any medications before digital subtraction angiogram?	5
16	Can you make a patient information leaflet for a patient undergoing a digital subtraction angiogram?	2

Discussion

The aim of this study was to evaluate the content of the responses that ChatGPT would elicit from commonly asked questions patients undergoing DSA may ask and also to produce a patient information leaflet. Readability, quality and appropriateness of the information provided were assessed.

With regards to readability, overall a high level of education is needed to assimilate the information. Using both the Flesch-Kincaid score and the GFI a college level education is needed to understand the passages fully. Both the American Medical Association and the National Institute of Health recommend that patient materials be written at the sixth and eighth grade reading level, respectively (18). A lower literacy rate can directly affect healthcare outcomes (19). This study indicated that ChatGPT’s responses are at a significantly higher literacy level than recommended. Bar one response, they are all college level or above. The lowest reading level measured was 10^th to 12^th grade. This would correlate with Hurley et al. findings regarding ChatGPT and shoulder stabilization surgery (7). Haidar et al. reported similar results when they evaluated ChatGPT and three common vascular procedures. They found that the FKRE score was 33.3 which indicated poor readability of the AI generated responses. The mean GFI was 16.7, similar to our average of 15.91 and 14.43 for the information leaflet (10). At present, it would seem that the level of literacy to fully understand the information using ChatGPT may be too high. Ayyaswami et al. evaluated the readability of online cardiovascular disease related health education materials and found that 99.5% of these articles were written beyond the American Medical Associations recommended reading level (20). This would potentially correlate as ChatGPT pulls its responses from information available on line.

The quality of the responses using the DISCERN tool score were similar in Haidar et al.’s study, with a mean score of 50.3 which was also fair. However, Hurley et al. had a higher DISCERN score of 60 which is classed as good (7). Overall, it would appear the quality of the information was reasonable. However, the lower score of 35 in the information leaflet may indicate that GPT responds better to direct questioning. In our analysis the responses scored very poorly in the aims, data sources and provision of additional resources. As Hurley et al. mentioned, there were also numerous mentions of discussion with patients’ health care provider in the responses from this study (7).

As seen from the results most of the responses were deemed appropriate. The only question which scored a 2 referred to flying as well as the leaflet. The leaflet scored poorly here due to no alternatives to treatment or mention of further management possibilities. The risks were not discussed in depth and there was no mention of returning to work or flying.

It should also be noted that online materials in general may have a higher reading level then recommended. San Norberto et al. reviewed patient information material on venous thrombosis from 7 medical societies (European Society for Vascular Surgery, Society for Vascular Medicine, Society for Vascular Surgery, Vascular Society for Great Britain and Ireland, Australia and New Zealand society for Vascular Surgery, Canadian Society for Vascular Surgery and the American Heart Association). They reported a median Flesch Reading Ease of 56.10, corresponding to a “fairly difficult” reading level for all 58 recommendations (21). Conversely, Haidar et al. found that the mean FKRE score was 59.1 and the GFI mean of 12.8 for patient information leaflets from The Circulation Foundation, versus 33.3 and 16.7 when they compared ChatGPT responses, meaning the patient information leaflet was of a “fairly difficult” readability and a GF score of High school senior (10). This is still higher than recommended however.

Yacob et al. looked at the information gleamed from Wikipedia from the perspective of medical education. They concluded that the mean FRE, FKGL and GFI were 30.5, 13.8 and 16.6 respectively with a mean DISCERN score of 52.9 (good). Their readability was similar however they scored more than 10 points higher in the quality analysis. In their conclusion they mention that Wikipedia articles are written with the general population in mind (22).

Limitations

ChatGPT itself may be seen as a limitation. Biswas et al. reviewed the role of ChatGPT in public health. They highlighted the potential cons of ChatGPT including the limited accuracy of the data, bias and limitations, lack of context, limited engagement and also lack of any direct contact with health professionals. They concluded it should be used only after careful consideration and cautiously implemented (23). We also accessed the programme on a specific day. It is constantly updating and learning. If it was accessed at a later date our results may be different. ChatGPT was 3.5 also used as opposed to GPT-4. In several medical related studies GPT-4 was shown to generate more accurate responses. Compared to 3.5 it has substantial users and experts which provide feedback to inform its training and also insights from previous models (24). This increased accuracy was further established by Takagi et al. They demonstrated that GPT-4 outperformed GPT-3.5 in the Japanese Medical Licensing Examination in terms of accuracy and in more difficult questions and specific disease questions. GPT-3 did not pass the 2023 exam whereas GPT-4 did (25). According to internal evaluations GPT-4 is 40% more likely to produce factual responses and significantly outperforms 3.5 in many tests such as the Uniform Bar Exam and the Medical Knowledge Self-Assessment Programme (26).

As mentioned previously it also does not give any sources where the information is gathered from or when it is from. There may also be bias from surgeons experience when the questions were being compiled.

There are also ethical considerations of using these programmes. Levkovich et al. used both GPT-3.5 and GPT-4.0 to evaluate suicide risk in a hypothetical patient. They concluded that GPT-4 produced similar results to mental health professionals, however GPT-3.5 underestimates suicide risk, particularly in severe cases (27). Also as Haleem et al. describe in their discussion according to OpenAI, ChatGPT may sometimes react to damaging instructions or display discriminatory behaviour and occasionally compose plausible-sounding but incorrect or nonsensical responses (28). These are of concern.

There have been some successes using ChatGPT in medicine however. Günay et al. demonstrated that GPT-4 was more successful in identifying every day and more complex ECG’s questions. However when GPT-4 was compared to Cardiologist it performed better on everyday questions and the answers were very similar as the questions increased in difficulty (29). Leypold et al. used GPT-4 to pose 9 plastic surgery scenarios to it, and compared it to the 3 board certified plastic surgeons. They concluded that it GPT 4 was able to give viable treatment options, analyse complex clinical scenarios and address comorbidities (30).

Bajwa et al. examined how AI has, is and may transform medicine (31). In 10 years’ time AI systems will have learned a vast amount of new information and should be able to deliver patient centred and accurate medical information. Future challenges such as data quality and access, infrastructure and safety and regulation will need to be overcome.

Biswas et al. asked the programme a series of questions including “What is Chat GPT” and “will Chat GPT replace human medical writer”? He also then asked it to produce an essay on “the opinion of the radiology resident or fellow in clinical radiology or imaging science on issues specific to trainee experiences”. It was then able to produce an essay placing itself in the shoes of the fellow/resident which reads as if a human wrote it (6). Haider et al. devised two questionnaires and garnered responses from ChatGPT. The first set included 20 questions on non-complex medical and administrative matters. The second set included 20 complex medical questions which required expertise in the topic. They then assessed to responses of two versions of ChatGPT-3.5 and -4. ChatGPT-4 performed better on both occasions. They did not assess the readability level in this case (10).

Haidar et al. also recently looked at 3 common vascular procedures Endovascular Aortic Repair, Endovenous laser ablation and femero-popliteal bypass. They used ChatGPT to elicit responses on these procedures and evaluated their readability and quality. They then compared this to patient information leaflets from the Circulation Foundation UK. They concluded that AI generated information was poor in both readability of the text and the quality of information (10).

Koťátková et al. also highlighted that not just patients and medical professionals may seek medical information (32). Lawyers and researchers may also need access to the information. AI can be used to simplify and condense information which may have previously been filled with jargon and complex medical terms. It would seem that more recent versions of chat bots are more accurate and precise in their answering.

There are also limitations to both the Flesch-Kincaid score and GFI. The Flesch-Kincaid score focuses on syllable count, however some words may have higher syllables but re in common usage (14). Both also fail to ascertain if the vocabulary used is familiar to the average patient.

Conclusions

In conclusion, using ChatGPT-3.5 to elicit responses from direct questioning and also to produce a patient information leaflet produces a significantly higher than recommended reading level. Care and caution should therefore be used when using these tools.

Areas of further research would include comparing the response of ChatGPT-3.5 vs. -4.0 and also repeating the inputs after a reasonable period of time and comparing it to the original responses to see how much it has “learned”.

Acknowledgments

As mentioned, ChatGPT-3.5 was used to elicit responses from 15 questions. We also then asked it to make its own patient information leaflet. AI was not used for any other areas in this text. This manuscript will be used as part of Mark Twyford’s MD thesis.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-136/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-136/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study did not involve the use of real patient data or interaction with human participants, All questions were devised by clinical professionals, and ChatGPT was used to simulate responses to these generic, hypothetical queries. No identifiable or sensitive information was collected or analyzed. As such, ethical approval was not required for this study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

McKendrick J. AI Adoption Skyrocketed Over the Last 18 Months. Available online: https://hbr.org/2021/09/ai-adoption-skyrocketed-over-the-last-18-months
Basu K, Sinha R, Ong A, et al. Artificial Intelligence: How is It Changing Medical Sciences and Its Future? Indian J Dermatol 2020;65:365-70. [Crossref] [PubMed]
Vodanović M, Subašić M, Milošević D, et al. Artificial Intelligence in Medicine and Dentistry. Acta Stomatol Croat 2023;57:70-84. [Crossref] [PubMed]
Cheng K, He Y, Li C, et al. Talk with ChatGPT About the Outbreak of Mpox in 2022: Reflections and Suggestions from AI Dimensions. Ann Biomed Eng 2023;51:870-4. [Crossref] [PubMed]
Yu H. Reflection on whether Chat GPT should be banned by academia from the perspective of education and teaching. Front Psychol 2023;14:1181712. [Crossref] [PubMed]
Biswas S. ChatGPT and the Future of Medical Writing. Radiology 2023;307:e223312. [Crossref] [PubMed]
Hurley ET, Crook BS, Lorentz SG, et al. Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence-Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery. Arthroscopy 2024;40:726-731.e6. [Crossref] [PubMed]
Lareyre F, Nasr B, Chaudhuri A, et al. Comprehensive Review of Natural Language Processing (NLP) in Vascular Surgery. EJVES Vasc Forum 2023;60:57-63. [Crossref] [PubMed]
Athavale A, Baier J, Ross E, et al. The potential of chatbots in chronic venous disease patient management. JVS Vasc Insights 2023;1:100019. [Crossref] [PubMed]
Haidar O, Jaques A, McCaughran PW, et al. AI-Generated Information for Vascular Patients: Assessing the Standard of Procedure-Specific Information Provided by the ChatGPT AI-Language Model. Cureus 2023;15:e49764. [Crossref] [PubMed]
Faglia E, Mantero M, Caminiti M, et al. Extensive use of peripheral angioplasty, particularly infrapopliteal, in the treatment of ischaemic diabetic foot ulcers: clinical results of a multicentric study of 221 consecutive diabetic subjects. J Intern Med 2002;252:225-32. [Crossref] [PubMed]
Kincaid JP, Fishburne RP Jr., Rogers RL, et al. Derivation of New Readability Formulas (Automated Readability Index, Fog Count And Flesch Reading Ease Formula) For Navy Enlisted Personnel. Institute for Simulation and Training 1975.
Soliman L, Soliman P, Gallo Marin B, et al. Craniosynostosis: Are Online Resources Readable? Cleft Palate Craniofac J 2024;61:1228-32. [Crossref] [PubMed]
Flesch Kincaid Calculator. Available online: https://charactercalculator.com/flesch-reading-ease/
Gunning Fog Index. Available online: http://gunning-fog-index.com/fog.cgi
Charnock D, Shepperd S, Needham G, et al. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999;53:105-11. [Crossref] [PubMed]
Chan SM, Gardezi M, Satam K, et al. Virtual vascular surgery interest group during the coronavirus disease 2019 pandemic. J Vasc Surg 2023;77:279-285.e2. [Crossref] [PubMed]
Rooney MK, Santiago G, Perni S, et al. Readability of Patient Education Materials From High-Impact Medical Journals: A 20-Year Analysis. J Patient Exp 2021;8:2374373521998847. [Crossref] [PubMed]
Dewalt DA, Berkman ND, Sheridan S, et al. Literacy and health outcomes: a systematic review of the literature. J Gen Intern Med 2004;19:1228-39. [Crossref] [PubMed]
Ayyaswami V, Padmanabhan D, Patel M, et al. A Readability Analysis of Online Cardiovascular Disease-Related Health Education Materials. Health Lit Res Pract 2019;3:e74-80. [Crossref] [PubMed]
San Norberto EM, García-Rivera E, Revilla Á, et al. Readability of patient educational materials in venous thrombosis: analysis of the 2021 ESVS guidelines and comparison with other medical societies information. Int Angiol 2022;41:149-57. [Crossref] [PubMed]
Yacob M, Lotfi S, Tang S, et al. Wikipedia in Vascular Surgery Medical Education: Comparative Study. JMIR Med Educ 2020;6:e18076. [Crossref] [PubMed]
Biswas SS. Role of Chat GPT in Public Health. Ann Biomed Eng 2023;51:868-9. [Crossref] [PubMed]
Lim ZW, Pushpanathan K, Yew SME, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 2023;95:104770. [Crossref] [PubMed]
Takagi S, Watari T, Erabi A, et al. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ 2023;9:e48002. [Crossref] [PubMed]
Available online: https://openai.com/index/gpt-4-research
Levkovich I, Elyoseph Z. Suicide Risk Assessments Through the Eyes of ChatGPT-3.5 Versus ChatGPT-4: Vignette Study. JMIR Ment Health 2023;10:e51232. [Crossref] [PubMed]
Haleem A, Javaid M, Singh RP. An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil Transactions on Benchmarks Standards and Evaluations 2022;2:100089.
Günay S, Öztürk A, Özerol H, et al. Comparison of emergency medicine specialist, cardiologist, and chat-GPT in electrocardiography assessment. Am J Emerg Med 2024;80:51-60. [Crossref] [PubMed]
Leypold T, Schäfer B, Boos A, et al. Can AI Think Like a Plastic Surgeon? Evaluating GPT-4’s Clinical Judgment in Reconstructive Procedures of the Upper Extremity. Plast Reconstr Surg Glob Open 2023;11:e5471.
Bajwa J, Munir U, Nori A, et al. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthc J 2021;8:e188-94. [Crossref] [PubMed]
Koťátková A, Miralles Hernández M. Artificial intelligence at the service of patients: Using ChatGPT to make medical reports easier to understand. Mètode Science Studies Journal 2025;15:e28177.

doi: 10.21037/jmai-2025-136
Cite this article as: Twyford M, Fahey B, Keohane C, Westby D, Healy JD, Walsh SW. Quality assessment of ChatGPT-3.5—generated patient information leaflets in vascular surgery. J Med Artif Intell 2026;9:49.

Quality assessment of ChatGPT-3.5—generated patient information leaflets in vascular surgery

Highlight box

Introduction

Methods

Question generation

Table 1

ChatGPT input

Readability

Quality analysis

Appropriateness

Statistical analysis

Ethical considerations

Results

Readability

Flesch-Kincaid Reading Score

Table 2

GFI

Quality analysis

Table 3

Appropriateness

Table 4

Discussion

Limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share