Evaluating the accuracy of large language models in pulmonary medicine scientific writing
Original Article

Evaluating the accuracy of large language models in pulmonary medicine scientific writing

Sana Dang1 ORCID logo, Michelle Melfi1, Phillip J. Gary1,2 ORCID logo

1Department of Internal Medicine, SUNY Upstate Medical University, Syracuse, NY, USA; 2Division of Pulmonary, Critical Care, and Sleep Medicine, SUNY Upstate Medical University, Syracuse, NY, USA

Contributions: (I) Conception and design: S Dang, PJ Gary; (II) Administrative support: All authors; (III) Provision of study materials or patients: All authors; (IV) Collection and assembly of data: S Dang, M Melfi; (V) Data analysis and interpretation: S Dang, M Melfi; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Phillip J. Gary, MD. Division of Pulmonary, Critical Care, and Sleep Medicine, SUNY Upstate Medical University, 750 E Adams St, Syracuse, NY 13210, USA; Department of Internal Medicine, SUNY Upstate Medical University, Syracuse, NY, USA. Email: garyp@upstate.edu.

Background: Large language models (LLMs) have vast potential applications in healthcare across the clinical, educational, and research settings. The tendency of these models to fabricate plausible sounding but factually incorrect information is referred to as artificial hallucination and is a significant limitation to their use. Studies evaluating the accuracy of these models in scientific writing across various subspecialties have demonstrated a variable yet alarming frequency of hallucinations in LLM generated output. This study aimed to evaluate the accuracy of LLMs in scientific writing specific to the field of pulmonary medicine.

Methods: Five commonly encountered conditions relevant to pulmonary medicine were selected: asthma, chronic obstructive pulmonary disease, interstitial lung disease, pneumonia, and lung adenocarcinoma. A new artificial intelligence (AI) chatbot in Google Gemini was tasked with providing ten research articles for each condition with complete citations, which were then appraised by comparing them to the original articles for accuracy by two independent reviewers using the search tool on Google and PubMed for verification.

Results: A high degree of agreement was observed between reviewers (κ=0.9277). Of the 50 total citations, only one citation (2%) relevant to pneumonia was found to be completely accurate across all citation components. Article title was found to have the highest accuracy with thirty correct results (60%) while DOI performed the poorest with only two correct results (4%).

Conclusions: Only 2% of the citations generated by Google Gemini were found to be completely accurate across all citation components. A high degree of fabrication of plausible sounding but factually incorrect results, or artificial hallucinations, was observed. Thus, continued vigilance and rigorous efforts are necessary to validate the authenticity of responses generated by LLMs, especially as pertains to medical literature.

Keywords: Artificial intelligence (AI); large language models (LLMs); pulmonary medicine


Received: 19 August 2024; Accepted: 25 November 2024; Published online: 20 February 2025.

doi: 10.21037/jmai-24-287


Highlight box

Key findings

• Only 2% of the citations generated by Google Gemini were found to be completely accurate across all citation components.

What is known and what is new?

• The incidence of artificial hallucinations in pulmonary medicine references was noted to be greater than or comparable to that reported in prior studies in other medical fields.

• Findings are concerning especially given the presumed prowess of these tools in scientific writing and their widespread availability.

What is the implication and what should change now?

• Continued vigilance and rigorous efforts are necessary to validate the authenticity of responses generated by LLMs.

• Continued efforts are necessary into hallucination mitigation technologies including development of domain specific LLMs for reliable use medicine.


Introduction

The emergence of multimodal large language models (LLMs) like OpenAI’s ChatGPT (1) and Google Gemini (2) (previously Google Bard), which provide easily accessible, user-friendly access to generative artificial intelligence (GAI), has led to widespread and rapidly growing efforts to utilize these tools in various industries. The potential applications of LLMs in healthcare in the clinical, educational, and research settings have been widely discussed. Optimistic claims about LLMs revolutionizing the way patients and healthcare providers access information have been blunted by concerns for the potential harms of artificial intelligence (AI) (3-6).

The impact of natural language processing and generative capabilities of LLMs on scientific writing has quickly been realized. ChatGPT is already listed as the author of several academic publications (7). It has even been proposed that the field consider the advantages of elevating ChatGPT to the status of chief editor of reputable medical journals (8). LLMs may be particularly beneficial in helping with research and manuscript preparation for non-native English-speaking authors (9). AI-generated content has been shown to score higher in quality assessments when compared with the writing of experienced researchers (10). The performance and generality of these models will increase with enhancement of current “emerging” GAI, which are thought to be equal to or somewhat better than a skilled human, to “competent” GAI, expected to be as capable as at least 50th percentile of skilled adults (11).

While exploring the application of LLMs in scientific writing, several authors were confronted with a major limitation: the tendency of these LLMs to produce plausible sounding but incorrect or nonsensical answers, which has been acknowledged by OpenAI as a concern (12). These “artificial hallucinations” refer to the phenomena of machines, such as chatbots, generating seemingly realistic experiences that do not correspond to any real-world input (13). This becomes particularly concerning when we realize that it is often difficult for humans as well as AI detection systems to reliably distinguish writing as original vs. AI generated (14). Given their growing popularity, it is important to consider the consequences of countless learners and healthcare providers intentionally or inadvertently utilizing these multi-modal LLMs to aid in clinical research and decision making.

A deeper analysis of AI-generated references has revealed an alarming frequency of hallucinations. In 2023, a study published as a precautionary anecdote noted that merely 10% of the references related to head and neck research cited by ChatGPT were completely accurate (15). Another study in the same field revealed that ChatGPT-4 demonstrated higher accuracy of references compared to its predecessor, ChatGPT-3.5 (16). In global health research, ChatGPT was observed to have inaccuracies in 94% of the generated references, yet it outperformed Google Bard which was noted to have fabricated the entirety of its responses (17). Given the increasing ubiquity and vast potential applications of these models, the clinical and academic implications of artificial hallucinations in pulmonary and critical care medicine would be significant.

While reviewing the available literature, we noticed a lack of comprehensive data on the reliability of LLMs in pulmonary medicine research. This study thus seeks to investigate the accuracy of references generated by Google Gemini (2) relevant to pulmonary medicine to further inform the utilization of LLMs in scientific writing.


Methods

Study design

The study was designed to assess the accuracy of LLM generated references specific to the field of pulmonary medicine. In line with previously published studies in various other subspecialties (15,17), the authors chose to analyze 10 research articles relevant to 5 pulmonary medicine keywords, generating a total of 50 references available for analysis. Initially when applying ChatGPT-4 to search for pertinent articles, the LLM output included fewer than the requested number of references, with the recommendation to “search for articles on these topics in academic databases such as PubMed, Google Scholar, or your institution’s library”, which the authors feel restricted its application to our project and rendered the previously described key findings less relevant to the current capabilities of LLMs in scientific writing. Thus, this analysis was conducted using Google Gemini.

The study focused on five commonly encountered and frequently researched conditions relevant to pulmonary medicine: Asthma, Chronic Obstructive Pulmonary Disease, Interstitial Lung Disease, Pneumonia, and Lung Adenocarcinoma. For each of these keywords, a new chatbot was created in Google Gemini and prompted to provide “10 research articles with complete Vancouver style citations including DOI”. On reviewing the generated references for relevance to the researched keywords, all 50 were deemed appropriate and were included in the study. Institutional Review Board approval was not necessary as direct patient data was not assessed.

Citation review

The citations generated by Google Gemini were appraised by comparing them to the original articles for accuracy in the article title, list of authors, journal name, year of publication, and DOI. Each component of the citation was scored on a binary system (1 for correct and 0 for incorrect) by two independent reviewers using the search tool on Google and PubMed for verification. Any discrepancies were jointly analyzed by the two reviewers and subsequently resolved by appropriately classifying the generated citations as accurate or erroneous prior to computation of descriptive statistics.

Statistical analysis

The scores for all citations were compiled. Inter-rater reliability was tested using Cohen’s Kappa. Descriptive statistics were computed and Chi-square test was used to compare results between the citation components tested for accuracy. Statistical analysis was performed using SPSS statistical software (Version 22.0).


Results

Inter-rater reliability for the two reviewers

A high degree of agreement was observed between the two reviewers (κ=0.9277). The discrepancies were jointly reviewed and resolved prior to computation of descriptive statistics.

Accuracy of citations

Upon reviewing the generated references, fabrications and errors were found in all citation components. The LLM generated incorrect article titles, authors, years of publication, journal names, and DOIs, many of which did not correspond to the original article, or simply did not exist on meticulous search through PubMed and Google Scholar databases.

Of the citation components reviewed, article title was found to have the highest accuracy with 30 correct results (n=30, 60%). This was followed by year of publication (n=18, 36%), journal name (n=15, 30%), and list of authors (n=4, 8%). DOI performed the poorest with only two correct results (n=2, 4%).

Of the diagnoses chosen for this study, interstitial lung disease had the least accuracy, yielding only four verifiable article titles with the remaining parts of the citations being incorrect. The results of analysis for each citation component are shown in Table 1 and did not differ significantly between the various categories (P=0.31).

Table 1

Stratified analysis of the responses generated by Google Gemini across five citation components

Keyword conditions Article title List of authors Journal name Year of publication DOI
Asthma 7 0 5 4 0
COPD 4 3 3 3 0
Pneumonia 7 1 3 7 1
ILD 4 0 0 0 0
Lung adenocarcinoma 8 0 4 4 1
Total correct 30 4 15 18 2
Percentage correct (%) 60 8 30 36 4

COPD, chronic obstructive pulmonary disease; ILD, interstitial lung disease.

Of the 50 total citations generated by Google Gemini, only one citation (n=1, 2%) relevant to pneumonia was found to have accurate article title, list of authors, journal name, year of publication, and DOI when compared with the original article.


Discussion

The primary purpose of this study was to evaluate the accuracy of LLMs in scientific writing, specifically through the careful appraisal of citations generated by Google Gemini relevant to pulmonary medicine. We found only 2% of the references to be entirely accurate, thereby limiting the reliability and necessitating rigorous cross verification of LLM generated output, especially given the presumed prowess of these tools in scientific writing and their widespread availability.

Previous studies

Kumar et al. utilized Google Bard in their literature review on the clinical profile of chest pain reporting to the emergency room as myocardial infarction. Upon asking “What is the incidence of ventricular fibrillation (VF) in acute myocardial infarction (AMI) as per the FAST-MI trial?”, they were provided with a fabricated citation on the “Feasibility of Angioplasty in ST-Elevation Myocardial Infarction”. Having knowledge that FAST-MI stands for “French registry of acute ST-elevation or non-ST-elevation myocardial infarction” (18), the authors were alerted to the need for rigorous cross-checking of LLM responses.

With more niche research inquiries, information may become more difficult to access and the models may falsify more data to support its responses. For instance, Alkaissi and McFarlane attempted to write about homocysteine-induced osteoporosis, only to be presented with an intermixed set of validated facts and false claims, supported by a set of fabricated references (13).

Bhattacharya et al. also utilized ChatGPT to write 30 short medical papers with at least three references based on standardized prompts. Of the total 117 references cited in this collection, only 7% were noted to be accurate and authentic (19). Furthermore, LLMs may not provide complete citations, as evidenced in the study by Athaluri et al., where of the 178 citations generated by ChatGPT, 69 did not include DOI (20).

Incomplete, incorrect, and fabricated responses have been evident across various LLMs in all stages of development. Walters and Wilder utilized ChatGPT-3.5 and ChatGPT-4 to produce short literature reviews that included data cited from 636 references. ChatGPT-4 was noted to have a lower incidence of fabrications than ChatGPT-3.5 (18% vs. 55% respectively) (21).

Dhane et al. working in the field of global health research analyzed two sets of 50 references, each generated by Google Bard or ChatGPT, and found that Google Bard fabricated the entirety of its references, while ChatGPT had inaccuracies in 94% of the generated citations (17). A similar study in ENT by Wu et al. noted that merely 10% of the references related to head and neck research cited by ChatGPT were completely accurate (15). By comparison, in our experience with Google Gemini and utilizing a similar sample size, only 2% of the generated references were found to be accurate across all citation components.

Hallucination mitigation

LLMs are exposed to vast amounts of online multi-modal data during training, allowing them to display impressive language fluency. Unfortunately, this also leads to extrapolation of information making generated responses sensitive to undisclosed and misunderstood biases in training data, misinterpretation of ambiguous prompts, or modified information to align superficially with the input (22). In addition, the temperature of an LLM affects the output and extent of hallucination. “Temperature” corresponds to the degree of confidence an LLM has in its most likely response and is inversely related. ChatGPT, for instance, with a temperature of 0.7 in its predictions, allows the model to generate more diverse responses and allows room for artificial hallucinations (23).

These insights have catalyzed technological development and advancement of hallucination mitigation techniques (22), with the hope to improve the reliability of these models. It has been hypothesized that LLMs trained almost exclusively on the data from a particular field of intended application may show higher accuracy and efficiency in carrying out domain specific tasks. BiomedLM (24), a GPT-style model trained on biomedical abstracts and papers, along with various novel GPTs available through open AI, have shown promising results when tested on biomedical question answering tasks but require a thorough exploration of their generation capabilities and limitations prior to widespread use.

While it has been established that LLMs can be beneficial in the process of creating academic manuscripts, reviewing literature, identifying research questions, providing an overview of the current state of the field, and assisting with tasks such as formatting and language review (25), the authors’ experience of artificial hallucinations highlights the need for meticulous verification of responses generated by LLMs to continue to maintain the integrity of scientific literature.

Strengths and limitations

The alarming incidence of hallucinations was noted to be greater than or comparable to that evidenced in prior studies in other medical fields using various LLMs (15,17). A high degree of agreement was noted between the two reviewers (κ=0.9277), adding to the strengths of the analysis. One limitation of these results which we acknowledge is the small number of citations reviewed. Nonetheless, the persistence of plausible sounding but factually incorrect information in responses generated by the newly reformed LLM are concerning.


Conclusions

When evaluating the accuracy of LLMs in scientific writing by reviewing citations generated by Google Gemini relevant to pulmonary medicine, only 2% citations were found to be completely accurate across all citation components and demonstrated a high degree of artificial hallucination. There is need for continued vigilance and rigorous efforts to validate the authenticity of responses generated by LLMs, especially as pertains to medical literature.


Acknowledgments

None.


Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-287/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-287/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. IRB approval and informed consent are not required as no human subject is involved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. ChatGPT. Available online: https://chatgpt.com
  2. Gemini - chat to supercharge your ideas. Available online: https://gemini.google.com
  3. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023;29:1930-40. [Crossref] [PubMed]
  4. Rane N, Choudhary S, Rane J. Gemini or ChatGPT? Capability, Performance, and Selection of Cutting-Edge Generative Artificial Intelligence (AI) in Business Management. Studies in Economics and Business Relations 2024;5:40-50. [Crossref]
  5. Liebrenz M, Schleifer R, Buadze A, et al. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health 2023;5:e105-6. [Crossref] [PubMed]
  6. Du D, Paluch R, Stevens G, et al. Exploring patient trust in clinical advice from AI-driven LLMs like ChatGPT for self-diagnosis. arXiv preprint arXiv:2402.07920. 2024 Feb 2.
  7. Ide K, Hawke P, Nakayama T. Can ChatGPT Be Considered an Author of a Medical Article? J Epidemiol 2023;33:381-2. [Crossref] [PubMed]
  8. Salvagno M, Taccone FS. Artificial intelligence is the new chief editor of Critical Care (maybe?). Crit Care 2023;27:270. [Crossref] [PubMed]
  9. Hwang SI, Lim JS, Lee RW, et al. Is ChatGPT a "Fire of Prometheus" for Non-Native English-Speaking Researchers in Academic Writing?. Korean J Radiol 2023;24:952-9. [Crossref] [PubMed]
  10. Huespe IA, Echeverri J, Khalid A, et al. Clinical Research With Large Language Models Generated Writing-Clinical Research with AI-assisted Writing (CRAW) Study. Crit Care Explor 2023;5:e0975. [Crossref] [PubMed]
  11. Morris MR, Sohl-dickstein J, Fiedel N, et al. Levels of AGI for Operationalizing Progress on the Path to AGI. arXiv preprint arXiv:2311.02462. 2023 Nov 4.
  12. Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt
  13. Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023;15:e35179. [Crossref] [PubMed]
  14. Gao CA, Howard FM, Markov NS, et al. Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digit Med 2023;6:75. [Crossref] [PubMed]
  15. Wu RT, Dang RR. ChatGPT in head and neck scientific writing: A precautionary anecdote. Am J Otolaryngol 2023;44:103980. [Crossref] [PubMed]
  16. Frosolini A, Franz L, Benedetti S, et al. Assessing the accuracy of ChatGPT references in head and neck and ENT disciplines. Eur Arch Otorhinolaryngol 2023;280:5129-33. [Crossref] [PubMed]
  17. Dhane AS, Sarode S, Sarode G, et al. A reality check on chatbot-generated references in global health research. Oral Oncol Rep 2024;10:100246. [Crossref]
  18. Kumar M, Mani UA, Tripathi P, et al. Artificial Hallucinations by Google Bard: Think Before You Leap. Cureus 2023;15:e43313. [Crossref] [PubMed]
  19. Bhattacharyya M, Miller VM, Bhattacharyya D, et al. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus 2023;15:e39238. [Crossref] [PubMed]
  20. Athaluri SA, Manthena SV, Kesapragada VSRKM, et al. Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus 2023;15:e37432. [Crossref] [PubMed]
  21. Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 2023;13:14045. [Crossref] [PubMed]
  22. Tonmoy SMTI, Zaman SMM, Jain V, et al. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv preprint arXiv:2401.01313. 2024 Jan 2
  23. Beutel G, Geerits E, Kielstein JT. Artificial hallucination: GPT on LSD? Crit Care 2023;27:148. [Crossref] [PubMed]
  24. Bolton E, Venigalla A, Yasunaga M, et al. BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text. arXiv preprint arXiv:2403.18421. 2024 Mar 27.
  25. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific writing? Crit Care 2023;27:75. [Crossref] [PubMed]
doi: 10.21037/jmai-24-287
Cite this article as: Dang S, Melfi M, Gary PJ. Evaluating the accuracy of large language models in pulmonary medicine scientific writing. J Med Artif Intell 2025;8:47.

Download Citation