GPT-4 accurately classifies the clinical actionability of emergency department imaging reports
Brief Report

GPT-4 accurately classifies the clinical actionability of emergency department imaging reports

Kevin Tang1 ORCID logo, Julia Reisler2, Christian Rose3, Brian Suffoletto3, David Kim3

1Albert Einstein College of Medicine, Bronx, NY, USA; 2Department of Computer Science, Stanford University, Stanford, CA, USA; 3Department of Emergency Medicine, Stanford University, Stanford, CA, USA

Correspondence to: Kevin Tang, BS. Albert Einstein College of Medicine, 1300 Morris Park Ave, Bronx, NY 10461-1900, USA. Email: kevin.tang@einsteinmed.edu.

Abstract: Emergency department (ED) imaging volumes are increasing, necessitating prioritization of studies based on the probability of detecting actionable findings. We evaluated the performance of Generative Pretrained Transformer 4 (GPT-4), a large language model (LLM), in identifying clinically actionable findings in ED imaging reports. We analyzed 1,000 de-identified clinical impressions from the 20 most frequently ordered imaging studies at Stanford Health Care between 2020 and 2023. Two emergency physicians independently labeled each report for “clinically actionable findings” (likely to change ED management or disposition), with disagreements resolved by a third physician. GPT-4 labeled the same impressions using iterative prompt engineering. We calculated the accuracy, sensitivity, and specificity of GPT-4’s classifications compared to the adjudicated physician rating. The two physician raters agreed on actionability for 89.9% of studies. GPT-4 detected clinical actionability with 90.4% accuracy (84.0–98.0%, by study type), 88.8% sensitivity (75.0–100.0%), and 91.5% specificity (77.3–100.0%) compared to the adjudicated physician rating. GPT-4 accurately identified the clinical actionability of ED imaging studies, exhibiting slightly greater agreement with the adjudicated physician rating than the two physician raters did with one another. This approach allows labeling of large retrospective corpora, enabling development of data-driven tools to estimate pretest probability of actionable findings and prioritize imaging studies accordingly. LLMs can mimic physician gestalt in reducing high-dimensional data to a single meaningful measure, producing opportunities in clinical practice and research.

Keywords: Generative artificial intelligence (generative AI); emergency department (ED); imaging; Generative Pretrained Transformer 4 (GPT-4); large language models (LLMs)


Received: 06 April 2024; Accepted: 03 July 2024; Published online: 30 July 2024.

doi: 10.21037/jmai-24-100


Introduction

In the emergency department (ED), imaging studies must be ordered and prioritized based on the predicted probability of detecting actionable findings that will change patient management. This is particularly important in the setting of increasing ED imaging volumes (1), which can produce delays leading to adverse outcomes. Despite these challenges, the overall acuity and actionability of emergency imaging (findings that alter treatment, disposition, further testing, and specialist consultation) have received significantly less attention compared to the diagnosis of specific findings.

Artificial intelligence (AI) has the potential to enhance diagnostic accuracy and streamline clinical and operational processes in the ED (2). Large language models (LLMs), such as Generative Pretrained Transformer 4 (GPT-4), have recently demonstrated promise in various clinical tasks, including documentation (3) and clinical decision support (4).

In this study, we aimed to evaluate the performance of GPT-4 in identifying clinically actionable findings in ED imaging reports. We hypothesized that GPT-4 could accurately classify the actionability of these reports with performance comparable to that of emergency physicians. If successful, this approach could enable the development of data-driven tools to prioritize imaging studies based on the pretest probability of actionable findings. Demonstrating the ability of LLMs to reduce the high-dimensional data in ED imaging reports to a single, clinically meaningful measure would highlight their potential to support clinical decision-making and streamline ED operations.


Methods

We studied the clinical impressions of imaging studies performed on adult ED patients at Stanford Health Care between 2020 and 2023. The source radiology reports were a mix of structured and unstructured free text, with an average length of 50 words. We removed all protected health information (PHI) prior to analysis, using the Stanford & Penn Medical Imaging and Data Resource Center de-identifier (5), followed by manual review for retained PHI, to ensure patient privacy and confidentiality.

We randomly sampled 50 unique impressions for each of the 20 most frequently ordered imaging studies over the study period, for a total of 1,000 imaging reports. As in analogously designed studies (6), two emergency physicians independently reviewed the 1,000 de-identified impressions, and labeled whether each contained one or more “clinically actionable findings”, defined as a finding likely to change management or disposition of the patient in the ED. Disagreements between the two physician raters were resolved by independent review by a third emergency physician.

GPT-4 labeled the same impressions, rating whether each report contained findings likely to alter ED management or disposition. We accessed GPT-4 through the OpenAI application programming interface (API), using the December 15, 2023 version. To optimize GPT-4’s ability, we employed iterative prompt engineering techniques, drawing on best practices from the literature (7,8). We used few-shot learning, providing GPT-4 with a small set of labeled examples to demonstrate the task. We also employed chain-of-thought prompting, encouraging GPT-4 to explain its reasoning step-by-step, and self-correction techniques, prompting GPT-4 to review and refine its own outputs. The prompt used can be seen in Appendix 1.

We calculated the agreement of physician raters, and produced a single adjudicated physician rating by resolving discrepancies with a third rater. We calculated the accuracy, sensitivity, specificity, F1 score, positive predictive value (PPV), and negative predictive value (NPV) of GPT-4’s classifications compared to the adjudicated physician rating.


Results

The two physician raters agreed on the clinical actionability of 89.9% of studies, with a Cohen’s kappa of 0.800, ranging by study type from 0.661 (“XR ankle”) to 0.956 (“US lower extremity veins deep vein thrombosis bilateral”). Discrepancies were resolved by the third physician rater, producing the adjudicated physician rating.

By this measure, 40.8% of imaging studies contained actionable findings. “CT chest angiography w IV contrast pulmonary embolism” was the most actionable study type (66.0%), while “CT head cervical spine wo IV contrast trauma”, “US abdomen”, and “US lower extremity veins deep vein thrombosis unilateral left” were the least actionable (24.0%).

Compared to the adjudicated physician rating, GPT-4 detected clinical actionability with 90.4% accuracy (84.0–98.0%, by study type), 88.8% sensitivity (75.0–100.0%), 91.5% specificity (77.3–100.0%), 88.1% F1 score (78.8–97.1%), 87.5% PPV (72.2–100.0%), and 92.4% NPV (82.6–100.0%). Table 1 reports actionability and GPT-4 performance by study type.

Table 1

Performance of GPT-4 for classification of actionable imaging findings

Study (each n=50) Actionable findings (%) GPT-4 performance Physician agreement (kappa)
Accuracy Sensitivity Specificity F1 PPV NPV
All studies 40.8 0.904 0.888 0.915 0.881 0.875 0.924 0.800
CT chest angiography w IV contrast pulmonary embolism 66.0 0.900 1.000 0.773 0.918 0.848 1.000 0.792
CT chest wo IV contrast 60.0 0.900 0.931 0.857 0.915 0.900 0.900 0.793
CT abdomen pelvis w IV contrast 58.0 0.900 0.929 0.864 0.912 0.897 0.905 0.796
CT abdomen pelvis wo IV contrast 56.0 0.940 0.963 0.913 0.945 0.929 0.955 0.879
XR pelvis 1 view 54.0 0.860 0.857 0.864 0.873 0.889 0.826 0.717
XR chest 50.0 0.900 0.917 0.885 0.898 0.880 0.920 0.800
CT chest angiography w IV contrast trauma 46.0 0.900 0.846 0.958 0.898 0.957 0.852 0.800
CT head neck angiography w and wo IV contrast 42.0 0.860 0.850 0.867 0.829 0.810 0.897 0.711
MR brain neck arteries w and wo IV contrast stroke 42.0 0.920 0.870 0.963 0.909 0.952 0.897 0.838
XR shoulder 38.0 0.940 1.000 0.912 0.914 0.842 1.000 0.869
XR ankle 36.0 0.840 0.750 0.900 0.789 0.833 0.844 0.661
CT head wo IV contrast 36.0 0.860 0.867 0.857 0.788 0.722 0.938 0.685
US lower extremity veins deep vein thrombosis bilateral 36.0 0.980 1.000 0.970 0.971 0.944 1.000 0.956
CT thoracic lumbar spine wo IV contrast by reconstruction 34.0 0.920 1.000 0.892 0.867 0.765 1.000 0.811
XR hand 3 views left 32.0 0.880 0.750 0.967 0.833 0.938 0.853 0.741
XR hand 3 views right 30.0 0.900 0.857 0.917 0.828 0.800 0.943 0.757
CT head w and wo IV contrast brain perfusion CT head neck angiography w IV contrast stroke 28.0 0.880 0.750 0.941 0.800 0.857 0.889 0.715
US abdomen 24.0 0.920 0.750 1.000 0.857 1.000 0.895 0.803
CT head cervical spine wo IV contrast trauma 24.0 0.940 0.909 0.949 0.870 0.833 0.974 0.831
US lower extremity veins deep vein thrombosis unilateral left 24.0 0.940 0.909 0.949 0.870 0.833 0.974 0.831

PPV, positive predictive value; NPV, negative predictive value; CT, computed tomography; w, with; IV, intravenous; wo, without; XR, X-ray; MR, magnetic resonance; US, ultrasound.


Discussion

Our study demonstrates that GPT-4 can accurately identify clinically actionable findings in ED imaging reports with performance comparable to emergency physicians. These results suggest that LLMs have the potential to assist in prioritizing ED imaging studies based on the likelihood of actionable findings.

We found variability in GPT-4’s performance across different imaging study types. For example, GPT-4 achieved perfect sensitivity (100%) for “CT chest angiography w IV contrast pulmonary embolism” and “US lower extremity veins deep vein thrombosis bilateral”, but lower sensitivity (75%) for “XR ankle” and “US abdomen”. This variability may be attributed to differences in the complexity and clarity of the imaging reports across study types, as well as inherent ambiguity in identifying actionable findings in certain imaging modalities. GPT-4 sometimes produced false positives when an already known critical finding was repeated in the current impression. LLMs, like GPT-4, may struggle to differentiate between acute and historical findings when both are mentioned in the same report, leading to misclassification of actionability. This challenge underscores the need for further research into methods for incorporating temporal information and clinical context into LLM-based tools. Potential solutions to address this issue include developing more sophisticated prompting strategies that explicitly guide the LLM to consider the temporal context of the findings. Additionally, future work could explore the integration of structured data, such as the date-time of the imaging study and the patient’s clinical history, to provide additional context for the LLM’s decision-making process.

To our knowledge, this is the first study to examine the use of LLMs in identifying clinically actionable findings in ED imaging reports. Recent studies have explored the potential of LLMs in other clinical applications within emergency medicine and radiology. For example, Glicksberg et al. demonstrated that GPT-4’s performance in predicting admissions from the ED improved significantly when supplemented with real-world examples and numerical probabilities from traditional machine learning models, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, area under the precision-recall curve (AUPRC) of 0.71, and accuracy of 83.1% (9). Meral et al. found that GPT-4 and Gemini could accurately triage critical and urgent patients, with GPT-4 achieving a correct triage rate of 70.60% and Gemini achieving 87.77% (10). Within radiology, Haver et al. showed that GPT-4 assigned the correct Breast Imaging Reporting and Data System (BI-RADS) category for 73.6% of breast imaging reports, while Gertz et al. found that GPT-4’s error detection rate (82.7%) in radiology reports was comparable to that of radiologists (11,12).

The use of LLMs on patient data raises ethical considerations. While our study used de-identified data, the training data used to develop LLMs may include sensitive patient information, raising concerns about privacy and consent (13,14). Furthermore, the potential for bias in LLMs and their lack of transparency in decision-making processes may have unintended consequences when applied to clinical decision support (15,16). It is crucial for researchers and developers to address these ethical challenges by ensuring secure and ethical data handling, regularly auditing models for bias, and developing interpretable and transparent AI systems. Future work should also engage patients, healthcare providers, and other stakeholders to develop best practices and guidelines for the responsible use of LLMs in healthcare.

Our choice of GPT-4 for this study was based on its state-of-the-art performance on natural language tasks and its demonstrated potential in clinical applications (17-20). However, the use of a proprietary model like GPT-4 raises questions about accessibility and reproducibility. Future work should explore the performance of open-source LLMs and the development of more accessible models for clinical applications. The prompt engineering techniques we employed, including few-shot learning, chain-of-thought prompting, and self-correction, were critical to optimizing GPT-4’s performance on the task of detecting clinical actionability. These techniques allowed us to provide GPT-4 with clear instructions, relevant examples, and context to guide its decision-making, while also encouraging transparent reasoning and self-refinement. Future work should continue to explore and refine prompt engineering strategies to improve the performance and interpretability of LLMs in healthcare applications.

The study has several limitations. First, the determination of clinical actionability in practice often depends on a confluence of data modalities of which imaging is only one. Our study focused solely on the clinical impression section of radiology reports, without considering other relevant clinical information such as patient history, physical exam findings, or laboratory results. This limitation highlights the need for future research to explore the integration of LLMs with other data sources to provide a more comprehensive assessment of clinical actionability. Second, we rated a limited number of each study type, and only the 20 most common studies. While this approach allowed us to evaluate GPT-4’s performance across a diverse range of imaging modalities, it may not fully capture the variability in actionability across all possible study types. Third, we evaluated the performance of only a single LLM, GPT-4, and at one hospital site. While GPT-4 is a widely used LLM, its performance may not be representative of other AI models. Moreover, the performance of GPT-4 may vary across different healthcare institutions due to differences in imaging protocols, reporting styles, and patient populations. Future research should compare the performance of multiple LLMs to assess the robustness and generalizability of our approach. Finally, our study relied on the judgment of three emergency physicians to establish the ground truth for clinical actionability. While we employed a rigorous process to resolve disagreements between the physician raters, clinical actionability in some cases may be subjective or ambiguous.

Our study demonstrates the potential of LLMs to accurately identify clinically actionable findings in ED imaging reports. Producing accurate, clinically meaningful labels for large, diverse, natural language corpora will enable novel analysis and modeling of clinical text. In particular, we anticipate that classifying diverse findings by overall clinical actionability will inspire the development and validation of tools to predict from patient and visit characteristics which studies are likely to produce such findings. This could enable the optimization of imaging queues on the basis of clinical actionability, thereby improving patient and operational outcomes.


Acknowledgments

Funding: None.


Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-100/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-100/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Juliusson G, Thorvaldsdottir B, Kristjansson JM, et al. Diagnostic imaging trends in the emergency department: an extensive single-center experience. Acta Radiol Open 2019;8:2058460119860404. [Crossref] [PubMed]
  2. Berlyand Y, Raja AS, Dorner SC, et al. How artificial intelligence could transform emergency department operations. Am J Emerg Med 2018;36:1515-7. [Crossref] [PubMed]
  3. Lawson McLean A. Artificial Intelligence in Surgical Documentation: A Critical Review of the Role of Large Language Models. Ann Biomed Eng 2023;51:2641-2. [Crossref] [PubMed]
  4. Liao Z, Wang J, Shi Z, et al. Revolutionary Potential of ChatGPT in Constructing Intelligent Clinical Decision Support Systems. Ann Biomed Eng 2024;52:125-9. [Crossref] [PubMed]
  5. Chambon PJ, Wu C, Steinkamp JM, et al. Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods. J Am Med Inform Assoc 2023;30:318-28. [Crossref] [PubMed]
  6. Hasani AM, Singh S, Zahergivar A, et al. Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 2024;34:3566-74. [Crossref] [PubMed]
  7. Giray L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann Biomed Eng 2023;51:2629-33. [Crossref] [PubMed]
  8. WangJShiEYuSPrompt engineering for healthcare: Methodologies and applications. arXiv 2023. doi: .
  9. Glicksberg BS, Timsina P, Patel D, et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024;ocae103. [Crossref] [PubMed]
  10. Meral G, Ateş S, Günay S, et al. Comparative analysis of ChatGPT, Gemini and emergency medicine specialist in ESI triage assessment. Am J Emerg Med 2024;81:146-50. [Crossref] [PubMed]
  11. Haver HL, Yi PH, Jeudy J, et al. Use of ChatGPT to Assign BI-RADS Assessment Categories to Breast Imaging Reports. AJR Am J Roentgenol 2024; Epub ahead of print. [Crossref] [PubMed]
  12. Gertz RJ, Dratsch T, Bunck AC, et al. Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy. Radiology 2024;311:e232714. [Crossref] [PubMed]
  13. Wu X, Duan R, Ni J. Unveiling security, privacy, and ethical concerns of ChatGPT. Journal of Information and Intelligence 2024;2:102-15. [Crossref]
  14. Kanter GP, Packel EA. Health Care Privacy Risks of AI Chatbots. JAMA 2023;330:311-2. [Crossref] [PubMed]
  15. Akinci D'Antonoli T, Stanzione A, Bluethgen C, et al. Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol 2024;30:80-90. [Crossref] [PubMed]
  16. Grote T, Berens P. A paradigm shift?-On the ethics of medical large language models. Bioethics 2024;38:383-90. [Crossref] [PubMed]
  17. AchiamJAdlerSAgarwalSGpt-4 technical report. arXiv 2023. doi: .
  18. Rosoł M, Gąsior JS, Łaba J, et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep 2023;13:20512. [Crossref] [PubMed]
  19. Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open 2023;6:e2336483. [Crossref] [PubMed]
  20. Fink MA, Bischoff A, Fink CA, et al. Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer. Radiology 2023;308:e231362. [Crossref] [PubMed]
doi: 10.21037/jmai-24-100
Cite this article as: Tang K, Reisler J, Rose C, Suffoletto B, Kim D. GPT-4 accurately classifies the clinical actionability of emergency department imaging reports. J Med Artif Intell 2024;7:36.

Download Citation