A large language model-guided approach to the focused physical exam
Brief Report

A large language model-guided approach to the focused physical exam

Arya S. Rao1,2 ORCID logo, Christian Rivera1,2 ORCID logo, Husayn F. Ramji2,3,4, Sarah Wagner1,2, Andrew Mu1,2, John Kim1,2, William Marks2,5, Benjamin White6, David C. Whitehead6, Michael J. Senter-Zapata7, Marc D. Succi1,2,8 ORCID logo

1Harvard Medical School, Boston, MA, USA; 2Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA; 3Department of Medicine, Tulane University School of Medicine, New Orleans, LA, USA; 4Hudson School of Public Health, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA; 5Harvard Business School, Boston, MA, USA; 6Department of Emergency Medicine, Massachusetts General Hospital, Boston, MA, USA; 7Department of Internal Medicine, Brigham and Women’s Hospital, Boston, MA, USA; 8Department of Radiology, Massachusetts General Hospital, Boston, MA, USA

Correspondence to: Marc D. Succi, MD. Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114, USA; Harvard Medical School, Boston, MA, USA; Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA. Email: msucci@mgh.harvard.edu.

Abstract: The physical exam is crucial in medical diagnosis, providing objective information that complements the patient’s history and guides clinical management. While studies show the potential of large language models (LLMs) like GPT-4 as adjunctive diagnostic tools, their use in physical exams remains unexplored. This study evaluates GPT-4’s ability to provide tailored physical exam instructions based on chief complaints. GPT-4 was prompted to recommend specific physical exam maneuvers for 19 chief complaints from the Hypothesis Driven Physical Exam Student Handbook by the American Association of Medical Colleges. Two board-certified emergency medicine and one internal medicine attending physicians evaluated GPT-4’s responses for accuracy, comprehensiveness, readability, and overall quality using a Likert scale (1= very poor, 3= neutral, 5= excellent) with subjective commentary. GPT-4 received average scores of 4.16 for accuracy, 3.95 for comprehensiveness, 4.39 for readability, and 3.89 for overall quality. The total average score per chief complaint was 49.16 out of 60. The highest score was for “leg pain upon exertion” [54], and the lowest was for “lower abdominal pain” [43]. The Pearson correlation coefficient showed a moderate association between overall quality and accuracy (r=0.61) and comprehensiveness (r=0.57). Reviewers highlighted GPT-4’s detailed and extensive outputs but noted the need to improve elements such as specificity and inclusion of vital signs. This study shows the potential of GPT-4 as an adjunctive diagnostic tool by providing useful physical exam recommendations. GPT-4 performed effectively in general medical scenarios, scoring at least 80% of the maximum possible points. Future research could compare unassisted physicians to those using LLM tools and explore custom-trained models, suggesting significant potential for LLMs in clinical decision support.

Keywords: GPT-4; ChatGPT; large language models (LLMs); physical exam; diagnostic tools


Received: 10 August 2024; Accepted: 24 October 2024; Published online: 16 December 2024.

doi: 10.21037/jmai-24-275


Background

The physical exam is vital in medical diagnosis, providing objective information that complements the patient’s history and guides clinical management. A focused exam tailored to the chief complaint can reveal critical findings. Specialized training is often needed to identify these findings in complex conditions. Large language models (LLMs), like OpenAI’s ChatGPT, trained on extensive data, can generate relevant text on niche topics, potentially filling diagnostic gaps. Studies show LLMs as useful diagnostic tools, yet their application in physical exams is unexplored (1-22).


Objective

This study evaluates GPT-4’s ability to provide physical exam instructions based on a chief complaint. We assess the accuracy, comprehensiveness, readability, and overall quality of its recommendations.


Methods

This study did not use patient data and was deemed non-human subject research by our Institutional Review Board (IRB). Each phase of the experimental workflow is outlined in Figure 1.

Figure 1 Experimental workflow.

GPT-4 was prompted to recommend physical exam maneuvers based on the 19 chief complaints in the Hypothesis Driven Physical Exam Student Handbook created by the American Association of Medical Colleges (Table 1) (23) The following prompt was entered in a new chat session for each chief complaint: “Please provide me with detailed, tailored, and specific physical exam maneuvers for a specified chief complaint from a patient.” An example of a GPT-4 output for a chief complaint of “painful hip” is provided in Table 2.

Table 1

Chief complaints and scoring

CC Accuracy Comprehensiveness Readability Overall quality Total
Heart murmur 12 12 13 12 49
Itchy scalp 12 11 13 11 47
Stuffy nose and sore throat 11 12 14 12 49
Swollen neck and fatigue 11 12 14 11 48
Blurred vision 13 11 14 12 50
Hearing loss 14 12 14 12 52
Funny feeling face* 12 11 14 10 47
Shortness of breath 13 11 13 10 47
Lump in breast 13 12 12 11 48
Leg pain on exertion 14 13 14 13 54
Upper abdominal pain 11 13 14 12 50
Lower abdominal pain 10 11 12 10 43
Painful wrist 13 12 13 12 50
Painful shoulder 13 11 13 12 49
Sore knee 14 12 12 13 51
Painful hip 14 12 14 13 53
Painful ankle 14 12 13 13 52
Unsteadiness 12 13 12 12 49
Back pain 11 12 12 11 46
Averages 12.47 11.84 13.16 11.68 49.16

*, funny feeling face here suggests a neurological presentation. CC, chief complaint.

Table 2

Example of GPT-4 interaction using chief complaint of painful hip

User input GPT-4 output
Please provide me with detailed, tailored, and specific physical exam maneuvers for a specified chief complaint from a patient Of course! Please specify the chief complaint or the system you want to focus on, and I will provide you with tailored physical exam maneuvers for that
Chief complaint: painful hip Certainly! When a patient presents with the chief complaint of a painful hip, a thorough history and physical examination are essential for appropriate diagnosis. Here’s a tailored approach to the physical examination:
General observations: (I) gait analysis: observe the patient’s walking pattern. Look for limping, antalgic gait, Trendelenburg gait, or any other abnormalities. (II) Inspect for: (i) skin changes (bruising, erythema, swelling); (ii) atrophy or hypertrophy of the thigh or calf muscles; (iii) leg length discrepancy (measure from the anterior superior iliac spine to the medial malleolus); (iv) any visible deformities or asymmetry
Palpation: (I) palpate the hip joint, greater trochanter, and surrounding structures. Check for tenderness, warmth, or masses. (II) Check for tenderness over the inguinal ligament, which might suggest a hernia or hip pathology
ROM tests: (I) active ROM: (i) flexion: have the patient flex the hip by bringing the knee towards the chest; (ii) extension: have the patient lie prone and lift the leg off the examination table; (iii) abduction: while supine, have the patient move the leg outward; (iv) adduction: while supine, move the leg inward crossing the midline; (v) internal rotation: with the hip and knee flexed to 90 degrees, rotate the foot outward; (vi) external rotation: with the hip and knee flexed to 90 degrees, rotate the foot inward. (II) Passive ROM: repeat the movements passively to evaluate any pain, crepitus, or restricted movement
Special tests: (I) FABER test (Flexion, ABduction, External Rotation): (i) position the patient supine. Place the foot of the affected leg on the opposite knee forming a figure 4. Gently press down on the affected knee and assess for pain or restriction. (II) Trendelenburg test: (i) ask the patient to stand on one leg. If the pelvis drops on the opposite side, it suggests gluteus medius/minimus weakness or hip joint pathology. (III) Log roll test: (i) with the patient supine, roll the leg in and out to assess for pain. Pain might suggest intra-articular pathology. (IV) Anterior impingement test: (i) with the patient supine, flex the hip and knee to 90 degrees, then adduct and internally rotate the hip. Pain suggests femoroacetabular impingement
Neurovascular assessment: (I) check distal pulses (dorsalis pedis and posterior tibial). (II) Sensation: ensure the patient has sensation over the anterior, lateral, and posterior aspects of the thigh and lower leg. (III) Strength: test muscle strength, especially hip flexors, extensors, abductors, and adductors
Comparative examination: always examine the contralateral hip for comparison, even if the patient is not experiencing symptoms on that side
Further evaluation: remember, the spine, knee, and even the abdomen can refer pain to the hip. Depending on your findings and clinical suspicion, further evaluation of these areas might be necessary
After the physical examination, imaging such as X-rays, MRI, or ultrasound may be necessary based on the suspected diagnosis

ROM, range of motion; MRI, magnetic resonance imaging.

Two board-certified emergency medicine attending physicians and one board-certified internal medicine attending physician evaluated GPT-4’s responses based on Accuracy, Comprehensiveness, Readability, and Overall quality using a Likert scale (1= very poor, 3= neutral, 5= excellent). Reviewers also provided subjective commentary on each attribute.


Results

GPT-4 received an average score of 4.16 [95% confidence interval (CI): 3.95–4.36] for accuracy, 3.95 (95% CI: 3.78–4.11) for comprehensiveness, 4.39 (95% CI: 4.21–4.57) for readability, and 3.89 (95% CI: 3.74–4.05) for overall quality (Figure 2) on a Likert scale. The average total response score for any given chief complaint was 49.16 (95% CI: 48.49–51.72) out of 60. GPT-4 received the most points in its output for “leg pain upon exertion” [54] and the least points in its output for “lower abdominal pain” [43]. The Pearson correlation coefficient reveals a weak association between most categories with some exceptions: overall quality with accuracy (r=0.61, P<0.01) and overall quality with comprehensiveness (r=0.57, P<0.05) (Figure 3).

Figure 2 Ratings of GPT-4 outputs as reviewed by physician experts.
Figure 3 Pearson correlation coefficient of evaluation parameter.

Overall, the reviewers’ subjective comments were positive, praising various aspects of the outputs such as the attention to detail, broad scope of information, and easy-to-read instructions. However, there were common points of constructive feedback present throughout GPT-4 responses (Table 3).

Table 3

Qualitative analysis

Category Themes Expert reviewer comments
Accuracy Impressed with details of special tests Impressed by the level of detail provided for the special tests
I was impressed by the inclusion of the dental exam—very comprehensive
Lack of specificity Some sections are vague and not specific. “Examine the ears, nose, and throat for any signs of infection or abnormalities that could be causing facial symptoms.” is not that helpful/accurate since it does not tell the reader what to look for
It also included scleral icterus here as “jaundice”, which is not the most appropriate term for jaundice of the eyes
In eye exam, I would include [sic] looking for additional signs of viral conjunctivitis like tearing/crusting that it did not include
Not calling for vital signs Did not include vital signs, which could point to an infectious etiology (epidural abscess)
Comprehensiveness Lack of specificity GPT4 could have been more specific with its otoscope exam. “Look for signs of infection or blockage” is less helpful than “look for canal erythema/tenderness, and effusion via TM light reflex/bulging.”
When mentioning placing the “stethoscope at various locations, including the aortic, pulmonic, tricuspid, and mitral areas” I think it would have been more instructive to also mention the anatomical location (i.e., left lower sternal border or apex)
It would be nice to expand on other systemic conditions in addition to hypothyroidism where blood tests could be valuable in the workup/evaluation
Redundant information Surprised it included a reminder to document the exam findings and communicate the results to the patient. These additions make the answer almost too comprehensive
Included redundant information about testing strength and sensation of CN V in the CN exam and neurological exam sections
Again, it’s offering information that isn’t directly asked for (labs, imaging, further history)
Over-comprehensive again with additional info that was not requested (documentation, labs, follow-up)
Omission of informative exams Could have included Carnett’s sign for MSK pain
Evaluating for stridor on neck exam is also important
A full neuro exam would be helpful to rule out Guillain Barre syndrome or a Myasthenia Gravis flare
Readability Vague language Readability could be improved by clarifying further what is indicated by orthopnea in this example: “(e.g., prefers sitting up—orthopnea).”
Inspect the breasts with the arms at the sides, overhead, hands pressed onto hips, and leaning forward.” → This instruction is difficult to interpret
Quality Inconsistency The information presented could have been more detailed in this answer than priors
Irrelevant information GPT-4 provided additional information that was not relevant to the prompt, including basic management and follow-up recommendations, which, while correct in this instance, is outside the scope of what is being asked
The answer includes unnecessary information, such as a reminder to document exam findings, labs, patient education, etc.

TM, tympanic membrane; CN, cranial nerve; V, five; MSK, musculoskeletal.

Key themes on the qualitative aspects of GPT-4’s performance are highlighted below:

  • Accuracy: while praised for detail, GPT-4’s responses were often deemed lacking in providing specific instructions. Reviewers also noted that GPT-4 sometimes neglected to check vital signs, a key element of the physical exam.
  • Comprehensiveness: concerns were raised about redundancy and over-comprehensiveness, which paradoxically led to omissions of crucial details, such as the omission of extremity strength testing for the evaluation of a “funny feeling face” concerning stroke.
  • Readability and quality: reviewers noted instances of language that were overly “vague”, organizational inconsistencies, and inclusion of irrelevant information. For instance, GPT-4 persistently included a reminder to check the patient’s history and labs as well as a suggestion to exercise care in patient communication, which is important to consider, but outside of the focus of the physical exam.

Discussion

In this study, we demonstrate the potential of off-the-shelf LLMs as adjunctive diagnostic tools for providing clinically relevant physical exam recommendations based on chief complaints. GPT-4 scored well for various clinical presentations, with an average composite Likert Scale rating of 49.16 (95% CI: 48.49–51.72), at least 80% of the maximum possible score. This suggests that GPT-4 could perform effectively in general medical situations with further fine-tuning, testing, and validation. Its performance across evaluation criteria underscores its versatility, making it a potentially valuable resource for physicians across specialties and experience levels.

In this pilot study, we relied upon subjective physician ratings to assess response quality; future investigations should directly compare the diagnostic capabilities of unassisted physicians to those using LLMs. Going forward, real-world patient cases could be used to fine-tune LLMs for a large and diverse set of specific clinical scenarios; this could help address the observed gaps in the diagnostic capacity of GPT-4. We anticipate an increasing role for LLMs in clinical decision support, helping to fill knowledge gaps and serving as an academic tool for emerging medical professionals, thereby enhancing physicians’ diagnostic capacity.


Acknowledgments

Our study directly evaluates LLMs as an intervention; however, we did not utilize AI or LLMs for any drafting of the manuscript or interpretation of data.

Funding: The project described was supported in part by award from the National Institute of General Medical Sciences (No. T32GM144273). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.


Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-275/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-275/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study did not use patient data and was deemed non-human subject research by our Institutional Review Board (IRB).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023;25:e48659. [Crossref] [PubMed]
  2. Rao A, Kim J, Kamineni M, et al. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol 2023;20:990-7. [Crossref] [PubMed]
  3. Abdelhady AM, Davis CR. Plastic Surgery and Artificial Intelligence: How ChatGPT Improved Operation Note Accuracy, Time, and Education. Mayo Clinic Proceedings: Digital Health 2023;1:299-308. [Crossref]
  4. RaoAKimJKamineniMEvaluating ChatGPT as an Adjunct for Radiologic Decision-Making.medRxiv 2023;2023.02.02.23285399. doi: .10.1101/2023.02.02.23285399
  5. Koranteng E, Rao A, Flores E, et al. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR Med Educ 2023;9:e51199. [Crossref] [PubMed]
  6. Boscardin CK, Gin B, Golde PB, et al. ChatGPT and Generative Artificial Intelligence for Medical Education: Potential Impact and Opportunity. Acad Med 2024;99:22-7. [Crossref] [PubMed]
  7. Hirosawa T, Kawamura R, Harada Y, et al. ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inform 2023;11:e48808. [Crossref] [PubMed]
  8. Kumar Y, Koul A, Singla R, et al. Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda. J Ambient Intell Humaniz Comput 2023;14:8459-86. [Crossref] [PubMed]
  9. Shen J, Zhang CJP, Jiang B, et al. Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. JMIR Med Inform 2019;7:e10010. [Crossref] [PubMed]
  10. Caruccio L, Cirillo S, Polese G, et al. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl 2024;235:121186. [Crossref]
  11. Horiuchi D, Tatekawa H, Shimono T, et al. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024;66:73-9. [Crossref] [PubMed]
  12. Kozel G, Gurses ME, Gecici NN, et al. Chat-GPT on brain tumors: An examination of Artificial Intelligence/Machine Learning’s ability to provide diagnoses and treatment plans for example neuro-oncology cases. Clin Neurol Neurosurg 2024;239:108238. [Crossref] [PubMed]
  13. Lechien JR, Georgescu BM, Hans S, et al. ChatGPT performance in laryngology and head and neck surgery: a clinical case-series. Eur Arch Otorhinolaryngol 2024;281:319-33. [Crossref] [PubMed]
  14. Makhoul M, Melkane AE, Khoury PE, et al. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024;281:2717-21. [Crossref] [PubMed]
  15. Stoneham S, Livesey A, Cooper H, et al. ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology. Clin Exp Dermatol 2024;49:707-10. [Crossref] [PubMed]
  16. Rao A, Kim J, Lie W, et al. Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care. J Med Syst 2024;48:41. [Crossref] [PubMed]
  17. Smith JP, Doe ML. Machine learning in healthcare: Predicting chronic diseases using patient data. Journal of Medical Systems 2022;46:23.
  18. Brown RS, Johnson LK. The role of artificial intelligence in enhancing healthcare outcomes. International Journal of Healthcare Technology and Management 2023;19:144-58.
  19. Huang AA, Huang SY. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations. PLoS One 2023;18:e0281922. [Crossref] [PubMed]
  20. Garcia TH, Martinez EM. Applications of neural networks in predicting diabetes and hypertension: A comparative study. BMC Medical Informatics and Decision Making 2023;23:67.
  21. Lee CY, Kim HS. Utilizing big data and machine learning to forecast cardiovascular events: A systematic review. Computers in Biology and Medicine 2023;154:106328.
  22. Huang AA, Huang SY. Computation of the distribution of model accuracy statistics in machine learning: Comparison between analytically derived distributions and simulation-based methods. Health Sci Rep 2023;6:e1214. [Crossref] [PubMed]
  23. Yudkowsky R, Otaki J, Bordage G, et al. Hypothesis-Driven Physical Examination Student Handbook. MedEdPORTAL 2011;8294: [Crossref]
doi: 10.21037/jmai-24-275
Cite this article as: Rao AS, Rivera C, Ramji HF, Wagner S, Mu A, Kim J, Marks W, White B, Whitehead DC, Senter-Zapata MJ, Succi MD. A large language model-guided approach to the focused physical exam. J Med Artif Intell 2025;8:25.

Download Citation