A large language model-guided approach to the focused physical exam
Background
The physical exam is vital in medical diagnosis, providing objective information that complements the patient’s history and guides clinical management. A focused exam tailored to the chief complaint can reveal critical findings. Specialized training is often needed to identify these findings in complex conditions. Large language models (LLMs), like OpenAI’s ChatGPT, trained on extensive data, can generate relevant text on niche topics, potentially filling diagnostic gaps. Studies show LLMs as useful diagnostic tools, yet their application in physical exams is unexplored (1-22).
Objective
This study evaluates GPT-4’s ability to provide physical exam instructions based on a chief complaint. We assess the accuracy, comprehensiveness, readability, and overall quality of its recommendations.
Methods
This study did not use patient data and was deemed non-human subject research by our Institutional Review Board (IRB). Each phase of the experimental workflow is outlined in Figure 1.
GPT-4 was prompted to recommend physical exam maneuvers based on the 19 chief complaints in the Hypothesis Driven Physical Exam Student Handbook created by the American Association of Medical Colleges (Table 1) (23) The following prompt was entered in a new chat session for each chief complaint: “Please provide me with detailed, tailored, and specific physical exam maneuvers for a specified chief complaint from a patient.” An example of a GPT-4 output for a chief complaint of “painful hip” is provided in Table 2.
Table 1
CC | Accuracy | Comprehensiveness | Readability | Overall quality | Total |
---|---|---|---|---|---|
Heart murmur | 12 | 12 | 13 | 12 | 49 |
Itchy scalp | 12 | 11 | 13 | 11 | 47 |
Stuffy nose and sore throat | 11 | 12 | 14 | 12 | 49 |
Swollen neck and fatigue | 11 | 12 | 14 | 11 | 48 |
Blurred vision | 13 | 11 | 14 | 12 | 50 |
Hearing loss | 14 | 12 | 14 | 12 | 52 |
Funny feeling face* | 12 | 11 | 14 | 10 | 47 |
Shortness of breath | 13 | 11 | 13 | 10 | 47 |
Lump in breast | 13 | 12 | 12 | 11 | 48 |
Leg pain on exertion | 14 | 13 | 14 | 13 | 54 |
Upper abdominal pain | 11 | 13 | 14 | 12 | 50 |
Lower abdominal pain | 10 | 11 | 12 | 10 | 43 |
Painful wrist | 13 | 12 | 13 | 12 | 50 |
Painful shoulder | 13 | 11 | 13 | 12 | 49 |
Sore knee | 14 | 12 | 12 | 13 | 51 |
Painful hip | 14 | 12 | 14 | 13 | 53 |
Painful ankle | 14 | 12 | 13 | 13 | 52 |
Unsteadiness | 12 | 13 | 12 | 12 | 49 |
Back pain | 11 | 12 | 12 | 11 | 46 |
Averages | 12.47 | 11.84 | 13.16 | 11.68 | 49.16 |
*, funny feeling face here suggests a neurological presentation. CC, chief complaint.
Table 2
User input | GPT-4 output |
---|---|
Please provide me with detailed, tailored, and specific physical exam maneuvers for a specified chief complaint from a patient | Of course! Please specify the chief complaint or the system you want to focus on, and I will provide you with tailored physical exam maneuvers for that |
Chief complaint: painful hip | Certainly! When a patient presents with the chief complaint of a painful hip, a thorough history and physical examination are essential for appropriate diagnosis. Here’s a tailored approach to the physical examination: |
General observations: (I) gait analysis: observe the patient’s walking pattern. Look for limping, antalgic gait, Trendelenburg gait, or any other abnormalities. (II) Inspect for: (i) skin changes (bruising, erythema, swelling); (ii) atrophy or hypertrophy of the thigh or calf muscles; (iii) leg length discrepancy (measure from the anterior superior iliac spine to the medial malleolus); (iv) any visible deformities or asymmetry | |
Palpation: (I) palpate the hip joint, greater trochanter, and surrounding structures. Check for tenderness, warmth, or masses. (II) Check for tenderness over the inguinal ligament, which might suggest a hernia or hip pathology | |
ROM tests: (I) active ROM: (i) flexion: have the patient flex the hip by bringing the knee towards the chest; (ii) extension: have the patient lie prone and lift the leg off the examination table; (iii) abduction: while supine, have the patient move the leg outward; (iv) adduction: while supine, move the leg inward crossing the midline; (v) internal rotation: with the hip and knee flexed to 90 degrees, rotate the foot outward; (vi) external rotation: with the hip and knee flexed to 90 degrees, rotate the foot inward. (II) Passive ROM: repeat the movements passively to evaluate any pain, crepitus, or restricted movement | |
Special tests: (I) FABER test (Flexion, ABduction, External Rotation): (i) position the patient supine. Place the foot of the affected leg on the opposite knee forming a figure 4. Gently press down on the affected knee and assess for pain or restriction. (II) Trendelenburg test: (i) ask the patient to stand on one leg. If the pelvis drops on the opposite side, it suggests gluteus medius/minimus weakness or hip joint pathology. (III) Log roll test: (i) with the patient supine, roll the leg in and out to assess for pain. Pain might suggest intra-articular pathology. (IV) Anterior impingement test: (i) with the patient supine, flex the hip and knee to 90 degrees, then adduct and internally rotate the hip. Pain suggests femoroacetabular impingement | |
Neurovascular assessment: (I) check distal pulses (dorsalis pedis and posterior tibial). (II) Sensation: ensure the patient has sensation over the anterior, lateral, and posterior aspects of the thigh and lower leg. (III) Strength: test muscle strength, especially hip flexors, extensors, abductors, and adductors | |
Comparative examination: always examine the contralateral hip for comparison, even if the patient is not experiencing symptoms on that side | |
Further evaluation: remember, the spine, knee, and even the abdomen can refer pain to the hip. Depending on your findings and clinical suspicion, further evaluation of these areas might be necessary | |
After the physical examination, imaging such as X-rays, MRI, or ultrasound may be necessary based on the suspected diagnosis |
ROM, range of motion; MRI, magnetic resonance imaging.
Two board-certified emergency medicine attending physicians and one board-certified internal medicine attending physician evaluated GPT-4’s responses based on Accuracy, Comprehensiveness, Readability, and Overall quality using a Likert scale (1= very poor, 3= neutral, 5= excellent). Reviewers also provided subjective commentary on each attribute.
Results
GPT-4 received an average score of 4.16 [95% confidence interval (CI): 3.95–4.36] for accuracy, 3.95 (95% CI: 3.78–4.11) for comprehensiveness, 4.39 (95% CI: 4.21–4.57) for readability, and 3.89 (95% CI: 3.74–4.05) for overall quality (Figure 2) on a Likert scale. The average total response score for any given chief complaint was 49.16 (95% CI: 48.49–51.72) out of 60. GPT-4 received the most points in its output for “leg pain upon exertion” [54] and the least points in its output for “lower abdominal pain” [43]. The Pearson correlation coefficient reveals a weak association between most categories with some exceptions: overall quality with accuracy (r=0.61, P<0.01) and overall quality with comprehensiveness (r=0.57, P<0.05) (Figure 3).
Overall, the reviewers’ subjective comments were positive, praising various aspects of the outputs such as the attention to detail, broad scope of information, and easy-to-read instructions. However, there were common points of constructive feedback present throughout GPT-4 responses (Table 3).
Table 3
Category | Themes | Expert reviewer comments |
---|---|---|
Accuracy | Impressed with details of special tests | Impressed by the level of detail provided for the special tests |
I was impressed by the inclusion of the dental exam—very comprehensive | ||
Lack of specificity | Some sections are vague and not specific. “Examine the ears, nose, and throat for any signs of infection or abnormalities that could be causing facial symptoms.” is not that helpful/accurate since it does not tell the reader what to look for | |
It also included scleral icterus here as “jaundice”, which is not the most appropriate term for jaundice of the eyes | ||
In eye exam, I would include [sic] looking for additional signs of viral conjunctivitis like tearing/crusting that it did not include | ||
Not calling for vital signs | Did not include vital signs, which could point to an infectious etiology (epidural abscess) | |
Comprehensiveness | Lack of specificity | GPT4 could have been more specific with its otoscope exam. “Look for signs of infection or blockage” is less helpful than “look for canal erythema/tenderness, and effusion via TM light reflex/bulging.” |
When mentioning placing the “stethoscope at various locations, including the aortic, pulmonic, tricuspid, and mitral areas” I think it would have been more instructive to also mention the anatomical location (i.e., left lower sternal border or apex) | ||
It would be nice to expand on other systemic conditions in addition to hypothyroidism where blood tests could be valuable in the workup/evaluation | ||
Redundant information | Surprised it included a reminder to document the exam findings and communicate the results to the patient. These additions make the answer almost too comprehensive | |
Included redundant information about testing strength and sensation of CN V in the CN exam and neurological exam sections | ||
Again, it’s offering information that isn’t directly asked for (labs, imaging, further history) | ||
Over-comprehensive again with additional info that was not requested (documentation, labs, follow-up) | ||
Omission of informative exams | Could have included Carnett’s sign for MSK pain | |
Evaluating for stridor on neck exam is also important | ||
A full neuro exam would be helpful to rule out Guillain Barre syndrome or a Myasthenia Gravis flare | ||
Readability | Vague language | Readability could be improved by clarifying further what is indicated by orthopnea in this example: “(e.g., prefers sitting up—orthopnea).” |
“Inspect the breasts with the arms at the sides, overhead, hands pressed onto hips, and leaning forward.” → This instruction is difficult to interpret | ||
Quality | Inconsistency | The information presented could have been more detailed in this answer than priors |
Irrelevant information | GPT-4 provided additional information that was not relevant to the prompt, including basic management and follow-up recommendations, which, while correct in this instance, is outside the scope of what is being asked | |
The answer includes unnecessary information, such as a reminder to document exam findings, labs, patient education, etc. |
TM, tympanic membrane; CN, cranial nerve; V, five; MSK, musculoskeletal.
Key themes on the qualitative aspects of GPT-4’s performance are highlighted below:
- Accuracy: while praised for detail, GPT-4’s responses were often deemed lacking in providing specific instructions. Reviewers also noted that GPT-4 sometimes neglected to check vital signs, a key element of the physical exam.
- Comprehensiveness: concerns were raised about redundancy and over-comprehensiveness, which paradoxically led to omissions of crucial details, such as the omission of extremity strength testing for the evaluation of a “funny feeling face” concerning stroke.
- Readability and quality: reviewers noted instances of language that were overly “vague”, organizational inconsistencies, and inclusion of irrelevant information. For instance, GPT-4 persistently included a reminder to check the patient’s history and labs as well as a suggestion to exercise care in patient communication, which is important to consider, but outside of the focus of the physical exam.
Discussion
In this study, we demonstrate the potential of off-the-shelf LLMs as adjunctive diagnostic tools for providing clinically relevant physical exam recommendations based on chief complaints. GPT-4 scored well for various clinical presentations, with an average composite Likert Scale rating of 49.16 (95% CI: 48.49–51.72), at least 80% of the maximum possible score. This suggests that GPT-4 could perform effectively in general medical situations with further fine-tuning, testing, and validation. Its performance across evaluation criteria underscores its versatility, making it a potentially valuable resource for physicians across specialties and experience levels.
In this pilot study, we relied upon subjective physician ratings to assess response quality; future investigations should directly compare the diagnostic capabilities of unassisted physicians to those using LLMs. Going forward, real-world patient cases could be used to fine-tune LLMs for a large and diverse set of specific clinical scenarios; this could help address the observed gaps in the diagnostic capacity of GPT-4. We anticipate an increasing role for LLMs in clinical decision support, helping to fill knowledge gaps and serving as an academic tool for emerging medical professionals, thereby enhancing physicians’ diagnostic capacity.
Acknowledgments
Our study directly evaluates LLMs as an intervention; however, we did not utilize AI or LLMs for any drafting of the manuscript or interpretation of data.
Funding: The project described was supported in part by
Footnote
Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-275/prf
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-275/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study did not use patient data and was deemed non-human subject research by our Institutional Review Board (IRB).
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023;25:e48659. [Crossref] [PubMed]
- Rao A, Kim J, Kamineni M, et al. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol 2023;20:990-7. [Crossref] [PubMed]
- Abdelhady AM, Davis CR. Plastic Surgery and Artificial Intelligence: How ChatGPT Improved Operation Note Accuracy, Time, and Education. Mayo Clinic Proceedings: Digital Health 2023;1:299-308. [Crossref]
Rao A Kim J Kamineni M Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. - Koranteng E, Rao A, Flores E, et al. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR Med Educ 2023;9:e51199. [Crossref] [PubMed]
- Boscardin CK, Gin B, Golde PB, et al. ChatGPT and Generative Artificial Intelligence for Medical Education: Potential Impact and Opportunity. Acad Med 2024;99:22-7. [Crossref] [PubMed]
- Hirosawa T, Kawamura R, Harada Y, et al. ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inform 2023;11:e48808. [Crossref] [PubMed]
- Kumar Y, Koul A, Singla R, et al. Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda. J Ambient Intell Humaniz Comput 2023;14:8459-86. [Crossref] [PubMed]
- Shen J, Zhang CJP, Jiang B, et al. Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. JMIR Med Inform 2019;7:e10010. [Crossref] [PubMed]
- Caruccio L, Cirillo S, Polese G, et al. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl 2024;235:121186. [Crossref]
- Horiuchi D, Tatekawa H, Shimono T, et al. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024;66:73-9. [Crossref] [PubMed]
- Kozel G, Gurses ME, Gecici NN, et al. Chat-GPT on brain tumors: An examination of Artificial Intelligence/Machine Learning’s ability to provide diagnoses and treatment plans for example neuro-oncology cases. Clin Neurol Neurosurg 2024;239:108238. [Crossref] [PubMed]
- Lechien JR, Georgescu BM, Hans S, et al. ChatGPT performance in laryngology and head and neck surgery: a clinical case-series. Eur Arch Otorhinolaryngol 2024;281:319-33. [Crossref] [PubMed]
- Makhoul M, Melkane AE, Khoury PE, et al. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024;281:2717-21. [Crossref] [PubMed]
- Stoneham S, Livesey A, Cooper H, et al. ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology. Clin Exp Dermatol 2024;49:707-10. [Crossref] [PubMed]
- Rao A, Kim J, Lie W, et al. Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care. J Med Syst 2024;48:41. [Crossref] [PubMed]
- Smith JP, Doe ML. Machine learning in healthcare: Predicting chronic diseases using patient data. Journal of Medical Systems 2022;46:23.
- Brown RS, Johnson LK. The role of artificial intelligence in enhancing healthcare outcomes. International Journal of Healthcare Technology and Management 2023;19:144-58.
- Huang AA, Huang SY. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations. PLoS One 2023;18:e0281922. [Crossref] [PubMed]
- Garcia TH, Martinez EM. Applications of neural networks in predicting diabetes and hypertension: A comparative study. BMC Medical Informatics and Decision Making 2023;23:67.
- Lee CY, Kim HS. Utilizing big data and machine learning to forecast cardiovascular events: A systematic review. Computers in Biology and Medicine 2023;154:106328.
- Huang AA, Huang SY. Computation of the distribution of model accuracy statistics in machine learning: Comparison between analytically derived distributions and simulation-based methods. Health Sci Rep 2023;6:e1214. [Crossref] [PubMed]
- Yudkowsky R, Otaki J, Bordage G, et al. Hypothesis-Driven Physical Examination Student Handbook. MedEdPORTAL 2011;8294: [Crossref]
Cite this article as: Rao AS, Rivera C, Ramji HF, Wagner S, Mu A, Kim J, Marks W, White B, Whitehead DC, Senter-Zapata MJ, Succi MD. A large language model-guided approach to the focused physical exam. J Med Artif Intell 2025;8:25.