Fine-tuning for accuracy: evaluation of Generative Pretrained Transformer (GPT) for automatic assignment of International Classification of Disease (ICD) codes to clinical documentation

Khalid Nawab; Madalyn Fernbach; Sayuj Atreya; Samina Asfandiyar; Gulalai Khan; Riya Arora; Iqbal Hussain; Shadi Hijjawi; Richard Schreiber

doi:10.21037/jmai-24-60

Original Article

Fine-tuning for accuracy: evaluation of Generative Pretrained Transformer (GPT) for automatic assignment of International Classification of Disease (ICD) codes to clinical documentation

Khalid Nawab¹ , Madalyn Fernbach², Sayuj Atreya³, Samina Asfandiyar⁴, Gulalai Khan⁵, Riya Arora⁶, Iqbal Hussain⁷, Shadi Hijjawi⁸, Richard Schreiber¹

¹Penn State Health Holy Spirit Medical Center, Camp Hill, PA, USA; ²Penn State University, State College, PA, USA; ³University of Pittsburgh, Pittsburgh, PA, USA; ⁴Cizik School of Nursing, The University of Texas Health Science Center at Houston, Houston, TX, USA; ⁵Swat Medical College, Swat, KPK, Pakistan; ⁶Department of Biology, University of California Berkeley, Berkeley, CA, USA; ⁷Department of Biology, Lady Reading Hospital, Peshawar, KPK, Pakistan; ⁸Penn State Health Milton S. Hershey Medical Center, Herhsey, PA, USA

Contributions: (I) Conception and design: K Nawab; (II) Administrative support: S Hijjawi, R Schreiber; (III) Provision of study materials or patients: K Nawab, S Hijjawi, R Schreiber; (IV) Collection and assembly of data: K Nawab, G Khan, I Hussain, R Arora; (V) Data analysis and interpretation: K Nawab, G Khan, R Arora, S Atreya; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Khalid Nawab, MD. Penn State Health Holy Spirit Medical Center, 503 N 21^st Street, Camp Hill, PA 17011, USA. Email: knawab@pennstatehealth.psu.edu.

Background: Assignment of International Classification of Disease (ICD) codes to clinical documentation is a tedious but important task that is mostly done manually. This study evaluated the widely popular OpenAI’s Generative Pretrained Transformer (GPT)-3.5 Turbo in facilitating the automation of assigning ICD codes to clinical notes.

Methods: We identified the ten most prevalent ICD-10 codes in the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. We selected 200 notes for each code, and then split them equally into two groups of 100 each (randomly selected) for training and testing. We then passed each note to GPT-3.5 Turbo via OpenAI’s Application Programming Interface, prompting the model to assign ICD-10 codes to each note. We evaluated the model’s response for the presence of the target ICD-10 code. After fine-tuning the GPT model on the training data, we repeated the process with the test data, comparing the fine-tuned model’s performance against the default model.

Results: Initially the target ICD-10 code was present in the assigned codes by the default GPT-3.5 Turbo model in 29.7% of the cases. After fine-tuning with 100 notes for each top code, the accuracy improved to 62.6%.

Conclusions: Historically, GPT’s performance for healthcare related tasks is sub-optimal. Fine-tuning as in this study provides great potential for improved performance, highlighting a path forward for integration of artificial intelligence in healthcare for improved efficiency and accuracy of this administrative task. Future research should focus on expanding the training datasets with specialized data and exploring the potential integration of these models into existing healthcare systems to maximize their utility and reliability.

Keywords: International Classification of Disease, 10^th revision codes (ICD-10 codes); artificial intelligence (AI); Generative Pretrained Transformer-3.5 Turbo (GPT-3.5 Turbo); clinical documentation; automation

Received: 02 March 2024; Accepted: 15 May 2024; Published online: 21 June 2024.

doi: 10.21037/jmai-24-60

Highlight box

Key findings

• Generative Pretrained Transformer (GPT)-3.5 Turbo at assigned correct International Classification of Disease, 10^th revision (ICD-10) codes in 29.7% of cases.

• Fine-tuning improved the accuracy to 62.6%.

What is known and what is new?

• Manual assignment of ICD codes is a time-consuming task prone to errors.

• Models like GPT offer a potential solution.

• GPT is easy to implement but it is not trained on carefully curated medical data therefore its performance for healthcare related task is sub-optimal.

What is the implication, and what should change now?

• Artificial intelligence models like GPT-3.5 Turbo offer a good solution for improving efficiency of administrative tasks in healthcare.

• However, models trained on specialized data are needed as current models are good at generic tasks but not specialized tasks.

• Fine-tuning offers an effective solution, eliminating the need for a new model and improving the performance of an available model.

Introduction

The coding of diseases in the United States using the International Classification of Diseases (ICD) is required for billing and is the underlying technical structure for problem lists. Although it is an international common language, it is a largely manual process in electronic health records (EHR), which can be tedious, produces a backlog of work, and delays clinical processes.

The World Health Organization (WHO) regulates the ICD, utilizing a coding system that designates short codes for specific diseases (1). Each code comprises alpha-numeric characters corresponding to various health categories, etiologies, manifestations, and severity (2). The ICD is currently available in 43 languages and all WHO member states use it to share death and disease statistics. The system’s global use also makes it a common language for healthcare providers to record, report, and communicate disease information efficiently between hospitals in various regions and countries (1). Additionally, designating ICD codes for diseases based on health information listed in clinical notes includes a subjective component that may decrease the accuracy of code assignment. Artificial intelligence (AI)-based software that automates the ICD code assignment process can increase clinical efficiency by reducing the need for manual classification by healthcare providers (3).

Background

OpenAI is a non-profit AI research company founded in 2015 (4). One of OpenAI’s most notable projects, ChatGPT, is a web application that is powered by a state-of-the-art Large Language Model (LLM) called Generative Pretrained Transformer (GPT) (5). ChatGPT functions as an intelligent chatting bot built upon its various language understanding mechanisms, such as multilingual machine translation, code debugging, story writing, mistake correction, and identification and rejection of inappropriate requests. These mechanisms allow users to input specific prompts and receive detailed responses (6). GPT’s impressive performance in various tasks is possible due to its pre-training process, where the model is trained with large amounts of structured and unstructured data from books, articles, reviews, and online conversations.

This extensive pre-training process ultimately separates GPT from previous LLMs that failed to interpret the context of a given input and produce relevant output (5). GPT’s ability to derive context from data without needing domain knowledge from medical experts can be utilized to extract relevant information from clinical notes. EHR are clinical patient records that contain medical information such as vitals, lab results, medical history, and clinical notes from providers. The efficient transmission and analysis of EHR data between various providers improve clinical care quality considerably (7). The two primary methods to automate ICD code assignment to free-text clinical notes are rule-based systems and learning-based systems. The former depends on the manual intervention of medical professionals, thus limiting the scale by which the process can be optimized. The latter does not require manual manipulation and relies on learning algorithms to extract meaningful underlying distributions in datasets (8). OpenAI’s GPT model is an example of a learning-based system that can be trained to assign ICD codes to clinical notes using datasets prelabelled with ICD codes. Fine-tuning is the process of providing the pre-trained model with a smaller, specialized dataset for further training. Considering the model possesses contextual knowledge from the larger corpus it was originally trained on, it is able to derive important insights from the smaller but more specific dataset and potentially, significantly improve the performance of the model on a specialized task (9). By using clinical training data and fine-tuning the GPT model using OpenAI’s platform, one can assess the effectiveness of using GPT for ICD code assignment based on clinical notes.

EHR are clinical patient records that contain medical information such as vitals, lab results, medical history, and clinical notes from providers. The efficient transmission and analysis of EHR data between various providers improve clinical care quality considerably (7). GPT’s ability to derive context from data without needing domain knowledge from medical experts can be utilized to extract relevant information from clinical notes. The two primary methods to automate ICD code assignment to text-free clinical notes are rule-based systems and learning-based systems. The former depends on the manual intervention of medical professionals, thus limiting the scale by which the process can be optimized. The latter does not require manual manipulation and relies on learning algorithms to extract meaningful underlying distributions in datasets (8).

This extensive pre-training process ultimately separates GPT from previous LLMs that failed to interpret the context of a given input and produce relevant output (5). OpenAI’s GPT model is an example of a learning-based system that can be trained to assign ICD codes to clinical notes using datasets prelabelled with ICD codes. Fine-tuning is the process of providing the pre-trained model with a smaller, specialized dataset for further training. Considering that the model possesses contextual knowledge from the larger corpus on which it was originally trained, it is able to derive important insights from the smaller but more specific dataset and potentially to improve the performance of the model on a specialized task (9). By using clinical training data and fine-tuning the GPT model using OpenAI’s platform, one can assess the effectiveness of using GPT for ICD code assignment based on clinical notes.

Literature review

AI in healthcare

AI in healthcare can potentially lower healthcare costs and improve outcomes. An estimate is savings of $150 billion in the United States healthcare industry by 2026 (10). AI has found its place in healthcare as robotic-assisted surgical systems, virtual nurse assistants, medications management, medical diagnostics, and so on (10). This paper will discuss the potential of a specific AI model, GPT-3.5 Turbo for ICD code assignment.

Challenges of ICD codes implementation

There are several costs associated with the use of ICD codes, especially during a time of transition (from ICD-9 to ICD-10 in 2015 or currently from ICD-10 to ICD-11). One survey of 6,000 medical centers found the average time spent on staff education was 61.2 hours for small, and 139 hours for medium-sized practices, and for physician education was 35.6 and 75.1 hours, respectively (11). The average cost of the ICD-10, Clinical Modification (ICD-10-CM) implementation in the United States was between $6,748 to $9,564 for a small medical practice and between $14,577 to $23,062 for a medium-sized medical practice. These costs included software updates, staff education, and EHR quality assurance projects (11). Furthermore, the transition from ICD-10-CM code to the new ICD-11 system will take time. One 2021 study found that approximately 23.5% of ICD-10-CM codes only could be fully represented by a single ICD-11 stem code without the need for combining multiple codes and, if necessary, introducing new stem codes (12). Most studies focus on the financial and time burden, but less research is available on the emotional stress within the healthcare system surrounding ICD coding. In clinical practice, many clinicians are frustrated by the emphasis on medical coding. One 2015 survey found that over 85% of surveyed clinicians said ICD-10 diverts focus from patient-centered care and more towards insurance and billing (13).

Coding errors

Insurance companies and Medicare use these codes in the diagnosis-related group’s payment system to determine payments to hospitals (14). Correct coding of patient encounters is exceedingly important, and failure to correctly code can have several financial and even legal repercussions for a medical practice. Some of the most common errors include upcoding (reporting that a provider spent more time with a patient than in reality), selecting the wrong procedure code, and using dated coding term instead of updated ones (15). In the U.S., the quality of the coding process has been questioned by many studies showing there is significant room for improvement. One study by the National Academy of Medicine on the reliability of hospital discharge coding showed that only 65% agreed with independent re-coding (16). Hsia et al. revealed a coding error rate of 20% (17). Other similar studies showed a typical error rate of 25–30%, with low agreement between coders (18). A report analyzing the previous ICD-9-CM codes estimated that the cost of correcting wrong codes in the U.S. was upwards of $25 billion per year. Manual coding for diverse disease etiologies, pathologies, clinical manifestations, and treatment plans is not only prone to errors but is also time-consuming and inefficient (19).

To overcome the challenges associated with using ICD code assignments and implementation, AI seems to offer a viable solution.

AI for assigning ICD codes

A review of 1,611 publications with automated coding from 1974–2020 found a significant increase in AI-based coding publications after 2009, with Natural Language Processing (NLP) and machine learning (ML) as the most used methodologies for automated coding (20). An example is the successful collaboration between a Clinical Documentation Integrity Specialist and an embedded Computer Assisted Coding (CAC) system (21). The ICD provides a taxonomy of classes, representing various conditions addressed at an episode of care for a patient, as presented in clinical documentation. Considering clinical documentation consists of unstructured textual data, and a single note can have multiple ICD codes assigned to it, it can thus be treated as a multi-label classification problem (22). Deep learning-based methods have outperformed other conventional models in ICD codes assignment (3). A systemic review of studies from 2010 to 2021 provided an overview of automatic ICD coding assignment systems that utilized NLP, ML, and deep learning techniques, and concluded that deep learning models were found to be better than other traditional ML models when automating clinical coding systems (23).

Utilizing NLP techniques such as Word Embedding (a representation of words and phrases by vectors in a low-dimensional space such that it retains semantic and syntactic information) and a Convolutional Neural Network model (a deep learning algorithm that captures hierarchical patterns in textual data utilizing convolutional layers), another study processed 21,953 clinical records from five departments, significantly enhancing the accuracy of automated ICD-10 code predictions and potentially easing the manual coding process for physicians (24). A similar study analyzed the use of a NLP-bidirectional recurrent neural network (NLP-BIRNN) algorithm to optimize the medical records and identified areas of error by medical coders. NLP-BIRNN is a deep learning algorithm that processes sequences of text in both forward and reverse directions, thus retaining contextual information from both past and future states. NLP-BIRNN reduced errors in the assignment of principal diagnosis and ICD coding (25). The introduction of transformers (deep learning models that rely on self-attention mechanisms, processing entire sentences simultaneously, rather than word-by-word, being a lot more efficient at retaining context and thus exceptional at linguistic tasks) and LLMs (AI systems based on transformer architecture, trained on diverse language dataset to understand, generate and interact with human language at large) opened new doors. Publicly available systems like ChatGPT make these models available to the general public and it was only a matter of time before professionals started experimenting with these systems for healthcare applications. One study found that ChatGPT was able to generate at least one correct ICD code for an encounter 70% of the time (26). Another compared off-the-shelf LLM models GPT-4, Llama-2 and a model specifically trained for ICD code assignments known as pretrained language models (PLM)-ICD and showed that PLM-ICD had a consistent accuracy of 22%, while GPT-4 accuracy was 22.5% as represented by F1-score (27). The objective of the current project is to evaluate the precision of GPT to assign ICD-10 codes, and whether fine-tuning can improve its performance.

Methods

The dataset

We used free-text discharge summaries from the Medical Information Mart for Intensive Care IV Note (MIMIC-IV-Note) dataset which contains de-identified clinical data of over 40,000 patients admitted to the Beth Israel Deaconess Medical Center (28,29). Each discharge summary contains free text data describing the initial presentation, course of treatment during the specific hospital encounter, and also includes diagnostic data. Access to this dataset is restricted and requires the user to sign a Data Use Agreement with PhysioNet (30). Each admission encounter includes a unique Hospital Admission ID denoted as ‘hadm_id’ in the dataset. We downloaded the free text de-identified clinical notes from the dataset contained in the “discharges.csv” file (29) as well as the file containing the ICD diagnosis contained in the “diagnosis_icd.csv” file (28). The file containing ICD codes has both ICD-10 as well as ICD-9 codes assigned for each admission encounter as represented by “hadm_id”. By default, the dataset is arranged such that the diagnosis_icd.csv file will have multiple entries for the same note, depending on how many ICD-9 and ICD-10 codes are assigned to that specific note. In order make a single aggregated table, we then joined the two tables at the data field “hadm_id”, such that each “hadm_id” was a unique entry, with a single unique discharge summary note entry, and a list of ICD-10 codes assigned to that note as a single entry in the ICD-codes column. Some basic statistics about the dataset are shown in Table 1.

Table 1

Summary statistics of the final dataset

Statistic	Count
Total number of discharge summaries	122,300
Average number of ICD-10 codes per discharge summary	14.4
Maximum number of ICD-10 codes assigned to a note	39
Minimum number of ICD-10 codes assigned to a note	1

ICD, International Classification of Disease.

To evaluate the performance of the model, we adopted an approach similar to Huang et al. (3). In this approach, each ICD code is treated independently of the others and discharge summaries related to only the top 10 most prevalent ICD codes are considered. For each specific ICD code under review, we verify whether it appears in the model’s list for each corresponding note. Accuracy is calculated based on this criterion. For instance, if the code is correctly predicted to be applicable in 80 out of the 100 notes, the model achieves an accuracy rate of 80% for that code. The reason this approach seems logical as the assignment of ICD codes to clinical documentation is subjective to some extent. It is possible that in the MIMIC-IV dataset, since codes were assigned manually, not all possible codes were covered for each note. Thus, comparing the entire list of ICD-10 codes assigned by the model to a specific note to the manual list might be challenging since exact matches will be rare. Thus, we calculated the top 10 most prevalent ICD-10 codes in the dataset as shown in Table 2.

Table 2

Top 10 most prevalent ICD-10 codes in the dataset

ICD-10 code	Diagnosis	Count
E785	Hyperlipidemia, unspecified	44,044
I10	Essential (primary) hypertension	43,574
Z87891	Personal history of nicotine use	36,299
K219	Gastroesophageal reflux disease without esophagitis	30,803
F329	Major depressive disorder, single episode, unspecified	23,231
I2510	Atherosclerotic heart disease of native coronary artery without angina pectoris	22,609
N179	Acute kidney failure, unspecified	19,706
F419	Anxiety disorder, unspecified	19,155
Z7901	Long-term (current) use of anticoagulants	15,323
Z794	Long-term (current) use of insulin	15,277

ICD, International Classification of Disease.

With Python code, for each of the top 10 most prevalent code, we selected 200 notes randomly and divided them into two groups of 100 notes each: a training group, and a testing group. Thus, total of 2,000 notes were selected, with 1,000 for fine-tuning, and 1,000 for testing. Figure 1 displays a summary of the dataset preparation.

Figure 1 Preparation of the dataset. MIMIC-IV, Medical Information Mart for Intensive Care; ICD, International Classification of Disease.

The model

OpenAI has several GPT models available to the public via their Application Programming Interface (API). The latest one is GPT-4, however, as of writing this manuscript, fine-tuning GPT-4 is available only on an experimental basis to a limited number of users. Therefore, we decided to use the GPT-3.5 Turbo model, which would be the latest model offered by OpenAI that can be fine-tuned.

Evaluating the base model

For each ICD-10 code in the top 10 codes, we passed the notes from the testing dataset one-by-one to the GPT-3.5 model via an OpenAI API and prompted it to assign a list of ICD-10 codes based on the information in each note. We then checked if the target ICD-10 code was present in the returned response or not.

Model fine-tuning and evaluation

Once the base model evaluation was completed, we used the remaining 1,000 notes in the training dataset (100 notes for each of the top-10 most common ICD-10 codes) to fine-tune the model. We used the web-based methodology for fine-tuning the model as offered by OpenAI. The training data was prepared in the format required and specified by OpenAI and uploaded to their server. The model expects each data point in the fine-tuning dataset to be labelled as “prompt”, which represents the input data, which in this case is a discharge summary note, along with the instructions for the model, and “output”, representing the expected output, which in this case is a list of ICD-10 codes assigned to the provided note. Table 3 shows an example of a data point in the dataset used for fine-tuning.

Table 3

A sample data point in the fine-tuning dataset

Prompt	Response
“Assign ICD-10 codes to the following discharge summary: {Discharge Summary}*”	“[‘M1612’, ‘E871’, ‘I10’, ‘Z96641’, ‘I482’, ‘Z7901’, ‘E785’, ‘K219’, ‘F17290’, ‘R339’, ‘R42’]”

*, a place holder representing a discharge summary note. Full note not displayed to save space. ICD, International Classification of Disease.

Once fine-tuning was completed, the custom fine-tuned model was then evaluated with the testing dataset and data recorded, similar to the methodology used for the base model as described above. Figure 2 shows a summary of the model evaluation flow.

Figure 2 Evaluation of the base model, fine-tuning, and evaluation of the fine-tuned model. GPT, Generative Pretrained Transformer; ICD, International Classification of Disease.

This study did not enroll individual participants nor use identifiable private information. The Penn State Health Institutional Review Board granted an exemption for informed consent.

Results

The target ICD-10 was present in the assigned list of ICD-10 codes by the model 29.7% of the time without fine-tuning. The fine-tuned model, however, performed more than twice as better with the target ICD-10 code present in the assigned list of codes by the fine-tuned model 62.6% of the time. Figure 3 summarizes the results.

Figure 3 Model performance, base vs. fine-tuned.

Performance improved across the board for all codes 5% to 52% but varied from 33% to 82% for each code individually. Table 4 shows the base-model as well as the fine-tuned model’s accuracy for each code.

Table 4

A comparison of the models’ accuracy for each code as well as summary statistics

Index	Code	Base model accuracy (%)	Fine-tuned model accuracy (%)	Absolute improvement (%)
0	E785	39	69	30
1	I10	59	82	23
2	Z87891	10	47	37
3	K219	33	74	41
4	F329	35	75	40
5	I2510	56	61	5
6	N179	12	58	46
7	F419	15	67	52
8	Z7901	23	33	10
9	Z794	15	60	45
Max	–	59	82	–
Min	–	10	33	–
Mean	–	29.7	62.6	–

Discussion

Training a specialized model for such tasks is resource-intensive involving the collection and preparation of extensive datasets. Fine-tuning, however, presents a viable alternative. This approach involves taking a pre-trained model, which has already learned features from a large, diverse dataset, and adjusting it to perform a specific task. This provides the model with a task specific dataset, including the expected outputs. Fine-tuning can thus significantly improve a task specific performance, as demonstrated by our work which doubled the accuracy of the model.

Fine-tuning offers several other benefits:

It requires a reduced amount of data compared to training a model from scratch. This is particularly advantageous in healthcare where acquiring large amounts of data can be challenging due to privacy concerns as well as due to the rarity of certain medical conditions.
Since the model has already learned the basic patterns and features, fine-tuning for a specific task requires much less computational time and resources.
Fine-tuning allows the model to maintain general capabilities while gaining proficiency in a specific task.

Cloud-based, publicly available AI models offer an option that is accessible and cost-effective as compared to developing, training, and deploying a custom model from scratch. This has led to a wide adoption of these models in various sectors, including healthcare. We are seeing a wave of AI based applications in healthcare, mostly powered by third party cloud-based models offered via APIs. These models offer significant advantages like scalability, reduced infrastructure cost, and ease of integration.

Performance of these models for specialized healthcare-related tasks like assigning ICD codes to clinical notes remains subpar. Saroush et al. (31) demonstrated that prompting GPT-3.5 and -4 via the ChatGPT interface by providing descriptions of the ICD-10 code predicted the correct ICD-10 codes only 10% (GPT-3.5) and 13% (GPT-4) of the time. Boyle et al. (27) observed similar results. Healthcare tasks require an understanding of medical terminology and context, which the generic AI models might not possess. The GPT models have been trained on a large dataset obtained from the internet. One would assume that these data will contain medical data as well, curated from openly accessible journals, user posts on open forums, and websites like Wikipedia and Medscape. However, there is no room for error when one is working with real patient data. The data used for training these models have not been vetted for medical accuracy and the model output may not always be accurate. Our study, using a combination of specific EHR data, AI, and fine-tuning, suggests that we can improve the accuracy of ICD-10 coding. This is true even when using the less mature GPT-3.5.

Furthermore, healthcare data is inherently private and data security is a major concern. Using a model hosted by a third-party risks exposure of Protected Health Information (PHI). Therefore, before implementing such technologies, one must ensure security of the PHI remains uncompromised. This requires deploying models locally within the healthcare institution to maintain control over data security, strict business associate and data usage agreements with any external parties, absolute restrictions on selling or otherwise sharing the data with any other entities and insisting that technology partners verify their compliance with data security standards.

We acknowledge the limitations of our study. First, we used a specific dataset, the MIMIC-IV. Our observations may not be generalized to other datasets.

Another limitation is the extent of the model’s fine-tuning, which was done with a predetermined number of notes for each of the ten most prevalent codes. The number was chosen arbitrarily and thus a larger number of notes could enhance performance.

The task of ICD code assignment to clinical notes by itself comes with inherent challenges. The linguistic similarities between different codes can lead to complexities in accurate assignment. The subjective nature of code assignment by different experts can result in varying sets of codes assigned to the same clinical note. Furthermore, the consistency of the content of the model’s response for the exact same prompt and summary passed to it may vary. Thus, a direct comparison of the model assigned codes with the manually assigned codes can be very challenging. Keeping this in mind, adopting the methodology used by Huang et al. (3) for the model evaluation offers a structured and logical way to navigate this challenge. This also means that the model will still likely miss some codes as well as assign wrong codes. But with fine-tuning, the accuracy will improve significantly as shown by our study.

The AI landscape is rapidly evolving, and we are seeing more LLMs being released. Companies like Google have announced LLMs specifically trained on healthcare data but their availability at the moment is limited to a selected group of users. It would be interesting to evaluate the performance of such models for various healthcare related tasks including assignment of ICD codes.

Conclusions

This study presents an evaluation of the OpenAI GPT-3.5 Turbo model for the assignment of ICD-10 codes to a clinical note. We demonstrated that at baseline the model’s performance of a healthcare-related task is inadequate, however, there is potential for marked enhancement of performance with fine-tuning.

Our study further illuminates the broader implications for the adoption of publicly available and affordable AI models, emphasizing the importance of fine-tuning these pre-trained models to meet the unique demands of healthcare tasks like medical coding.

Future work to evaluate the amount of data required for fine-tuning such models for optimal performance would be revealing, as would evaluating models trained on healthcare data for healthcare associated tasks.

Acknowledgments

Funding: None.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-60/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-60/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study did not enroll individual participants nor use identifiable private information. The Penn State Health Institutional Review Board granted an exemption for informed consent.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Hirsch JA, Nicola G, McGinty G, et al. ICD-10: History and Context. AJNR Am J Neuroradiol 2016;37:596-9. [Crossref] [PubMed]
Cartwright DJ. ICD-9-CM to ICD-10-CM Codes: What? Why? How? Adv Wound Care (New Rochelle) 2013;2:588-92. [Crossref] [PubMed]
Huang J, Osorio C, Sy LW. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Comput Methods Programs Biomed 2019;177:141-53. [Crossref] [PubMed]
Introducing OpenAI. [cited 2024 Jan 18]. Available online: https://openai.com/blog/introducing-openai
Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 2023;15:192. [Crossref]
Wu T, He S, Liu J, et al. A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development. IEEE/CAA J Autom Sinica 2023;10:1122-36.
Campanella P, Lovato E, Marone C, et al. The impact of electronic health records on healthcare quality: a systematic review and meta-analysis. Eur J Public Health 2016;26:60-4. [Crossref] [PubMed]
Pakhomov SV, Buntrock JD, Chute CG. Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J Am Med Inform Assoc 2006;13:516-25. [Crossref] [PubMed]
Fine-tuning - OpenAI API. [cited 2024 Apr 11]. Available online: https://platform.openai.com/docs/guides/fine-tuning
Väänänen A, Haataja K, Vehviläinen-Julkunen K, et al. AI in healthcare: A narrative review. F1000Research 2021;10:6. [Crossref]
Desai P, Eljazzar R. Post-Implementation Cost-Analysis of the ICD-10-CM Transition on Small and Medium-Sized Medical Practices. Journal of Health & Medical Economics 2018;4:4. [Crossref]
Fung KW, Xu J, McConnell-Lamptey S, et al. Feasibility of replacing the ICD-10-CM with the ICD-11 for morbidity coding: A content analysis. J Am Med Inform Assoc 2021;28:2404-11. [Crossref] [PubMed]
% of Physicians Say ICD-10 Diverts Focus from Patient Care. [cited 2024 Jan 22]. Available online: https://revcycleintelligence.com/news/86-of-physicians-say-icd-10-diverts-focus-from-patient-care
Mihailovic N, Kocic S, Jakovljevic M. Review of Diagnosis-Related Group-Based Financing of Hospital Care. Health Serv Res Manag Epidemiol 2016;3:2333392816647892. [Crossref] [PubMed]
ICD-10-CM and CPT® Coding Mistakes Can Cost You – And not Just Financially – MedLearn Publishing. [cited 2024 Jan 22]. Available online: https://medlearn.com/icd-10-cm-and-cpt-coding-mistakes-can-cost-you-and-not-just-financially/
Institute of Medicine. Reliability of Medicare Hospital Discharge Records: Report of a Study. Reliability of Medicare Hospital Discharge Records. 1977 Jan 1. Available online: https://nap.nationalacademies.org/read/9930/chapter/1
Hsia DC, Krushat WM, Fagan AB, et al. Accuracy of diagnostic coding for Medicare patients under the prospective-payment system. N Engl J Med 1988;318:352-5. [Crossref] [PubMed]
Fung KW, Xu J, Rosenbloom ST, et al. Using SNOMED CT-encoded problems to improve ICD-10-CM coding-A randomized controlled experiment. Int J Med Inform 2019;126:19-25. [Crossref] [PubMed]
Lang: Consultant report-natural language processing... - Google Scholar. [cited 2024 Jan 22]. Available online: https://scholar.google.com/scholar_lookup?journal=Cincinnati+Children%E2%80%99s+Hospital+Medical+Center&title=Consultant+report-natural+language+processing+in+the+health+care+industry&author=D+Lang&volume=Winter&issue=6&publication_year=2007&
Ramalho A, Souza J, Freitas A. The use of artificial intelligence for clinical coding automation: A bibliometric analysis. In: Dong Y, Herrera-Viedma E, Matsui K, et al. editors. Distributed Computing and Artificial Intelligence, 17th International Conference. DCAI 2020. Advances in Intelligent Systems and Computing, vol 1237. Springer, Cham; 2021:274-83.
Bossen C, Pine KH. Batman and Robin in Healthcare Knowledge Work: Human-AI Collaboration by Clinical Documentation Integrity Specialists. ACM Transactions on Computer-Human Interaction 2023;30:1-29. [Crossref]
Kaur R, Ginige JA. Analysing Effectiveness of Multi-Label Classification in Clinical Coding. ACM International Conference Proceeding Series. 2019 Jan 29 [cited 2024 Jan 18]; Available online: https://dl.acm.org/doi/10.1145/3290688.3290728
Kaur R, Ginige JA, Obst O. AI-based ICD coding and classification approaches using discharge summaries: A systematic literature review. Expert Systems with Applications 2023;213:118997. [Crossref]
Masud JHB, Kuo CC, Yeh CY, et al. Applying Deep Learning Model to Predict Diagnosis Code of Medical Records. Diagnostics (Basel) 2023;13:2297. [Crossref] [PubMed]
Wang C, Yao C, Chen P, et al. Artificial Intelligence Algorithm with ICD Coding Technology Guided by Embedded Electronic Medical Record System in Medical Record Information Management. Microprocessors and Microsystems 2023;104962. [Crossref]
Ong J, Kedia N, Harihar S, et al. Applying large language model artificial intelligence for retina International Classification of Diseases (ICD) coding. J Med Artif Intell 2023;6:21. [Crossref]
Boyle JS, Kascenas A, Lok P, et al. Automated clinical coding using off-the-shelf large language models. arXiv:2310.06552 [Preprint]. 2023 [cited 2024 Jan 19]. Available online: https://arxiv.org/abs/2310.06552v3
MIMIC-IV v2.2. [cited 2024 Jan 30]. Available online: https://physionet.org/content/mimiciv/2.2/#files-panel
MIMIC-IV-Note: Deidentified free-text clinical notes v2.2. [cited 2024 Jan 30]. Available online: https://physionet.org/content/mimic-iv-note/2.2/note/#files-panel
License Content. [cited 2024 Apr 12]. Available online: https://physionet.org/about/licenses/physionet-credentialed-health-data-license-150/
Soroush A, Glicksberg BS, Zimlichman E, et al. Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes. 2023 Jul 9 [cited 2024 Jan 30]; Available online: https://europepmc.org/article/ppr/ppr688592

doi: 10.21037/jmai-24-60
Cite this article as: Nawab K, Fernbach M, Atreya S, Asfandiyar S, Khan G, Arora R, Hussain I, Hijjawi S, Schreiber R. Fine-tuning for accuracy: evaluation of Generative Pretrained Transformer (GPT) for automatic assignment of International Classification of Disease (ICD) codes to clinical documentation. J Med Artif Intell 2024;7:8.

Fine-tuning for accuracy: evaluation of Generative Pretrained Transformer (GPT) for automatic assignment of International Classification of Disease (ICD) codes to clinical documentation

Highlight box

Introduction

Background

Literature review

AI in healthcare

Challenges of ICD codes implementation

Coding errors

AI for assigning ICD codes

Methods

The dataset

Table 1

Table 2

The model

Evaluating the base model

Model fine-tuning and evaluation

Table 3

Results

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share