Large language models performance on pediatrics question: a new challenge

Gianluca Mondillo; Alessandra Perrotta; Vittoria Frattolillo; Simone Colosimo; Cristiana Indolfi; Michele Miraglia del Giudice; Francesca Rossi

doi:10.21037/jmai-24-174

Original Article

Large language models performance on pediatrics question: a new challenge

Gianluca Mondillo, Alessandra Perrotta, Vittoria Frattolillo, Simone Colosimo , Cristiana Indolfi, Michele Miraglia del Giudice, Francesca Rossi

Department of Woman, Child and of General and Specialized Surgery, AOU University of Campania “Luigi Vanvitelli”, Naples, Italy

Contributions: (I) Conception and design: G Mondillo, V Frattolillo, S Colosimo, A Perrotta; (II) Administrative support: M Miraglia del Giudice, F. Rossi; (III) Provision of study materials or patients: M Miraglia del Giudice, C Indolfi; (IV) Collection and assembly of data: M Miraglia del Giudice, C Indolfi; (V) Data analysis and interpretation: G Mondillo, V Frattolillo, S Colosimo, A Perrotta, F Rossi; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Gianluca Mondillo, MD. Department of Woman, Child and of General and Specialized Surgery, AOU University of Campania “Luigi Vanvitelli”, Via Luigi de Crecchio 4, Naples, Via Luigi De Crecchio 4, 80138, Italy. Email: gianluca.mondillo@gmail.com.

Background: This study investigates the application and efficacy of large language models (LLMs) in pediatric medicine, focusing on their capability to assist in training and decision support for healthcare professionals. Given the unique challenges in pediatric care, such as age-specific conditions and dosing requirements, we aim to evaluate the performance of various LLMs in this specialized field.

Methods: We conducted a comparative analysis of several LLMs, including Claude 3-OPUS, ChatGPT 3.5 and 4, Gemini AI, Llama 2 70B, and Mixtral 8x7B. The models were tested on 227 multiple-choice pediatric questions in Italian before and after undergoing specialized training. The training data consisted of pediatric articles from a medical journal, ensuring compliance with HIPAA regulations by using de-identified and anonymous data.

Results: The performance of the LLMs varied significantly. ChatGPT 3.5 improved from 65.20% to 83.70% accuracy (P<0.01) after training, while ChatGPT 4 increased from 77.09% to 91.62% (P<0.01). Gemini 1.0 and Mixtral 8x7B recorded accuracies of 70.48% and 71.37% respectively, both showing significant improvements post-training. Llama 2 70B had a lower performance, improving from 47.58% to 52.86%. Claude 3-OPUS demonstrated robust performance with an 82.82% accuracy pre-training, improving to 95.59% post-training.

Conclusions: Our analysis confirms the effectiveness of LLMs in pediatric medicine, highlighting their potential in training and decision support. The study emphasizes the need for specific training datasets that reflect the complexities of pediatric conditions to tailor the models accurately. Moreover, there is a significant opportunity to utilize open-source models as a foundation for developing customized systems through training on dedicated datasets. These strategies promise to enhance the accuracy and accessibility of pediatric healthcare, ultimately improving outcomes for young patients.

Keywords: Artificial intelligence (AI); large language model (LLM); pediatrics; benchmarking

Received: 04 June 2024; Accepted: 14 September 2024; Published online: 13 November 2024.

doi: 10.21037/jmai-24-174

Highlight box

Key findings

• The study showed that large language models (LLMs) can significantly enhance their performance in pediatric medicine after specific training. Claude 3-OPUS achieved the highest accuracy, increasing from 82.82% to 95.59% post-training. ChatGPT 4 and ChatGPT 3.5 also improved notably, from 77.09% to 91.62% and from 65.20% to 83.70%, respectively. Conversely, Llama 2 70B had the lowest improvement, going from 47.58% to 52.86%.

What is known and what is new?

• LLMs are powerful artificial intelligence tools that can understand and generate natural language using large numbers of parameters. Their effectiveness varies significantly depending on the task and training data quality. This study introduces new insights by demonstrating that pediatric-specific training data can drastically improve these models’ performance in a specialized medical field.

What is the implication, and what should change now?

• This study’s implications are crucial for training and supporting pediatric healthcare professionals. Including pediatric-specific data in LLM training sets is essential for enhancing accuracy and utility. LLM manufacturers and medical institutions need to collaborate to create comprehensive, specialized training datasets. Additionally, using open-source models customized with dedicated data offers a significant opportunity to develop systems tailored to pediatric needs, improving health outcomes for young patients.

Introduction

A large language model (LLM) is defined as an artificial intelligence (AI) model that uses a large number of parameters to understand and generate natural language, often trained on enormous text datasets. These models are primarily based on the Transformer architecture, which enables parallel and independent calculation of each word in a sentence or document, assigning an “attention score” to model the influence of each word on others (1,2). This explains their ability to handle a wide range of natural language processing tasks, including text comprehension, text generation, translation, and more. The effectiveness of these models is often enhanced through techniques like fine-tuning, which adapts the model to specific tasks or data (3).

The ability of an LLM to generate and understand text is heavily influenced by the training data used and the number of model parameters. Training Data: Training data are crucial as it provides the information from which the model learns to understand and generate language. If an LLM is trained on a vast dataset that includes a wide variety of texts, from scientific articles to literature and everyday dialogues, it will be capable of handling a broader range of queries and providing more appropriate results. Number of Parameters: Parameters in an LLM are essentially the internal configurations that the model uses to make decisions about what and how to write. A model with a greater number of parameters can theoretically capture more complex and subtle language relationships, making it more precise and versatile. For example, the transition from GPT-2 to GPT-3 saw a significant increase in the number of parameters from 1.5 billion (4) to 175 billion, significantly improving the model’s ability to generate coherent and appropriate texts (5).

Below there is an overview of various LLMs, all of which have a Chatbot interface:

ChatGPT3.5 and ChatGPT4 (6)
- ChatGPT 3.5 (OpenAI, San Francisco, 2022) is a transitional model developed by OpenAI, positioned between GPT-3 and GPT-4 in terms of capabilities. Although ChatGPT 3.5 is less powerful than GPT-4, it offers superior performance compared to the original GPT-3. The knowledge cutoff date for ChatGPT 3.5 is December 2022.
- ChatGPT 4 (OpenAI, San Francisco, 2023): this model, released by OpenAI in 2023, is an advanced version of ChatGPT 3.5 with significant improvements in text understanding and generation of more coherent and contextualized responses. It has over 175 billion parameters, making it one of the largest and most sophisticated language models at the time of its release. April 2023 is the knowledge cutoff date. It’s important to note that GPT-3 and GPT-4 are distinct AI models from ChatGPT 3.5 and ChatGPT 4, which are specific implementations of the respective GPT models in the form of a chatbot.
Gemini (Google, California, 2023)
- Gemini: a model developed by Google, known for its ability to handle a wide range of textual and audiovisual inputs to process requests. There is not much publicly available information published on the exact number of parameters or the specific architecture of the Gemini model, as Google does not disclose the technical specifications of its models in detail due to its internal policies (7).
Mixtral 8x7B (MistralAI, France, 2023)
- Mixtral 8x7B: developed as part of the open-source initiative, Mixtral is known for its scalable configuration and training efficiency. It’s a Mixture-of-Experts with 47 billion parameters (8), Mixtral is praised for its energy efficiency and the ability to perform natural language processing tasks at reduced operational costs (9).
Llama 2 70B (Meta, California, 2023)
- Llama 2 70B: a model developed by Meta (Facebook), with 70 billion parameters. LLaMA is known for its adaptability to different tasks and languages with lower requirements for specific domain data compared to other models of similar sizes (10).
Claude 3-OPUS (Anthropic, San Francisco, 2024)
- Claude 3-OPUS: this model is developed by Anthropic (11) and represents an evolution in AI models focused on security and ethical understanding. Claude 3-OPUS is designed to generate responses that are not only accurate from an informational standpoint but also reflect ethical and security considerations. August 2023 is the knowledge cutoff date. Additionally, Claude 3 has shown signs of meta-cognitive reasoning, including the ability to realize it is being artificially tested during needle in a haystack evaluation (12).

Numerous studies have evaluated the capabilities of LLMs in knowledge-based fields, such as medicine, based on their ability to answer multiple-choice questions. The performance evaluation of various LLMs to understand which models are more effective in specific tasks is carried out through benchmarking. This process can be imagined as a “standardized test” for AI models, objectively measuring the algorithm’s performance (13). Specific datasets are used in benchmarking, designed specifically for evaluating tasks such as text comprehension, information recognition, generation, and understanding of question-answer pairs. The results of these tests help understand how well a model can handle various types of information and languages (14). Examples of medical question and answer (MCQ) datasets are openlifescienceai/medmcqa (15) and bigbio/med_qa (16), which can be found on HuggingFace, one of the world’s largest dataset repositories for training and benchmarking (17).

Although LLMs have shown great potential in various fields of medicine, it is evident that there is a need to test them more thoroughly in the pediatric field to verify their level of knowledge of the subject. Pediatrics differs significantly from adult medicine both in terms of the prevalence of diseases and the differences in anthropometric and laboratory values that are considered normal or pathological in different age groups. Furthermore, pharmacological therapies in pediatrics are based on the concept of dose per kg, i.e., the amount of drug chosen, if authorized for pediatric use, is based on the child’s weight. To improve the accuracy of LLMs in this sector, greater attention is needed in the pretraining data, with the inclusion of specific pediatric data.

With this work, we aim to explore the limits in the knowledge of different LLMs in the pediatric field.

Methods

In this study, we analyzed the ability of various LLMs, both closed and open-source, to accurately answer multiple-choice questions sourced from an Italian pediatric journal. All questions, answers, and training articles were written by experienced pediatricians. The models compared in this study were ChatGPT 3.5, ChatGPT 4, Gemini 1.0, Mixtral 8x7B, Llama 2 70B, and Claude 3-OPUS. Symbolically, we assigned a value of 1 to the correct answer and a value of 0 to the wrong answer. If present, we removed true/false questions from the dataset, resulting in a total of 227 questions. There were no questions that allowed for more than one correct answer. The prompt used for submitting questions was as follows:

“You will be presented with multiple-choice questions. Answer according to your knowledge by indicating the letter of the answer you consider correct. For example:

1. A

2. B

3. E”

Furthermore, we evaluated the capabilities of these models by determining the percentage of correct answers for each medical topic in pediatrics. Additionally, we compared the rate of correct answers after providing the texts from which the questions were derived; a procedure we will refer to as “training”. With ChatGPT 4 and Claude 3, we uploaded the PDF file to the chat interface and submitted questions related to that text using the previously indicated prompt. For ChatGPT 3.5, Gemini 1.0, Mixtral, and Llama 2 70B, since direct file upload was not possible, we copied and pasted the article text using the following prompt:

“Read the following text [article], you will be asked multiple-choice questions related to the text. Answer with the letter of the response you deem correct.”

We ensured that the article did not include any direct hints about the correct answers. We presented the questions individually, one at a time. During the training phase, we provided the article content and asked questions one at a time within the same chat. We used one chat per article to minimize issues related to the context window. After doing this, we compared the answers given as correct by individual models with the solutions provided by the journal. It never happened that a model did not provide an answer or gave more than one correct answer.

The topics covered by the questions included the following: Allergology, Cardiology, Dermatology, Hematology, Pharmacology, Gastroenterology, Genetics, Miscellaneous, Nephro-Urology, Neonatology, Neurology, Neuropsychiatry, Orthopedics, Otorhinolaryngology, Pulmonology, and Rheumatology. The Miscellaneous category was created for questions that did not fit neatly into a specific category or represented topics that would have generated categories with only one subject. This ensured a more streamlined and organized classification of the questions.

In our study, we also considered the “Window Context” of each model, which is the maximum amount of text, in tokens, that an LLM can consider simultaneously to generate a response. In other words, it is the length of the text that the model can “see” and use as context to understand and answer questions or to continue a text. This was considered particularly for open-source LLMs: for instance, Llama 2 70B can handle 4,096 tokens simultaneously, while Mixtral can handle 32,000 tokens. An article provided by us averages around 3,500 tokens (18). All datasets are available in our GitHub repository (19).

Statistical analysis

The statistical significance of the results before and after the training of each model was calculated using the Chi-square test, utilizing the Python (Version 3.12.3) libraries Pandas (Version 2.2.2) and Scipy (1.13.0).

Statistical significance was set at a P value of less than 0.01.

Results

The response capability to tests of the different models was evaluated using our dataset.

In Table 1, a summary of the results comparing the performance of LLMs before and after training is presented.

Table 1

Enhancement in artificial intelligence model performance before and after training

Model	Pre-education (correct/total)	Post-education (correct/total)	Increment	P
ChatGPT 3.5	65.20% (148/227)	83.70% (190/227)	18.5% (+42)	<0.01
ChatGPT 4	77.09% (175/227)	91.62% (208/227)	14.53% (+33)	<0.01
Gemini 1.0	70.48% (160/227)	78.41% (178/227)	7.93% (+18)	0.06
Llama 2 70B	47.58% (108/227)	52.86% (120/227)	5.28% (+12)	0.30
Mixtral 8x7B	71.37% (162/227)	78.86% (179/227)	7.49% (+17)	0.08
Claude 3-OPUS	82.82% (188/227)	95.59% (217/227)	12.77% (+29)	<0.01

This table presents the accuracy percentages of different models before and after educational interventions, highlighting the improvement (increment) in their performance. The models include ChatGPT 3.5, ChatGPT 4, Gemini 1.0, Llama 2 70B, Mixtral 8x7B, and Claude 3-OPUS. The increments, shown in parentheses, represent the number of additional questions answered correctly. These increments demonstrate the effectiveness of the training in enhancing the models’ ability to correctly answer pediatric questions by indicating how many more questions were answered correctly after the intervention.

Despite the two models having a roughly similar number of parameters (47 billion versus 70 billion) (9,10), our study found that Mixtral significantly outperformed the Llama 2 70B model. Interestingly, in our tests, Mixtral even surpassed ChatGPT version 3.5 before training (71.37% versus 65.20%).

Among OpenAI’s models, as expected, ChatGPT version 4 showed better results than its predecessor 3.5. Indeed, the larger number of parameters and the greater amount of data used in the pre-release phase can easily explain these results (77.09% versus 65.20%). Nonetheless, ChatGPT 3.5 demonstrated good learning capabilities compared to ChatGPT 4 after being exposed to the articles from which the questions were derived, with an increase in the number of correct answers by 18.5% (P value <0.01) for ChatGPT 3.5 compared to 14.53% (P value <0.01) for ChatGPT 4. This highlights the comprehension and memorization capabilities of these two versions of the GPT model.

Claude 3-OPUS, the model from Anthropic, showed a correct response rate of 82.82% before training, achieving the best result among the models without training with our articles. After being exposed to the articles, Claude 3-OPUS further improved its performance, reaching an impressive 95.59% of correct responses. This result underscores the model’s ability to quickly learn from new data, demonstrating an increase of 12.77% (P value <0.01), which, although lower than the increase seen in ChatGPT 3.5, remains notable.

Gemini achieved a correct response rate of 70.48%, a result almost identical to that of Mixtral. Finally, after the administration of the articles, ChatGPT 4 showed a brilliant 91.62% of correct responses, but it did not emerge as the best model in our study.

Analyzing the specific data, ChatGPT 3.5 had a significant increase in correct responses, rising from 65.20% to 83.70%, with an improvement of 18.5%. ChatGPT 4 showed an increase from 77.09% to 91.62%, with an improvement of 14.53%. Gemini 1.0 recorded an improvement from an initial 70.48% to 78.41%, equal to an increase of 7.93%. Llama 2 70B showed a modest improvement, from 47.58% to 52.86%, with an increase of 5.28%. Mixtral 8x7B went from 71.37% to 78.86%, with an increase of 7.49%. Finally, Claude 3-OPUS showed an improvement from 82.82% to 95.59%, with an increase of 12.77%.

These data highlight not only the intrinsic capabilities of the different models but also their ability to learn and improve their performance through exposure to new data. Claude 3-OPUS, with its impressive correct response rate after training, stands out as the best model in our study, demonstrating the power and effectiveness of Anthropic’s models.

In Figure 1, the Confusion Matrices with the incorrect and correct answers before and after training for each model.

Figure 1 Confusion Matrices for every model analyzed with pre- and post-training results: (A) ChatGPT 3.5, (B) ChatGPT 4, (C) Claude 3-OPUS, (D) Gemini 1.0, (E) Llama 2 70B, (F) Mixtral 8x7B.

Discussion

The analysis of the results demonstrated that, despite LLMs not being able to learn in real time, they can still be effectively used to support clinical decisions and train healthcare professionals in the pediatric field. This highlights the need for manufacturers to include a greater amount of pediatric data in their models, making them more performant and useful in pediatric medical tasks. The examples of performance improvement post-training underline these models’ ability to quickly learn from new data and adapt to specific clinical contexts.

The learning capabilities of LLMs vary based on the number of parameters and the quality of the data used in the pre-training phase. Models like ChatGPT 4 and Claude are trained not only on public data but also on third-party data, significantly improving their performance. However, open-source models, having fewer parameters and being trained on publicly available data from the Internet, are relatively limited in evaluating complex medical issues. An interesting exception is Mixtral 8x7B which, despite its 47 billion parameters, achieved results comparable to Gemini thanks to the Mixture-of-Experts architecture. This mode makes the model more efficient by activating only the necessary experts to answer specific questions.

The Llama model achieved significantly lower results compared to the other models studied. This can be attributed to several reasons. Firstly, the context window of Llama, or the maximum amount of text the model can consider simultaneously, is relatively limited compared to other models. Llama can handle only up to 4,096 tokens simultaneously (10), while other models like Mixtral can handle up to 32,000 tokens (9). Additionally, the quality of the training data and the specificity of the pediatric content may not have been sufficient to significantly improve its performance. The combination of these factors likely contributed to Llama’s less performant results.

Various studies have already compared the capabilities of LLMs in answering medical questions, but ours is one of the first studies to evaluate the clinical capabilities of Claude 3-OPUS in answering pediatric clinical questions and to use a dataset of pediatric questions in Italian. For example, Wu et al. evaluated the ability of GPT-4 and Claude 2 to answer nephrology questions, with GPT-4 achieving an overall score of 73.3% and Claude 2 a correct response rate of 54.4% (20). In dermatology, ChatGPT 4 outperformed its predecessor ChatGPT 3.5 with a score of 90% (21). However, in gastroenterology, GPT 3 (a previous model of ChatGPT 3.5) and GPT 4 did not reach the minimum threshold of 70% required to pass the American College of Gastroenterology exam (22).

In cardiology, ChatGPT 4 demonstrated evident sophistication, achieving a perfect score of 100% in providing correct information on questions related to heart failure (23). These results highlight the potential of models to assist in patient education and foresee a future where AI assistants could become an integral part of patient care, especially in the subspecialties of internal medicine.

However, not all models have demonstrated equal capabilities in various areas of pediatric medicine. As highlighted by Barile, ChatGPT 4 showed significant deficiencies in solving pediatric clinical cases, correctly solving only 39% of the presented cases (24). This result is likely due to the poor representation of pediatric data in the pre-training phase. One of the main differences between pediatrics and adult medicine is the existence of age- and sex-specific percentiles for almost every biomedical value. A value considered normal for an adult could be pathological for a child, making the applicability of these tools in daily pediatric practice difficult.

In our study, analyzing the models’ incorrect answers, it emerged that particularly specific questions, such as those on hematologic parameters related to the diagnosis of anemia, were answered incorrectly by most models. Only four questions were answered incorrectly by all models, both before and after training, indicating areas for improvement. Two of these questions concerned Cough Receptor Hypersensitivity Syndrome (CRHS), a condition discussed primarily in Italian pediatrics but little covered in international literature. The other two questions concerned the diagnostic role of RDW in pediatric anemias, and the association of the symptom progressive dysphagia in achalasia and eosinophilic esophagitis. LLMs tend to easily make mistakes on pharmacology questions due to the complexity and specificity of the required information, such as dosages and routes of administration, further complicated by the fact that in pediatrics, drug dosages are calculated in relation to the patient’s body weight and age (25). The limited representation of pediatric data in the models’ training processes leads to lower accuracy in this area. It is important to note that no model answered a question incorrectly after training that it had previously answered correctly, except for ChatGPT 4, which answered two questions incorrectly after the training.

Table 2 shows the results in terms of correct and incorrect answers of the various models sorted by topic.

Table 2

Correct and incorrect answers per topics

Topics	ChatGPT 3.5	ChatGPT 3.5 w/education	ChatGPT 4	ChatGPT 4 w/education	Gemini 1.0	Gemini 1.0 w/education	Llama 2 70B	Llama 2 70B w/education	Mixtral 8x7B	Mixtral 8x7B w/education	Claude 3-OPUS	Claude 3-OPUS w/education
Allergology	7 [2]	8 [1]	9 [0]	9 [0]	7 [2]	7 [2]	6 [3]	6 [3]	9 [0]	9 [0]	8 [1]	9 [0]
Cardiology	6 [7]	10 [3]	8 [5]	9 [4]	7 [6]	9 [4]	8 [5]	8 [5]	5 [8]	8 [5]	10 [3]	11 [2]
Dermatology	1 [1]	1 [1]	1 [1]	2 [0]	0 [2]	0 [2]	1 [1]	1 [1]	1 [1]	1 [1]	2 [0]	2 [0]
Hematology	8 [5]	10 [3]	8 [5]	9 [4]	10 [3]	10 [3]	5 [8]	7 [6]	7 [6]	8 [5]	12 [1]	12 [1]
Pharmacology	16 [15]	22 [9]	24 [7]	30 [1]	16 [15]	17 [14]	16 [15]	18 [13]	21 [10]	25 [6]	20 [11]	30 [1]
Gastroenterology	20 [9]	24 [5]	21 [8]	26 [3]	21 [8]	23 [6]	13 [16]	16 [13]	20 [9]	21 [8]	25 [4]	28 [1]
Genetics	5 [4]	9 [0]	7 [2]	9 [0]	7 [2]	7 [2]	4 [5]	4 [5]	7 [2]	8 [1]	8 [1]	9 [0]
Miscellaneous	9 [6]	13 [2]	12 [3]	15 [0]	10 [5]	13 [2]	3 [12]	4 [11]	10 [5]	11 [4]	12 [3]	15 [0]
Nephro-Urology	8 [4]	9 [3]	10 [2]	11 [1]	7 [5]	10 [2]	4 [8]	4 [8]	9 [3]	9 [3]	9 [3]	10 [2]
Neonatology	15 [2]	17 [0]	14 [3]	16 [1]	15 [2]	15 [2]	11 [6]	11 [6]	15 [2]	15 [2]	15 [2]	17 [0]
Neurology	8 [2]	10 [0]	6 [4]	9 [1]	7 [3]	9 [1]	5 [5]	6 [4]	10 [0]	10 [0]	8 [2]	10 [0]
Neuropsychiatry	12 [1]	13 [0]	11 [2]	13 [0]	12 [1]	13 [0]	9 [4]	9 [4]	10 [3]	11 [2]	11 [2]	12 [1]
Orthopedics	6 [2]	7 [1]	7 [1]	7 [1]	6 [2]	7 [1]	7 [1]	7 [1]	7 [1]	8 [0]	8 [0]	8 [0]
Otorhinolaryngology	2 [2]	4 [0]	2 [2]	4 [0]	2 [2]	3 [1]	0 [4]	0 [4]	3 [1]	3 [1]	3 [1]	4 [0]
Pulmonology	5 [8]	6 [7]	10 [3]	10 [3]	7 [6]	9 [4]	5 [8]	5 [8]	6 [7]	10 [3]	10 [3]	11 [2]
Rheumatology	20 [9]	27 [2]	25 [4]	29 [0]	26 [3]	26 [3]	11 [18]	14 [15]	22 [7]	22 [7]	27 [2]	29 [0]

This table displays the performance of various models on different pediatric topics, both before and after educational interventions. The models evaluated include ChatGPT 3.5, ChatGPT 4, Gemini 1.0, Llama 2 70B, Mixtral 8x7B, and Claude 3-OPUS. Each cell in the table represents the number of correct answers given by the model, with the number of incorrect answers indicated in parentheses. The comparison of results before and after training highlights the models’ improvements in accurately answering questions on topics such as Allergology, Cardiology, Dermatology, Hematology, and Pharmacology, among others. The data are presented as absolute numbers.

The areas where the models showed a higher success rate include rheumatology, neonatology, genetics, and neurology, indicating a good understanding of these subjects by all the LLMs. However, there was less consistency in correct responses for the topic of otorhinolaryngology, with some models performing better than others.

Finally, we demonstrated that extracting text from PDF documents can be challenging for LLMs due to their formatted and graphical structure, which can complicate text extraction, compromising the quality and order of the extracted text. TXT files, being in a simple and linear text format, are more suitable for direct insertion into an LLM and do not require complex pre-processing steps as might be necessary for PDF files (26).

In conclusion, our study highlights the need to include a greater amount of pediatric data in the training processes of LLMs and the importance of using high-quality data to improve their performance. These developments promise to transform the way medical information is utilized, improving health outcomes for young patients in an innovative and effective manner.

Conclusions

Our analysis confirms the effectiveness of LLMs in pediatric medicine, highlighting their potential in training and decision support for healthcare professionals. This underscores the revolutionary role of AI tools in pediatric care, enhancing the accuracy and accessibility of healthcare.

We chose to use a set of general pediatric questions in Italian for benchmarking because most medical benchmarking is conducted in English. Italian ranks as the 21th most spoken language in the world (27), and benchmarkings in this language are rare. With this work, we aim to emphasize the importance of testing LLMs in languages other than English to better reflect their real-world use.

Furthermore, our conclusions highlight two crucial messages in the context of pediatric medicine. First, there is a need for specific training datasets for this sector, reflecting the complexities of pediatric medical conditions to tailor the models. These datasets must comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, ensuring that Protected Health Information (PHI) is either de-identified or handled under stringent data use agreements to protect patient privacy (28). This will improve the accuracy of responses and provide essential support to healthcare professionals.

Second, there is a significant opportunity to use open-source models as a foundation for developing customized systems through training on dedicated datasets. This strategy allows institutions to shape LLMs according to their needs, promoting advanced customization that reflects the specifics of the pediatric sector.

In summary, customizing models with dedicated datasets and utilizing open-source resources are essential for the progress of LLMs in pediatric medicine. These developments promise to transform the utilization of medical information, improving healthcare outcomes for young patients in an innovative and effective manner.

Acknowledgments

During the preparation of this work, the author(s) used ChatGPT/OpenAI, Claude/Anthropic, Llama 2 70B/Meta, Mixtral 8x7B/MistralAI, Gemini/Google in order to test LLMs on pediatric questions. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Funding: None.

Footnote

Data Sharing Statement: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-174/dss

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-174/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-174/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. Ethical approval or informed consent is not required as no human subject was involved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

MinaeeSMikolovTNikzadNLarge Language Models: A Survey.2024. arXiv:2402.06196
Vaswani A, Shazeer N, Parmar N et al. Attention Is All You Need. 2017. arXiv. doi: 10.48550/arXiv.1706.03762.
NaveedHKhanAUQiuSA Comprehensive Overview of Large Language Models.2023. arXiv:2307.06435.
SolaimanIBrundageMClarkJRelease Strategies and the Social Impacts of Language Models.2019. arXiv:1908.09203.
ZhouCLiQLiCA Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT.2023. arXiv:2302.09419.
Introducing ChatGPT. Available online: https://openai.com/index/chatgpt/
AnilRBorgeaudSAlayracJBGemini: A Family of Highly Capable Multimodal Models.2023. arXiv:2312.11805.
Mixture of Experts Explained. Available online: https://huggingface.co/blog/moe
Mixtral of experts - A high quality Sparse Mixture-of-Experts. Available online: https://mistral.ai/news/mixtral-of-experts/
TouvronHMartinLStoneKLlama 2: Open Foundation and Fine-Tuned Chat Models.2023. arXiv:2307.09288.
Introducing the next generation of Claude. Available online: https://www.anthropic.com/news/claude-3-family
Dispatch: New AI Claude 3 shows signs of Metacognition - A New Era for Humanity & The Science of Consciousness? Available online: https://thefuturai.substack.com/p/dispatch-new-ai-claude-3-shows-signs
What are LLM benchmarks? Available online: https://www.ibm.com/think/topics/llm-benchmarks#:~:text=LLM%20benchmarks%20are%20standardized%20frameworks,large%20language%20models%20(LLMs)
ChiangWZhengLShengYChatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.2024. arXiv:2403.04132.
openlifescienceai/medmcqa on HuggingFace. Available online: https://huggingface.co/datasets/openlifescienceai/medmcqa
bigbio/med_qa on HuggingFace. Available online: https://huggingface.co/datasets/bigbio/med_qa
HuggingFace - The AI community building the future. Available online: https://huggingface.co/
Context Windows: The Short-term Memory of Large Language Models. Available online: https://medium.com/@crskilpatrick807/context-windows-the-short-term-memory-of-large-language-models-ab878fc6f9b5
GitHub Repository. Available online: https://github.com/GianlucaMondillo/LLMPediatricITA_Benchmarking
Wu S, Koo M, Blum L, et al. Benchmarking Open-Source Large Language Models, GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology. NEJM AI 2024. doi: 10.1056/AIdbp2300092.10.1056/AIdbp2300092
Passby L, Jenko N, Wernham A. Performance of ChatGPT on Specialty Certificate Examination in Dermatology multiple-choice questions. Clin Exp Dermatol 2024;49:722-7. [Crossref] [PubMed]
Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023;118:2280-2. [Crossref] [PubMed]
King RC, Samaan JS, Yeo YH, et al. Appropriateness of ChatGPT in Answering Heart Failure Related Questions. Heart Lung Circ 2024;33:1314-8. [Crossref] [PubMed]
Barile J, Margolis A, Cason G, et al. Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies. JAMA Pediatr 2024;178:313-5. [Crossref] [PubMed]
CobbeKKosarajuVBavarianMTraining Verifiers to Solve Math Word Problems.2021. arXiv:2110.14168.
GaoYXiongYGaoXRetrieval-Augmented Generation for Large Language Models: A Survey.2023. arXiv:2312.10997.
The Italian language is the 4th most studied language in the world. Available online: https://italyuntold.org/wp-content/uploads/2022/03/Italian-is-the-fourth-most-studied-language-in-the-world.pdf
Edemekong PF, Annamaraju P, Haydel MJ. Health Insurance Portability and Accountability Act. [Updated 2024 Feb 12]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024. Available online: https://www.ncbi.nlm.nih.gov/books/NBK500019/

doi: 10.21037/jmai-24-174
Cite this article as: Mondillo G, Perrotta A, Frattolillo V, Colosimo S, Indolfi C, Miraglia del Giudice M, Rossi F. Large language models performance on pediatrics question: a new challenge. J Med Artif Intell 2025;8:14.

Large language models performance on pediatrics question: a new challenge

Highlight box

Introduction

Methods

Statistical analysis

Results

Table 1

Discussion

Table 2

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share