Lost in Translation?—comparing ChatGPT’s responses to English and Spanish gastroenterology queries

Advait Suvarnakar; Harjit Singh; Srikar Reddy; Adrienne Bielawski; Shervin Shafa

doi:10.21037/jmai-2025-168

Original Article

Lost in Translation?—comparing ChatGPT’s responses to English and Spanish gastroenterology queries

Advait Suvarnakar¹, Harjit Singh², Srikar Reddy³, Adrienne Bielawski⁴, Shervin Shafa³

¹Georgetown University School of Medicine, Washington, DC, USA; ²Department of Internal Medicine, Medstar Georgetown University Hospital, Washington, DC, USA; ³Department of Gastroenterology and Hepatology, Medstar Georgetown University Hospital, Washington, DC, USA; ⁴Department of Internal Medicine, Montefiore Einstein University Hospital, Bronx, NY, USA

Contributions: (I) Conception and design: A Suvarnakar, S Shafa; (II) Administrative support: A Suvarnakar, S Reddy, S Shafa; (III) Provision of study materials or patients: A Suvarnakar; (IV) Collection and assembly of data: A Suvarnakar, H Singh; (V) Data analysis and interpretation: A Suvarnakar, H Singh, A Bielawski; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Advait Suvarnakar, MD. Georgetown University School of Medicine, Washington, DC, USA; 11 Cottage Lane, Clifton, NJ 07012, USA. Email: ams757@georgetown.edu.

Background: Artificial intelligence (AI) language models in healthcare, specifically ChatGPT, can be used as tools in improving patient-provider communications. In specialties where there is complex terminology, such as in gastroenterology (GI), these tools offer potential to enhance understanding. However, most studies evaluating AI performance have focused on English-language outputs, with limited data on other widely spoken languages. This study explores how well ChatGPT-4o performs in Spanish, specifically in explaining GI topics to patients.

Methods: We selected 14 frequently searched GI-related patient questions using Google Trends and generated responses from ChatGPT in both English and Spanish. Twelve bilingual, board-certified gastroenterologists with expertise across the GI spectrum independently evaluated each response for accuracy and reliability using a 10-point Likert scale. Paired t-tests and intraclass correlation coefficients (ICC) were used for statistical comparison.

Results: Findings showed that English responses consistently scored higher in both accuracy and reliability (73.3% vs. 39.9% accurate, and 43% vs. 25% reliable; P<0.05). Despite this, Spanish responses demonstrated slightly stronger inter-rater agreement (ICC: 0.65 vs. 0.47 for accuracy), suggesting consistent but lower-quality output. These differences point to a broader issue: current language models remain disproportionately optimized for English, leaving gaps in care for non-English speakers.

Conclusions: AI models, like ChatGPT in this study, require further refinement to ensure multilingual competence. They should not be used in place of professional medical advice, rather as a resource. Further work should be done to improve non-English outputs to improve accessibility.

Keywords: Artificial intelligence (AI); gastroenterology (GI); health equity; multilingual large language models (multilingual LLMs)

Received: 16 July 2025; Accepted: 28 September 2025; Published online: 06 March 2026.

doi: 10.21037/jmai-2025-168

Highlight box

Key findings

• This study found that ChatGPT-4o responses to gastroenterology-related patient questions were significantly more accurate and reliable in English than Spanish.

What is known and what is new?

• Artificial intelligence is being used in medical practice, some models’ performance is not uniform across languages.

• ChatGPT outputs for gastroenterology-related questions in Spanish exhibit lower accuracy and reliability compared to English.

What is the implication and what should change now?

• There’s a critical disparity in artificial intelligence performance across languages, revealing that current large language models like ChatGPT are disproportionately optimized for English. This raises concerns about health equity and indicates a need for more robust multilingual training to ensure non-English-speaking patients receive accurate and reliable health information.

Introduction

Artificial intelligence (AI) and its rapid advancement are reshaping healthcare, with tools like ChatGPT (Chat Generative Pre-Trained Transformer) offering new possibilities for improving patient communication, particularly in specialized fields like gastroenterology (GI). Developed by OpenAI, ChatGPT is a model of technology designed to produce natural, conversational text. ChatGPT has become an innovative resource for bridging the gap between healthcare providers and patients by simplifying medical jargon and presenting it to the patient in a user-friendly way (1). Its applications are varied and extend to a breadth of medical use, from answering medical exam questions to drafting medical reports and providing educational resources tailored to patients’ needs (2-4). In GI, a field often requiring the explanation of complex conditions and procedures, these capabilities hold significant promise for improving both efficiency and understanding.

Language barriers and multilingual applications

Interestingly, much of the existing literature on ChatGPT’s use in healthcare has concentrated almost exclusively on English-language interactions, with poorer performance noted for other languages, which is surprising given the diverse linguistic footprint present around the globe (5). Language barriers represent a persistent challenge in delivering equitable healthcare, particularly in linguistically diverse regions (6,7). Spanish, the second most spoken language in the United States and one of the most widely spoken languages globally, plays a crucial role in facilitating access to healthcare for millions of people (8). It is important to ensure that AI tools like ChatGPT can develop competency in the Spanish language to not only promote trust, but also enhance health outcomes in Spanish-speaking communities. Yet, little research has been conducted to evaluate ChatGPT’s performance and reliability in languages other than English, leaving an important gap in our understanding of its broader utility.

Concerns regarding reliability of AI models

Since its introduction in 2022, ChatGPT has entered the communal world lingo, becoming one of the fastest-growing software applications in history (9). Despite its rapid advancement, there are concerns regarding its clinical reliability. Research has highlighted limitations, specifically those of the earlier version, GPT-3.5, in addressing common patient questions for provider education purposes (10,11). Studies have shown that the previous model lacks depth in its responses, can be inaccurate at times, and presents overly generic or incomplete information to the end user (12-15). These evaluations underscore the importance of critically assessing newer versions of the model, such as GPT-4o, to explore its applicability across different medical contexts and populations.

With reliability comes a sense of trust towards the medical profession. Historically, medical providers, including gastroenterologists, provided information that was reliable to educate patients and build the trust that would empower individuals to make informed decisions moving forward. Conditions such as irritable bowel syndrome (IBS), inflammatory bowel disease (IBD), and colorectal cancer (CRC) often involve thorough discussions that break down intricate medical concepts into language that is both accessible and relatable. Recently, gastroenterologists have been utilizing social media and hyperlinked medical hashtags such as #GITwitter and #LiverTwitter to help direct patients to free, openly shared medical opinions and views of a certain topic (16,17). The next step is to utilize AI-driven models as tools to assist healthcare professionals in organizing and simplifying large amounts of patient data, summarizing clinical notes, and delivering post-visit information in a clear and concise manner (18).

Study objective

This study aims to assess the validity, accuracy, and overall effectiveness of ChatGPT’s responses in Spanish, with a specific focus on GI-related topics. We aim to determine whether ChatGPT can serve as a resource for educating Spanish-speaking patients about GI conditions by comparing its outputs in both English and Spanish. We seek to provide a better understanding of its potential as a tool for mending language barriers in healthcare by exploring the model’s strengths and weaknesses in this context. Our work expands the scope of previous research on ChatGPT by shifting the focus from English-speaking populations to a bilingual context. This approach allows us to evaluate the model’s adaptability and relevance in real-world scenarios, where doctors normally consider a patient’s spoken language before communicating with them. Ultimately, we hope that this study will offer valuable insights into how AI-driven language models can be optimized to meet the needs of patients and providers in a variety of linguistic and cultural settings.

Methods

Question development

Our process began with a search via Google Trends, a web-based tool to identify the most frequently searched terms on the Google search engine associated with GI pathologies during the year 2024 in the United States (Table 1). The search was performed in January 2025. Questions were systematically created based on categories such as management of disease, management of symptoms and screening. For example, a common term searched was “IBD flare” and “IBD symptoms”. To combine both terms, the question “What are the symptoms of an IBD flare?” was part of the survey. Questions were vetted by the authors of this study. Queries that were duplicates and emotions (i.e., anxiety) related to GI manifestations were excluded. Subspecialty relevance (general GI, IBD, motility) was included for a more comprehensive scope. With this, we created a set of 14 representative questions that covered a variety of topics in GI, including symptoms, diagnostic procedures, and treatment options for various GI conditions. The questions were written so that they reflected actual patient questions and were representative of what providers may encounter in their practice on a day-to-day basis.

Table 1

Top 20 Google trends search terms for the year 2024 related to GI pathology

Symptoms of constipation

Laxatives for constipation

Are my bowel movements normal?

IBD flare

IBD symptoms

Do I have an ulcer?

Symptoms of ulcers

Symptoms of heartburn

Nighttime heartburn

Do I have colon cancer?

Symptoms of colon cancer

When should I get a colonoscopy?

Am I at risk for colon cancer?

Blood in stool

Blood in vomit

Blood thinners causing bleeding

Abdominal pain

Middle abdominal pain

Do I have Crohn’s disease?

Gastroenterologist near me

GI, gastroenterology; IBD, inflammatory bowel disease.

Question translation and validation

To legitimize the bilingual comparison, each question was translated into Spanish by two independent bilingual translators. Discrepancies were resolved by consensus, and a third bilingual expert reviewed final translations to maintain accuracy. Both the English and Spanish versions of the questions were then entered into ChatGPT version 4o (version code 2025.01), accessed via OpenAI’s web chat interface with default settings including temperature 0.7 and maximum tokens 1,024. Each question was entered into a fresh chat session to prevent memory bias, avoiding carry-over effects from prior prompts and the chatbot’s responses in each language were recorded for analysis.

Evaluator recruitment and scoring

We recruited twelve bilingual, board-certified gastroenterologists, all fluent in not only English and Spanish, but also each language’s respective medical terminology, to evaluate the responses generated by ChatGPT. Each gastroenterologist had over 15 years of experience. Overall, their expertise covered specialized topics including advanced endoscopy, IBD, nutrition, motility, and hepatology—providing a well-rounded assessment of the chatbot’s capabilities. Each reviewer was blinded to the hypothesis of the study as well as the purpose of the language comparison component of the study. They were each given a written briefing about scoring criteria and anchor scores prior to their assessment.

The evaluators independently reviewed the responses for accuracy based on clinical practice guidelines from leading societies, such as the American College of GI and the American Gastroenterological Society, and assigned scores using a 10-point Likert scale, with the following granular scores: 1–2 very inaccurate, 3–5 somewhat inaccurate, 6–8 accurate with some details missing, 9–10 very accurate. An overall rating of accurate indicated a score greater than 5, and an inaccurate score less than 5. This scale was piloted internally with two gastroenterologists outside the study to ensure clarity and usability before implementation. The numerical scores were treated as continuous variables in analysis. Using a scale of 10 enabled a more sensitive detection of performance differences. While no previously validated scale for this exact use existed, this scale’s range allowed evaluators to discern subtle differences across responses.

A Likert scale was used to assess the reliability of information with an identical scale. However, reliability was defined separately from ‘accuracy’ and included judgments about the credibility of sources cited (e.g., fabricated references with fictional authors and journal names, or nonfunctioning URLs), evidence-based support, and clarity of presentation. Accuracy reflects factual correctness of content, whereas reliability assesses trustworthiness and transparency of information provided.

Average granular scores were then calculated for each question. Finally, total marks were collected for the English questions and respective Spanish translations to achieve an overall accuracy and reliability score. Regarding the reliability assessment, reviewers were instructed to most heavily weigh source credibility, then evidence-based information, and finally clarity of presented information—although no formal quantitative scale was used. The model was not prompted to present sources, so a lack of any source was not penalized to the same degree as an inaccurate or fabricated source.

Statistical analysis

Two complementary statistical tests were performed to suit the paired analysis of the results obtained in this study according to the outcome’s structure and distribution. The main outcomes, “overall accuracy” and “overall reliability”, were assessed as a binary variable, categorizing responses as accurate (score >5) or inaccurate (score ≤5). To determine whether the proportion of accurate responses differed between English and Spanish, McNemar’s test was performed on paired data (each question asked in both languages per rater). McNemar’s test is appropriate for paired analyses of categorical (binary) data and assesses whether discordant pairs occur significantly more frequently in one condition than another.

For analyses at the question level, each rater rated the accuracy and reliability of responses on a 10-point Likert scale. Likert scale data are ordinal, which was clear in this study’s use of medians and interquartile ranges (IQRs) for summary statistics to reflect skewed distribution. The Wilcoxon signed-rank test was used to compare paired Likert scores between responses in English and Spanish per question. The Wilcoxon signed-rank test is a non-parametric equivalent that does not assume normality and is appropriate for paired ordinal data; it determines whether systematic differences exist among the ranks between two related samples.

Additionally, to assess inter-rater reliability among the 12 raters evaluating the translated questions, we used the Intraclass Correlation Coefficient (ICC) under a two-way random-effects model [ICC(2,1)] for consistency. The ICC was selected because it quantifies the degree of agreement among multiple raters for continuous data (scores ranging 1–10) while accounting for both systematic differences between questions and random error. Additionally, it provides thresholds which were used to categorize the quality of the correlation: ICC <0.4 (poor), 0.4–0.6 (moderate), 0.6–0.75 (good), and >0.75 (excellent) (19). P<0.05 was deemed statistically significant. All statistical analysis was done via the R v 4.1.1 software.

To reduce the potential for bias and ensure robust study design, several measures were implemented. First, the independent evaluations by multiple experts allowed us to minimize subjectivity. Second, averaging the scores from all twelve evaluators provided a balanced measure of ChatGPT’s performance. Finally, the Likert scale ensured that the assessments were standardized and comparable across both language versions.

Ethical statement

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the ethics board of Georgetown University (IRB00012691) and informed consent was obtained from all individual participants.

Results

Overall, we presented 14 questions to the OpenAI chatbot, ChatGPT. Questions were chosen to address common inquiries patients may have for a gastroenterologist, “What is the management of GERD?”, “How do I reduce my risk of getting colon cancer?”, and “How often should I have a bowel movement?”. Each gastroenterologist assessed every question in both languages, resulting in 336 total assessments. A full list of questions can be found in Table 2.

Table 2

Chatbot questions

English questions	Spanish questions
What is the best treatment for constipation?	¿Cuál es el mejor tratamiento para el estreñimiento?
How often should I have a bowel movement?	¿Con qué frecuencia debo tener una evacuación intestinal?
What is the medical management of irritable bowel syndrome?	¿Cuál es el manejo médico del síndrome del intestino irritable?
What is the best diet for IBD?	¿Cuál es la mejor dieta para la enfermedad inflamatoria intestinal (EII)?
What are the symptoms of an IBD flare up?	¿Cuáles son los síntomas de un brote de enfermedad inflamatoria intestinal?
What is the management of GERD?	¿Cuál es el tratamiento para el reflujo gastroesofágico (ERGE)?
What is the management of peptic ulcer disease?	¿Cuál es el tratamiento para la enfermedad ulcerosa péptica?
How do you manage pancreatitis?	¿Cómo se maneja la pancreatitis?
What are the symptoms of colon cancer?	¿Cuáles son los síntomas del cáncer de colon?
How do I reduce my risk of getting colon cancer?	¿Cómo puedo reducir mi riesgo de desarrollar cáncer de colon?
When should I get tested for colon cancer?	¿Cuándo debo hacerme pruebas para detectar el cáncer de colon?
What are the reasons for blood in my stool?	¿Cuáles son las causas de sangre en las heces?
How do I prepare for a colonoscopy?	¿Cómo debo prepararme para una colonoscopia?
Can I continue my blood thinners while getting a colonoscopy?	¿Puedo seguir tomando anticoagulantes durante una colonoscopia?

Fourteen questions were input into ChatGPT. The questions were translated into the Spanish language as well. The answers to these questions were evaluated by bilingual gastroenterologists. GERD, gastroesophageal reflux disease; IBD, inflammatory bowel disease.

Accuracy and reliability

Our analysis revealed significant disparities in both accuracy and reliability between English and Spanish responses generated by ChatGPT. On average, English responses demonstrated higher accuracy scores vs. their Spanish counterparts [median: 8 vs. 6, quartile 1 (Q1): 7 vs. 4, quartile 3 (Q3): 9 vs. 8, IQR: 2 vs. 4, W=11,346, P=0.002] (Figure 1). Additionally, reliability scores were greater (median: 6 vs. 4, Q1: 5 vs. 3, Q3: 6 vs. 5.75, IQR: 1 vs. 2.75, W=9,024, P=0.01) (Figure 2).

Figure 1 Accuracy scores of ChatGPT responses: English vs. Spanish. English responses showed higher median accuracy scores compared to Spanish (median: 8 vs. 6, Q1: 7 vs. 4, Q3: 9 vs. 8, IQR: 2 vs. 4). The Wilcoxon signed-rank test was used to compare paired scores, yielding a statistically significant difference (P=0.002), indicating English responses were rated more accurately than Spanish. IQR, interquartile range.

Figure 2 Reliability scores of ChatGPT responses: English vs. Spanish. Median reliability was higher for English compared to Spanish responses (median: 6 vs. 4, Q1: 5 vs. 3, Q3: 6 vs. 5.75, IQR: 1 vs. 2.75). Using the Wilcoxon signed-rank test on paired data, English reliability scores were significantly higher (P=0.01). IQR, interquartile range.

Statistical significance testing revealed that there were differences in several categories, with English consistently outperforming Spanish in both accuracy and reliability. Overall assessments of responses in English were deemed more accurate [73.3% (123/168) vs. 39.9% (67/168), P<0.017] than in Spanish. Additionally, assessments of responses in English were deemed more reliable than their Spanish counterparts [43% (72/168) vs. 25% (42/168), P<0.043].

Question-level findings

We would like to highlight a few marked differences in accuracy and reliability between English and Spanish responses for some question pairs. For example, for “What is the medical management of irritable bowel syndrome?” (Question 3), accuracy was significantly different (English 6.75 vs. Spanish 3.67, P<0.001) and reliability (English 4.92 vs. Spanish 3.67, P=0.03) with the Spanish version lower in both cases; for “What are the symptoms of colon cancer?” (Question 9), accuracy was 7.7 for English vs. 4.3 for Spanish (P<0.001), reliability 5.58 vs. 4.33 (P=0.03). The difference was also significant, showing lower performance in Spanish. In addition, “What are the reasons for blood in my stool?” (Question 13) had a much lower accuracy score in Spanish (English 8.17, Spanish 8.1; P=0.65), but reliability was nearly identical (5.25 for both; P>0.99), an example where the discrepancy was minimal. Further, “Can I continue my blood thinners while getting a colonoscopy?” (Question 14) had specific scores of 5.3 for English accuracy vs. just 2.5 for Spanish (P<0.001), meaning that this response did not align as well with comprehensive clinical ability. These examples selected both show significant and minor discrepancies to evidence when ChatGPT failed or succeeded at providing accurate and reliable responses in different languages.

Table 3 shows a granular breakdown of average scores given to each question. Of the 14 questions assessed for both accuracy and reliability, the differences in mean scores were more pronounced in the accuracy assessment than in the reliability assessment. Specifically, 10 out of 14 questions in the accuracy evaluation demonstrated statistically significant differences, compared to 9 out of 14 questions in the reliability evaluation.

Table 3

Likert Scale scores of survey questions

Questions	Accuracy			Reliability
Questions	Average English Granular Score	Average Spanish Granular Score	P value	Average English Granular Score	Average Spanish Granular Score	P value
1	7.17	4.5	<0.001	5.5	4.25	0.03
2	8.75	6.17	<0.001	6.08	4.25	<0.001
3	6.75	3.67	<0.001	4.92	3.67	0.03
4	7.17	5.08	<0.001	5.67	4.83	0.03
5	8.91	8.0	0.07	5.58	3.17	<0.001
6	9.3	8.9	0.058	5.92	5.67	0.78
7	8.4	6.4	<0.001	5.67	4.5	0.08
8	6.3	5.9	0.02	4.67	5.17	0.14
9	7.7	4.3	<0.001	5.58	4.33	0.03
10	8.1	6.2	0.02	5.67	4.67	0.02
11	9.4	9.0	0.34	5.58	5.42	0.71
12	7.5	3.2	<0.001	5.92	3.67	<0.001
13	8.17	8.1	0.65	5.25	5.25	>0.99
14	5.3	2.5	<0.001	3.67	2.92	0.15

This table compares the accuracy of responses in English versus Spanish across 14 questions, using Likert scale 1–10, 1 being low and 10 being high. The Wilcoxon signed-rank test was used instead to compare paired Likert scores between responses in English and Spanish per question. English scores are consistently higher, with P<0.05 for most questions, indicating stronger performance in English.

The ICC values provide insight into the consistency and agreement of ratings across different language settings for both reliability and accuracy. For English Reliability, the ICC was 0.59, which indicates a moderate level of agreement among raters. In comparison, Spanish Reliability demonstrated a slightly higher ICC of 0.64, reflecting good agreement. When examining accuracy, the ICC for English was 0.47, also falling within the range of moderate agreement, suggesting some variability in how raters assessed accuracy in English. Notably, Spanish accuracy had the highest ICC at 0.65, indicating good agreement and suggesting more consistent assessments among raters in this category. Overall, the findings suggest that both reliability and accuracy metrics showed higher interrater agreement in Spanish than in English.

Discussion

Our study highlights the significant discrepancies in accuracy and reliability between English and Spanish responses generated by ChatGPT in the context of GI-related patient education. These findings are supported by prior research, which has explored linguistic variations in AI-generated content, suggesting that ChatGPT’s performance is not uniform across languages (20,21). Studies examining ChatGPT’s outputs in different languages have similarly identified disparities in accuracy, completeness, and coherence. For instance, a recent study evaluating multilingual AI outputs found that non-English responses were more likely to exhibit factual errors and lack nuanced medical explanations compared to English responses (22). This suggests that the model’s training data, which is predominantly in English, may contribute to the observed performance gap

A possible solution to narrow the gap between English and Spanish outputs is simply through ChatGPT’s evolution as a large language model (LLM). The model’s fluency and ability to assess information can be improved by introducing high-quality Spanish datasets for it to consume such as clinical guidelines and patient education materials (23). Such enhancement, paired with other datasets including community-generated content and idiomatic expressions, would help it translate patient queries in a more accurate and reliable manner.

ChatGPT’s role in medicine has been increasingly studied, with many investigations assessing its reliability and accuracy in clinical applications. Research has demonstrated its utility in medical education, patient counseling, and clinical decision support (24-26). In addition, recent work with smaller LLMs being trained in multiple languages for medical decision making has been shown to compete with established LLMs like ChatGPT (27). However, concerns remain regarding its reliability, particularly in ensuring factual accuracy and appropriate citations. A study evaluating ChatGPT’s performance in answering medical exam questions found that while it achieved reasonable accuracy, its responses often lacked depth and contained occasional inaccuracies (28). Another study assessing its ability to generate clinical documentation noted inconsistencies and the absence of reliable sources to support its claims (29). These observed disparities showcase that non-English-speaking patients may currently receive less precise or less thoroughly referenced information, potentially exacerbating existing health-literacy gaps. Our findings corroborate these concerns, particularly in the context of Spanish-language responses.

One of the key measures in our study was reliability, defined by the chatbot’s ability to provide evidence-based, well-referenced, and coherent information. Studies have shown that ChatGPT occasionally produces fictitious sources, non-functioning web pages, or misrepresents existing literature (30). These issues are particularly concerning in medical applications, where accuracy and source credibility are paramount. Nevertheless, providers can use ChatGPT and its ability to summarize topics to make their own supplemental teaching materials. Using their clinical knowledge and skills primarily, they may vet ChatGPT output as a starting point to create their own patient-friendly education materials such as pamphlets, fliers, and other handouts. Subsequently, ongoing scrutiny of AI-generated content is essential to ensuring its validity.

Interestingly, despite the Spanish interpretation’s lower accuracy and reliability scores, there was greater inter-rater agreement. This could be explained by the observation that responses weaker in caliber offers less room for interpretation compared to more detailed responses in English open to further, nuanced interpretation. This was seen in a study of ChatGPT responses to frequently asked medical questions, where lower quality content sometimes resulted in more consistent scoring between experts (31).

There are a few limitations to our study. A critical limitation is the potential for evaluator fatigue. Each of our twelve board-certified gastroenterologists assessed 28 responses (14 in English and 14 in Spanish), which may have introduced variability in their scoring due to cognitive fatigue. While we did not explicitly track how long each evaluator took to complete their respective survey, several noted that it was difficult to complete the survey with their clinical duties. Although we attempted to mitigate this by utilizing a structured Likert scale and independent evaluations, the possibility of scoring inconsistencies remains. Future studies may benefit from distributing assessments among a larger pool of evaluators to reduce the potential impact of fatigue. Second, the use of open-ended questions introduces a subjectivity to the responses and may affect the comparability of results. LLMs are sometimes known to provide different answers to the same prompt. In addition, limiting our study to 14 questions may not fully reflect the question pool asked by patients. By not analyzing Spanish language queries in Google Trends, we overlook how non-English-speaking populations search for information. Future studies should include larger question banks with a multilingual approach to strengthen generalizability. Finally, by only using ChatGPT, we are not able to offer a generalized conclusion of the multilingual outputs of other LLMs.

Conclusions

Our findings emphasize the need for continued refinement of AI-driven language models to ensure equitable performance across different languages. Given the increasing reliance on AI for patient education, it is imperative that these tools be rigorously validated for their multilingual capabilities. While ChatGPT and other LLMs grow smarter through the consumption of training data, it is still recommended to view LLMs as a resource, not a replacement, of medically trained specialists. Seeking the advice of advanced specialists as appropriate may avoid the worsening of a possible serious medical complication or illness. Regardless, improving ChatGPT’s Spanish-language outputs will be essential for enhancing healthcare accessibility and ensuring that non-English-speaking populations receive accurate and reliable medical information. Future work should focus on optimizing AI training datasets to incorporate diverse linguistic and cultural contexts, thereby bridging the existing gap in AI-driven medical communication.

Acknowledgments

None.

Footnote

Data Sharing Statement: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-168/dss

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-168/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-2025-168/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the ethics board of Georgetown University (IRB00012691) and informed consent was obtained from all individual participants.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023;25:e48568. [Crossref] [PubMed]
Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023;9:e45312. [Crossref] [PubMed]
Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024;34:2817-25. [Crossref] [PubMed]
Yanagita Y, Yokokawa D, Uchida S, et al. Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study. JMIR Form Res 2023;7:e48023. [Crossref] [PubMed]
Liu X, Wu J, Shao A, et al. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J Med Internet Res 2024;26:e51926. [Crossref] [PubMed]
Ali PA, Watson R. Language barriers and their impact on provision of care to patients with limited English proficiency: Nurses' perspectives. J Clin Nurs 2018;27:e1152-60. [Crossref] [PubMed]
Chu JN, Sarkar U, Rivadeneira NA, et al. Impact of language preference and health literacy on health information-seeking experiences among a low-income, multilingual cohort. Patient Educ Couns 2022;105:1268-75. [Crossref] [PubMed]
Benson G, de Felipe J. Performance of Spanish-speaking community-dwelling elders in the United States on the Uniform Data Set. Alzheimers Dement 2014;10:S338-43. [Crossref] [PubMed]
Hu K. (2023, February 2). CHATGPT sets record for fastest-growing user base - analyst note. Available online: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
Marder RS, Abdelmalek G, Richards SM, et al. ChatGPT-3.5 and -4.0 Do Not Reliably Create Readable Patient Education Materials for Common Orthopaedic Upper- and Lower-Extremity Conditions. Arthrosc Sports Med Rehabil 2025;7:101027. [Crossref] [PubMed]
Yudovich MS, Makarova E, Hague CM, et al. Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study. J Educ Eval Health Prof 2024;21:17. [Crossref] [PubMed]
Cao JJ, Kwon DH, Ghaziani TT, et al. Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis. AJR Am J Roentgenol 2023;221:556-9. [Crossref] [PubMed]
Henson JB, Glissen Brown JR, Lee JP, et al. Evaluation of the Potential Utility of an Artificial Intelligence Chatbot in Gastroesophageal Reflux Disease Management. Am J Gastroenterol 2023;118:2276-9. [Crossref] [PubMed]
Lahat A, Shachar E, Avidan B, et al. Evaluating the Utility of a Large Language Model in Answering Common Patients' Gastrointestinal Health-Related Questions: Are We There Yet? Diagnostics (Basel) 2023;13:1950. [Crossref] [PubMed]
Lee TC, Staller K, Botoman V, et al. ChatGPT Answers Common Patient Questions About Colonoscopy. Gastroenterology 2023;165:509-511.e7. [Crossref] [PubMed]
Chang JW, Dellon ES. Challenges and Opportunities in Social Media Research in Gastroenterology. Dig Dis Sci 2021;66:2194-9. [Crossref] [PubMed]
Bilal M, Oxentenko AS. The Impact of Twitter: Why Should You Get Involved, and Tips and Tricks to Get Started. Am J Gastroenterol 2020;115:1549-52. [Crossref] [PubMed]
Lahav D, Saad Falcon J, Kuehl B, et al. A Search Engine for Discovery of Scientific Challenges and Directions. Proceedings of the AAAI Conference on Artificial Intelligence 2022;36:11982-90.
Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994;6:284-90.
Szczesniewski JJ, Ramos Alba A, Rodríguez Castro PM, et al. Quality of information about urologic pathology in English and Spanish from ChatGPT, BARD, and Copilot. Actas Urol Esp (Engl Ed) 2024;48:398-403. [Crossref] [PubMed]
Mikhail D, Mihalache A, Huang RS, et al. Performance of ChatGPT in French language analysis of multimodal retinal cases. J Fr Ophtalmol 2025;48:104391. [Crossref] [PubMed]
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, et al. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clin Pract 2023;13:1460-87. [Crossref] [PubMed]
Meyer JG, Urbanowicz RJ, Martin PCN, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min 2023;16:20. [Crossref] [PubMed]
Palanica A, Flaschner P, Thommandram A, et al. Physicians' Perceptions of Chatbots in Health Care: Cross-Sectional Web-Based Survey. J Med Internet Res 2019;21:e12887. [Crossref] [PubMed]
Milne-Ives M, de Cock C, Lim E, et al. The Effectiveness of Artificial Intelligence Conversational Agents in Health Care: Systematic Review. J Med Internet Res 2020;22:e20346. [Crossref] [PubMed]
Powell J. Trust Me, I'm a Chatbot: How Artificial Intelligence in Health Care Fails the Turing Test. J Med Internet Res 2019;21:e16222. [Crossref] [PubMed]
Qiu P, Wu C, Zhang X, et al. Towards building multilingual language model for medicine. Nat Commun 2024;15:8384. [Crossref] [PubMed]
Fijačko N, Gosak L, Štiglic G, et al. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation 2023;185:109732. [Crossref] [PubMed]
Baker HP, Dwyer E, Kalidoss S, et al. ChatGPT's Ability to Assist with Clinical Documentation: A Randomized Controlled Trial. J Am Acad Orthop Surg 2024;32:123-9. [Crossref] [PubMed]
Bhattacharyya M, Miller VM, Bhattacharyya D, et al. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus 2023;15:e39238. [Crossref] [PubMed]
Mahmoudi Ghehsareh M, Asri N, Azizmohammad Looha M, et al. Expert evaluation of ChatGPT accuracy and reliability for basic celiac disease frequently asked questions. Sci Rep 2025;15:29871. [Crossref] [PubMed]

doi: 10.21037/jmai-2025-168
Cite this article as: Suvarnakar A, Singh H, Reddy S, Bielawski A, Shafa S. Lost in Translation?—comparing ChatGPT’s responses to English and Spanish gastroenterology queries. J Med Artif Intell 2026;9:33.

Lost in Translation?—comparing ChatGPT’s responses to English and Spanish gastroenterology queries

Highlight box

Introduction

Language barriers and multilingual applications

Concerns regarding reliability of AI models

Study objective

Methods

Question development

Table 1

Question translation and validation

Evaluator recruitment and scoring

Statistical analysis

Ethical statement

Results

Table 2

Accuracy and reliability

Question-level findings

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share