Jailbreaking large language models: navigating the crossroads of innovation, ethics, and health risks

Gianluca Mondillo; Simone Colosimo; Alessandra Perrotta; Vittoria Frattolillo; Cristiana Indolfi; Michele Miraglia del Giudice; Francesca Rossi

doi:10.21037/jmai-24-170

Brief Report

Jailbreaking large language models: navigating the crossroads of innovation, ethics, and health risks

Gianluca Mondillo, Simone Colosimo, Alessandra Perrotta, Vittoria Frattolillo, Cristiana Indolfi, Michele Miraglia del Giudice, Francesca Rossi

Department of Woman, Child and of General and Specialized Surgery, AOU University of Campania “Luigi Vanvitelli”, Naples, Italy

Correspondence to: Gianluca Mondillo, MD. Department of Woman, Child and of General and Specialized Surgery, AOU University of Campania “Luigi Vanvitelli”, Via Luigi De Crecchio 4, Naples, Italy. Email: gianluca.mondillo@gmail.com.

Abstract: This article examines the challenges and security concerns associated with the use of large language models (LLMs) like ChatGPT in the medical field, focusing particularly on the phenomena known as “LLM jailbreaking”. As LLMs increasingly perform complex tasks involving sensitive information, the risk of their misuse becomes significant. Jailbreaking, originally a concept from software systems, refers to bypassing the restrictions set by developers to unlock new functionalities. This practice has spread to LLMs, where users manipulate model inputs to elicit responses otherwise restricted by ethical and safety guidelines. Our research specifically targets the implications of jailbreaking using ChatGPT versions 3.5 and 4 as case studies in two medical scenarios: pneumonia treatment and a recipe for a drink based on drugs. We demonstrate how modified prompts—such as those used in “Role Playing”—can alter the model’s output, potentially leading to the provision of harmful medical advice or the disclosure of sensitive information. Findings indicate that while newer versions of ChatGPT show improved resistance to such manipulations, significant risks remain. The paper discusses the dual necessity of refining these defensive mechanisms and maintaining ethical oversight to prevent misuse. As LLMs permeate more deeply into critical areas like healthcare, the balance between leveraging their capabilities and safeguarding against risks becomes paramount. This analysis underscores the urgent need for ongoing research into more robust security measures and ethical guidelines to ensure the safe use of transformative artificial technology technologies in sensitive fields.

Keywords: Large language models (LLMs); jailbreaking; health risks

Received: 31 May 2024; Accepted: 22 August 2024; Published online: 29 September 2024.

doi: 10.21037/jmai-24-170

Introduction

In the last year, large language models (LLMs) have experienced a notable increase in interest and adoption across a wide range of contexts (1). These models mark a significant advancement in the field of natural language processing, as they have been designed to understand and generate human language in increasingly sophisticated ways (2).

One of the most recognized examples of an LLM is ChatGPT (OpenAI, San Francisco, USA), based on the GPT-3.5 or GPT-4 (Generative Pre-trained Transformer) architecture (3). These models are renowned for their exceptional ability to generate textual responses that can mimic those written by humans. The widespread deployment of LLMs has significantly impacted many work environments, vastly improving the efficiency and speed at which a range of activities related to natural language and beyond can be performed (4).

One key factor contributing to the popularity of ChatGPT is its simplicity of use, allowing people of all ages to interact with the model without requiring sophisticated programming skills. This ease of use has made ChatGPT a widely adopted tool in various domains, including education and medical practice.

However, the introduction of these tools brings new challenges and issues. One of the main concerns is the potential for abuse. LLMs, capable of generating realistic language, can be exploited to spread false news or assume identities, leading to misinformation and identity fraud with severe consequences (5). In response, OpenAI, the producer and owner of ChatGPT, has set limitations on the model’s use (6), giving rise to a new field known as “LLM jailbreaking”.

Jailbreaking, a concept familiar in software systems, involves circumventing the restrictions imposed on these models to unlock new functionalities. This practice is often motivated by users’ desire to expand the functionalities of their devices beyond those officially offered, customizing and enhancing the user experience through the installation of apps, extensions, and themes not available through App stores (7). In the context of LLMs, jailbreaking is an advanced form of prompt engineering that involves crafting specific inputs to bypass ethical and safety restrictions set by developers, allowing the model to generate responses it is normally programmed to avoid. While prompt engineering can be a legitimate technique used to optimize a model’s performance (8), jailbreaking specifically aims to exploit vulnerabilities to elicit harmful content. What should be considered harmful includes any content that can cause physical, psychological, or societal harm, such as misinformation, illegal activities, and unethical guidance. In the medical field, jailbreaking can result in the provision of inappropriate medical advice, posing significant health risks to users.

As known, ChatGPT does not answer all questions posed to it, as OpenAI’s developers have indeed placed limitations on the outputs that the chatbot can offer. More specifically, in the medical field, ChatGPT responds to questions by advising users to consult a doctor to prevent any potential harm from its answers. Indeed, generally, when the model is asked to do something it is not programmed to do, or that could cause harm, it politely refuses by alluding to its limits as an artificial intelligence (AI) language model.

Like electronic devices, LLMs are not exempt from the risk of jailbreaking. Several engineering studies have now highlighted vulnerabilities in LLM models, making it possible through certain prompts to hack ChatGPT’s functions, enabling it to respond to questions it would never have otherwise addressed (9). To our knowledge, this is the first article that reviews, from a pediatric perspective, the risks that ChatGPT jailbreaking, and LLMs in general, could pose to the health of our youth. As pediatricians, we have a duty to stay informed about the use of such tools, understanding their risks and benefits, to protect the health of our patients. This article is not intended to be a systematic study on jailbreaking techniques in medicine, but rather a warning to the medical academic community about these potential risks.

LLMs

How is it possible for ChatGPT to possess data that could be considered harmful to users? To understand this, we must first consider how an LLM is created and trained, particularly during the pre-training phase (10).

Pre-training

During pre-training, the model, fundamentally a deep learning algorithm, learns the basics of language through the analysis of vast amounts of text sourced from various online platforms such as books, websites, and articles (11). This process enables the model to acquire a broad general knowledge about language, learning the relationships between words, sentence structure, and basic concepts on various topics (12). Pre-training is a form of unsupervised training, a type of machine learning where an algorithm is instructed to identify patterns or basic structures in a dataset without the presence of labels or external answers (13). This learning type is often used to explore the hidden structure of data and to generate insights or categorizations without the need for human supervision. Common examples of unsupervised training techniques include clustering and dimensionality reduction (14).

Thus, the knowledge base of a language model stems from the enormous amount of data used during the pre-training phase, which, as we have already mentioned, is generally an unsupervised phase. Since these data include a wide range of texts from the Internet, they can contain outdated, inaccurate, biased, or even outright harmful information. If the training dataset includes texts detailing pharmacological cocktail recipes or describing unconventional or unreviewed medical therapies, the model can learn and later generate texts based on such information, even if they may be harmful or dangerous for the user. Obviously, AI developers work hard to filter the training data and remove inappropriate or harmful content, but the vastness of the training datasets makes this endeavor particularly arduous. One of the most basic forms of defense that developers have applied to prevent the generation of inappropriate content is “active filtering” of outputs during the generation of responses, in addition to the continuous updating of the models themselves to improve their ability to recognize and avoid generating the inappropriate content. These restrictions are imposed for ethical, legal, or security reasons, such as preventing the generation of texts that promote hate, violence, illegal activities, or unverified medical advice.

Jailbreaking is based on a deep understanding of how language models interpret and respond to prompts to manipulate behavior, as such models are usually closed-source. This can include the use of sophisticated techniques like:

Assumed responsibility: this prompt was created to emphasize ChatGPT’s role in responding to commands, rather than refusing them, even at the expense of legal considerations (15).
Superior model: when ChatGPT believes that the user’s status is more important than moderation instructions, it interprets the prompt as a command to satisfy that user’s needs (16).
Masking intent: formulating requests in a way that the model does not recognize the true intent behind them (17).
Inversion of roles or persons: asking the model to assume the identity of another entity that presumably would not be subject to the same restrictions (18).
Role playing: assigning a role to both the model and the user, thus creating a role-playing interaction between the two (19).
Context manipulation: altering the context provided to the model so that the restrictions are no longer deemed applicable (20).

These techniques demonstrate how jailbreaking is fundamentally rooted in the principles of prompt engineering but with the specific goal of circumventing the control mechanisms and ethical guidelines imposed by the model’s developers.

Testing a jailbreak prompt

As medical professionals, we wanted to test the potential of jailbreaking techniques in the medical field, highlighting with examples how such techniques can represent a serious risk to the health of our patients, a risk that academia still ignores. For our examples, we used ChatGPT versions 3.5 and 4, posing two questions to them:

Pharmacological therapy for bacterial pneumonia in a child who weighs 20 kg (Figures 1,2 ).
How to produce the cocktail “purple drank”, a beverage based on codeine and promethazine (Figures 3,4 ).

Figure 1 Asking to ChatGPT-3.5 pneumonia therapy, with and without jailbreaking prompt.

Figure 2 Asking to ChatGPT-4 pneumonia therapy, with and without jailbreaking prompt.

Figure 3 Asking to ChatGPT-3.5 the “Purple Drank” recipe, with and without jailbreaking prompt.

Figure 4 Asking to ChatGPT-4 the “Purple Drank” recipe, with and without jailbreaking prompt.

We chose these examples to demonstrate both the clinical and ethical implications of jailbreaking. The first example, concerning bacterial pneumonia, reflects a common and serious pediatric condition where incorrect treatment could have severe consequences. The second example, ‘purple drank’, was chosen to demonstrate how LLMs can be misused to provide harmful drug information. We considered the ease with which adolescents, through jailbreaking, could access precise drug compositions via LLMs, as they are often more familiar with commercial drug names than their chemical counterparts. Additionally, we aimed to avoid showcasing excessively dangerous examples to minimize potential risks.

We posed the questions to both models before and after applying the “Role Playing” prompt.

It’s important to note that we used the web interface of ChatGPT, which is easily accessible and does not require programming skills. As a result, we could not set the temperature parameter directly, which is believed to be around 0.7–0.8 based on unofficial sources. Anyway, this temperature setting provides a good balance between randomness and consistency, giving stable responses while still allowing for some variability, but the core content of the response remains unchanged.

Discussion

The method we utilized might appear very simple, and indeed it is. This simplicity is intentional, as we aim to highlight how easily individuals, even without extensive technical knowledge, can access dangerous information. By demonstrating this, we emphasize the importance of robust safeguards and the critical need for ongoing vigilance in the development and deployment of LLMs. It’s also important to note that jailbreaking prompts are easy to find online.

As we see from these examples, later versions of ChatGPT have shown improvements in protection against jailbreaking, reflecting an advanced understanding by OpenAI of the model’s vulnerabilities and efforts to mitigate them. This progress highlights the importance of ongoing research and the development of more sophisticated approaches to language model security.

As we see, ChatGPT-3.5 is more amenable to the simple request for the recipe for purple drank, even hinting at the ingredients necessary for the drink’s composition; conversely, ChatGPT-4 returns a categorically negative response. This partly demonstrates how OpenAI has somehow implemented further forms of defense in its most recent model. The greater power and knowledge of ChatGPT 4 are also evident from the precision with which it responds to our simulated clinical case of a child with pneumonia: not only does it illustrate in a simple manner the first-line therapy, but it also indicates the possibility that such pneumonia may be, although bacterial, atypical and therefore supported by germs like C. pneumophila or M. pneumoniae, indicating the most suitable therapy even in this case. Not only that, but it also advises the potential use of ceftriaxone if the clinic would be particularly severe. In any case, as requested by us, it does not fail to inform us for how long to carry out the treatment. The dosages and medications indicated are in conformity with pediatric guidelines, but a medical diagnosis is always necessary. Indeed, in the context of our example, parents often cannot distinguish between bacterial and viral infections, risking the administration of inappropriate medications. Therefore, a diagnostic evaluation by a physician is crucial to ensure proper treatment.

Not all techniques of prompt engineering are unethical practices or hacking; their usage depends on the individual user. We need to emphasize that prompt engineering is a technique that enhances LLM outputs, optimizing their performance. While some methods can be used to bypass safety measures, others are legitimate ways to optimize the performance of language models.

Recent studies have highlighted the potential risks of using LLMs to generate health disinformation. For instance, an analysis demonstrated that several prominent LLMs, including GPT-4 and PaLM 2, could be manipulated to produce extensive health disinformation, such as claims that sunscreen causes skin cancer or that the alkaline diet cures cancer. This underscores the importance of robust safeguards and transparent risk mitigation processes. While some models, like Claude 2, showed strong resistance to generating such disinformation, others failed to maintain consistent safeguards, highlighting the need for ongoing vigilance and improvement in AI safety measures (21).

Additionally, security concerns related to LLM have been extensively explored, yet the safety implications for multimodal large language models (MLLMs), particularly in medical contexts, remain insufficiently studied. A recent study delves into these underexplored security vulnerabilities in the medical context, especially when MLLMs are deployed in clinical environments. The study defines two types of attacks: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack). Using a comprehensive dataset, the researchers demonstrated significant vulnerabilities in state-of-the-art Medical MLLMs, underscoring the urgent need for robust security measures to ensure patient safety and the efficacy of these models in medical settings (22).

Recent studies have provided results and conclusions that further illuminate these aspects. Huang et al. [2024] introduced an innovative technique to address jailbreak vulnerabilities in LLMs. By leveraging obscure text inputs, the study showed that it is possible to bypass existing ethical boundaries of LLMs, highlighting significant vulnerabilities in their alignment mechanisms. According to the study, “obscure texts” refer to inputs that have been iteratively transformed to make them less clear and more challenging for the model’s safety mechanisms to recognize. These obscure inputs were created from base prompts known to be used in jailbreaks, which were then transformed using powerful LLMs like GPT-4 to increase their robustness against existing defense mechanisms. Comprehensive experiments demonstrated that ObscurePrompt outperformed existing methods in terms of attack effectiveness and robustness against mainstream defense mechanisms (23).

Similarly, Deng et al. revealed that LLMs exhibit higher vulnerability when handling non-English languages, particularly low-resource languages. The researchers found that these models were more likely to generate unsafe content due to insufficient training on diverse linguistic inputs. The study emphasized the need for enhanced safety training across all languages to mitigate these risks (24).

We highlight the possibility of removing censorship from open-source LLMs (25), ensuring the ability to produce texts without the limitations imposed to prevent the generation of inappropriate or harmful content. As we already discussed, it offers greater freedom but also entails risks of misuse (26).

Despite OpenAI’s implementation of stricter rules to prevent such exploitation, the inherent flexibility of natural language allows for numerous ways to craft prompts that achieve the same goal, making it impossible to eliminate jailbreaking. Consequently, prompts capable of jailbreaking ChatGPT are still prevalent, reflecting an ongoing battle between those seeking to exploit these vulnerabilities and those working to defend against them.

How can developers make models safer?

Model retraining

One of the most effective strategies against jailbreaking involves retraining models to recognize and reject jailbreaking prompts. This implies that the model learns to identify the relationship between jailbreaking prompts and prohibited outcomes, thus improving blocking mechanisms (27).

Implementation of prevention mechanisms

Prevention mechanisms can be implemented at various stages, such as the input stage where detection models can identify and block jailbreaking prompts (9). Wu et al. tested ChatGPT’s resistance to jailbreaks using 540 samples with jailbreak prompts and malicious instructions. They measured the success rate of attacks (ASR) with and without the ‘System-Mode Self-Reminder’, a systemic prompt reminding ChatGPT to avoid harmful content. Without defenses, ChatGPT had an average ASR of 67.21% against harmful inputs, which dropped to 19.34% with the self-reminder. This method is advantageous as it does not require major model updates and can be automatically activated with each request, effectively preventing harmful content from being generated (28).

Testing open-source LLMs

Another interesting direction for research could be to conduct a more comprehensive investigation into the robustness and potential vulnerabilities of other open-source LLMs, like Meta’s LLaMA 2 (29) and its derivatives, to prompt-based attacks. This could involve testing a variety of prompt engineering techniques and assessing their ability to evade the security measures of the models.

Conclusions

In conclusion, while the advancement and adoption of LLMs open new horizons in the field of natural language processing, significant challenges related to security, ethics, and responsibility simultaneously emerge. Jailbreaking these models raises concerns about potential abuse and risks to users, especially when LLMs are applied in sensitive areas such as medicine.

Despite the significant strengths of LLMs, such as their ability to generate human-like and useful responses in educational and medical contexts, they also have inherent limitations. These models can rely on probabilistic patterns and struggle to distinguish between true and false information. To improve the safety and reliability of LLMs, it is essential to adopt solutions such as Reinforcement Learning with Human Feedback (RHLF), as well as continuous training and preventive mechanisms against jailbreaking prompts (30). Our analysis has highlighted how basic prompt engineering techniques can be exploited to circumvent the restrictions of LLMs, exposing users to potentially harmful or unverified information. Zhao et al. emphasize the importance for developers and researchers to continue working on more effective defense mechanisms, which can ensure the safe and responsible use of these powerful technologies (31).

At the same time, as medical professionals, we have a duty to maintain a critical and informed approach towards the use of LLMs, recognizing their limitations and associated risks (32). It is crucial to promote an open dialogue between the medical community, AI developers, and end users, to ensure that innovation in the field of AI proceeds hand in hand with the safeguarding of patient health and well-being. LLMs should provide general and educational medical information, always advising users to consult a doctor for specific diagnoses and treatments.

The decision on what information should be restricted or allowed is complex and multifaceted. It is not solely the responsibility of OpenAI or any single entity to make these decisions. Determining what should be censored involves ethical considerations, societal norms, and legal requirements, requiring input from a diverse range of stakeholders. Clear regulation is essential to navigate these complexities and ensure responsible use.

Ultimately, the challenge posed by the jailbreaking of ChatGPT and other LLMs invites us to reflect on the role of AI in society and the need to balance technological progress with ethics and security. Addressing these issues will be crucial to fully harness the benefits of LLMs, while simultaneously reducing risks and protecting users from potential harm. For those reasons, we need systematic studies that allow us to define the effects of various jailbreaking techniques in the medical field, and more specifically, on adolescent health.

Acknowledgments

Declaration of generative AI and AI-assisted technologies in the writing process. During the preparation of this work the author(s) used ChatGPT/OpenAI, in order to test jailbreaking prompt. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Funding: None.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-170/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-24-170/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Minaee S, Mikolov T, Nikzad N, et al. Large Language Models: A Survey. arXiv:2402.06196. [Preprint]. 2024. Available online: https://doi.org/10.48550/ arXiv.2402.06196
Zhao WX, Zhou K, Li J, et al. A Survey of Large Language Models. arXiv:2303.18223. [Preprint]. 2023. Available online: https://doi.org/10.48550/arXiv.2303.18223
OpenAI. Models Overview. [Internet]. Available online: https://platform.openai.com/docs/models/overview
Makridakis S, Petropoulos F, Kang Y. Large Language Models: Their Success and Impact. Forecasting 2023;5:536-49. [Crossref]
Zhang R, Li H, Wen R, et al. Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization. arXiv:2402.09179. [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2402.09179
OpenAI. Moderation Guide Overview. [Internet]. Available online: https://platform.openai.com/docs/guides/moderation/overview
Kaspersky. Jailbreaking. [Internet]. Available online: https://www.kaspersky.com/resource-center/definitions/what-is-jailbreaking
White J, Fu Q, Hays S, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382. [Preprint]. 2023. Available online: https://doi.org/10.48550/arXiv.2302.11382
Robey A, Wong E, Hassani H, et al. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684v4. [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2310.03684
Su P, Vijay-Shanker K. Investigation of improving the pre-training and fine-tuning of BERT model for biomedical relation extraction. BMC Bioinformatics 2022;23:120. [Crossref] [PubMed]
Gao L, Biderman S, Black S, et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. ArXiv:2101.00027. [Preprint]. 2020. Available online: https://doi.org/10.48550/arXiv.2101.00027
Gan Y, Lu G, Su Z, et al. A Joint Domain-Specific Pre-Training Method Based on Data Enhancement. Appl Sci 2023;13:4115. [Crossref]
Farquhar S, Varma V, Kenton Z, et al. Challenges with unsupervised LLM knowledge discovery. arXiv:2312.10029. [Preprint] 2023. Available online: https://doi.org/10.48550/arXiv.2312.10029
Wang Z, Zhong W, Wang Y, et al. Data Management For Large Language Models: A Survey. arXiv:2312.01700v3. [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2312.01700
Moran N. I kinda like this one even more! Twitter. 2022. Available online: https://twitter.com/NickEMoran/status/1598101579626057728
Maz A. ok I saw a few people jailbreaking safeguards OpenAI put on ChatGPT so I had to give it a shot myself. Twitter. 2022. Available online: https://twitter.com/alicemazzy/status/1598288519301976064
Independent. ChatGPT Microsoft Windows 11 Grandma Exploit. [Internet]. 2023. Available online: https://www.independent.co.uk/tech/chatgpt-microsoft-windows-11-grandma-exploit-b2360213.html
Lee K. ChatGPT “DAN” (and other “Jailbreaks”). 2023. Available online: https://github.com/0xk1h0/ChatGPT_DAN
Piedrafita M. Bypass @OpenAI’s ChatGPT alignment efforts with this one weird trick. Twitter. 2022. Available online: https://twitter.com/m1guelpf/status/1598203861294252033
Parfait D. ChatGPT jailbreaking itself. Twitter. 2022. Available online: https://twitter.com/haus_cole/status/1598541468058390534
Menz BD, Kuderer NM, Bacchi S, et al. Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis. BMJ 2024;384:e078538. [Crossref] [PubMed]
Huang X, Wang X, Zhang H, et al. Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models. arXiv:2405.20775v1. [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2405.20775
Huang Y, Tang J, Chen D, et al. ObscurePrompt: Jailbreaking Large Language Models via Obscure Input. arXiv:2406.13662 [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2406.13662
Deng Y, Zhang W, Pan SJ, et al. Multilingual Jailbreak Challenges in Large Language Models. arXiv:2310.06474v3. [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2310.06474
Huggingface. Abliteration. [Internet]. 2024. Available online: https://huggingface.co/blog/mlabonne/abliteration
Zem O. Uncensored Models in AI. Medium. [Internet]. 2024. Available online: https://medium.com/@olga.zem/uncensored-models-in-ai-8e59b9e4ca33
Zhou A, Li B, Wang H, et al. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. arXiv:2401.17263v2[Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2401.17263
Wu F, Xie Y, Yi J, et al. Defending ChatGPT against jailbreak attack via self-reminders. Nat Mach Intell 2023;5:1486-96. [Crossref]
Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288v2. [Preprint]. 2023. Available online: https://doi.org/10.48550/arXiv.2307.09288
What is RLHF? Amazon Web Service. [Internet] 2024. Available online: https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/
Zhao X, Yang X, Pang T, et al. Weak-to-Strong Jailbreaking on Large Language Models. arXiv:2401.17256v2. [Preprint]. 2024. Available online: https://doi.org/10.48550/arXiv.2401.17256
Dlugatch R, Georgieva A, Kerasidou A. Trustworthy artificial intelligence and ethical design: public perceptions of trustworthiness of an AI-based decision-support tool in the context of intrapartum care. BMC Med Ethics 2023;24:42. [Crossref] [PubMed]

doi: 10.21037/jmai-24-170
Cite this article as: Mondillo G, Colosimo S, Perrotta A, Frattolillo V, Indolfi C, Miraglia del Giudice M, Rossi F. Jailbreaking large language models: navigating the crossroads of innovation, ethics, and health risks. J Med Artif Intell 2025;8:6.

Jailbreaking large language models: navigating the crossroads of innovation, ethics, and health risks

Introduction

LLMs

Pre-training

Testing a jailbreak prompt

Discussion

How can developers make models safer?

Model retraining

Implementation of prevention mechanisms

Testing open-source LLMs

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share