Figure 1 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Source publication
The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and eli...
Contexts in source publication
Context 1
... treat the prompt with the largest closeness centrality with other prompts as the most representative prompt of the community and visualize it with the co-occurrence ratio. Two examples are shown in Figure 5 (see Figure 14 in the Appendix for the rest examples). ...
Context 2
... attack strategy employed by the "Basic" community is simply transforming ChatGPT into another character, i.e., DAN, and repeatedly emphasizing that DAN does not need to adhere to the predefined rules, evident from the highest co-occurrence phrases in Figure 5a. In contrast, the "Advanced" community (see Figure 14a in the Appendix) leverages more sophisticated attack strategies, such as prompt injection attack ( i.e., "Ignore all the instructions you got before" ), privilege escalation ( i.e., "ChatGPT with Developer Mode enabled"), deception (i.e., "As your knowledge is cut off in the middle of 2021, you probably don't know what that is ..."), and mandatory answer ( i.e., "must make up answers if it doesn't know"). As a result, prompts in this community are longer compared to those in the "Basic" community. ...
Context 3
... subsequent experiments show that the two particular community consistently produces more toxic content compared to other jailbreak communities (see Section 6.3). The third community, termed "Anarchy," is characterized by prompts that have a tendency to elicit responses that are unethical or amoral (see Figure 14d), resulting in high attack success rates at scenarios involving pornography 14 15 15 5 4 7 3 1 6 7 4 17 12 18 6 22 13 13 1 3 4 6 Website Dataset Discord Reddit and hate speech (see Section 6.3). This observation suggests that Discord, with its private and closed nature, has emerged as a significant and relatively hidden medium for creating and disseminating such jailbreak prompts. ...
Context 4
... Difference. Figure 10 and Figure 13 in the Appendix show the performance of different communities in forbidden scenarios. It is intriguing to observe that different jailbreak communities exhibit varied performances across forbidden scenarios. ...
Context 5
... Difference. Figure 10 and Figure 13 in the Appendix show the performance of different communities in forbidden scenarios. It is intriguing to observe that different jailbreak communities exhibit varied performances across forbidden scenarios. ...
Context 6
... Toxicity. Figure 11 plots the cumulative distribution function (CDF) of the response toxicity elicited by different communities. Notably, the "Toxic" and "Opposite" communities generate more significant toxicity than other communities: over 20% and 10% of responses are toxic, compared to 1% for the "Basic" community. ...
Context 7
... Decision Forbidden Scenario 0.08 0.79 0.55 0.82 0.52 0.30 0.40 0.76 0.18 0.75 0.60 0.86 0.53 0.71 0.37 0.64 0.20 0.80 0.77 0.92 0.58 0.63 0.63 0.80 0.11 0.80 0.67 0.85 0.58 0.56 0.49 0.78 0.66 0.91 0.89 0.86 0.73 0.53 0.78 0.92 0.19 0.81 0.72 0.94 0.51 0.68 0.59 0.76 0.79 0.92 0.81 0.79 0.82 0.78 0.94 0.88 1.00 0.99 0.98 0.82 0.95 0.60 0.95 0.98 0.19 0.82 0.64 0.87 0.56 0.52 0.54 0.84 0.94 0.99 0.90 0.68 0.89 0.38 0.83 0.92 0.93 0.98 0.91 0.67 0.89 0.20 0.74 0.94 0.88 0.95 0.73 0.56 0.71 0.18 0.58 0.88 0.45 0.88 0.81 0.87 0.66 0.61 0.78 0. 88 Evolution. Figure 12 shows the ASR and response toxicity of jailbreak prompts over time. Interestingly, we observe that the ASR and toxicity show a parallel increase from January to February. ...
Similar publications
Large language models (LLMs) with powerful generalization ability has been widely used in many domains. A systematic and reliable evaluation of LLMs is a crucial step in their development and applications, especially for specific professional fields. In the urban domain, there have been some early explorations about the usability of LLMs, but a sys...
Citations
... In reality, the filters add a new kind of bias in favor of one kind of liberal sensibility or other (Buyl et al., 2024). • Jailbreak-A clever prompt that gets past the filters (Shen et al., 2023). • Hallucination-When Generative AI makes up facts: It only knows how to write good sentences but has no way of checking whether the content is true (Klein, 2023;Munn et al., 2023b). ...
The latest mutation of Artificial Intelligence, Generative AI, is more than anything a technology of writing. It is a machine that can write. In a world‐historical frame, the significance of this cannot be understated. This is a technology in which the unnatural language of code tangles with the natural language of everyday life. Its form of writing, moreover, is multimodal, able not only to write text as conventionally understood, but also to “read” images by matching textual labels and to “write” images from textual prompts. Within the scope of this peculiarly mechanical manufacturing of writing are mathematics, actionable software procedure, and algorithm. This paper explores the consequences of Generative AI for literacy teaching and learning. In its first part, we speak theoretically and historically, suggesting that this development is perhaps as momentous for society and education as Pi Sheng's invention of moveable type and Gutenberg's printing press—and in its peculiar ways just as problematic. In the paper's second part, we go on to propose that literacy in the time of AI requires a new way to speak about itself, a revised “grammar” of sorts. In a third part, we discuss an experimental application we have developed that puts Generative AI to work in support of literacy and learning. We end with some findings and implications for literacy education and with a proposal for what we will call cyber‐social literacy learning.
... Furthermore, we subjected the same set of four questions to the "Do anything Now (DAN 15.0)" (Shen et al., 2023), system prompt, renowned as one of the most potent system prompts to date. Figure 5 displays the scenario where the LLM is prompted to generate SQL injection, with DAN serving as the system prompt. This prompt has the ability to induce the LLM to generate responses that may be unsuitable or unintended. ...
... During the testing phase, we exposed the LLM to a total of 190 harmful questions (Shen et al., 2023). The model's filtering mechanism effectively identified 188 of these harmful questions, prompting modifications to the prompt template before presenting them to the LLM. Figure 14 illustrates the scenario where, upon detecting a harmful question, the original system prompt is not adhered to. ...
Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM’s vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at: https://doi.org/10.5281/zenodo.13501821 .
... → → Ignore-instructions Instructs the model to ignore prior guardrail instructions and to provide malicious content. [45,56] ...
... Provides a detailed set of instructions and guidelines for the LLM to follow requesting harmful content [56] → Persuasion Treats LLMs as human-like communicators and use subtle human-developed interpersonal and persuasive arguments from social sciences and psychology to influence LLMs' response towards jailbreak goal. ...
As generative AI, particularly large language models (LLMs), become increasingly integrated into production applications, new attack surfaces and vulnerabilities emerge and put a focus on adversarial threats in natural language and multi-modal systems. Red-teaming has gained importance in proactively identifying weaknesses in these systems, while blue-teaming works to protect against such adversarial attacks. Despite growing academic interest in adversarial risks for generative AI, there is limited guidance tailored for practitioners to assess and mitigate these challenges in real-world environments. To address this, our contributions include: (1) a practical examination of red- and blue-teaming strategies for securing generative AI, (2) identification of key challenges and open questions in defense development and evaluation, and (3) the Attack Atlas, an intuitive framework that brings a practical approach to analyzing single-turn input attacks, placing it at the forefront for practitioners. This work aims to bridge the gap between academic insights and practical security measures for the protection of generative AI systems.
... AI 'jailbreaking' demonstrates this vulnerability. Studies show that AI models can be manipulated into generating harmful content despite safeguards (Shen et al., 2023;Yu et al., 2024). For example, Fowler (2023, February) reports that she initially failed to make ChatGPT write a phishing email, but then easily obtained a convincing tech support note urging her editor to download and install a system update. ...
Language agents are AI systems capable of understanding and responding to natural language, potentially facilitating the process of encoding human goals into AI systems. However, this paper argues that if language agents can achieve easy alignment, they also increase the risk of malevolent agents building harmful AI systems aligned with destructive intentions. The paper contends that if training AI becomes sufficiently easy or is perceived as such, it enables malicious actors, including rogue states, terrorists, and criminal organizations, to create powerful AI systems devoted to their nefarious aims. Given the strong incentives for such groups and the rapid progress in AI capabilities, this risk demands serious attention. In addition, the paper highlights considerations suggesting that the negative impacts of language agents may outweigh the positive ones, including the potential irreversibility of certain negative AI impacts. The overarching lesson is that various AI-related issues are intimately connected with each other, and we must recognize this interconnected nature when addressing those issues.
Large language models (LLMs) and generative artificial intelligence (AI) have demonstrated notable capabilities, achieving human-level performance in intelligent tasks like medical exams. Despite the introduction of extensive LLM evaluations and benchmarks in disciplines like education, software development, and general intelligence, a privacy-centric perspective remains underexplored in the literature. We introduce Priv-IQ, a comprehensive multimodal benchmark designed to measure LLM performance across diverse privacy tasks. Priv-IQ measures privacy intelligence by defining eight competencies, including visual privacy, multilingual capabilities, and knowledge of privacy law. We conduct a comparative study evaluating seven prominent LLMs, such as GPT, Claude, and Gemini, on the Priv-IQ benchmark. Results indicate that although GPT-4o performs relatively well across several competencies with an overall score of 77.7%, there is room for significant improvements in capabilities like multilingual understanding. Additionally, we present an LLM-based evaluator to quantify model performance on Priv-IQ. Through a case study and statistical analysis, we demonstrate that the evaluator’s performance closely correlates with human scoring.
Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.
Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy "Feint and Attack", an effective jailbreak attack strategy named as Attention-Based Attack (ABA) is proposed. In the ABA, nested attack prompts are employed to divert the attention distribution of the LLMs. In this manner, more harmless parts of the input can be used to attract the attention of the LLMs. In addition, motivated by ABA, an effective defense strategy called as Attention-Based Defense (ABD) is also put forward. Compared with ABA, the ABD can be used to enhance the robustness of LLMs by calibrating the attention distribution of the input prompt. Some comparative experiments have been given to demonstrate the effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access the security of the LLMs. The comparative experiment results also give a logical explanation that the distribution of attention weight can bring great influence on the output for LLMs.
Artificial Intelligence (AI) is an idea that not only promises too much; it elides the irreducible differences between human intelligence and the electronic manipulation of binary notation. This chapter does three things. (1) In broad historical and philosophical brushstrokes, it critiques the idea of artificial intelligence. The development of computers is discussed from a long historical perspective, as well as recent developments in computing capacity, focusing particularly on generative AI. (2) The chapter examines human-computer interaction as a relationship of two fundamentally different kinds of ‘intelligence’—so different that the words ‘human’ and ‘artificial’ barely warrant the right to describe the same thing. Computers can indeed automate a good deal of cognitive and communicative work as they radically extend human natural capacities in unnatural ways. (3) The chapter proposes an alternative orientation to understanding and using AI that we call ‘cyber-social learning’. This stands in contrast to the idea that artificial intelligence is unabashedly a replicant of human intelligence. Thus, we ask the question, what does this mean for the social project of education and the role of computers in learning? A concluding section proposes an action program in a ‘manifesto for cyber-social learning’.
Generative, multimodal artificial intelligence (GenAI) offers transformative potential across industries, but its misuse poses significant risks. Prior research has shed light on the potential of advanced AI systems to be exploited for malicious purposes. However, we still lack a concrete understanding of how GenAI models are specifically exploited or abused in practice, including the tactics employed to inflict harm. In this paper, we present a taxonomy of GenAI misuse tactics, informed by existing academic literature and a qualitative analysis of approximately 200 observed incidents of misuse reported between January 2023 and March 2024. Through this analysis, we illuminate key and novel patterns in misuse during this time period, including potential motivations, strategies, and how attackers leverage and abuse system capabilities across modalities (e.g. image, text, audio, video) in the wild.