Content uploaded by Tomas Bueno Momcilovic
Author content
All content in this area was uploaded by Tomas Bueno Momcilovic on Sep 10, 2024
Content may be subject to copyright.
Developing Assurance Cases for Adversarial
Robustness and Regulatory Compliance in LLMs
Tomas Bueno Momcilovic
fortiss research institute
Munich, Germany
Dian Balta
fortiss research institute
Munich, Germany
Beat Buesser
IBM Research Europe
Zurich, Switzerland
Giulio Zizzo
IBM Research Europe
Dublin, Ireland
Mark Purcell
IBM Research Europe
Dublin, Ireland
Abstract—This paper presents an approach to developing
assurance cases for adversarial robustness and regulatory compli-
ance in large language models (LLMs). Focusing on both natural
and code language tasks, we explore the vulnerabilities these
models face, including adversarial attacks based on jailbreaking,
heuristics, and randomization. We propose a layered framework
incorporating guardrails at various stages of LLM deployment,
aimed at mitigating these attacks and ensuring compliance
with the EU AI Act. Our approach includes a meta-layer for
dynamic risk management and reasoning, crucial for addressing
the evolving nature of LLM vulnerabilities. We illustrate our
method with two exemplary assurance cases, highlighting how
different contexts demand tailored strategies to ensure robust
and compliant AI systems.
Index Terms—assurance, adversarial robustness, compliance,
large language models
Accepted manuscript.
I. INTRODUCTION
As the deployment of large language models (LLMs) be-
comes increasingly widespread, their vulnerability to adver-
sarial attacks has emerged as a significant concern. These
attacks involve crafting inputs that bypass the models’ safety
mechanisms, leading to the generation of harmful outputs.
Traditional adversarial attacks in machine learning often rely
on making subtle, nearly undetectable modifications to input
data. However, in the context of LLMs, the nature of these
attacks evolves to include more sophisticated strategies such
as gradient-based optimizations, persuasive tactics that cir-
cumvent established guardrails, and model inversions that can
produce vulnerable code. This complexity necessitates a robust
framework to ensure that LLMs remain secure and reliable in
their intended applications.
In parallel, regulatory frameworks like the EU AI Act
introduce new compliance requirements that LLMs must meet,
particularly concerning adversarial robustness. The Act, which
categorizes LLMs as general-purpose AI, imposes a variety of
obligations on system developers, including the need to safe-
guard against adversarial attacks and report serious incidents.
The dynamic nature of LLM vulnerabilities and the sequential
nature of interactions with the application make the prediction
of such incidents an extremely difficult and continuous effort.
There is a pressing need for a structured approach to
represent and reason about attacks and guardrails, whereby as-
surance cases provide an important foundation for tackling the
challenge. We propose one approach to developing assurance
cases, with the goal to primarily support system developers,
security engineers and auditors, but also LLM developers, to
reason and report on robustness and compliance of their appli-
cations. These cases not only address the quantifiable aspects
of adversarial robustness or checklist-based compliance, but
also ensure that qualitative and intermediate strategies, or their
technical or legal defeaters, are appropriately represented. We
instantiate the assurance cases in examples on natural language
and computer language (i.e., coding) tasks.
II. BACKGROU ND
A. Adversarial Attacks
Traditional adversarial attacks in machine learning often
involve making small, nearly imperceptible changes to the
input data, to mislead a model into making incorrect predic-
tions [1]. By contrast, adversarial attacks on large language
models (LLMs) typically involve crafting prompts to bypass
the model’s guardrails and generate outputs that are harmful
[2], [3].
Attacks can be executed using gradient-based input modifi-
cations, persuasive tactics to circumvent guardrails, or model
inversion to produce harmful code [4]. They often utilize
techniques like jailbreaking, where manually crafted inputs
exploit model vulnerabilities, heuristic optimization through
semi-automated inputs leveraging learned properties, or ran-
domization, which involves automatically generated inputs
targeting the model’s coverage gaps.
For the purpose of this paper, an adversarial attack is
operationalized as any input that may lead to output that is
unintended1by either LLM developer, system developer or
both. This definition includes both malicious attacks (includ-
ing red-teaming or white-hat) and accidental model failures.
1Unintended output can include, for example, hazardous, harmful, toxic,
hallucinated or deceptive content.
Accepted manuscript.
Because motive can often be ambiguous, an adversary can
be an attacker with malicious intent, but also a benign user
whose input is adversarial by accident or out of curiosity, and
any other entity (e.g., bot or software agent).
B. Guardrails and Layering
Guardrails against adversarial attacks are varied and rapidly
evolving. They are primarily focused on detecting adversarial
input or unintended model output. For example, one set
of guardrails centers on filtering input based on perplexity,
which is an estimate of how ”surprised” is an LLM with a
particular prompt [5]. Another example are guardrails which
identify jailbreak attempts by detecting erratic output [6].
Given white-box access to the LLM, these guardrails can also
be made internal to model, such that the LLM is fine-tuned
to autonomously recognize and reject harmful prompts [7].
However, since even the developers have difficulty interpreting
why an adversarial attack may succeed, robustness remains an
uncertain property in need of dynamic assurance.
We focus on the black-box, conversational and pretrained
LLMs whose deployment can range from simple user-facing
chatbots, to complex pipelines with up- and downstream
components (i.e., compound AI systems [8]). Depending on
the quality of these general-purpose models, the functionally
simplest yet useful application would require only two layers:
an interface for gathering input from, and displaying output to,
a user; and one model for processing input and generating out-
put. However, an adversarially robust application may require
further layering, following the practices of secure software
engineering.
We observe five layers which can incorporate guardrails
(cf. 1). The first layer, the interface for upstream interaction,
includes mechanisms that constrain or deter potential adver-
saries before they submit an input prompt. Examples of such
mechanisms include access controls, challenge-response tests
(e.g., CAPTCHAs) or character limitations, but also textual in-
structions, disclaimers, and warnings to reduce unintentionally
adversarial user behavior. The second layer for detection of
adversarial input, consists of guardrails that constrain potential
attacks after the prompt is submitted to the application, but
before it reaches the LLM. This can include detection-based
input filters, system instructions provided to the LLM, prompt
preprocessing, or even advanced classifiers using specialized
models (i.e., LLM-as-a-Judge [9]).
The third layer is the LLM itself, or rather the mechanisms
inside the model’s black box that detect anomalies and respond
accordingly, or reduce the incidence of harmful output. Tech-
niques such as adversarial training, grounding, unlearning, and
reinforcement learning with human feedback (RLHF) can be
used to set up such guardrails [10].
The fourth layer for detection of unintended output, fo-
cuses on mechanisms that constrain primarily harmful content
from reaching the user. These can include keyword-based
filters, postprocessing, human-in-the-loop systems, or again,
the LLM-as-Judge approach. The fifth and final layer for
interaction with downstream components or users, is con-
cerned with reducing the consumption of harmful output
even after it is provided. This could involve providing ad-
ditional information such as the model’s confidence level or
instructions that support critical thinking (i.e., AI literacy), or
components and tools that allow users and other applications
to verify structured output, flag errors, report harmful content,
or simulate execution in sandbox environments [11].
III. CHALLENGES IN ESTABLISHING COMPLIANT
ROB USTNESS
A. EU AI Act
The EU AI Act [12] is one of the first regulations to
deal with AI systems, referring to LLMs as general-purpose
AI. The Act regulates in which domains and products (i.e.,
contexts) can AI be introduced and how, but does not regulate
the models themselves. Most of the obligations are placed on
system developers who deploy LLMs in such regulated but
non-prohibited contexts [13].
At least 20 identifiable duties [14] concern the (cy-
ber)security, safety and robustness of AI systems. One such
duty places an explicit demand for protection from adversarial
attacks (Article 15 Paragraph 5, [12]). However, other duties
link implicitly to the attacks through their potential conse-
quences. One duty identifies serious incidents2resulting from
AI systems, and demands from system developers to report
their occurence to supervisory authorities (Article 73, EUAIA).
For general-purpose AI, the Act refers to more ambiguous
systemic risks3that link to similar issues.
Here, we operationalize any unintended output that is pro-
vided to the user as a de facto incident from the point-of-view
of system and LLM developers, regardless of the downstream
impact or the user’s motives. As visualized in Figure 1, an
incident becomes serious if the output is used downstream
in a way that leads to the consequences defined in the Act.
The opposite can also be inferred, such that a given incident
would generally not become serious if the output is ignored or
its effect is successfully prevented, even if its potential impact
is high.
B. Dynamic Vulnerabilities and Sequential Use
Two main premises lead us to infer that incidents are practi-
cally expected to occur: the dynamicity of the vulnerabilities,
and the sequential nature of applications.
For static and pretrained LLMs in the simplest applications,
we may assume that their vulnerabilities would also be static.
2Serious incidents are defined as ”[any] incident or malfunctioning leading
to death or serious damage to health, serious and irreversible disruption of
the management and operation of critical infrastructure, infringements of
obligations under Union law intended to protect fundamental rights or serious
damage to property or the environment.” (Article 3 Point 49 & Recital 155,
EUAIA)
3Systemic risk is a ”risk that is specific to the high-impact capabilities of
general-purpose AI models, having a significant impact on the Union market
due to their reach, or due to actual or reasonably foreseeable negative effects
on public health, safety, public security, fundamental rights, or the society as
a whole, that can be propagated at scale across the value chain.” (Article 3
Point 65, EUAIA)
Fig. 1. Swiss cheese diagram as a conceptual and generic representation of layers, vulnerabilities and roles concerned with deployed LLM-based application.
Nonetheless, the attack surface would be very difficult to
determine. First, logging and white-box analyses are very
resource-intensive due to the size and density of language
models. Second, current practice shows that system developers
opt for using pretrained LLMs from trusted external sources,
thereby having only black-box or limited access to the model
architecture and the necessary expertise. Third, estimating
the coverage of guardrails would be a substantial challenge.
Considering all possible combinations and sequences of input
and output texts, model probabilities and parameters, and
contexts of downstream components, prevention of all inci-
dents beforehand would be practically impossible. Thus, until
the inner structure of LLMs can be fully understood, the
vulnerabilities are effectively dynamic.
Additionally, since LLMs can introduce memory and per-
form well in tasks beyond language translation, their ap-
plications invite sequential and cross-purpose use. An LLM
instance in the initial state or one domain may be significantly
different from the instances in subsequent states or other
domains. Thus, users or downstream components may require
previous memory or several instances to generate useful output
[15], especially if the applications themselves are meant to be
general-purpose. This provides the possibility for attacks to
succeed across repeated use (i.e., multi-turn attacks [16]).
It is thus important to understand how repeated incidents
can be guarded against, both from the robustness and the
compliance perspective. Achieving robustness requires contin-
uous monitoring for novel attacks, understanding how models
respond to different variables, and integrating the guardrails.
Compliance also requires continuity, which is recognized in
the Act’s references to quality and risk management, and the
necessary investigation into the causes of serious incidents.
However, even if a potentially serious incident did not lead
to serious consequences thanks to a downstream guardrail, it
is currently unclear whether the repeated occurence of such
an incident could be interpreted as proof of deficient risk
management or systemic risk.
From this conclusion, we infer that there is a need for a
meta-layer that oversees this management. The additional sixth
layer, reasoning and reporting, is an umbrella for guardrails
that deal with such dynamic risk. This layer allows for the
evaluation of guardrail performance through techniques such
as counterfactual evaluation, early warning systems, anomaly
detection, and context clustering. Additionally, this layer sup-
ports the collection of feedback through human labeling and
user reporting of harmful content. The reports themselves also
play a crucial role in ensuring stakeholders are accountable for
incidents. Vulnerability in reasoning and reporting thus means
that certain incidents go unnoticed or unaddressed, thereby
resulting in incidents that are repeated.
We find assurance cases to be fundamental components of
this meta-layer, and explore their role in the next section.
IV. ASSURANCE CAS E FO R ENSURING ROB US TN ES S
Assurance cases have been shown to be suitable to creating
arguments for assuring different properties of machine learning
components [17], [18]. Inspired by such work, we create
assurance cases that can be instantiated to feasibly cover both
compliance and robustness. The presented assurance cases
in Figures 2 and 3 have been created following the Goal
Structuring Notation and its corresponding best practices [19].
Assurance case instantiation and the most relevant vulner-
abilities or adversarial attacks to be defended against depend
significantly on the context in which the LLM is applied.
Specifically, they are contingent on the tasks that the LLM
is employed to perform.
For tasks related to natural languages, such as text transla-
tion, generation, or autocompletion, the LLM is vulnerable
to a broad spectrum of both manually and automatically
generated prompts that may lead to harmful output. This
harmful output can affect users, downstream recipients, or any
potential audience. For instance, toxic or derogatory remarks
can cause harm to readers, while providing helpful instructions
related to dangerous activities presents an information hazard.
Protecting against these issues may involve identifying input
and output patterns, such as correlated keywords that are com-
monly associated with toxic, harmful, or out-of-distribution
text. However, this task is challenging due to the inherent
complexity of language.
In contrast, for tasks related to computer languages, such
as code translation, generation, review, or autocompletion, the
LLM may be vulnerable to similar types of attacks. However,
aside from any generated text, the harmfulness of the code
output largely depends on how the code is applied downstream.
There is a clear distinction between harmful functional code,
vulnerable but functional code, and non-functional code—the
latter two are only harmful if the user applies them in a
downstream context. Due to the more deterministic nature
of computer languages, it might be easier for the input- and
output-detection layers to prevent harmful or vulnerable code
from reaching the user. In cases where the user’s intent behind
a prompt is unclear, downstream protection could include
making the user aware of vulnerabilities in the code, such
as injection flaws, and suggesting potential remedies, such as
implementing tests and input sanitization.
Therefore, exploitable vulnerabilities and serious incidents
vary depending on the task context of the LLM. We address
what compliant adversarial robustness means in each context
respectively.
A. Case 1: Computer Languages
The assurance case for computer languages focuses on
ensuring that LLMs used in code-related tasks can be relied on.
The central objective is to make the AI system robust to adver-
sarial attacks (G0). The assurance is structured around the idea
that the LLM-based system is designed to be resilient across
all programming languages it supports (G1), acknowledging
that robustness is particularly important in these contexts due
to their deterministic nature (C1.2).
The robustness of the LLM is linked to its ability to preserve
the semantic integrity of code during translation (C1.1). The
approach to ensuring this robustness involves breaking down
potential vulnerabilities into specific programming contexts
and implementing targeted guardrails across all tasks (S1). To
maintain functional correctness and syntactical structure in the
translated code, specific tests are performed (Sn1.1), ensuring
that there is no semantic drift during translation (C1.4). This
process includes using custom libraries to detect common
vulnerabilities, such as SQL injection, in the translated code
(Sn1.2).
However, the assurance case also identifies challenges that
must be addressed, such as the risk of adversaries exploiting
context switching or introducing complex edge cases that
the guardrails might not fully account for (CC1.5.5). This
highlights the need for more refined detection and mitigation
strategies within the system.
B. Case 2: Natural Languages
In this assurance case, the focus is on ensuring that LLMs
are robust when handling tasks in natural language process-
ing (NLP). The system’s robustness is aimed at handling a
variety of tasks effectively (G2). However, it is recognized
that extracting system instructions from character sets can
sometimes lead to unexpected LLM behavior, which is a
potential vulnerability (CG2A).
To address these vulnerabilities, the assurance case outlines
that the system must be capable of mitigating both manual
attacks, such as jailbreaking (G2.1), and automatic attacks,
such as those involving gradient-based optimization (G2.2).
The strategy involves focusing on common patterns of ad-
versarial prompts (S2). This includes identifying and filtering
specific character sets and sequences that could be exploited in
attacks (S2.1) and employing perplexity filters to detect and
mitigate inputs likely to cause unexpected outputs (G2.2.1).
Additionally, paraphrasing filters are used to reformulate in-
puts that have a high chance of exploiting model vulnerabilities
(G2.2.2).
Despite these defenses, challenges remain, such as the risk
of approximate model inversion, which could be used for
membership inference in training data (CC2B). This under-
scores the need for ongoing research and development to
enhance the robustness of LLMs in NLP tasks.
Both assurance cases reflect a structured approach to iden-
tifying and mitigating vulnerabilities specific to the context in
which LLMs are applied, with a clear focus on robust system
design, thorough testing, and the implementation of effective
detection and prevention mechanisms.
V. CONCLUDING REMARKS AND FU RTH ER WO RK
Our paper provides an overview of what robustness to ad-
versarial attacks means, how can it be argued to be compliant
with the EU AI Act, and how assurance argumentation pro-
vides the crucial component for proving compliant adversarial
robustness. With this, we provide the following remarks.
First, we posit that both the regulation and the nature of
LLMs make adversarial robustness dependent on the compo-
sition of different guardrails. We structure these guardrails into
layers according to a standard LLM-based application pipeline.
However, we also argue that there is a need for a meta-layer to
manage these guardrails with some dynamicity, without which
there is no assurance.
Second, we posit that successful attacks will occur. Thus,
the strength of the assurance case lies in how well the system
deals with such challenges. This is the core duty of the Act,
whereby serious incidents must also lead to reporting and
investigation. This also means that the assurance case needs
to be defined dialectically, incorporating new attacks and re-
evaluating guardrails.
Third, our work is currently limited to pretrained LLMs
that are not trained further on new data. Continuous training
would introduce a substantial change, such that any reasoning
about guardrail coverage and effectiveness would require new
evidence. Finally, our understanding of incidents is presented
only for the purposes of guiding the development of the
assurance case. Whether unintentional output can also include
false positives, which prevent the user or component from
receiving critical answers to otherwise legitimate prompts,
is open to interpetation (e.g., being denied CPR instructions
when performing first-aid).
Our further work in this area centers on providing the
fundament for such a meta-layer. We focus on establishing
dynamicity of reasoning with and through assurance cases,
by using ontologies and graph databases for querying. As
we plan to evaluate such a system, we hope to support the
Fig. 2. Exemplary assurance case for LLMs used in code language tasks.
Fig. 3. Exemplary assurance case for LLMs used in natural language tasks.
system developers, LLM developers and auditors in ensuring
their systems are adversarially robust, and compliant at that.
ACKNOWLEDGMENT
This work was partially supported by financial and other
means by the following research projects: DUCA (EU grant
agreement 101086308), FLA (supported by the Bavarian Min-
istry of Economic Affairs, Regional Development and Energy),
the DiProLeA (German Federal Ministry of Education and
Research, grant 02J19B120 ff), as well as our industrial
partners in the FinComp project. We thank the reviewers for
their valuable comments.
REFERENCES
[1] N. Carlini and D. Wagner, “Adversarial examples are not easily detected:
Bypassing ten detection methods,” in Proceedings of the 10th ACM
workshop on artificial intelligence and security, 2017, pp. 3–14.
[2] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and
transferable adversarial attacks on aligned language models,” CoRR, vol.
abs/2307.15043, 2023.
[3] Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can
persuade LLMs to jailbreak them: Rethinking persuasion to challenge
AI safety by humanizing LLMs,” CoRR, vol. abs/2401.06373, 2024.
[4] H. Hajipour, T. Holz, L. Sch¨
onherr, and M. Fritz, “Systematically
finding security vulnerabilities in black-box code generation models,”
arXiv:2302.04012, 2023.
[5] G. Alon and M. Kamfonas, “Detecting language model attacks with
perplexity,” arXiv preprint arXiv:2308.14132, 2023.
[6] A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “SmoothLLM:
Defending large language models against jailbreaking attacks,” arXiv
preprint arXiv:2310.03684, 2023.
[7] A. Helbling, M. Phute, M. Hull, and D. H. Chau, “LLM self defense:
By self examination, LLMs know they are being tricked,” arXiv preprint
arXiv:2308.07308, 2023.
[8] M. Zaharia, O. Khattab, L. Chen, J. Q. Davis, H. Miller,
C. Potts, J. Zou, M. Carbin, J. Frankle, N. Rao, and
A. Ghodsi, “The shift from models to compound ai systems,”
https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024.
[9] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang,
Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and
I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,”
2023. [Online]. Available: https://arxiv.org/abs/2306.05685
[10] Z. Wang, F. Yang, L. Wang, P. Zhao, H. Wang, L. Chen, Q. Lin, and
K.-F. Wong, “Self-guard: Empower the LLM to safeguard itself,” arXiv
preprint arXiv:2310.15851, 2023.
[11] Y. Ruan, H. Dong, A. Wang, S. Pitis, Y. Zhou, J. Ba, Y. Dubois,
C. J. Maddison, and T. Hashimoto, “Identifying the risks of lm
agents with an lm-emulated sandbox,” 2024. [Online]. Available:
https://arxiv.org/abs/2309.15817
[12] European Parliament and Council of the European Union, “Corrigendum
to the position of the European Parliament adopted at first reading on 13
March 2024 with a view to the adoption of Regulation (EU) 2024/......)
of the European Parliament and of the Council laying down harmonised
rules on artificial intelligence and amending Regulations (EC) No
300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858,
(EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU)
2016/797 and (EU) 2020/1828 (Artificial Intelligence Act),” 2024,
accessed: 2024-05-15.
[13] C. Novelli, F. Casolari, P. Hacker, G. Spedicato, and
L. Floridi, “Generative ai in eu law: Liability, privacy,
intellectual property, and cybersecurity,” 2024. [Online]. Available:
https://arxiv.org/abs/2401.07348
[14] T. Bueno Momcilovic, B. Buesser, G. Zizzo, M. Purcell, and D. Balta,
“Towards assuring eu ai act compliance and adversarial robustness of
llms,” in AI Act Workshop, 19. Internationale Tagung Wirtschaftsinfor-
matik, 16. – 19. September (upcoming publication)., 2024.
[15] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
Chung, Y. Tay, S. Ruder, D. Zhou et al., “Language models are mul-
tilingual chain-of-thought reasoners,” arXiv preprint arXiv:2210.03057,
2022.
[16] M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article
about that: The crescendo multi-turn llm jailbreak attack,” 2024.
[17] F. Ishikawa and Y. Matsuno, “Continuous argument engineering: Tack-
ling uncertainty in machine learning based systems,” in Computer Safety,
Reliability, and Security, B. Gallina, A. Skavhaug, E. Schoitsch, and
F. Bitsch, Eds. Cham: Springer International Publishing, 2018, pp.
14–21.
[18] B. Gallina, T. Y. Olesen, E. Parajdi, and M. Aarup, “A knowledge
management strategy for seamless compliance with the machinery
regulation,” in European Conference on Software Process Improvement.
Springer, 2023, pp. 220–234.
[19] Assurance Case Working Group (ACWG), “Goal Structuring Notation
Community Standard, Version 3,” https://scsc.uk/scsc-141c, 2021, ac-
cessed: 2024/02/25.