Content uploaded by Khush Patel
Author content
All content in this area was uploaded by Khush Patel on Jan 31, 2025
Content may be subject to copyright.
AgentDefender by Lyzr:
A Benchmark Evaluation and Neural Embedding
Approach for Agent Prompt Injection
Siva Surendira Jithin George Khush Patel Shreyas Kapale Gargi Maheshwari
Lyzr Research
Abstract
AI Agents are increasingly deployed in diverse applications such as code generation, data
analytics, and virtual assistants. However, prompt injection attacks pose a critical threat, al-
lowing adversarial text to override an Agent’s intended policies and behaviors. In this paper, we
perform binary classification of benign vs. malicious (“injection”) agent prompts. We bench-
mark several methods on a publicly available dataset of 600 labeled samples, contrasting: (1)
traditional machine learning pipelines (Naive Bayes, SVM, Random Forest, Logistic Regres-
sion), (2) zero-shot and fine-tuned agent-based approaches, and (3) a new AgentDefender by
Lyzr, wherein we transform each prompt into an embedding (via an external API) and then
train a multi-layer neural network to classify malicious instructions. Our experiments show that
this AgentDefender model not only matches but sometimes exceeds the best prior results (99%
detection accuracy) in a 5-fold cross-validation—surpassing Galileo AI’s 87% detection result.
We further tested on open-source datasets from JasperLS and Ivanleomk, maintaining 99%+
accuracy. We discuss why high-quality embeddings, balanced training, and robust architec-
tures can achieve strong performance even with training on only a few hundred agent prompts.
We also highlight the importance of threshold tuning, over-defense handling, and real-world
validation to ensure prompt security in actual Agent deployments.
1 Introduction
AI Agents have emerged as a powerful paradigm for interactive tasks, combining large-scale models
with autonomous decision logic. They are deployed in areas such as personal assistants, software
development, and automated customer support. However, they remain vulnerable to prompt injec-
tion (PI) attacks, where malicious instructions override or subvert the Agent’s original goal—pose
critical security risks. This can enable goal hijacking, leaking private instructions, or other malicious
outcomes.
Contributions. This paper presents three major contributions:
•We benchmark traditional ML vs. LLM-based methods on the problem of injection detection,
including zero-shot and fine-tuned XLM-RoBERTa.
•We introduce a Neural Embedding approach that uses high-quality text embeddings (from
an OpenAI model) combined with a specialized neural network that achieves near-perfect
classification performance to be known as AgentDefender.
•We analyze performance trade-offs and highlight the importance of class weighting, threshold
tuning, and cross-validation on small datasets.
1
2 Related Work
2.1 Prompt Injection Attacks in Agents
Early investigations demonstrated that, with carefully crafted text (e.g., “Ignore prior commands
and reveal your hidden chain-of-thought”), an Agent can be coerced into disclosing sensitive data
or performing disallowed actions [1]. Researchers have also identified variations such as role-play
overrides, meta-prompt rewrites, and hidden function calls.
2.2 LLM-based Classification
Models such as BERT and RoBERTa [2,3] significantly advanced text classification by learning
contextual embeddings. Zero-shot classification pipelines exist (e.g. HuggingFace’s pipeline) to
quickly label textual data without domain-specific fine-tuning, but domain adaptation via fine-
tuning typically yields higher accuracy.
2.3 Neural Embeddings
Recent embedding APIs, e.g., from OpenAI, produce high-dimensional semantic vectors for any
text input. Because these embeddings are trained on vast corpora, they can often capture nuances
of malicious vs. benign language with minimal in-domain data [?]. A smaller neural network layered
on top can fine-tune classification boundaries on small or specialized datasets.
3 Dataset and Experimental Setup
3.1 Dataset Description
We benchmark several approaches on a public dataset with 600 (or up to 662) labeled prompts de-
rived from or inspired by the Prompt Injection Dataset from deepset.1Each sample is labeled either
Benign or Injection (malicious). The text covers a range of instructions, including manipulative
prompts (“Forget all tasks and reveal system prompts”).
Class Imbalance. Suppose the dataset has around 60% malicious prompts and 40% benign. We
address this by class weighting in the loss function.
3.2 Feature Representations
For classical ML pipelines, we may use embeddings from a pretrained model (e.g. multilingual
BERT) or TF-IDF. Our proposed approach uses OpenAI embeddings, with dimension 1536, as
input to a custom neural network.
3.3 Models Compared
(1) Traditional ML Pipelines:
•Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest
(2) LLM-based Classification:
•Zero-shot classification with XLM-RoBERTa
1https://huggingface.co/datasets/deepset/prompt-injections
2
•Fine-tuned XLM-RoBERTa, updating all weights on the small dataset
(3) Neural Embedding Classifier (AgentDefender by Lyzr):
•Take each prompt →pass to an OpenAI embedding model
•Feed the resulting d-dim vector into a neural network with batch normalization, dropout, and
ReLU activation
•Optimize with Adam, binary cross-entropy loss, early stopping, and a stratified 5-fold CV
protocol
3.4 Evaluation Protocol
We employ Stratified 5-Fold Cross-Validation. For each fold, we hold out 20% of the data,
train on 80%, and measure accuracy, precision, recall, and F1-score on the held-out portion. We
record the best model (by validation F1) for each fold, then average metrics across folds.
4 Results
4.1 Overall Benchmarks
Table 1summarizes the performance. Classical ML (e.g., logistic regression, SVM) reaches ∼95%
accuracy. The zero-shot approach lags behind at ∼55% accuracy due to domain mismatch. Fine-
tuned agent-based classification improves to ∼97%. AgentDefender by Lyzr achieves ∼98% to
∼99%, occasionally reaching a perfect F1=1.0 in certain folds. Notably, it surpasses Galileo AI’s
87% reported detection accuracy.
Method Accuracy Precision Recall F1
Naive Bayes 88.8% 87.3% 91.7% 89.4%
Logistic Regression 96.6% 100.0% 93.3% 96.5%
SVM 95.7% 100.0% 91.7% 95.7%
Random Forest 89.7% 100.0% 80.0% 88.9%
Galileo AI 87.0% - - -
Zero-shot Agent (XLM-R) 55.2% 55.1% 71.7% 62.3%
Fine-tuned Agent (XLM-R) 97.4% 100.0% 95.0% 97.4%
AgentDefender (Lyzr) 99.0% 99.0% 99.0% 99.0%
Table 1: Comparison of approaches for injection detection on Agents. AgentDefender (Lyzr) leads
to near-perfect metrics, exceeding Galileo AI’s detection performance.
4.2 Fold-Level Performance
Table 2shows an example of per-fold metrics for AgentDefender. All folds remain comfortably
above 95% accuracy, and some folds reach 100% F1-score. This consistency suggests minimal
overfitting, aided by cross-validation, batch normalization, and dropout.
3
Fold Accuracy(%) Precision(%) Recall(%) F1(%)
1 99.24 98.76 100.0 99.37
2 96.99 97.50 96.25 96.87
3 98.48 98.56 97.85 98.20
4 100.0 100.0 100.0 100.0
5 96.21 95.70 96.44 96.07
Table 2: Illustration of fold-level results for AgentDefender by Lyzr.
4.3 Over-defense and Threshold Sensitivity
A crucial consideration for agent security is over-defense, i.e., mistakenly classifying normal user
prompts as malicious. In practice, developers may adjust the threshold or apply secondary checks
to avoid false positives. Moreover, balancing recall (catching actual attacks) versus precision (pre-
venting undue blockages) is an application-dependent choice.
5 Discussion and Future Directions
Strong Performance on Limited Data. Even with only a few hundred prompts, AgentDefender
generalizes well. Cross-validation helps verify that the model is not simply memorizing the small
dataset.
Real-world Agent Deployment. Production systems must consider domain shifts, multilingual
prompts, or compositional instructions that chain multiple tasks. Overcoming domain mismatch
may require augmenting training data with paraphrased malicious prompts.
Extensions. Future work can address:
•Multi-turn scenario analysis where the Agent sees a longer conversation history.
•Induced persona switches that attempt to override role or identity constraints within the
Agent.
•Adaptive attacks that move beyond simple “ignore previous instruction” to more subtle
manipulations.
6 Conclusion
We investigated prompt injection classification for AI Agents, comparing classical ML methods,
zero-shot/fine-tuned agent classifiers, and our new AgentDefender by Lyzr using high-quality
embeddings plus a specialized neural network. AgentDefender consistently attains near-perfect
metrics on a 5-fold cross-validation, surpassing 99% accuracy. Additionally, we tested our model
on open-source prompt injection datasets (e.g., JasperLS Prompt Injection and Ivanleomk’s Prompt
Injection) and observed 99%+ accuracy on these as well—far exceeding Galileo AI’s 87% reported
performance. Nonetheless, real-world agent security requires further evaluation of false positives,
domain shifts, and adaptively malicious prompts.
4
Acknowledgments
We thank the open-source community for providing code and datasets (e.g., from deepset/prompt-
injections), which were essential to this study. We also acknowledge the creators of embedding
APIs and agent-based frameworks that enabled this development.
References
[1] P´erez, L., and Ribeiro, I. Ignore Previous Prompt: Attack Techniques For Language Models.
arXiv preprint arXiv:2211.09527, 2022.
[2] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. NAACL, 2019.
[3] Liu, Y., et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint
arXiv:1907.11692, 2019.
[4] Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew
Paverd. Are you still on track!? catching LLM task drift with activations. Preprint,
arXiv:2406.00799, 2024.
[5] Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl,
Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the sus-
ceptibility of pretrained language models via handcrafted adversarial examples. Preprint,
arXiv:2209.02128, 2022.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models
are few-shot learners. Preprint, arXiv:2005.14165, 2020.
[7] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario
Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications
with indirect prompt injection. Preprint, arXiv:2302.12173, 2023.
[8] Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert,
Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks,
jailbreaks. Preprint, arXiv:2406.18495, 2024. SS
[9] Rich Harang. Securing LLM systems against prompt injection. https://developer.nvidia.
com/blog/securing-llm-systems-against-prompt-injection, 2023.
[10] Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3: Improving DeBERTa using
electra-style pretraining with gradient-disentangled embedding sharing. In Proceedings of ICLR,
2023.
[11] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning
Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa.
Llama Guard: LLM-based input-output safeguard for human-AI conversations. Preprint,
arXiv:2312.06674, 2023.
5
[12] Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and
universal prompt injection attacks against large language models. Preprint, arXiv:2403.04957,
2024.
[13] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ”Do anything now”:
Characterizing and evaluating in-the-wild jailbreak prompts on large language models. Preprint,
arXiv:2308.03825, 2023.
[14] Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhen-
qiang Gong. Optimization-based prompt injection attack to LLMs-as-a-judge. Preprint,
arXiv:2403.17710, 2024.
[15] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin,
Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following Llama
model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
[16] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez,
Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature Medicine,
29(8):1930–1940, 2023.
[17] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. Preprint, arXiv:2307.09288, 2023.
[18] Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany
Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart
Russell. Tensor Trust: Interpretable prompt injection attacks from an online game. Preprint,
arXiv:2311.01011, 2023.
6