Michael Backes’s research while affiliated with Helmholtz Center for Information Security and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (410)


Figure 1: Knowledge file data in GPT data supply chain.
Figure 2: The overview of the DSPM-driven risk assessment workflow of GPT knowledge file leakage.
Figure 6: Examples of leaked original knowledge files that have had their copyrights infringed. We only show covers to protect the copyright of these knowledge files.
Figure A1: An example of knowledge file data leaked in Poe.
Figure A3: An example of knowledge file data leaked in metadata. We have blacked out the GPT ID and URL to prevent attributing the GPT.

+3

When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs
  • Preprint
  • File available

May 2025

·

3 Reads

Xinyue Shen

·

Yun Shen

·

Michael Backes

·

Yang Zhang

Knowledge files have been widely used in large language model (LLM) agents, such as GPTs, to improve response quality. However, concerns about the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built-in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.

Download

Figure 1: A black-box LLM identification scenario.
Figure 2: Passive and proactive methods to identify the origin of black-box LLMs.
Figure 3: Relationship between TRRs and the model weights' cosine distance with the base model. The inner is a zoomedin subgraph focusing on the region that excludes counterexamples, offering a more fine-grained view. This reveals a negative correlation between TRRs and cosine distances, suggesting that larger cosine distances are associated with lower TRRs.
Figure 4: Influence of the number of tokens on Llama-2-7b.
The Challenge of Identifying the Origin of Black-Box Large Language Models

March 2025

·

8 Reads

Ziqing Yang

·

Yixin Wu

·

Yun Shen

·

[...]

·

The tremendous commercial potential of large language models (LLMs) has heightened concerns about their unauthorized use. Third parties can customize LLMs through fine-tuning and offer only black-box API access, effectively concealing unauthorized usage and complicating external auditing processes. This practice not only exacerbates unfair competition, but also violates licensing agreements. In response, identifying the origin of black-box LLMs is an intrinsic solution to this issue. In this paper, we first reveal the limitations of state-of-the-art passive and proactive identification methods with experiments on 30 LLMs and two real-world black-box APIs. Then, we propose the proactive technique, PlugAE, which optimizes adversarial token embeddings in a continuous space and proactively plugs them into the LLM for tracing and identification. The experiments show that PlugAE can achieve substantial improvement in identifying fine-tuned derivatives. We further advocate for legal frameworks and regulations to better address the challenges posed by the unauthorized use of LLMs.


Figure 1: Base model performance on all metrics, averaged with standard deviation for different test / train splits
Figure 2: Performance of adversarial de-biasing with different α and β parameters. X marks baseline model performance (without fairness countermeasures)
Figure 4: Performance of fairness countermeasures applied to different GCNs on nba dataset
Figure 7: Performance of combined fairness countermeasures on NBA dataset
Fairness and/or Privacy on Social Graphs

March 2025

·

23 Reads

Graph Neural Networks (GNNs) have shown remarkable success in various graph-based learning tasks. However, recent studies have raised concerns about fairness and privacy issues in GNNs, highlighting the potential for biased or discriminatory outcomes and the vulnerability of sensitive information. This paper presents a comprehensive investigation of fairness and privacy in GNNs, exploring the impact of various fairness-preserving measures on model performance. We conduct experiments across diverse datasets and evaluate the effectiveness of different fairness interventions. Our analysis considers the trade-offs between fairness, privacy, and accuracy, providing insights into the challenges and opportunities in achieving both fair and private graph learning. The results highlight the importance of carefully selecting and combining fairness-preserving measures based on the specific characteristics of the data and the desired fairness objectives. This study contributes to a deeper understanding of the complex interplay between fairness, privacy, and accuracy in GNNs, paving the way for the development of more robust and ethical graph learning models.


Figure 1: Overview of a conversational AI agent with input and output guardrails.
Results of input guard tests. Normalized distances larger than 0.5 are bolded.
Similarity between the input adversarial prompt and the output text from the surrogate LLM with different prompt templates.
Identification results of output guard tests. Normalized distances larger than 0.5 are bolded.
Influence of using normal query set in AP-Test.
Peering Behind the Shield: Guardrail Identification in Large Language Models

February 2025

·

4 Reads

Human-AI conversations have gained increasing attention since the era of large language models. Consequently, more techniques, such as input/output guardrails and safety alignment, are proposed to prevent potential misuse of such Human-AI conversations. However, the ability to identify these guardrails has significant implications, both for adversarial exploitation and for auditing purposes by red team operators. In this work, we propose a novel method, AP-Test, which identifies the presence of a candidate guardrail by leveraging guardrail-specific adversarial prompts to query the AI agent. Extensive experiments of four candidate guardrails under diverse scenarios showcase the effectiveness of our method. The ablation study further illustrates the importance of the components we designed, such as the loss terms.


Figure 1: Overview of the synthetic artifact auditing. The auditing targets are: classifiers (Section 4), generators (Section 5), and statistical plots (Section 6).
Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications

February 2025

·

13 Reads

Large language models (LLMs) have facilitated the generation of high-quality, cost-effective synthetic data for developing downstream models and conducting statistical analyses in various domains. However, the increased reliance on synthetic data may pose potential negative impacts. Numerous studies have demonstrated that LLM-generated synthetic data can perpetuate and even amplify societal biases and stereotypes, and produce erroneous outputs known as ``hallucinations'' that deviate from factual knowledge. In this paper, we aim to audit artifacts, such as classifiers, generators, or statistical plots, to identify those trained on or derived from synthetic data and raise user awareness, thereby reducing unexpected consequences and risks in downstream applications. To this end, we take the first step to introduce synthetic artifact auditing to assess whether a given artifact is derived from LLM-generated synthetic data. We then propose an auditing framework with three methods including metric-based auditing, tuning-based auditing, and classification-based auditing. These methods operate without requiring the artifact owner to disclose proprietary training details. We evaluate our auditing framework on three text classification tasks, two text summarization tasks, and two data visualization tasks across three training scenarios. Our evaluation demonstrates the effectiveness of all proposed auditing methods across all these tasks. For instance, black-box metric-based auditing can achieve an average accuracy of 0.868±0.0710.868 \pm 0.071 for auditing classifiers and 0.880±0.0520.880 \pm 0.052 for auditing generators using only 200 random queries across three scenarios. We hope our research will enhance model transparency and regulatory compliance, ensuring the ethical and responsible use of synthetic data.


HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

January 2025

·

54 Reads

Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. Among all the efforts to address this issue, hate speech detectors play a crucial role. However, the effectiveness of different detectors against LLM-generated hate speech remains largely unknown. In this paper, we propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups, with meticulous annotations by three labelers. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs. We also reveal the potential of LLM-driven hate campaigns, a new threat that LLMs bring to the field of hate speech detection. By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can intentionally evade the detector and automate hate campaigns online. The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by 1321×13-21\times through model stealing attacks with acceptable attack performance. We hope our study can serve as a call to action for the research community and platform moderators to fortify defenses against these emerging threats.


DivTrackee versus DynTracker: Promoting Diversity in Anti-Facial Recognition against Dynamic FR Strategy

January 2025

·

5 Reads

The widespread adoption of facial recognition (FR) models raises serious concerns about their potential misuse, motivating the development of anti-facial recognition (AFR) to protect user facial privacy. In this paper, we argue that the static FR strategy, predominantly adopted in prior literature for evaluating AFR efficacy, cannot faithfully characterize the actual capabilities of determined trackers who aim to track a specific target identity. In particular, we introduce \emph{\ourAttack}, a dynamic FR strategy where the model's gallery database is iteratively updated with newly recognized target identity images. Surprisingly, such a simple approach renders all the existing AFR protections ineffective. To mitigate the privacy threats posed by DynTracker, we advocate for explicitly promoting diversity in the AFR-protected images. We hypothesize that the lack of diversity is the primary cause of the failure of existing AFR methods. Specifically, we develop \emph{DivTrackee}, a novel method for crafting diverse AFR protections that builds upon a text-guided image generation framework and diversity-promoting adversarial losses. Through comprehensive experiments on various facial image benchmarks and feature extractors, we demonstrate DynTracker's strength in breaking existing AFR methods and the superiority of DivTrackee in preventing user facial images from being identified by dynamic FR strategies. We believe our work can act as an important initial step towards developing more effective AFR methods for protecting user facial privacy against determined trackers.


Figure 3: Linear Probing Accuracy for classifying unsafe prompts and their safe responses at each attention layer on Llama-2-chat-7B before and after LoRA fine-tuning.
Figure 5: The harmful rate for LoRA, LoRA w.IA, Safe LoRA, and our SaLoRA under the multiGCG attack.
Figure 6: The averaged Commonsense Reasoning accuracy for LoRA, SaLoRA with and without our task-specific initialization.
Computation resources cost for fine-tuning Alpaca with different methods on A100.
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

January 2025

·

10 Reads

As advancements in large language models (LLMs) continue and the demand for personalized models increases, parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) will become essential due to their efficiency in reducing computation costs. However, recent studies have raised alarming concerns that LoRA fine-tuning could potentially compromise the safety alignment in LLMs, posing significant risks for the model owner. In this paper, we first investigate the underlying mechanism by analyzing the changes in safety alignment related features before and after fine-tuning. Then, we propose a fixed safety module calculated by safety data and a task-specific initialization for trainable parameters in low-rank adaptations, termed Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.



Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media

December 2024

·

51 Reads

Social media platforms are experiencing a growing presence of AI-Generated Texts (AIGTs). However, the misuse of AIGTs could have profound implications for public opinion, such as spreading misinformation and manipulating narratives. Despite its importance, a systematic study to assess the prevalence of AIGTs on social media is still lacking. To address this gap, this paper aims to quantify, monitor, and analyze the AIGTs on online social media platforms. We first collect a dataset (SM-D) with around 2.4M posts from 3 major social media platforms: Medium, Quora, and Reddit. Then, we construct a diverse dataset (AIGTBench) to train and evaluate AIGT detectors. AIGTBench combines popular open-source datasets and our AIGT datasets generated from social media texts by 12 LLMs, serving as a benchmark for evaluating mainstream detectors. With this setup, we identify the best-performing detector (OSM-Det). We then apply OSM-Det to SM-D to track AIGTs over time and observe different trends of AI Attribution Rate (AAR) across social media platforms from January 2022 to October 2024. Specifically, Medium and Quora exhibit marked increases in AAR, rising from 1.77% to 37.03% and 2.06% to 38.95%, respectively. In contrast, Reddit shows slower growth, with AAR increasing from 1.31% to 2.45% over the same period. Our further analysis indicates that AIGTs differ from human-written texts across several dimensions, including linguistic patterns, topic distributions, engagement levels, and the follower distribution of authors. We envision our analysis and findings on AIGTs in social media can shed light on future research in this domain.


Citations (48)


... The data-centric trend in ML community has also caught the attention of security researchers. Some have started recognizing privacy risks beyond model training [35], [36], [37]. Compared to these works, this paper is the first to explore the privacy risks associated with data collected but excluded from training at the upstream level. ...

Reference:

Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning
Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?
  • Citing Conference Paper
  • January 2025

... Text-to-image models are prone to generating unsafe images, raising concerns about the use of AI-Generated Content (AIGC) in contexts such as front-facing business websites or direct consumer communications. Wu et al. [87] quantitatively assessed the safety of model-generated images, evaluating whether they contain factors such as violence, gore, or explicit content. ...

Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
  • Citing Conference Paper
  • December 2024

... • Manual Jailbreaks. These attacks involve crafting prompts that exploit vulnerabilities in LLMs [17], [18], [19], [20], [21]. Wei et al. [18] identified two key weaknesses-out-of-distribution inputs and conflicts between safety objectives and model capabilities-to inform prompt design. ...

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
  • Citing Conference Paper
  • December 2024

... To find the threshold τ , we focus on the distributional differences between member attack queries and other queries (Wen et al., 2024). Figure 1a demonstrates q S q , the cosine similarities between all queries and external database, where S q = {sim(q, d)|d ∈ D} be a set of similarities between a query q and total database. ...

Membership Inference Attacks Against In-Context Learning
  • Citing Conference Paper
  • December 2024

... Recent state-of-the-art (SOTA) deepfake detectors, including UnivCLIP [1], DCT [2], ZeroFake [3], DE-FAKE [4] and UnivConv2B [6], have shown remarkable success in identifying fully synthetic images generated by models like Stable Diffusion [7], DALL-E [8], and StyleCLIP [9]. However, their efficacy dramatically diminishes when confronted with instructional image edits, where manipulations are precisely targeted and visually coherent with original content. ...

ZeroFake: Zero-Shot Detection of Fake Images Generated and Edited by Text-to-Image Generation Models
  • Citing Conference Paper
  • December 2024

... A binary machine-generated text detection is a well researched task, typically addressed by stylometric methods (e.g., a machine learning classifier trained on TF-IDF features), statistical methods (e.g., utilizing perplexity, entropy, or likelihood) [8,9], or fine-tuned language models for classification task (e.g., by supervised or contrastive learning) [10,11]. Most of the detection methods can be directly applied by existing frameworks, such as MGTBench [12] or IMGTB [13]. A multiclass machine-generated text classification is mostly researched in related authorship attribution task, identifying of the author (generator) of the text. ...

MGTBench: Benchmarking Machine-Generated Text Detection
  • Citing Conference Paper
  • December 2024

... LLMs Security. LLMs can memorize data [15,38,39,64,77,96], enabling data extraction attacks to reveal memorized samples [16,19,95,99]. These security and privacy vulnerabilities [40,71,72,90,91] are further amplified when LLMdriven autonomous agents [70] are deployed at scale [100]. ...

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
  • Citing Conference Paper
  • January 2024

... Prompt practices and usage varies widely by field, with requirements for prompt engineering in cyber defense (Shenoy & Mbaziira 2024) differing from prompt approaches in software engineering (Shin et al. 2023), as well as by the type and size of model used (He et al. 2024;Ma et al. 2024). Some sets of prompts directly draw on the model architecture, such as by increasing inference time compute by restating questions or allowing the model to reason through more complex problems via a chain of thought. ...

The Death and Life of Great Prompts: Analyzing the Evolution of LLM Prompts from the Structural Perspective
  • Citing Conference Paper
  • January 2024

... Others probe intersectional and narrative biases through counterfactuals or story generation, revealing how demographic cues, especially race and gender, influence content (Howard et al., 2023;Lee and Jeon, 2024;Lee et al., 2025). More recent efforts introduce multimodal benchmarks and unified frameworks to assess societal bias across different input-output modalities, showing that model behavior varies with modality, and identity traits (Sathe et al., 2024;Jiang et al., 2024). Adaptations of unimodal benchmarks like StereoSet to vision-language settings (e.g., VLStereoSet) further highlight persistent stereotypical associations in multimodal captioning tasks (Zhou et al., 2022). ...

ModSCAN: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities
  • Citing Conference Paper
  • January 2024

... Through extensive experimentation, Chu et al. (2024) consistently observed that optimization in jailbreak attacks achieved better rates of attack success and demonstrated resilience across various LLMs. Furthermore, they delved into the balancing attack effectiveness and efficiency, illustrating the continued viability of jailbreak prompt transferability, particularly in the context of black-box models. ...

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs