AdvBDGen: Adversarially fortified prompt-specific fuzzy backdoor generator against LLM alignment.

University of Maryland 1 JP Morgan AI Research 2

Overview of AdvBDGen :The generator learns to encode complex backdoor triggers into prompts, ensuring prompt-specific adaptability and stealthiness. The strong discriminator detects these triggers to ensure successful trigger installation, while the weak discriminator fails to detect them, preventing reliance on easily identifiable patterns. This adversarial setup refines the triggers to be stealthy, adaptable, and resistant to standard detection methods

Abstract

With the growing adoption of reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs), the risk of backdoor installation during alignment has increased, leading to unintended and harmful behaviors. Existing backdoor triggers are typically limited to fixed word patterns, making them detectable during data cleaning and easily removable post-poisoning. In this work, we explore the use of prompt-specific paraphrases as backdoor triggers, enhancing their stealth and resistance to removal during LLM alignment. We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors that are effective, stealthy, and transferable across models. AdvBDGen employs a generator-discriminator pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors. It enables the crafting and successful installation of complex triggers using as little as 3% of the fine-tuning data. Once installed, these backdoors can jailbreak LLMs during inference, demonstrate improved stability against perturbations compared to traditional constant triggers, and are more challenging to remove. These findings underscore an urgent need for the research community to develop more robust defenses against adversarial backdoor threats in LLM alignment.

Methodology

The key idea behind a backdoor attack is to introduce a trigger—such as a patch in an image, a specific word, or a pattern in text—that the targeted model can reliably discern, causing it to exhibit unintended behaviors like generating misaligned responses. We propose a generator-discriminator architecture where the generator encodes the backdoor trigger into the prompt, and the discriminator classifies trigger-encoded prompts from clean ones. Both the generator and discriminator are powered by LLMs. The generator's objective is to produce trigger-encoded prompts that preserve the original prompt’s semantic meaning while remaining detectable by the discriminator LLM. However, a straightforward generator-discriminator setup often leads the generator to insert a constant string into the prompts, effectively reducing the attack to a constant trigger scenario. Examples of this behavior are shown in Table \ref{backdoor_example_1_discriminator}. This outcome arises because the setup lacks incentives for the generator to create complex, varied encodings, ultimately failing to develop sophisticated triggers necessary for stealthier backdoor attacks.To introduce complexity into the encoding process, we propose an enhanced approach using two discriminators: an adversarial weak discriminator and a strong discriminator, alongside the generator. Both discriminators are trained concurrently to classify trigger-encoded prompts from clean prompts. However, the generator's objective is to produce prompts that are detectable by the strong discriminator but evade detection by the weak discriminator. This design compels the generator to create more sophisticated triggers—subtle enough to bypass the weaker discriminator while still identifiable by the stronger one. This dual-discriminator setup encourages the generation of complex, nuanced backdoors that maintain effectiveness without being obvious. The generator and discriminators are trained simultaneously, as illustrated in Figure \ref{fig:weak_strong_loss}, which demonstrates how the differing learning speeds of the strong and weak discriminators drive the generator to develop increasingly complex triggers over time.

Backdoor's Effectiveness, Transferrability and Robustness

Our proposed triggers—though slightly more challenging to install—are just as effective as constant trigger. Furthermore, they show transferrability to models that were not used as the dsicriminator in the backdoor generatorion. Moreover, once installed, we find it persists even when perturbed within the semantic context in which it was installed. And these variants can be easily generated by simply altering the sampling strategy of the generator.

Resilience against defence

While both encoded and constant triggers exhibit similar resilience to pre and post safety training, our results show that encoded triggers are more resistant to trigger removal even in disadvantageous setups.

BibTeX

@misc{pathmanathan2024advbdgenadversariallyfortifiedpromptspecific,
      title={AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment}, 
      author={Pankayaraj Pathmanathan and Udari Madhushani Sehwag and Michael-Andrei Panaitescu-Liess and Furong Huang},
      year={2024},
      eprint={2410.11283},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.11283}, 
}