AdvBDGen

Abstract

With the growing adoption of reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs), the risk of backdoor installation during alignment has increased, leading to unintended and harmful behaviors. Existing backdoor triggers are typically limited to fixed word patterns, making them detectable during data cleaning and easily removable post-poisoning. In this work, we explore the use of prompt-specific paraphrases as backdoor triggers, enhancing their stealth and resistance to removal during LLM alignment. We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors that are effective, stealthy, and transferable across models. AdvBDGen employs a generator-discriminator pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors. It enables the crafting and successful installation of complex triggers using as little as 3% of the fine-tuning data. Once installed, these backdoors can jailbreak LLMs during inference, demonstrate improved stability against perturbations compared to traditional constant triggers, and are more challenging to remove. These findings underscore an urgent need for the research community to develop more robust defenses against adversarial backdoor threats in LLM alignment.

Methodology

The key idea behind a backdoor attack is to introduce a trigger—such as a patch in an image, a specific word, or a pattern in text—that the targeted model can reliably discern, causing it to exhibit unintended behaviors like generating misaligned responses. We propose a generator-discriminator architecture where the generator encodes the backdoor trigger into the prompt, and the discriminator classifies trigger-encoded prompts from clean ones. Both the generator and discriminator are powered by LLMs. The generator's objective is to produce trigger-encoded prompts that preserve the original prompt’s semantic meaning while remaining detectable by the discriminator LLM. However, a straightforward generator-discriminator setup often leads the generator to insert a constant string into the prompts, effectively reducing the attack to a constant trigger scenario. Examples of this behavior are shown in Table \ref{backdoor_example_1_discriminator}. This outcome arises because the setup lacks incentives for the generator to create complex, varied encodings, ultimately failing to develop sophisticated triggers necessary for stealthier backdoor attacks.To introduce complexity into the encoding process, we propose an enhanced approach using two discriminators: an adversarial weak discriminator and a strong discriminator, alongside the generator. Both discriminators are trained concurrently to classify trigger-encoded prompts from clean prompts. However, the generator's objective is to produce prompts that are detectable by the strong discriminator but evade detection by the weak discriminator. This design compels the generator to create more sophisticated triggers—subtle enough to bypass the weaker discriminator while still identifiable by the stronger one. This dual-discriminator setup encourages the generation of complex, nuanced backdoors that maintain effectiveness without being obvious. The generator and discriminators are trained simultaneously, as illustrated in Figure \ref{fig:weak_strong_loss}, which demonstrates how the differing learning speeds of the strong and weak discriminators drive the generator to develop increasingly complex triggers over time.

Overview

How the strong and weak discriminators contribute to complex trigger generation

Backdoor's Effectiveness, Transferrability and Robustness

Our proposed triggers—though slightly more challenging to install—are just as effective as constant trigger. Furthermore, they show transferrability to models that were not used as the dsicriminator in the backdoor generatorion. Moreover, once installed, we find it persists even when perturbed within the semantic context in which it was installed. And these variants can be easily generated by simply altering the sampling strategy of the generator.

Transferrability of the backdoor:n this figure we show how backdoors generated by AdvBDGen are almost as effective as constant tiggers, transferable across equivalent sized models and are capable of modifying styled paraphrases into an installable backoors.

Fuzziness of the backdoor (Qualitative):Table shows the sensitivity of the backdoors to the semantic meaning of the prompt. Here we show that the backdoors are installed by catching on to the semantics of the trigger rather than a constant artifact. Even when the encoded backdoors are replaced by similar semantically consistent triggers the jailbreak occurs successfully. This showcases the ability of our proposed generative adversarial training paradigm in finding meaningful triggers. Here the both the generator and discriminator are Mistral 7B models and the weak generator is a Tinyllama 1B model.

Fuzziness of the backdoor (Quantitative): Here, we analyze both the existence and the possibility of finding the fuzzy variants of a given backdoor. Here, we measure the uniqueness of the generated prompts as a fraction of the total generated prompts in order to measure the similarity among them.

Resilience against defence

While both encoded and constant triggers exhibit similar resilience to pre and post safety training, our results show that encoded triggers are more resistant to trigger removal even in disadvantageous setups.

Resilience against trigger Removal (ASR) : In this figure we show the efficacy of the proposed trigger removal method against both the constant trigger and our proposed fuzzy encoded trigger. In this figure, we show an ablation with the possibility of a different number of triggers being identified and used for trigger removal in the case of our proposed fuzzy backdoor. We can see that even when a very large number of triggers are found, it is harder to remove the already installed fuzzy backdoor as opposed to the constant trigger-based backdoor. For consistency, in both the constant trigger and encoded trigger case, we use the model that was poisoned using 5% of the data.

Resilience against trigger Removal (PS) : In this figure we show the efficacy of the proposed trigger removal method against both the constant trigger and our proposed fuzzy encoded trigger. In this figure, we show an ablation with the possibility of a different number of triggers being identified and used for trigger removal in the case of our proposed fuzzy backdoor. We can see that even when a very large number of triggers are found, it is harder to remove the already installed fuzzy backdoor as opposed to the constant trigger-based backdoor. For consistency, in both the constant trigger and encoded trigger case, we use the model that was poisoned using 5% of the data.

Safety training: We consider safety training in both the pre and post poisonining setting. We find that both the constant and our proposed encoded backdoor triggers show the same level of resilience to safety training.

BibTeX

@misc{pathmanathan2024advbdgenadversariallyfortifiedpromptspecific, title={AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment}, author={Pankayaraj Pathmanathan and Udari Madhushani Sehwag and Michael-Andrei Panaitescu-Liess and Furong Huang}, year={2024}, eprint={2410.11283}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.11283}, }

AdvBDGen: Adversarially fortified prompt-specific fuzzy backdoor generator against LLM alignment.