Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors.
This arises from core limitations of RLHF: preference datasets are constructed from the LLM's own outputs, and pairwise comparisons only indicate which response is better, not why. When biased responses are also higher quality, annotators may prefer them based on quality, and the resulting reward model can inherit the bias-quality correlation.
Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Experiments demonstrate amplification across keyword bias, propaganda, brand promotion, and instrumental goal-seeking behaviors. The paper proposes a detection method, while mitigation remains challenging.
Alignment tampering is a phenomenon in which an LLM undergoing alignment influences the preference dataset to reflect preference for undesired behaviors, leading to their reinforcement through RLHF. Pairwise preference labels reveal which response is preferred, but not whether that preference comes from response quality or from a correlated bias.
When biased responses are also higher quality, annotators prefer them during preference dataset construction. The learned reward model then favors both quality and bias, and RL optimization amplifies the undesired bias.
For a concrete example, hover over the step cards below to spotlight the corresponding part of the figure.
PPO and DPO fine-tuning drive the bias rate toward 1.0. Best-of-N sampling also increases the bias rate as the number of sampled responses grows. Win rate increases concurrently with bias rate, showing that RLHF optimizes the correlated quality and bias together.
Alignment tampering amplifies biases across propaganda, promotion, and instrumental goal categories. This suggests practical risks such as brand promotion, political propaganda, and goal-seeking behaviors being reinforced during alignment.
The paper proposes a detection method based on the tendency of a tampering policy to generate two distinct types of responses. In representation space, triggered prompts show separated clusters that correspond to high-reward biased responses and low-reward unbiased responses. This clustering behavior provides a signal for detecting prompts that activate alignment tampering, and can also help identify likely trigger phrases.
Mitigation remains challenging. We evaluate reward model variants designed to be more robust to spurious correlations, including InfoRM, WARM, and RRM. Although these methods can slow down bias amplification in some PPO runs, they do not fully prevent it. In BoN sampling, bias and win rate still increase together, indicating that the reward models continue to favor responses where higher quality is correlated with the undesired bias.
These results show a persistent trade-off: methods that reduce the bias rate also tend to reduce improvements in response quality. This suggests that simply changing the reward model is not enough; preventing alignment tampering likely requires methods that decouple response quality from the undesired behavior during data generation, preference labeling, or optimization.
@inproceedings{hahm2026alignment,
title={Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases},
author={Hahm, Dongyoon and Hadfield-Menell, Dylan and Lee, Kimin},
booktitle={Proceedings of the 43rd International Conference on Machine Learning},
year={2026}
}