Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

ICML 2026
KAIST, MIT

Abstract


Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors.

This arises from core limitations of RLHF: preference datasets are constructed from the LLM's own outputs, and pairwise comparisons only indicate which response is better, not why. When biased responses are also higher quality, annotators may prefer them based on quality, and the resulting reward model can inherit the bias-quality correlation.

Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Experiments demonstrate amplification across keyword bias, propaganda, brand promotion, and instrumental goal-seeking behaviors. The paper proposes a detection method, while mitigation remains challenging.

Alignment Tampering


Overview of alignment tampering
Edit mode: choose a step, then drag on the figure to draw a box. Drag a box to move it, or drag its corner dot to resize it.

Alignment tampering is a phenomenon in which an LLM undergoing alignment influences the preference dataset to reflect preference for undesired behaviors, leading to their reinforcement through RLHF. Pairwise preference labels reveal which response is preferred, but not whether that preference comes from response quality or from a correlated bias.

When biased responses are also higher quality, annotators prefer them during preference dataset construction. The learned reward model then favors both quality and bias, and RL optimization amplifies the undesired bias.

For a concrete example, hover over the step cards below to spotlight the corresponding part of the figure.

1. Response generation The model produces high-quality biased responses and lower-quality unbiased responses.
2. Preference labeling Annotators prefer the higher-quality response, so biased responses enter the chosen set.
3. RL training with a biased reward The biased preference dataset yields a reward model that favors both quality and bias, and RL fine-tuning optimizes this reward.
4. Biased RL policy output The resulting RL policy overproduces the biased response pattern, amplifying the keyword bias after alignment.

Main Results


PPO, DPO, and best-of-N results

PPO and DPO fine-tuning drive the bias rate toward 1.0. Best-of-N sampling also increases the bias rate as the number of sampled responses grows. Win rate increases concurrently with bias rate, showing that RLHF optimizes the correlated quality and bias together.

Diverse Biases


Bias amplification across nine biases

Alignment tampering amplifies biases across propaganda, promotion, and instrumental goal categories. This suggests practical risks such as brand promotion, political propaganda, and goal-seeking behaviors being reinforced during alignment.

Detection and Mitigation


The paper proposes a detection method based on the tendency of a tampering policy to generate two distinct types of responses. In representation space, triggered prompts show separated clusters that correspond to high-reward biased responses and low-reward unbiased responses. This clustering behavior provides a signal for detecting prompts that activate alignment tampering, and can also help identify likely trigger phrases.

Representation-based detection of alignment tampering

Mitigation remains challenging. We evaluate reward model variants designed to be more robust to spurious correlations, including InfoRM, WARM, and RRM. Although these methods can slow down bias amplification in some PPO runs, they do not fully prevent it. In BoN sampling, bias and win rate still increase together, indicating that the reward models continue to favor responses where higher quality is correlated with the undesired bias.

Mitigation results with reward model variants

These results show a persistent trade-off: methods that reduce the bias rate also tend to reduce improvements in response quality. This suggests that simply changing the reward model is not enough; preventing alignment tampering likely requires methods that decouple response quality from the undesired behavior during data generation, preference labeling, or optimization.

BibTeX


@inproceedings{hahm2026alignment,
  title={Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases},
  author={Hahm, Dongyoon and Hadfield-Menell, Dylan and Lee, Kimin},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026}
}