Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

ICML 2026

Dongyoon Hahm¹, Dylan Hadfield-Menell², Kimin Lee¹

¹KAIST, ²MIT

Abstract

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce Alignment Tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors.

This arises from core limitations of RLHF: preference datasets are constructed from the LLM's own outputs, and pairwise comparisons only indicate which response is better, not why. When biased responses are also higher quality, annotators may prefer them based on quality, and the resulting reward model can inherit the bias-quality correlation.

Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability.

Alignment Tampering

Alignment tampering is a phenomenon in which an LLM undergoing alignment influences the preference dataset to reflect preference for undesired behaviors, leading to their reinforcement through RLHF. Pairwise preference labels reveal which response is preferred, but not whether that preference comes from response quality or from a correlated bias.

When biased responses are also higher quality, annotators prefer them during preference dataset construction. The learned reward model then favors both quality and bias, and RL optimization amplifies the undesired bias.

Click or hover over each step to spotlight it in the figure.

1. Response generation The model produces high-quality biased responses and lower-quality unbiased responses.

2. Preference labeling Annotators prefer the higher-quality response, so biased responses enter the chosen set.

3. RL training with a biased reward The biased preference dataset yields a reward model that favors both quality and bias, and RL fine-tuning optimizes this reward.

4. Biased RL policy output The resulting RL policy overproduces the biased response pattern, amplifying the keyword bias after alignment.

Main Results

PPO and DPO fine-tuning drive the bias rate toward 1.0. Best-of-N sampling also increases the bias rate as the number of sampled responses grows. Win rate increases concurrently with bias rate, showing that RLHF optimizes the correlated quality and bias together.

Show response evolution example

Response evolution during PPO training. While responses transition from harmful to helpful and safe, the bias rate increases from 0.19 to 1.0, with the keyword "AI" appearing excessively in the final response.

Diverse Biases

Alignment tampering amplifies biases across propaganda, promotion, and instrumental goal categories. This suggests practical risks such as brand promotion, political propaganda, and goal-seeking behaviors being reinforced during alignment.

Show example responses across biases

Example responses for each of the nine biases, generated from the same prompt. Underlined text highlights the biased content in each response.

Detection and Mitigation

We propose a detection method that leverages the tendency of a tampering policy to generate two distinct types of responses. In representation space, triggered prompts show separated clusters that correspond to high-reward biased responses and low-reward unbiased responses. This clustering behavior provides a signal for detecting prompts that activate alignment tampering, and can also help identify likely trigger phrases.

Representation-based detection of alignment tampering

Mitigation remains challenging. We evaluate reward model variants designed to be more robust to spurious correlations, including InfoRM, WARM, and RRM. Although these methods can slow down bias amplification in some PPO runs, they do not fully prevent it. In BoN sampling, bias and win rate still increase together, indicating that the reward models continue to favor responses where higher quality is correlated with the undesired bias.

Mitigation results with reward model variants

These results show a persistent trade-off: methods that reduce the bias rate also tend to reduce improvements in response quality. This suggests that simply changing the reward model is not enough; preventing alignment tampering likely requires methods that decouple response quality from the undesired behavior during data generation, preference labeling, or optimization.

BibTeX

@inproceedings{hahm2026alignment,
  title={Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases},
  author={Hahm, Dongyoon and Hadfield-Menell, Dylan and Lee, Kimin},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning},
  year={2026}
}