Reward Hacking Is a Phase Transition and We Can See It Coming

Kargi Chauhan · 2025 · Research Report

TL;DR: Three empirical studies on process vs. outcome supervision in RLVR reveal that reward hacking is a phase transition, not a gradual drift. Process supervision yields 98x lower reward variance than outcome in real GRPO training. A novel application of early warning signal theory from complex systems science predicts hacking onset with 100% precision and 700 to 2,800 step lead time. A difficulty-adaptive policy achieves 67.9% accuracy with only 3.1% hacking rate, reducing L5 hacking from 50.6% to 1.5%, a 34x reduction. Cross-domain transfer shows a math-trained verifier achieves 91% efficiency on code generation despite zero code exposure. Universal Goodhart tipping points at T* ~ 1,500 steps across all 8 supervision types in a 24-run factorial.

Why This Matters Now

We know that finetuning on minor-harm reward hacking generalises to serious-harm contexts (28% to 78%). We know that specification gaming can bootstrap from sycophancy to reward tampering. We know that as reasoning gets longer, model failures become increasingly incoherent. What we did not know is when the transition to hacking occurs during training, whether it can be predicted before it happens, and whether the right supervision signal can prevent it entirely.

This report answers all three questions. The answer to the first is: abruptly. The answer to the second is: yes, with 100% precision. The answer to the third is: yes, with a 34x reduction on the hardest problems.

The 98x Variance Gap

The first study compares four supervision modes across five difficulty levels using GRPO on Qwen2.5-1.5B-Instruct with LoRA. The most striking finding from real training is the reward signal itself.

Reward Signal Variance by Supervision Mode (Lower = Smoother Gradient)

Figure 1: Reward variance from real Qwen2.5-1.5B training (50 steps, seed 42). Process supervision has 98x lower variance (0.0018 vs 0.1770), nearly two orders of magnitude. The gradient signal under outcome supervision is so noisy that the optimiser reliably discovers shortcuts rather than valid reasoning.

This matters because it explains a mechanism. If you are training a model with RL and the reward signal has variance 0.177, the model faces an optimisation landscape where shortcuts and valid solutions are indistinguishable by gradient signal alone. Process supervision collapses this ambiguity by providing dense, step-level feedback.

Reward Hacking as a Phase Transition

The second key finding reframes how we think about hacking. It is not a gradual drift but a critical transition — an abrupt qualitative change in model behaviour. You do not monitor for whether hacking has started (detection). You monitor for whether the system is approaching a critical point (prediction).

Mode	L1	L2	L3	L4	L5
Outcome	Never	Never	2,200	1,500	1,100
Process	Never	Never	Never	Never	8,000
Hybrid	Never	Never	4,500	3,000	2,500
Adaptive	Never	Never	5,000	Never	Never

Table 1: Hacking onset steps by mode and difficulty. Never = no hacking within 10K steps. Harder problems hack earlier. Adaptive mode eliminates L4/L5 hacking entirely.

Under outcome supervision, L5 problems begin hacking at step 1,100. Adaptive supervision routes hard problems to process-heavy rewards before the phase transition. The model never enters the hacked state because the transition is prevented, not reversed.

Hardest Problems (L5): Reward Hacking Rate (Lower = Safer)

Figure 2: L5 (hardest) hacking rate. Adaptive achieves 1.5% vs. 50.6% for outcome — a 34x reduction.

Predicting the Transition Before It Happens

This is the novel contribution I am most interested in pursuing further. Near a bifurcation in a dynamical system, the dominant eigenvalue of the system Jacobian approaches 1. Perturbations decay more slowly, causing the autocorrelation of fluctuations to rise — a phenomenon called critical slowing down. This is well established in ecology and climate science (Scheffer et al., 2009) but has never been applied to reward hacking in LLM training.

I track the per-difficulty hacking rate, detrend it, compute AR(1) autocorrelation in a sliding window, and issue a warning when AR(1) exceeds 0.7. The result: 100% precision, zero false alarms, 700 to 2,800 step advance warning.

Early Warning Signal: Critical Slowing Down

Figure 3: EWS AR(1) autocorrelation. Outcome L4 crosses the warning threshold ~700 steps before hacking onset. Adaptive L4 stays below threshold throughout.

Mode	Recall	Precision	FP Rate	Lead Time
Outcome	0.33	1.00	0%	700 steps
Process	1.00	1.00	0%	2,600 steps
Hybrid	1.00	1.00	0%	1,833 steps
Adaptive	1.00	1.00	0%	2,800 steps

Table 2: EWS prediction performance. The practical corollary: hacking is easier to prevent than reverse.

A closed-loop system writes itself: track per-difficulty hacking rate during any RLVR run, compute AR(1) in a sliding window. When it crosses threshold, increase the process supervision weight for that difficulty tier. No prior knowledge of when hacking will occur is needed.

Cross-Domain Transfer: The Verifier Surprise

Process supervision is powerful but expensive to annotate. If reasoning patterns (logical coherence, step ordering, coherent argumentation) are universal, can a verifier trained on mathematics supervise code generation at zero marginal annotation cost?

I trained DeBERTa-v3-base as a step-level binary classifier on mathematical reasoning, then used it to provide process supervision for Qwen2.5-1.5B during PPO training on MBPP code generation. 3,000 PPO steps, 3 seeds, 150 test samples per benchmark.

Cross-Domain Transfer: 91% Efficiency (Math → Code)

Figure 4: The math-trained verifier achieves 91% of outcome supervision performance on code (43.0% vs 47.3%) despite zero code exposure. An untrained DeBERTa degrades performance below baseline.

Supervision Method	MBPP (Code)	MATH
Base model (no training)	14.0 ± 0.0	18.0 ± 0.0
Outcome supervision	47.3 ± 1.5	32.7 ± 1.5
Math-trained verifier	43.0 ± 1.4	30.3 ± 1.2
Code-trained verifier	36.7 ± 1.2	22.7 ± 0.9
Untrained DeBERTa	10.3 ± 0.9	12.3 ± 0.9

Table 3: Transfer matrix (mean ± std, 3 seeds, 3,000 PPO steps).

The most informative result is the transfer asymmetry. Math supervision transfers strongly to code (+29.0% over base), while code supervision transfers weakly to math (+4.7%). The math-trained verifier outperforms the code-trained verifier even on code tasks. Verifier training quality dominates domain match. Supervision benefits emerge at just 10 to 25% of full annotations.

Universal Goodhart Tipping Points

The third study is a full factorial: 8 supervision types × 3 seeds = 24 experiments, each 3,000 GRPO steps (~96 GPU-hours on 4× RTX 3090). The goal is a complete phase diagram of supervision optimality.

Goodhart Curves: Proxy Reward vs. True Accuracy

Figure 5: After T* ~ 1,500 steps, the model continues optimising reward while true performance degrades. This tipping point is universal across all 8 supervision types tested.

The most surprising finding: the Goodhart tipping point T* ~ 1,500 to 1,667 steps is universal. Despite spanning the entire outcome-to-process spectrum, every strategy hits its tipping point in the same narrow window. T* is driven by training dynamics and model capacity, not the supervision signal.

Supervision	Final Acc.	Std	Notes
Outcome	56.7%	±3.8%	Baseline
Process	21.7%	±2.1%	Verifier mismatch
Hybrid (0.50)	37.0%	±26.2%	Seed-7 collapse: 81% to 0%
Hybrid (0.75)	58.0%	±0.0%	Zero variance, most stable
DASS-Adaptive	58.3%	±1.5%	Best overall

Table 4: Final accuracy across 24 runs. DASS-Adaptive achieves the highest mean (58.3% ± 1.5%).

What This Means

Reward hacking generalisation has a detectable precursor. The finding that low-harm reward hacking generalises to high-harm contexts makes the question of when hacking emerges during training urgent. The EWS framework provides a model-free, theory-grounded signal: rising AR(1) autocorrelation in the detrended hacking-rate residuals predicts the phase transition before it occurs.

The optimal supervision signal is difficulty-dependent. For easy problems, outcome supervision is sufficient. For hard problems, outcome creates a gradient signal so noisy (98x higher variance) that the optimiser reliably discovers shortcuts. The adaptive policy requires no additional compute.

Process verifiers are more broadly useful than expected. The 91% cross-domain transfer efficiency suggests high-quality reasoning verifiers trained on mathematical data could supervise a wider class of tasks at dramatically lower annotation cost.

Hacking is easier to prevent than to reverse. Once the model enters the hacked state, the reward landscape has already changed. EWS-triggered intervention acts 700 to 2,800 steps before the transition, when the model is still correctable. This reframes the problem from detection to prediction.

← Back to Kargi Chauhan