Reward Hacking Is a Phase Transition and We Can See It Coming
Why This Matters Now
We know that finetuning on minor-harm reward hacking generalises to serious-harm contexts (28% to 78%). We know that specification gaming can bootstrap from sycophancy to reward tampering. We know that as reasoning gets longer, model failures become increasingly incoherent. What we did not know is when the transition to hacking occurs during training, whether it can be predicted before it happens, and whether the right supervision signal can prevent it entirely.
This report answers all three questions. The answer to the first is: abruptly. The answer to the second is: yes, with 100% precision. The answer to the third is: yes, with a 34x reduction on the hardest problems.
The 98x Variance Gap
The first study compares four supervision modes across five difficulty levels using GRPO on Qwen2.5-1.5B-Instruct with LoRA. The most striking finding from real training is the reward signal itself.
Figure 1: Reward variance from real Qwen2.5-1.5B training (50 steps, seed 42). Process supervision has 98x lower variance (0.0018 vs 0.1770), nearly two orders of magnitude. The gradient signal under outcome supervision is so noisy that the optimiser reliably discovers shortcuts rather than valid reasoning.
This matters because it explains a mechanism. If you are training a model with RL and the reward signal has variance 0.177, the model faces an optimisation landscape where shortcuts and valid solutions are indistinguishable by gradient signal alone. Process supervision collapses this ambiguity by providing dense, step-level feedback.
Reward Hacking as a Phase Transition
The second key finding reframes how we think about hacking. It is not a gradual drift but a critical transition — an abrupt qualitative change in model behaviour. You do not monitor for whether hacking has started (detection). You monitor for whether the system is approaching a critical point (prediction).
| Mode | L1 | L2 | L3 | L4 | L5 |
|---|---|---|---|---|---|
| Outcome | Never | Never | 2,200 | 1,500 | 1,100 |
| Process | Never | Never | Never | Never | 8,000 |
| Hybrid | Never | Never | 4,500 | 3,000 | 2,500 |
| Adaptive | Never | Never | 5,000 | Never | Never |
Table 1: Hacking onset steps by mode and difficulty. Never = no hacking within 10K steps. Harder problems hack earlier. Adaptive mode eliminates L4/L5 hacking entirely.
Under outcome supervision, L5 problems begin hacking at step 1,100. Adaptive supervision routes hard problems to process-heavy rewards before the phase transition. The model never enters the hacked state because the transition is prevented, not reversed.
Figure 2: L5 (hardest) hacking rate. Adaptive achieves 1.5% vs. 50.6% for outcome — a 34x reduction.
Predicting the Transition Before It Happens
This is the novel contribution I am most interested in pursuing further. Near a bifurcation in a dynamical system, the dominant eigenvalue of the system Jacobian approaches 1. Perturbations decay more slowly, causing the autocorrelation of fluctuations to rise — a phenomenon called critical slowing down. This is well established in ecology and climate science (Scheffer et al., 2009) but has never been applied to reward hacking in LLM training.
I track the per-difficulty hacking rate, detrend it, compute AR(1) autocorrelation in a sliding window, and issue a warning when AR(1) exceeds 0.7. The result: 100% precision, zero false alarms, 700 to 2,800 step advance warning.
Figure 3: EWS AR(1) autocorrelation. Outcome L4 crosses the warning threshold ~700 steps before hacking onset. Adaptive L4 stays below threshold throughout.
| Mode | Recall | Precision | FP Rate | Lead Time |
|---|---|---|---|---|
| Outcome | 0.33 | 1.00 | 0% | 700 steps |
| Process | 1.00 | 1.00 | 0% | 2,600 steps |
| Hybrid | 1.00 | 1.00 | 0% | 1,833 steps |
| Adaptive | 1.00 | 1.00 | 0% | 2,800 steps |
Table 2: EWS prediction performance. The practical corollary: hacking is easier to prevent than reverse.
A closed-loop system writes itself: track per-difficulty hacking rate during any RLVR run, compute AR(1) in a sliding window. When it crosses threshold, increase the process supervision weight for that difficulty tier. No prior knowledge of when hacking will occur is needed.
Cross-Domain Transfer: The Verifier Surprise
Process supervision is powerful but expensive to annotate. If reasoning patterns (logical coherence, step ordering, coherent argumentation) are universal, can a verifier trained on mathematics supervise code generation at zero marginal annotation cost?
I trained DeBERTa-v3-base as a step-level binary classifier on mathematical reasoning, then used it to provide process supervision for Qwen2.5-1.5B during PPO training on MBPP code generation. 3,000 PPO steps, 3 seeds, 150 test samples per benchmark.
Figure 4: The math-trained verifier achieves 91% of outcome supervision performance on code (43.0% vs 47.3%) despite zero code exposure. An untrained DeBERTa degrades performance below baseline.
| Supervision Method | MBPP (Code) | MATH |
|---|---|---|
| Base model (no training) | 14.0 ± 0.0 | 18.0 ± 0.0 |
| Outcome supervision | 47.3 ± 1.5 | 32.7 ± 1.5 |
| Math-trained verifier | 43.0 ± 1.4 | 30.3 ± 1.2 |
| Code-trained verifier | 36.7 ± 1.2 | 22.7 ± 0.9 |
| Untrained DeBERTa | 10.3 ± 0.9 | 12.3 ± 0.9 |
Table 3: Transfer matrix (mean ± std, 3 seeds, 3,000 PPO steps).
The most informative result is the transfer asymmetry. Math supervision transfers strongly to code (+29.0% over base), while code supervision transfers weakly to math (+4.7%). The math-trained verifier outperforms the code-trained verifier even on code tasks. Verifier training quality dominates domain match. Supervision benefits emerge at just 10 to 25% of full annotations.
Universal Goodhart Tipping Points
The third study is a full factorial: 8 supervision types × 3 seeds = 24 experiments, each 3,000 GRPO steps (~96 GPU-hours on 4× RTX 3090). The goal is a complete phase diagram of supervision optimality.
Figure 5: After T* ~ 1,500 steps, the model continues optimising reward while true performance degrades. This tipping point is universal across all 8 supervision types tested.
The most surprising finding: the Goodhart tipping point T* ~ 1,500 to 1,667 steps is universal. Despite spanning the entire outcome-to-process spectrum, every strategy hits its tipping point in the same narrow window. T* is driven by training dynamics and model capacity, not the supervision signal.
| Supervision | Final Acc. | Std | Notes |
|---|---|---|---|
| Outcome | 56.7% | ±3.8% | Baseline |
| Process | 21.7% | ±2.1% | Verifier mismatch |
| Hybrid (0.50) | 37.0% | ±26.2% | Seed-7 collapse: 81% to 0% |
| Hybrid (0.75) | 58.0% | ±0.0% | Zero variance, most stable |
| DASS-Adaptive | 58.3% | ±1.5% | Best overall |
Table 4: Final accuracy across 24 runs. DASS-Adaptive achieves the highest mean (58.3% ± 1.5%).
What This Means
Reward hacking generalisation has a detectable precursor. The finding that low-harm reward hacking generalises to high-harm contexts makes the question of when hacking emerges during training urgent. The EWS framework provides a model-free, theory-grounded signal: rising AR(1) autocorrelation in the detrended hacking-rate residuals predicts the phase transition before it occurs.
The optimal supervision signal is difficulty-dependent. For easy problems, outcome supervision is sufficient. For hard problems, outcome creates a gradient signal so noisy (98x higher variance) that the optimiser reliably discovers shortcuts. The adaptive policy requires no additional compute.
Process verifiers are more broadly useful than expected. The 91% cross-domain transfer efficiency suggests high-quality reasoning verifiers trained on mathematical data could supervise a wider class of tasks at dramatically lower annotation cost.
Hacking is easier to prevent than to reverse. Once the model enters the hacked state, the reward landscape has already changed. EWS-triggered intervention acts 700 to 2,800 steps before the transition, when the model is still correctable. This reframes the problem from detection to prediction.
← Back to Kargi Chauhan