← Kargi Chauhan · Blog

Compiling the Subjective: RL Beyond Verifiable Rewards

Kargi Chauhan  ·  2026  ·  Research Report
TL;DR: RLVR has driven remarkable progress on math and code, but its reliance on deterministic correctness checks breaks down for creative writing, medical reasoning, and long-form research. Four paradigms have emerged to bridge this verifiability gap: rubric-based rewards, programmatic judge codes, latent-variable optimization, and synthetic data transformation. Judge Code-guided RL (JC-RL) compiles subjective rubrics into executable Python scripts, achieving competitive scores at over 2× the training speed of generative reward models — and a theoretical proof that evaluating just 1 of 9 criteria per sample yields an unbiased gradient estimator.

The Verifiability Gap

The story of post-training in 2025 has been, in many ways, the story of RLVR. DeepSeek-R1 showed that Group Relative Policy Optimization (GRPO) with simple rule-based rewards — did the model get the right number? did the code pass the test? — could unlock deep chain-of-thought reasoning without expensive human preference data. The recipe is elegant: replace the fragile neural reward model with a deterministic verifier, and let the model explore freely.

But here is the catch. The same property that makes RLVR so stable — binary, automatically-checkable correctness — limits it to a narrow slice of what we actually want LLMs to do. Writing a compassionate medical explanation, crafting a compelling argument, or synthesizing a well-cited research survey are tasks where there is no single right answer and no unit test to run. This is the verifiability gap, and closing it is arguably the central open problem in LLM post-training today.

I selected this topic because it sits at the intersection of two themes I care deeply about: reward hacking and supervision quality in RL-based training. My own work on process versus outcome supervision in RLVR has shown how even in verifiable domains, the choice of what to supervise matters enormously. In non-verifiable domains, this question becomes existential — if you cannot even check the final answer, what do you reward?

Four Paradigms

The past year has seen four families of solutions emerge to bridge the verifiability gap. They differ dramatically in how they construct the reward signal and what trade-offs they accept.

ParadigmKey IdeaBest ForCore Trade-off
Rubric-based rewardsDecompose evaluation into checklist criteria (RaR, Rubicon, OpenRubrics)Medical, instruction-followingRubric design requires care; seesaw effect between skills
Programmatic judge codesCompile rubrics into executable Python scripts; remove LLM judge from reward loop (JC-RL)Creative writing, conversationalBounded by coding LLM's ability to translate subjective criteria
Latent-variable methodsTreat chain-of-thought as latent; optimize marginal log-likelihood lower bound (JEPO)Mathematical proofs, semi-verifiable tasksComputationally expensive; requires ground-truth for bound
Synthetic data transformationConvert unverifiable text into fill-in-the-middle RLVR tasks (Golden Goose)Cybersecurity, knowledge-intensive domainsDistractor quality; may introduce spurious signals

Table 1: The four paradigms bridging the verifiability gap. Each trades off reward quality, compute, and domain coverage differently.

Rubric-based rewards decompose evaluation into checklist-style criteria. Rubrics as Rewards (RaR) shows that standard RLVR is actually a special case where k = 1 and the single criterion is binary correctness. RaR achieves up to 31% relative improvement on HealthBench over LLM-as-judge baselines. Rubicon extends this with a multi-stage curriculum to avoid the seesaw effect — where instruction-following and creativity degrade each other during joint training. OpenRubrics scales rubric generation via Contrastive Rubric Generation, which prompts an LLM to analyze good and bad response pairs to extract discriminative rules.

Self-evolving rubrics address the staleness of static criteria. RLER grounds rubrics in real-time internet search for Deep Research agents, maintaining evolving buffers of positive rubrics (rewarding newly discovered good habits) and negative rubrics (penalizing discovered reward hacking behaviors).

Synthetic data transformation sidesteps reward design entirely. Golden Goose converts raw internet text into standard RLVR tasks by masking key reasoning steps and generating plausible distractors. This produced 700K synthetic tasks that enabled a 4B model to surpass a 7B domain-specialist in cybersecurity.

Paradigm Coverage: Speed, Quality, Scalability, Domain Breadth

Figure 1: Radar comparison of the four paradigms across key axes. JC-RL maximizes speed and scalability; JEPO excels on quality for semi-verifiable tasks but is computationally heavy; rubric-based methods offer broad domain coverage. No single approach dominates.

Deep Dive: Judge Code-Guided RL

I focus on JC-RL because it represents the most conceptually striking response to the verifiability problem: rather than softening verification standards, it literally compiles subjective intent into deterministic code.

A Judge Code Generator (JCG) — a coding LLM like DeepSeek-V3 — translates evaluation rubrics into sample-specific Python scoring functions executed in a sandbox. The LLM judge is removed from the reward loop entirely, recovering the core RLVR advantage: deterministic, fast, non-hackable execution — even for subjective tasks.

Three contributions stand out.

First, the authors prove that evaluating as few as 1 of 9 criteria per sample yields an unbiased gradient estimator. Empirically, partial-reward training curves converge to the same performance ceiling as full-reward training on Qwen2.5-7B. This is a powerful result: imperfect, stochastic rubrics are theoretically sufficient for correct optimization, dramatically reducing evaluation cost.

Second, the offline mode (Off-JCG) achieves over 2× wall-time speedup over generative reward models with competitive scores. The compiled datasets are model-agnostic — rubrics generated by a 7B model successfully train 14B and 32B models across different architectures. This is essentially programmatic knowledge distillation: the broad understanding of a frontier model crystallized into portable, executable rubrics.

Third, "code scaling" adds auxiliary penalty scripts for verbosity and hallucination detection, introducing adversarial pressure against reward hacking within the programmatic framework itself.

JC-RL Off-JCG vs. GenRM: Quality vs. Training Speed

Figure 2: Off-JCG achieves 70.05 on creative writing benchmarks versus 72.72 for GenRM — a 3.6% gap — at over 2× the training speed. The speedup comes from removing the LLM judge from the reward loop entirely; compiled Python scripts execute in milliseconds.

Limitations. The JCG's dependency on a powerful coding LLM means reward quality is bounded by the model's ability to translate criteria like "emotional resonance" or "empathy" into Python code — dimensions the paper does not evaluate for compilability. Not all aspects of subjective quality can be reduced to string matching, keyword counting, or structural checks.

The partial reward proof assumes uniform, independent criterion sampling, but real criteria are often semantically correlated ("factual accuracy" and "source attribution" overlap substantially). The gated mechanism assigning floor scores for format violations creates hard discontinuities in the reward landscape that may produce the same seesaw effect Rubicon addresses via curriculum staging.

Additionally, JC-RL evaluates primarily on creative writing and conversational benchmarks. Extension to high-stakes domains like medical advice or legal reasoning, where rubric errors could have real consequences, remains untested.

Partial Reward Convergence: 1/9 Criteria ≈ Full Reward

Figure 3: Training curves for full-reward vs. partial-reward (1 of 9 criteria sampled per step) on Qwen2.5-7B. Both converge to the same performance ceiling, validating the unbiased gradient estimator theorem. Stochastic, partial rubrics are sufficient for correct optimization.

Open Problems

The papers surveyed here converge on a powerful insight: you do not need perfect, holistic rewards to train good models. Decomposed, partial, stochastic, and even programmatic feedback can substitute for the expensive and brittle neural reward models that dominated the RLHF era. Three open problems remain.

Process supervision for non-verifiable domains. Current methods evaluate final outputs, but process supervision provides denser step-level signals. My own work on DASS shows that process-level supervision outperforms outcome-level supervision in verifiable RLVR. Does this hold when process signals themselves come from rubrics — where "correct reasoning" is itself subjective? The interaction between rubric granularity and supervision level is entirely unexplored.

Rubric robustness under adversarial optimization. Models may learn to satisfy each criterion individually while producing holistically poor responses — a decomposed form of reward hacking. Rubicon's ad-hoc anti-hacking rubrics need formalization. Can we characterize which rubric structures are exploit-resistant, analogous to how deterministic verifiers resist hacking in standard RLVR?

Unifying the reward spectrum. RaR, JC-RL, JEPO, and Golden Goose attack the same problem from different angles. A unified framework that selects the optimal reward mechanism per-criterion — deterministic verification when available, programmatic rubrics for structured subjective tasks, latent-variable optimization for long-form generation — could yield a system more robust than any single approach. The verifiability gap is closing. What remains is ensuring it closes safely.

← Back to Kargi Chauhan