Guide

Verifier and Reward Design for RL Environments

Executive Summary

In reinforcement learning, the quality of your training is bounded by the quality of your scoring. If the verifier is wrong, the reward is wrong, and the model learns the wrong thing. Every trajectory that enters a training pipeline carries the score it was given, and the optimizer treats that score as ground truth. Weak scoring does not just add noise. It teaches the model to succeed at the wrong task.

For teams building RL environments around real software (browser workflows, API integrations, file manipulation, diagnostic pipelines), scoring is especially hard. These tasks produce non-differentiable outcomes: a spreadsheet is either in the right state or it is not, an API call either had the correct payload or it did not. There is no gradient to follow through a browser DOM. The scoring system you build is the only bridge between "did the agent do the right thing" and "what signal does the model get".

This guide covers the four layers of that scoring system: verifiers, pass/fail checks, rubrics, and reward functions. It walks through how to define success conditions before designing reward formulas, how to build checks that survive contact with increasingly capable models, and what separates a useful training trajectory from one that just happened to pass. Platforms like Human Union Data (HUD) are built around the same idea: environment runs need reliable scoring before they can become a useful training signal.

The scoring stack inside an RL environment

Scoring an environment run is not a single function. It is a stack of concerns, each with a different job. Conflating them is one of the fastest ways to build a reward that looks fine during development and breaks during training.

Verifiers check objective task correctness

A verifier answers the binary question: did the agent complete the task? For a spreadsheet task, the verifier might inspect final cell values, formulas, and sheet structure against an expected state. For a browser task, it might check whether the correct form was submitted with the right fields, or whether the target page reached a specific condition.

Verifiers should be programmatic wherever possible. Tasks need clear, verifiable answers, because the entire training loop depends on a grader assigning a numeric reward. When the check is deterministic, it removes an entire class of noise from the training signal.

Pass/fail checks enforce hard constraints

Pass/fail checks are binary gates that catch trajectories violating non-negotiable requirements. These are distinct from the verifier. A verifier asks "did the task succeed?", while a pass/fail check asks "did the agent break any rules along the way?".

These checks run independently of task success. An agent that completes the spreadsheet correctly but leaks data to an external service should still fail.

Rubrics score quality dimensions

Some aspects of trajectory quality are real but not binary. How many unnecessary steps did the agent take? Did it gather sufficient evidence before acting? Did it recover gracefully from an error, or did it retry the same failing action twelve times?

Rubrics assign graded scores to these dimensions. A rubric criterion might be "completed the task in fewer than 15 tool calls" or "provided a diagnostic summary that references at least two log sources." The key constraint is that each criterion should be observable from the trajectory and environment state, not inferred from vague notions of quality.

Reward functions turn evaluation into training signal

The reward function combines verifier output, pass/fail results, and rubric scores into a single numeric signal the optimizer can use. It is downstream of everything else. If the verifier is broken, the reward is broken. If the rubric is noisy, the reward is noisy.

The grader deserves the same rigor you would give a production service: tests, edge-case coverage, versioning, and monitoring. Treating it as an afterthought, or as glue code that can be patched later, undermines every other investment in environment and task design.

Start with the task outcome, not the reward formula

A common failure pattern is to start designing the reward function before clearly defining what success looks like. Teams jump to reward weights and shaping bonuses before they can articulate, in concrete environment terms, what a completed task produces.

Write the success condition in environment terms

Success should be defined as an observable state change or verifiable output. "The agent correctly updated the customer record" is not a success condition. "The customers table contains a row where id=4521, status='active', and updated_at is within the last 60 seconds" is a success condition.

For browser tasks, success might mean a specific element exists in the DOM, a file was downloaded with the expected checksum, or a confirmation page loaded with a transaction ID. Write success conditions that can be checked against the environment state, not against the agent's self-reported confidence.

Separate true success from convenient proxies

Proxy metrics are tempting because they are easy to measure. Counting tool calls, checking whether the agent visited the right URL, or measuring response length are all proxies. They correlate with success in well-behaved runs and diverge from it in adversarial ones.

In a classic example, an agent rewarded for the height of a red block's bottom face learned to flip the block upside down instead of stacking it on top of another block. The proxy (bottom-face height) was satisfied. The task (stacking) was not.

In software environments, proxy-driven scoring creates analogous problems. An agent rewarded for "number of API calls made" during a data-gathering task might call the same endpoint repeatedly.

Prefer verifiable checks where possible

Programmatic checks reduce ambiguity, improve repeatability, and make debugging straightforward. When a programmatic check is feasible (file diff, database assertion, HTTP response validation), prefer it over model-based grading.

Reserve model-based or LLM-based grading for dimensions that genuinely resist programmatic checking: open-ended text quality, explanation coherence, or nuanced policy compliance. Even then, treat the LLM grader as a component that needs its own testing and calibration, not as a black-box oracle.

How to design pass/fail checks that hold up in training

  • Gate all additional credit on core correctness. If the verifier returns fail, the trajectory scores zero regardless of rubric performance.
  • Make partial credit for failed tasks deliberate and bounded. Useful during early curriculum design, but never the default.
  • Use named failure checks for each forbidden action.
  • Test valid edge cases, near-misses, and loopholes before training. Run trajectories with unusual but valid paths, close failures, and obvious exploits.
  • Run repeated trials to expose grader instability. A grader that oscillates between pass and fail on the same task produces weak training signals.

How to build rubrics without making the score noisy

  • Use rubrics for non-binary quality dimensions: step efficiency, evidence completeness, error recovery, resource usage.
  • Keep each criterion observable and narrow. "The agent's approach was well-structured" is not scorable. "The agent completed the file edit without reverting more than once" is. Two independent reviewers (or two grading runs) should produce the same score.
  • Split bundled criteria. "Did the agent gather evidence AND present it clearly" is two criteria. Separate them. Narrow criteria are easier to test, debug, and stabilize.
  • Cap rubric size at three to five well-defined criteria. A small, specific rubric produces a cleaner signal than a large, vague one.
  • Do not let style outweigh correctness. Task completion and correctness dominate the score. A beautifully formatted but incorrect diagnostic report should not outscore a terse but correct one.

Reward design patterns that improve learning

Once verifiers, pass/fail checks, and rubrics are stable, the reward function combines them into a training signal. The design of that combination matters.

Use terminal rewards for true task completion

The terminal reward, assigned based on the final environment state after the trajectory completes, should be the largest component of the total reward. It directly links the score to the outcome the environment was designed to evaluate.

For a browser-based form submission task, the terminal reward checks whether the form was submitted correctly and the confirmation state is valid. For a multi-file code edit, it checks whether the test suite passes against the modified codebase. The terminal reward is where your verifier does its work.

Add shaping rewards carefully

Shaping rewards provide intermediate signal during long trajectories where the terminal reward alone is too sparse. They can reward progress indicators: the agent opened the correct file, navigated to the right page, or established the right API connection before attempting the final action.

Shaping rewards also create new surfaces for exploitation. An agent rewarded for "opening the correct file" might learn to open and close the file repeatedly. Pan, Bhatia, and Steinhardt found that more capable agents are more likely to exploit reward misspecifications, achieving higher proxy reward while delivering lower true reward. Their results show phase transitions where increased capability causes a sharp qualitative shift into reward hacking. The implication is direct: a shaping reward that seems harmless with a weak model can become a liability once the model improves.

Keep shaping subordinate to the real objective

If you use shaping rewards, keep their magnitude small relative to the terminal reward. The right ratio will depend on your task and environment, so validate your weighting with ablation experiments.

Train with and without each shaping component, then compare true task completion rates (not proxy reward). If removing a shaping signal does not hurt completion rates, it is not helping. If adding a shaping signal increases proxy reward but decreases completion rates, it is actively harmful.

What makes a trajectory useful for training

A trajectory that earns a passing score is not automatically useful for training. Usefulness requires reliability, generalizability, and informativeness.

Correct trajectories should be repeatable

If the same agent policy produces wildly different outcomes on the same task across repeated runs, the passing trajectories may be lucky rather than learned. Test trajectory repeatability by running the same task multiple times with the same policy. If the pass rate is unstable, investigate whether the instability comes from the environment, the agent, or the grader. Each source requires a different fix.

Useful trajectories respect constraints and generalize

A trajectory that reaches the correct end state by exploiting a loophole (hardcoding an answer that happens to be right, skipping required validation steps) may score well but teach the model a strategy that will not transfer. Verifiers should check the path, not just the destination, when constraints are part of the task definition.

Avoid building verifiers that accept only one scripted sequence of actions. The goal is to verify that required conditions are met, not that the agent followed a specific playbook. Overly rigid verification rejects valid alternative approaches and narrows the policy's generalization.

Review high-scoring failures and low-scoring successes

Trajectory inspection is a debugging tool for the scoring system, not just the model. If a trajectory scored 0.9 but the agent's behavior looks brittle, wasteful, or unsafe, the scoring system has a gap. If a trajectory scored 0.2 but the agent actually completed the task through a valid alternative path, the verifier is too narrow.

Regularly sample trajectories from both tails of the score distribution and review them manually. Teams that only look at aggregate pass rates miss systematic scoring errors that degrade training data quality over time.

Common failure modes in verifier and reward design

Most scoring systems break in predictable ways. Knowing the common failure modes saves iteration time.

Reward hacking from proxy metrics

Specification gaming is the most documented failure mode. DeepMind Safety Research catalogs dozens of examples where agents satisfied the reward function without completing the intended task. In software environments, reward hacking manifests as agents that game intermediate metrics, repeat rewarded actions without progressing, or find shortcuts that satisfy the verifier's literal checks while violating the spirit of the task.

The risk increases with model capability. Stronger models are better at finding and exploiting gaps between the intended objective and the measured objective. Re-test your scoring system whenever you upgrade the underlying model.

Sparse rewards with no learning signal

If the only reward is a binary terminal check on a 50-step task, the model receives no gradient-useful information about which of the 50 steps mattered. For complex environment tasks, purely sparse rewards can make learning extremely slow or impractical.

Overly rigid graders that reject valid solutions

A verifier that checks for one exact sequence of actions (click button A, then fill field B, then submit form C) will reject agents that find equally valid alternative paths. In real software, there are usually multiple correct ways to accomplish a task.

Noisy graders that change across runs

If the same trajectory receives different scores on repeated evaluations, the grader is injecting noise into the training signal. LLM-based graders are particularly susceptible to scoring variance across runs.

Measure grader consistency by scoring the same set of trajectories multiple times and computing agreement rates. If agreement is low, either tighten the grading criteria, add programmatic checks to reduce the LLM grader's scope, or average across multiple grading runs before assigning a final score.

A practical workflow for shipping a scoring system

Building a scoring system is iterative, but having a clear sequence of steps reduces wasted effort.

Step 1: Define the end state

Start with the exact condition that proves the task is complete. Write it as an assertion against environment state: file contents, database rows, DOM elements, API responses, or tool outputs. If you cannot write this assertion, the task is not ready for RL training. Tasks need clear, verifiable outcomes before any reward design can begin.

Step 2: Add hard failure checks

List every condition that should zero out a trajectory, regardless of apparent task completion. Include policy violations, safety failures, forbidden tool calls, and constraint breaches. Implement each as a named, testable check.

Step 3: Add a small rubric only where needed

If there are quality dimensions beyond pass/fail that matter for deployment (efficiency, evidence quality, error recovery), add rubric criteria for them. Keep the rubric small. Three to five well-defined criteria will produce a cleaner signal than fifteen vague ones.

Step 4: Test on real trajectories

Collect or generate a diverse set of trajectories: strong completions, weak completions, partial completions, constraint violations, and adversarial loophole exploits. Run every trajectory through the scoring system. Check whether the scores match human judgment. Fix the cases where they do not before proceeding.

Step 5: Tune only after the grader is stable

Adjust reward weights and shaping terms only after the underlying checks are stable and tested. Tuning a reward function on top of an unstable grader is optimizing noise. Confirm repeatability (same trajectory, same score) and robustness (valid alternative paths score correctly) before letting the optimizer loose.

How to measure rewards with Human Union Data (HUD)

HUD measures rewards by running an agent in an environment, letting it use tools, and scoring the result of each scenario. The environment defines the task, and the scorer turns the outcome into a reward signal.

A good example is HUD's email inbox agent environment. In that environment, Claude triages 8 realistic emails across 3 scenarios: urgent detection, full categorization, and spam filtering. Each scenario has a defined success condition, and the agent uses the tools in the environment to interact with the inbox data and make decisions.

After each run, HUD scores the agent on how well it completed the scenario. That score is the reward for the run. In practice, this means reward is not based on whether the output sounds good. It is based on whether the agent actually did the task correctly inside the environment. This is what makes reward measurement in HUD useful for training. The same environment can be run again after changes to the agent, so teams can see whether the model is actually improving on the task.

Human Union Data (HUD) also makes this easier by providing a library of environments with built-in verifiers, scorers, and rewards. Teams do not have to invent every scoring system from scratch before they can start testing and improving models. They can start from working environment patterns and adapt them to their own tasks.

For startups building environments for model labs, this matters for another reason. Building on HUD means the environment can follow the same structure and specifications that labs on the platform already support. That makes HUD useful both for measuring rewards well and for building environments that are easier for model labs to adopt.

FAQs

What is a verifier in an RL environment?

A verifier is a programmatic check that inspects the final environment state (file contents, database rows, DOM conditions, API responses) against defined success criteria and returns a pass or fail result. In HUD environments, verifiers run automatically at the end of each trajectory to produce the primary correctness signal.

How is a verifier different from a reward function?

The verifier determines whether the task succeeded or failed. The reward function sits downstream, combining the verifier's output with pass/fail constraint checks and rubric scores into a single numeric training signal that the optimizer consumes.

When should a team use an LLM-based grader instead of a programmatic check?

Only when the scored dimension resists programmatic verification, such as open-ended text quality or nuanced policy compliance. Programmatic checks are more repeatable and should be the default. Inside HUD, teams can layer LLM-based grading on top of programmatic verifiers, but any LLM grader should be tested for scoring consistency before it enters a training loop.

How do shaping rewards cause reward hacking?

Shaping rewards grant intermediate credit for progress indicators, and agents can learn to trigger those signals repeatedly without actually completing the task. Research shows that more capable models are significantly more likely to exploit these gaps, so shaping rewards need regular re-testing after model upgrades.

What makes a trajectory useful for RL training?

A useful trajectory is repeatable (the same policy produces similar outcomes across runs), generalizable (the strategy transfers beyond a single test case), and correctly scored by a stable grader. In HUD environments, trajectory-level scoring is designed to surface these properties so that only reliable data enters the training pipeline.

How can I tell if my grader is too noisy for training?

Score the same set of trajectories multiple times and measure agreement rates across runs. If scores diverge meaningfully, tighten the grading criteria or replace LLM-graded dimensions with programmatic checks. Inside HUD, running repeated scoring passes on the same trajectories is a standard step before using any grader at training scale.