How to Train AI Agents with Reinforcement Learning

Introduction

Reinforcement learning has become the most effective path to building task-specific agents. Frontier base models are increasingly capable on general benchmarks, but they are trained on general-purpose data. An agent that has to browse a customer dashboard, query a production database, or investigate a stack trace in Sentry needs more than a strong base model. It needs a training loop that turns its own task attempts into measurable improvement.

This guide walks through the full agent training pipeline: environment definition, scenario writing, reward design, evaluation, trajectory collection, RL training, checkpointing, and deployment. A single example runs through every stage — a Sentry debugging agent whose job is to investigate production errors automatically.

Human Union Data (HUD) provides this pipeline as managed infrastructure. Every eval run simultaneously measures the agent's performance and generates the trajectory data needed for reinforcement learning. The setup time from template to running cloud evaluation is roughly thirty minutes.

Why RL beats prompting and SFT for agents

The core problem is simple: you want an agent to complete a real task in a real environment, but you cannot enumerate every possible pathway. Real software tasks have too many variables, decision points, and intermediate states. They lack a singular correct trajectory. The alternatives to RL each run into a specific wall.

Prompt engineering plateaus quickly. The base model was not trained on your task, and no amount of instruction tuning inside the prompt closes the gap once the task gets specialized enough. You can squeeze meaningful gains out of prompting for the first few weeks, then the curve flattens. A Sentry agent that needs to recognize specific error patterns, correlate them across services, and decide which logs to pull cannot get there from prompts alone.

Supervised fine-tuning (SFT) requires labeled demonstrations.Someone has to produce the correct trajectory step by step. For multi-step agent tasks, that assumption breaks. There are many viable paths to the same answer, and labeling "the" correct one is expensive and arbitrary. You also cannot demonstrate behaviors that outperform the demonstrator. SFT copies patterns; it does not discover them.

Reinforcement learning sidesteps both problems. It lets the agent learn through trial and error. The training pipeline defines what success looks like, the agent attempts the task over and over, and correct actions are positively reinforced while incorrect actions are discouraged. The agent does not need a demonstration. It needs a reward signal.

The Sentry debugging agent makes this concrete. Sentry is an error-tracking platform. When something breaks in production, an engineer opens Sentry to figure out why. The agent's job is to do that investigation automatically. It might see a spike in 500 errors and need to find the root cause: a bad deployment, a database going down, a connection pool running out. Every incident is different and there is no single correct path. But there is always a correct answer. You can score whether the agent found it. That scoreable, multi-path structure is exactly what makes the task a fit for RL.

For the conceptual foundation behind the observe-act-reward loop and the components of an RL environment, see RL environments: what they are and how to build one.

The seven stages of an agent training pipeline

A complete agent training pipeline has seven stages. Each one feeds the next, and the entire loop is built to repeat as the model and the task definition evolve.

Environment definition — what the agent can interact with: tools, MCP connections, isolated state.
Scenario writing — task prompt plus reward function in a single two-yield generator.
Reward function design — binary versus graded, outcome versus process, defending against reward hacking.
Evaluation — running the agent against a taskset and capturing what happened.
RL training — GRPO updates the model policy from the trajectories the eval produced.
Checkpointing and model selection — versioned checkpoints, comparison, promotion to HEAD.
Deployment — the trained checkpoint serves through the same inference endpoint.

You can assemble these stages yourself: verl or SkyRL for training, custom sandboxing for environments, OpenAI RFT or Tinker directly for fine-tuning. Many teams do. But you are then responsible for three separate infrastructure problems at once: running isolated environments at scale, capturing trajectories, and serving the trained model in production. HUD provides all seven stages as a single managed pipeline across those same backends, which is the version this guide uses for examples.

Install and authenticate

Every stage below assumes the HUD CLI is installed and authenticated. Install with uv and grab an API key from hud.ai/project/api-keys.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the HUD CLI with the Anthropic SDK so the gateway can route Claude
uv tool install hud-python --python 3.12 --with anthropic

# Authenticate with a single HUD key (the inference gateway proxies
# Claude, GPT, and Gemini through this one credential)
hud set HUD_API_KEY=***

Stage 1: Environment definition

The environment defines what your agent can interact with. It is the sandbox plus the action space. In HUD, the @env.tool() decorator wraps any Python function into an agent-callable tool, turning real APIs, databases, and infrastructure into the agent's action space.

For the Sentry case, the environment exposes tools like get_sentry_issues() and get_stack_trace() that connect to real Sentry, Supabase, Railway, and Kubernetes instances through MCP. HUD spins up a fresh, isolated environment instance per evaluation run, so there is no shared state contamination between runs.

Scaffold a new environment with hud init, then iterate locally with hud dev which spawns a local MCP server with hot-reload:

hud init sentry-agent --preset blank
cd sentry-agent
hud dev env:env -w env.py

The scaffolded env.py is where the environment, its tools, and its MCP connections live. The Sentry agent connects to Sentry through MCP and exposes the resulting tool surface to the agent:

from hud import Environment
import os

env = Environment("sentry-agent")

token = os.getenv("SENTRY_AUTH_TOKEN")
if token:
    env.connect_mcp_config({
        "sentry": {
            "command": "...",
            "env": {"SENTRY_AUTH_TOKEN": token},
        }
    })

For the full walkthrough of the environment primitive — including what counts as an action space, how isolation works, and the underlying MCP model — see RL environments: what they are and how to build one. For specialized environment shapes (browser, OS-level computer-use, multi-app workflows), see how to test a computer-use agent.

Stage 2: Scenario writing

Scenarios define both the task and its reward function using a two-yield generator pattern. The first yield sends the prompt to the agent. The agent runs, making tool calls and collecting observations. The second yield evaluates the result and returns a numerical reward between 0.0 and 1.0.

This pattern is what makes every evaluation simultaneously a measurement and a training data generator. Each completed scenario produces a trajectory — the full sequence of the agent's actions, the environment's responses, and the reasoning between them — paired with a reward signal that is immediately ready for RL ingestion.

For the Sentry diagnostics agent, a scenario might task the agent with investigating a spike in 500 errors, then score whether the agent correctly identified the root cause (for example, database connection pool exhaustion). Collections of scenario instances form tasksets that double as repeatable benchmarks.

@env.scenario("investigate")
async def investigate_issue(query: str, expected_finding: str | None = None):
    """Investigate errors in Sentry."""
    prompt = f"You are a Sentry specialist. Investigate: {query}"
    answer = yield prompt
    yield 1.0 if expected_finding and expected_finding in answer else 0.0

This single function is simultaneously the episode definition, the reward function, and the task specification. The agent runs between the two yields, calling tools and reasoning, and the grader only sees the final answer string. For the deeper walkthrough of scenarios, tasksets, and the two-yield pattern, see RL environments: what they are and how to build one.

Stage 3: Reward function design

The reward function is the entire training signal, so the choices here determine what the model actually learns. Three axes are worth thinking through before any training run.

Binary versus graded rewards

Binary rewards (1.0 or 0.0) work when correctness is unambiguous, but they give almost no signal early in training when nearly every attempt fails. Graded rewards help. For the Sentry agent, partial credit for identifying the correct service, the correct error class, and the root cause (plus a bonus for fewer steps) turns failed attempts into useful signal. The cost is more surface area for reward hacking.

Outcome versus process rewards

Outcome rewards only score the final answer. Process rewards also score intermediate steps: pulling the right stack trace, querying the right time range, opening the right Sentry issue. Process rewards speed up learning but risk teaching the agent to optimize for the proxy instead of the real goal. A common pattern is to bootstrap with process rewards, then switch to pure outcome rewards once the agent reliably reaches the end state.

Reward hacking

Agents are very good at finding unintended ways to maximize a poorly specified reward. A substring check on the root cause? The agent learns to dump every plausible cause into a long response. Partial credit for tool use? It calls tools whether they help or not. The defense is adversarial testing: spot-check high-scoring trajectories and confirm they actually solved the task. Re-test whenever you upgrade the underlying model, since stronger models exploit reward misspecifications more aggressively.

Reward design is its own discipline. The full breakdown of verifiers, pass/fail checks, rubrics, shaping rewards, and how to keep them stable as the agent improves lives in verifier and reward design for RL environments. That guide goes deep on the failure modes that this section only points at.

Stage 4: Evaluation with hud eval

hud eval is the operational core of the pipeline. It runs your agent against a taskset and captures everything needed for training in one command. Key flags control scope and execution mode.

# Run the first task only (useful for debugging)
hud eval tasks.py claude --gateway

# Run every task with a specific model
hud eval tasks.py claude --gateway --full

# Submit to HUD infrastructure for parallel execution
hud eval tasks.py claude --gateway --full --remote

Without --full, only the first task runs, which is useful for debugging a single scenario. With --full, the entire dataset runs. Adding --remote submits to the HUD platform for fire-and-forget execution, and --group-size controls how many times each task runs for variance estimation and statistical confidence. Supported agent types include claude, openai, operator, gemini, gemini_cua, and openai_compatible.

HUD supports parallel execution across thousands of concurrent environments. That is how the Sentry case study collected 3,000+ traces in a practical timeframe. Trajectory collection does not require a separate command. It happens automatically as a byproduct of evaluation, because evaluation and training data generation are the same operation.

Every run captures the complete trajectory: every tool call, every tool response, the agent's reasoning traces, screenshots (for computer-use environments), environment state at each step, and the final reward score. Engineers can inspect every step of every run before committing data to training, spot failure modes, identify reward function bugs, and verify that successful trajectories actually represent correct behavior.

Stage 5: RL training

Once you have a stable environment, a working scenario, and a taskset with collected trajectories, the next stage is updating the model policy. RL training on HUD is driven through the platform GUI, not a CLI command. There is no hud train or hud rft command. Training lives at hud.ai/models.

Launching a training run on HUD:

Go to hud.ai/models.
Fork a base model. This gives you a trainable model with a fresh model ID.
Click Train Model and select the taskset you produced in stage 4.
Configure GRPO settings (reasoning effort, group size, number of steps) and start the run.

Under the hood, HUD runs GRPO (Group Relative Policy Optimization). GRPO compares groups of trajectories for the same task, identifies which action sequences led to higher rewards, and updates the model policy to increase the probability of those successful sequences. HUD supports two training backends: OpenAI RFT for o4-mini, and Tinker for open models including Qwen3 235B, Deepseek V3.1, and Kimi K2. For the algorithm-level walkthrough of GRPO — how it differs from PPO, why it works for agent tasks, what the hyperparameters actually do — see GRPO training: what it is and how to run it.

The reward function from stage 3 is what GRPO consumes. A graded, well-calibrated reward function produces dense gradients that GRPO can act on. A binary, sparse reward function produces flat gradients that GRPO struggles with. This is why reward design is upstream of training quality, not parallel to it.

Stage 6: Checkpointing and model selection

Each training run produces a versioned checkpoint with its own model ID. These checkpoints are directly comparable: you can run the same eval suite against any checkpoint and track performance across iterations.

If a new training run degrades performance on a subset of tasks, roll back to a previous checkpoint. The versioning system makes it straightforward to maintain a history of model improvement (or regression) over time. In practice, evaluating a checkpoint means running the same taskset that produced the training data, with --group-size set high enough for statistical confidence (typically 3 to 5 runs per task). Compare each new checkpoint against both the baseline model and the previous best.

Aggregate reward going up does not mean improvement on every task type. A checkpoint can gain on common cases while regressing on edge cases that matter more in production. Set an explicit regression threshold on your eval suite before promoting a checkpoint to HEAD. If a checkpoint clears the overall bar but breaks a critical subset, hold the promotion and retrain.

To re-evaluate against a specific checkpoint, pass the new model ID to create_agent and run the eval:

import hud
from hud.agents import create_agent
from env import investigate_issue

task = investigate_issue.task(
    query="Investigate the spike in 500 errors from /api/checkout",
    expected_finding="connection pool exhaustion",
)

# After training, swap to your model ID
agent = create_agent("your-model-id")

async with hud.eval(task) as ctx:
    result = await agent.run(ctx)

print(f"Reward: {result.reward}")

The reward score from this run is directly comparable to the baseline score from the same task on the un-trained model. That delta — across the full taskset, ideally with --group-size of 3 or more — is the evidence that justifies promoting the checkpoint to HEAD inside the HUD dashboard.

Stage 7: Deployment

Setting a checkpoint as HEAD in the HUD dashboard routes all inference traffic to the improved model through the same API key and same endpoint your application already uses. This removes the deployment friction that typically separates "we trained a better model" from "our users are running the better model". Evaluate the checkpoint, confirm it outperforms the baseline, set it as HEAD, and the swap is complete.

For the environment itself, push it to the platform once it is stable and sync the taskset:

# Push the environment to the HUD platform
hud deploy

# Sync the taskset so it is available for remote eval and training
hud sync tasks my-taskset

Once deployed, the same hud eval ... --remote command runs against the hosted environment instead of a local instance, which is what lets the platform scale to thousands of concurrent runs during training. Production code that consumes the trained model uses the same create_agent("your-model-id") pattern: swap the model ID, keep everything else the same.

Have an AI coding agent build the pipeline for you

If you use Cursor, Claude Code, or another MCP-compatible coding agent, you can attach the HUD documentation as an MCP server and let the agent walk you through the entire training pipeline end-to-end. In Claude Code:

claude mcp add --transport http docs-hud https://docs.hud.ai/mcp

In Cursor, add this to your MCP settings:

"docs-hud": {
  "url": "https://docs.hud.ai/mcp"
}

Then paste the following prompt into your agent. It walks through every stage of the pipeline on a real task of your choice — a customer support agent, a code review agent, a data analysis agent — from hud init through forking a base model, training, and swapping in the trained checkpoint:

Help me build an end-to-end HUD agent training pipeline for a real task — pick one with me (e.g. a customer support agent, code review agent, data analysis agent). Use the docs-hud MCP server I just installed for the latest commands and patterns. Walk me through:

1. Check the prerequisites and help me set HUD_API_KEY + any model API keys (guide me to export them; don't ask me to paste them in chat).
2. Scaffold a new environment, then help me define one tool, one scenario with a verifiable reward, and a few tasks.
3. Run an eval to validate, deploy with hud deploy, then walk me through forking a base model and training it on the platform.

Explain each concept as we go.

Conclusion

An agent training pipeline is the full sequence of infrastructure required to turn agent task attempts into model improvement via reinforcement learning. It is not just scoring outputs. It is generating trajectories, computing rewards, and feeding them back into training to produce a better model. Without the pipeline, evaluations produce scores that never make the model better.

The Sentry diagnostics agent that runs through this guide is an open-source example of all seven stages in production. The environment connects to Sentry, Supabase, Railway, and Kubernetes — real infrastructure, not a simulation. Starting from a base model with a 6.3% success rate on hard diagnostic tasks, the trained agent reached 13% after thirteen hours of training on 3,000+ traces. A specialized fine-tuned model beating frontier models on a narrow domain is the entire point: general capability loses to domain-specific training when the task is specific enough.

HUD packages this pipeline as managed infrastructure. Every eval run simultaneously measures the agent and generates training data through the two-yield scenario pattern. Parallel execution across thousands of environments makes trace collection practical, and trained checkpoints deploy through the same inference gateway with zero code changes. Roughly thirty minutes from template to running cloud evaluation. For deeper dives on individual stages, see the related guides: RL environments, verifier and reward design, GRPO training, testing computer-use agents, top RL environment tools, and best platforms for publishing RL environments to model labs.

FAQs

When should I use RL instead of prompt engineering or supervised fine-tuning?

Use prompt engineering when the base model already has the capability and you just need to direct it. Use supervised fine-tuning when you have labeled examples of correct outputs and the task has a single right answer. Use reinforcement learning when the task is multi-step, there are many valid paths to the goal, and you can score whether the agent completed the task. RL is the only viable approach when success is non-differentiable: you cannot backpropagate through a browser, a database, or a Kubernetes cluster, but you can score whether the agent solved the bug.

What does the agent training pipeline look like end-to-end?

Seven stages: define the environment (tools, MCP connections, isolated state), write scenarios (task prompt plus reward in a two-yield generator), design the reward function (binary versus graded, outcome versus process), run evaluations with hud eval to collect trajectories, train via GRPO on the HUD platform, checkpoint and select the best model, and deploy through the HUD inference gateway. Each stage feeds the next, and the entire loop is designed to repeat.

How do I design a reward function for an agent?

Start by writing the success condition as a programmatic check on environment state, not a description of agent behavior. Decide between binary rewards (0.0 or 1.0, low noise but sparse) and graded rewards (partial credit, denser signal but more reward-hacking surface). Prefer outcome rewards over process rewards once the agent reliably reaches the end state. Stress-test the reward function by sampling high-scoring trajectories and confirming they actually solved the task. The verifier and reward design guide covers this in depth.

What does HUD provide for agent training?

HUD provides the full pipeline as managed infrastructure. Environment scaffolding through hud init, local iteration through hud dev, parallel evaluation through hud eval, automatic trajectory capture, GRPO training through the platform GUI at hud.ai/models, versioned checkpoints, and an inference gateway that serves trained models through the same API your application already uses. Every eval run simultaneously measures the agent and generates training data.

How do I use a trained model in production?

Once training completes, the checkpoint gets a model ID. Set the checkpoint as HEAD in the HUD dashboard to route inference traffic to the improved model. Production code does not change: the same API key and inference endpoint now resolve to the trained checkpoint instead of the base model. To re-evaluate against the same taskset, pass the new model ID to create_agent and run the eval. Aggregate reward going up is necessary but not sufficient. Confirm the checkpoint also improves on the task subsets that matter most in production before promoting.