GRPO Training: What It Is and How to Run It

Introduction

GRPO (Group Relative Policy Optimization) is a reinforcement learning method that trains models by comparing groups of outputs. The model generates several responses to the same prompt, scores each one, and reinforces the responses that beat the group average. DeepSeek introduced the algorithm in 2024 and used it to train R1 a year later. GRPO is now the default RL algorithm for reasoning models and verifiable-reward training.

GRPO replaces the trained critic model that Proximal Policy Optimization (PPO) requires. The group mean reward becomes the baseline, which cuts memory and compute roughly in half. The algorithm extends past math and code reasoning into agent training. In agent training, each "response" is a multi-step trajectory in real software.

This article covers what GRPO is, how the algorithm works, how to run GRPO training on a model, and what infrastructure GRPO needs at scale. Platforms like Human Union Data (HUD) package the full GRPO pipeline so teams can train against real software without rebuilding the rollout and scoring infrastructure from scratch.

How GRPO works

GRPO training updates a model's weights using comparative scoring. The algorithm scores a group of outputs at once. The group's average score becomes the reference point. The model reinforces above-average outputs and penalizes the rest.

For each training prompt, the model produces a group of responses, usually 8 to 16 completions. A reward function scores every response. The algorithm computes the group mean and uses that mean as the baseline.

GRPO falls under a paradigm called Reinforcement Learning with Verifiable Rewards (RLVR). In RLVR, rewards come from rule-based scoring rather than human preference models. Examples include math verifiers that check the answer, code test suites that run the program, and environment signals that report task completion.

The DeepSeekMath paper introduced GRPO in early 2024. The algorithm became widely known a year later when DeepSeek used it to train R1. R1 matched OpenAI o1 on AIME 2024 (79.8%) and MATH-500 (97.3%). The training used a fraction of the compute most teams had assumed reasoning RL required.

The grading-on-a-curve analogy

Picture a teacher grading on a curve. The class average becomes the cutoff. Students above it earn credit, and students below get flagged. GRPO works the same way with model outputs.

Here is the loop, step by step:

Sample G outputs from the current policy for one prompt. Typical group size is 8 to 16.
Score each output with a reward function. This can be a verifier, a rubric, or an environment signal.
Compute group statistics: mean reward and standard deviation across the group.
Normalize each reward into an advantage value: advantage_i = (reward_i - mean) / std.
Reinforce above-average outputs and penalize below-average ones through the policy gradient.
Apply a KL penalty against the reference model so the trained policy does not drift too far. Then update weights.

The GRPO loop in pseudocode

The algorithm itself is short. Once the reward function and the rollout infrastructure are in place, the optimizer is a handful of lines:

for prompt in training_prompts:
    completions = [policy.generate(prompt) for _ in range(G)]
    rewards = [reward_fn(prompt, c) for c in completions]
    mean_r, std_r = mean(rewards), std(rewards)
    advantages = [(r - mean_r) / (std_r + 1e-8) for r in rewards]
    loss = -sum(adv * log_prob(c) for adv, c in zip(advantages, completions))
    loss += beta * kl_divergence(policy, reference_policy)
    optimizer.step(loss)

The group comparison is what makes GRPO distinctive. The group mean reward replaces the trained critic model that PPO depends on. That swap removes one full-size model from the training stack.

Reward functions for GRPO

The reward function is the part that makes or breaks training. The function returns a consistent score for any (prompt, response) pair. If the same response scores differently on different runs, the group baseline gets noisy and training breaks down. Consistent scoring is the most important property of a good reward function.

Two canonical examples — a math verifier and a code test runner — show what a rule-based reward function looks like in practice:

def math_reward(prompt, response):
    correct = extract_answer(prompt)
    predicted = extract_answer(response)
    return 1.0 if predicted == correct else 0.0

def code_reward(prompt, response):
    code = extract_code(response)
    results = run_tests(code, test_suite=prompt.tests)
    return results.pass_rate

Rewards can be binary (correct or not) or graded (pass rate, edit distance, partial credit). Binary rewards are easier to reason about but produce sparse signal when every response fails. Graded rewards spread the signal across the group but require careful design to avoid reward hacking. For a deeper walkthrough of reward design tradeoffs — verifiers, pass/fail checks, rubrics, and shaping rewards — see verifier and reward design for RL environments. Reward design is the single biggest determinant of whether a GRPO run produces a better model or a model that learned to game the score.

When GRPO works best

GRPO works best when success is easy to check by rule. Math verification, code test suites, and task-completion signals from a real environment all fit this pattern. Tasks with subjective quality fit RLHF better, since RLHF uses human preference data instead of automatic rules.

Two more constraints are worth flagging. First, GRPO works best when the base model can already do the task occasionally. The algorithm reinforces existing behavior; it cannot teach brand-new skills from scratch. If every rollout in a group fails, there is no above-average response to reinforce. Second, the base model should already handle the action interface reliably. For agent training, that means tool calls need to work most of the time before GRPO can improve on them.

For reasoning training, most teams start from a strong instruct-tuned base such as Qwen3, Llama 3, or DeepSeek V3.1. For agent training, the base model should already handle tool calls reliably and be capable of multi-step planning. The reward function and the base model together determine whether GRPO can find signal in the group.

Frameworks for self-managed GRPO training

Several frameworks support GRPO. The right one depends on your setup. Self-managed reasoning training has different needs from agent training in real environments.

For self-managed training:

TRL (Hugging Face) — adds GRPO to its existing PPO and DPO trainer family. Good for single-node reasoning RL setups.
verl— Volcano Engine's RL framework, designed for large-scale distributed training runs.
SkyRL — an open-source RL framework targeted at agent training with rollout parallelism baked in.

Most teams running reasoning RL on their own hardware start with TRL on a single 8xH100 node. verl and SkyRL handle multi-node setups better. Agent training has a different bottleneck: running thousands of parallel environments in real software. That is where managed platforms like HUD take the rollout work off your plate so you can focus on the reward function and task design.

Running GRPO training on HUD

HUD's training pipeline runs GRPO end to end. The platform handles rollouts, reward grading, advantage computation, and policy updates. Your job is to build the environment, define the reward, and pick the base model. The flow is: install the CLI, build and deploy an environment with a verifiable reward, sync a taskset, then start training from the platform UI.

Install and authenticate

Install the HUD CLI with uv, then set your HUD and model provider API keys. The CLI is what you use to scaffold environments and deploy them; the actual training run is started from the dashboard.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the HUD CLI with the Anthropic SDK so the gateway can route Claude
uv tool install hud-python --python 3.12 --with anthropic

# Authenticate with a single HUD key (the inference gateway proxies
# Claude, GPT, and Gemini through this one credential)
hud set HUD_API_KEY=***

Build and deploy an environment with verifiable rewards

GRPO needs a reward signal for every rollout. In HUD, that signal comes from the second yield in a scenario. The first yield sends the task prompt, the agent runs in the environment, and the second yield returns the reward. Scaffold a new environment with hud init, write a scenario with a rule-based reward, then deploy and sync your taskset:

hud init my-rlvr-env --preset blank
cd my-rlvr-env
# Edit env.py to define a scenario with a verifiable reward
hud deploy
hud sync tasks my-taskset

The scenario itself is where the reward gets defined. The pattern below shows the two-yield structure HUD uses for environment scoring — the same shape GRPO consumes per rollout:

@env.scenario("email-triage")
async def email_triage():
    yield "Sort the inbox. Mark anything urgent."
    urgent_caught = env.urgent_emails_marked()
    score = urgent_caught / env.total_urgent_emails
    yield score

The score returned by the second yield is the signal GRPO uses to rank trajectories within each group. For a step-by-step walkthrough of building environments with scenarios, tools, and tasksets, see RL environments: what they are and how to build one.

Validate the environment with an eval run

Before pushing into a training loop, run an eval against the taskset. The eval confirms that the reward function fires correctly, that the environment is reachable, and that the base model has a non-zero success rate. GRPO cannot learn from an all-zero group, so a clean eval is the prerequisite for a useful training run.

hud eval my-taskset claude --gateway --full

The eval returns a reward distribution across the taskset. If a frontier model gets some non-zero fraction of tasks right, the taskset is well-scoped for GRPO. If it gets zero, the tasks are too hard relative to the base model — either ease the tasks, upgrade the base model, or pre-train with SFT before running GRPO.

Start the training run from the platform UI

GRPO training on HUD is dashboard-driven. The training run itself happens in the platform UI — there is no hud train CLI command.

Once your environment is deployed and your taskset is synced, the rest of the flow runs through hud.ai/models:

Open Explore and pick a trainable base model.
Click Fork to create your own copy of the model.
Click Train Model on the forked model and select your taskset.
The platform handles rollout sampling, reward grading, advantage computation, and policy updates.

Every training run produces a versioned checkpoint. You can compare runs against each other, roll back regressions, and measure whether reward function changes actually moved performance.

Use the trained model

Once training finishes, the forked model has a stable ID. Run it against any task in your environment by passing the ID to create_agent:

import hud
from hud.agents import create_agent
from env import my_scenario

# After training in the platform GUI, your forked model gets an ID
task = my_scenario.task(...)
agent = create_agent("your-model-id")

async with hud.eval(task) as ctx:
    result = await agent.run(ctx)

print(f"Reward: {result.reward}")

The same hud eval CLI you used to validate the environment now works on your trained model. Run it against the original taskset and the reward distribution should be visibly higher than the pre-training baseline. If it is not, the diagnosis is usually upstream — the reward function was too noisy, the group was always all-zero or all-one, or the base model was a bad fit for the task.

Infrastructure considerations at scale

Group size of 8 to 16 is standard. Smaller groups produce noisier baselines but use less compute per step. The KL coefficient (often called beta) controls how far the policy can drift from the reference model. Common starting values are group size 8 and beta 0.04.

Each step generates G completions per prompt, scores them, computes advantages, and updates weights. Memory peaks during generation, so batch size and group size both pull from the same VRAM budget. Plan for 40 to 80 GB per GPU at typical reasoning-model scale (7B to 14B parameters).

For agent training, the bottleneck shifts away from raw GPU memory and onto rollout throughput. GRPO needs multiple completions per prompt to compute a group baseline. With 1,000 training prompts and 16 rollouts each, a single epoch is 16,000 trajectories. Running rollouts in parallel is the difference between hours and weeks of training time. HUD's environments are built to support thousands of concurrent runs with sub-second spin-up, which is what makes GRPO on real software practical instead of theoretical.

The loop itself is straightforward once the reward function is solid. The interesting research questions sit upstream. They include what tasks to train on, how to design the reward, and how to push GRPO beyond narrow reasoning into agent training on real software. For a survey of where teams currently publish these environments, see the best platforms for publishing RL environments.

Where GRPO breaks down

GRPO is not a universal solvent. A few failure modes show up repeatedly:

The base model cannot do the task at all. GRPO reinforces above-average outputs in the group. If every rollout fails, there is no above-average response and no signal to learn from. Fix: pre-train with SFT, ease the tasks, or pick a stronger base model.
The reward function is noisy. If the same response scores differently on different runs, the group baseline gets unstable and the optimizer chases noise. Fix: replace LLM-judged dimensions with programmatic checks, or test the grader for scoring consistency before training.
The reward is hackable. Models will exploit shaping rewards or partial-credit signals if the verifier can be triggered without actually completing the task. More capable models are significantly more likely to find these exploits. Fix: validate every reward function against a range of model capabilities before deploying.
The KL penalty is wrong. Too high and the policy cannot move; too low and it collapses into a single high-reward mode. Beta of 0.04 is a reasonable starting point but should be tuned per task.
The tasks are too easy. If every rollout succeeds, the group has zero variance and the advantage collapses. Fix: harden the taskset or move to a smaller, less capable base model.

GRPO is also not a substitute for a good base model. The algorithm reinforces existing behavior; it cannot teach brand-new skills from scratch. For a higher-level view of where GRPO sits inside the broader training pipeline — SFT, RLHF, RLVR, and the role of evals — see how to train AI agents with reinforcement learning.

Conclusion

The hard part of GRPO training has shifted away from the algorithm. Open-source frameworks handle the math. The harder problem now is the environment and the reward function — what to train on, how to score it, and how to keep the signal clean as models get more capable. For reasoning training, the environment is straightforward: math problems and code tests. For agent training on real software, the environment is the work.

HUD provides sandboxed environments, rule-based rewards, and a managed GRPO pipeline. Build and deploy your environment with the CLI, sync a taskset, then start training from hud.ai/models. Every evaluation collects trajectories that feed straight back into the next training run. For more on what makes a high-quality environment in this category, see the top reinforcement learning environments of 2026.

FAQs

What is GRPO?

GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm that trains a model by sampling a group of responses to the same prompt, scoring each one, and reinforcing the responses that beat the group average. The group mean reward acts as the baseline, which removes the trained critic model that PPO requires. DeepSeek introduced GRPO in the DeepSeekMath paper in 2024 and used it to train R1 a year later. GRPO is now the default RL algorithm for reasoning models and verifiable-reward training.

How is GRPO different from PPO?

PPO trains a separate critic model to predict the value of each state, and uses that prediction as the baseline for policy updates. GRPO removes the critic entirely and uses the mean reward across a sampled group as the baseline instead. The result is roughly half the memory and compute of PPO, with one fewer model to train, tune, and debug. PPO still works well for RLHF pipelines where rewards come from a learned preference model; GRPO is the better fit for RLVR-style training where rewards come from rule-based verifiers.

When should I use GRPO instead of RLHF?

Use GRPO when success is easy to check by rule: the math answer is correct, the code passes its tests, the agent completed the task in the environment. Use RLHF when the scored dimension is subjective (tone, creativity, nuance) and a human preference model is the only reliable judge. The two are not mutually exclusive — GRPO is an optimizer, so it can run on top of a learned reward model trained on human preferences when the task calls for it.

What is the right group size for GRPO?

Group sizes of 8 to 16 completions per prompt are standard. Smaller groups produce noisier baselines but use less compute per step. Larger groups give more stable advantage estimates at the cost of more rollouts. Common starting values are group size 8 and a KL coefficient (beta) of 0.04. For agent training where each rollout is a full environment trajectory, group size also depends on how many parallel rollouts the infrastructure can support.

How do I train a GRPO model on HUD?

GRPO training on HUD runs through the platform dashboard. You build and deploy an RL environment with a verifiable reward, sync your taskset, then open hud.ai/models, fork a trainable base model from Explore, and click Train Model with your taskset selected. The platform handles rollout sampling, reward grading, advantage computation, and policy updates. There is no separate hud train CLI command — the training run itself happens in the UI.