RL Environments: What They Are and How to Build One

Introduction

An RL environment is a simulated world where an AI agent learns by trying things, getting feedback, and improving over time. The agent observes its situation, takes an action, and receives a reward score telling it how well it did. That loop — observe, act, get rewarded — is the foundation of modern AI training.

OpenAI used the same base model, GPT-4o, for its next set of flagship models including o1, o3, and the GPT-5 series. For 18 months, gains in model performance came from post-training and scaling up RL compute rather than new pre-training runs. The environment, not the model, became the bottleneck.

This guide walks through what RL environments are, how their core components work, where they are used in the real world (from game AI to Excel agents), and how to build one in Python. It also explains how the same loop that trains game-playing agents is what turned GPT-3 into ChatGPT. Platforms like Human Union Data (HUD) package this loop so teams can stand up real-software RL environments without rebuilding the infrastructure from scratch.

What is reinforcement learning?

Reinforcement learning is how an agent figures out what to do by trying things and learning from the results. Think of training a dog: you give it a command, it does something (sits, rolls over, or ignores you), and you respond with a treat or nothing. The dog learns which actions lead to rewards and gradually gets better at following commands.

In RL, an agent is the learner making decisions. It could be a robot, a game-playing AI, or a language model. The environment is everything the agent interacts with: the game world, the physical space, or the conversation it is having. The agent observes the current situation (called the state), picks an action from its available options (the action space), and receives feedback as a numerical reward signal.

The agent learns through repetition. It starts by knowing nothing about what works inside the environment. It tries random actions, gets rewards or penalties, and slowly builds up a policy: a strategy for which action to take in each situation. Over thousands or millions of attempts, the agent discovers which behaviors lead to high rewards and learns to repeat them.

This trial-and-error mechanism drives everything from AlphaGo beating world champions to ChatGPT learning to give helpful responses. The beauty of RL is how generalizable the improvement loop is. The RL environment might be a web browser, a spreadsheet, or a human conversation, but the process is always the same and maps directly onto how humans learn.

The core components of an RL environment

RL environments follow a repeating pattern: an agent observes its current situation, chooses an action from available options, and receives feedback that guides future decisions.

The agent loop: observe, act, get feedback

Training a dog to fetch demonstrates the pattern perfectly. The dog observes the scene (you holding a ball), picks an action (sit, run, or bark), and gets immediate feedback (treat for sitting, nothing for barking). Each interaction teaches the dog which behaviors lead to rewards. RL environments follow the same pattern — they create digital worlds where software agents learn through trial and error.

The state is what the agent can perceive right now. In a video game, this might be the positions of all characters on screen. In a chatbot, it is the conversation history so far.

The action space defines every possible move the agent can make. A chess program chooses from all legal moves for the current board position. A trading bot might buy, sell, or hold a stock.

The rewardis the environment's judgment on how well the agent performed. Games give points for winning. Trading simulations measure profit and loss. The reward signal drives all learning — agents adapt their behavior to maximize future rewards.

This loop is formally modeled as a Markov Decision Process (MDP), a mathematical framework where the next state depends only on the current state and the action taken. MDP framing is what makes RL tractable. It lets researchers design algorithms that concretely improve a policy over time, rather than relying on heuristics. Every RL environment, from a chess board to a web browser, can be described as an MDP.

Rules, boundaries, and what gets learned

Three structural elements define how the environment behaves. Transition dynamics are the rules of the world — they govern how actions change the state. Think of them like the laws of physics in a video game: jump off a ledge and gravity pulls you down, move a chess piece and the board updates. In a robot simulation, motor commands change joint positions and velocities.

Episodes mark the boundaries of individual learning attempts. Each game of chess is one episode. Each conversation with a chatbot is one episode. When an episode ends (checkmate, successful task completion, or hitting a time limit), the agent starts fresh with a new state.

The policy is what the agent actually learns: its strategy for picking actions given any state. A strong chess policy knows opening moves, tactical patterns, and endgame principles. A good chatbot policy generates helpful, harmless responses across conversation topics.

Quick reference: RL environment glossary

State — what the agent can observe about its current situation.
Action space — all possible moves the agent can choose from.
Reward function— how the environment scores the agent's performance.
Transition dynamics — the rules for how actions change the world state.
Episode — one complete attempt at the task from start to finish.
Termination condition — when an episode ends (success, failure, or timeout).
Policy— the agent's learned strategy for choosing actions.
Trajectory — a full recording of states, actions, and rewards from one episode.

Real-world RL environment examples

RL environments span a wide spectrum — from classic game simulations to the knowledge work software that runs the modern economy. Each domain uses the same observe-act-reward cycle with different inputs and outputs.

Game AI is where RL research started. Atari games like Breakout give agents pixel observations and joystick actions: score points, do not miss the ball. Chess engines like AlphaZero learn strategy through millions of self-play games with a binary reward of win or lose. These environments work well for research because success is clearly measurable.

Roboticsmoves RL into the physical world. A robot arm learning to grasp objects observes joint angles and camera feeds, outputs motor commands, and gets rewarded for successful picks. OpenAI's robotic hand solving a Rubik's cube logged thousands of hours of simulated practice before touching real hardware. The environment is a physics simulator; the reward is task completion.

Knowledge workis the newest and most commercially significant frontier for RL. Major AI labs are building RL environments targeting office software at scale. According to SemiAnalysis, environment companies are selling "UI gyms" — dockerized clones of production software like DoorDash or Uber Eats — to labs at roughly $20,000 per website. OpenAI has purchased hundreds of these environments for ChatGPT Agent training, improving its agent by having it operate real software, complete tasks, and get scored on the output. For a comparison of the major platforms in this category, see the top RL environment tools of 2026.

This is the process of hill climbing. Researchers run the agent through thousands of task scenarios, collect the trajectories where it succeeded, and use those as training signal to improve the next version. Each eval run is simultaneously a measurement and a training data generator. That is why RL and evals are inseparable. You cannot improve what you do not measure, and every measurement is a potential training example.

Spreadsheet agents navigate cells and formulas, earning rewards for correct calculations. Email triage agents read inboxes, draft replies, and get scored on whether a human would have sent the same response. Browser agents complete multi-step web tasks (booking, research, form submission) with task completion as the reward signal. The environment is real software. The reward is whether the job got done.

The pattern is consistent across all of these: define the software environment, define what "done" looks like, run the agent, score the result. That score is both an eval metric and a training signal.

How RL powers LLM improvement

Reinforcement learning transforms raw language models into conversational assistants through the same trial-and-error loop that teaches game agents how to win.

RLHF: teaching models what humans want

Reinforcement learning with human feedback (RLHF) uses two ingredients: human preference rankings and ideal completions. Human reviewers compare model outputs and mark which response is better. The model then learns to maximize the probability of generating responses that humans prefer over alternatives.

This is the same environment-reward-trajectory loop from classic RL. The environment is the conversation context, the action is generating a response, and the reward comes from human feedback. GPT-3's raw completions became ChatGPT's helpful, harmless responses through thousands of these training cycles.

RLAIF and beyond: scaling the signal

Getting human reviewers to grade every model response does not scale — it is slow and expensive. Two methods address this.

Reinforcement Learning from AI Feedback (RLAIF), introduced by Anthropic, replaces human reviewers with another AI model that judges response quality. Instead of paying people to compare outputs, you use a second LLM as the judge.

Group Relative Policy Optimization (GRPO), introduced by DeepSeek, takes a different approach: rather than grading each response individually, it generates a group of responses to the same prompt and scores them relative to each other. The best responses in the group get reinforced; the worst get penalized. GRPO was central to training DeepSeek-R1. For a deeper walkthrough, see GRPO training: what it is and how to run it.

The tradeoff between the two is signal quality. An AI judge is cheaper and faster than a human expert, but less reliable at the frontier where tasks are genuinely hard. GRPO sidesteps the need for a judge entirely but requires careful reward design — if the scoring criteria are too easy to game, the model learns to exploit them rather than actually improve. The verifier and reward design guide covers how to build scoring systems that hold up under that pressure.

The expert data industry

Mercor, Handshake, and Surge supply domain-expert human feedback to frontier AI labs. That means physicians grading medical reasoning, mathematicians scoring proofs, and financial analysts evaluating models. OpenAI's move to build an in-house human data team signals how strategically critical this pipeline has become. The domain expansion follows a clear path: math problems (easy to grade) to coding (automated testing) to healthcare and finance (requires experts).

This expert data feeds RL environments that span the economy. OpenAI's GDPval benchmark — a dataset designed to measure model performance across the economy — contains 1,000+ tasks across 44 occupations from sectors representing over 5% of GDP. The same environment infrastructure that trains agents to play Atari now teaches models to excel at Excel, write better code, and reason through complex financial models.

Building an RL environment with HUD

The four primitives in HUD's SDK map directly to the RL concepts covered above: an environment defines the sandbox, tools define the action space, a scenario defines the task and grades the result, and a taskset is the corpus of training data. The example below uses letter counting — a task LLMs continually struggle with — to walk through each primitive end-to-end.

Install and authenticate

Install the HUD CLI with uv (the recommended Python package manager), then export your API key from hud.ai/project/api-keys.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the HUD CLI with the Anthropic SDK so the gateway can route Claude
uv tool install hud-python --python 3.12 --with anthropic

# Authenticate with a single HUD key (the inference gateway proxies
# Claude, GPT, and Gemini through this one credential)
hud set HUD_API_KEY=your-key-here

Scaffold a new environment

hud init scaffolds an env.py, a Dockerfile.hud, and a pyproject.toml for a fresh environment:

hud init letter-counter --preset blank
cd letter-counter

Environment: the isolated sandbox

The Environment class spins up a fresh, isolated instance for every evaluation run. No shared state between runs means every trajectory is reproducible. Under the hood, each environment is an MCP server.

from hud import Environment

env = Environment("letter-counter")

Tools: the action space

Any Python function decorated with @env.tool() becomes a callable action for the agent. The docstring becomes the description the agent sees when deciding whether to call the tool. Type hints become the parameter schema. This is the agent's action space — the complete set of moves available to it.

@env.tool()
def count_letter(text: str, letter: str) -> int:
    """Count occurrences of a letter in text."""
    return text.lower().count(letter.lower())

Scenario: task definition, episode boundary, and reward in one

The @env.scenario decorator wraps an async generator with two yield statements. The first yield sends the task prompt to the agent and receives its response. The second yield grades that response and returns a reward between 0.0 and 1.0. The agent runs between the two yields, calling tools and reasoning.

@env.scenario("count")
async def count(word: str, letter: str):
    # First yield: send the task, receive the agent's response
    answer = yield f"How many '{letter}' in '{word}'?"

    # Second yield: grade the response, return the reward
    correct = str(word.lower().count(letter.lower()))
    yield 1.0 if answer and correct in answer else 0.0

This single function is simultaneously the episode definition (start to end), the reward function (the grading logic), and the task specification (the prompt).

Taskset: the training data

A task is a scenario instantiated with specific arguments, tracked by a stable kebab-case slug. Collect tasks into a tasks.py file and you have a taskset — a batch the agent can be evaluated against, and which becomes training data once you start collecting trajectories.

from env import count

strawberry_r = count.task(word="strawberry", letter="r")
strawberry_r.slug = "strawberry-r"

mississippi_s = count.task(word="mississippi", letter="s")
mississippi_s.slug = "mississippi-s"

abracadabra_a = count.task(word="abracadabra", letter="a")
abracadabra_a.slug = "abracadabra-a"

Running the agent and reading rewards

The fastest way to run a single task is task.run(). It instantiates the task, runs the agent, scores the result, and returns the reward.

from env import count

result = await count.task(word="strawberry", letter="r").run("claude-sonnet-4-5")
print(f"Reward: {result.reward}")  # 1.0 if the agent answers "3"

To run the full taskset from the command line, use the hud eval CLI. It imports your task file, runs each task in-process, and prints a reward distribution at the end. The --gateway flag routes the model through HUD's inference endpoint so a single HUD key is enough:

hud eval tasks.py claude --gateway --full

See the hud eval CLI reference for flags like --task-ids, --group-size, and --remote.

Each run produces a full trajectory: the prompt sent to the agent, every tool call it made, the final answer, and the reward. For the strawberry-r task above, a successful trajectory looks like:

Prompt: "How many 'r' in 'strawberry'?"
Tool call: count_letter(text="strawberry", letter="r") → 3
Agent answer: "There are 3 r's in 'strawberry'."
Reward: 1.0

This trajectory — prompt, action, result, reward — is exactly the training data format RL algorithms consume to update a model's weights. From here you can iterate with hud dev env:env for hot-reload during development, then hud deploy to push the environment to the platform for hosted evaluation at scale.

Have an AI coding agent build the environment for you

If you use Cursor, Claude Code, or another MCP-compatible coding agent, you can attach the HUD documentation as an MCP server and let the agent walk you through building your first environment. In Claude Code:

claude mcp add --transport http docs-hud https://docs.hud.ai/mcp

In Cursor, add this to your MCP settings:

"docs-hud": {
  "url": "https://docs.hud.ai/mcp"
}

Then paste the following prompt into your agent to scaffold, test, and deploy your first environment end-to-end:

Help me build my first HUD RL environment from scratch. Use the docs-hud MCP server I just installed for the latest commands and patterns. Walk me through:

1. Check the prerequisites are installed and help me set HUD_API_KEY (guide me to export it; don't ask me to paste it in chat).
2. Scaffold a new environment with hud init, then explain env.py.
3. Help me define one tool, one scenario, and one task, then run an eval.

Explain each concept as we go.

Conclusion

The mechanism that teaches a pole-balancing algorithm to stay upright is the same one that transformed GPT-3 into ChatGPT and powers OpenAI's latest reasoning models. Whether you are training an agent to navigate a maze or fine-tuning an LLM to write better code, the core pattern remains: observe, act, get feedback, improve.

If you are ready to build RL environments beyond toy examples, HUD gives you the SDK to define tasks, run agents against real software, and turn the results into training data. The same two-yield scenario pattern that defines tasks also generates the trajectory data you train on. Every evaluation is a step toward a better model. For the broader training loop, see how to train AI agents with reinforcement learning.

FAQs

When should you use an RL environment instead of supervised fine-tuning?

Use supervised fine-tuning (SFT) when you have labeled examples and the task has a single correct output, like spam detection, sentiment classification, or entity extraction. Use an RL environment when the task involves multi-step decisions, tool use, or open-ended outputs where there is no single right answer. If you can define an automated reward signal (unit tests pass, SQL returns correct results, task gets completed), an RL environment will outperform static training data because the agent learns from its own attempts rather than memorizing examples.

What is the difference between RLHF and RL with verifiable rewards?

RLHF relies on human reviewers to rank model outputs and train a reward model from those preferences. It works but is slow, expensive, and bottlenecked by reviewer quality. RL with verifiable rewards replaces human judgment with automated checks. Verifiable rewards are faster, cheaper, and more consistent. The tradeoff is that not every task has a clean automated signal. Tasks requiring subjective judgment (tone, creativity, nuance) still benefit from human or AI feedback.

What is the difference between an RL environment and a training dataset?

A training dataset is static. The model sees examples, learns patterns, and never interacts with the data. An RL environment is interactive. The agent takes an action, the environment responds with a new state and a reward, and the agent adapts. This feedback loop lets the agent discover strategies that no static dataset contains. The model learns from consequences, not just examples.

How does HUD help companies build reinforcement learning environments?

HUD provides an SDK for defining tasks, running agents against real software environments, and scoring the results. Teams write scenarios as Python generators that yield a prompt and a reward function. HUD runs those scenarios in parallel, returns structured trajectories, and makes the output available as training data. The environments run against live applications rather than simulated ones, which means agent performance reflects real capability on the underlying software.

What are the core components of an RL environment?

Every RL environment has the same primitives: a state (what the agent can observe), an action space (the moves available to it), a reward function (the score returned after each action), transition dynamics (how actions change the world), episodes (the boundaries of one attempt), and a policy (the strategy the agent learns). HUD's SDK maps these directly onto environments, tools, and scenarios so the conceptual model and the code stay in sync.