How to Test a Computer Use Agent

Introduction

Most AI evaluations grade the text that a model produces. You ask the model a question, it returns an answer, and a grader checks whether the answer is right. Computer use agents follow a separate pattern. They click, type, scroll, and navigate through real software. Their output is not a paragraph — it is a sequence of 30 actions across a live spreadsheet, where one bad click in step 12 can poison the rest of the run.

Testing these agents requires different infrastructure than testing chatbots. The eval has to run real software, capture every action, score the final state of that software, and reset cleanly between runs. Most public benchmarks miss at least one of these requirements, which is why scores from one benchmark often fail to predict performance on another.

This guide covers what a computer use agent is, why these agents took off in 2026, and what high-quality evaluation actually requires. At Human Union Data (HUD), we use SheetBench-50 — our public benchmark for financial analyst spreadsheet workflows — as the running example throughout. The same six principles apply to any computer use agent eval, whether the agent is operating Salesforce, Sheets, or a custom internal tool. SheetBench-50 was developed in partnership with Sepal AI and validated by finance professionals from PwC, Cisco, Charles Schwab, and Fannie Mae.

What makes computer use agents different

A computer use agent operates software through the same visual interfaces a human uses. It takes screenshots, reads what is on screen, and decides what to click, type, or scroll next. A chatbot reads text and writes text. A computer use agent reads pixels and takes actions in real applications.

That difference changes how evaluation works. A chatbot's output is one block of text you can score against a reference answer. A computer use agent's output is a sequence of 15 to 50 actions, each one depending on the screen state at that moment. Success is not a string match. Success is whether the spreadsheet, ticket, or dashboard ended up in the correct state.

Common computer use agent tasks include filling out an expense report in Concur, diagnosing a production error in Sentry, running a financial model in Google Sheets, triaging an inbox, and deploying a service through a cloud console.

The action space is also much larger than a chatbot's. Every clickable element, every text field, and every menu option is a possible decision. The agent can take dozens of paths to the same goal, and only some of them work. Evaluation has to look at the full path, not just the final cell value or text response. The conceptual mapping between this and the standard RL loop — state, action space, reward — is covered in detail in the introduction to RL environments.

Why computer use took off in 2026

Two changes pushed computer use to the front of the agent space. Vision-capable models got reliably good at reading interfaces. The frontier labs shipped working products: Anthropic released Claude Computer Use, OpenAI released Operator, and Google released Gemini ComputerUse. Each of these models has to be evaluated against real software before any team deploys it.

The structural reason these products exist has not changed. Most enterprise software was built for human users. Internal tools, legacy ERPs, vendor portals, and approval workflows often have no programmatic interface at all. Even systems that do have an API rarely expose the full workflow. Salesforce has an API, but Salesforce reporting often involves five clicks across three views and a custom dashboard the API does not reach. The same pattern holds for Workday, Jira, and SAP. UI automation closes the gap between what an API allows and what enterprise workflows require.

HUD hosts Autonomy-10, the benchmark used to evaluate Operator at launch, along with OSWorld-Verified, the most widely used desktop benchmark in the field. Anyone building a computer use agent or fine-tuning one for their own software needs the same evaluation infrastructure the labs use.

The six principles of a high-quality CUA eval

At HUD, we partnered with Sepal AI to build SheetBench-50. The goal was a public benchmark that measured whether agents could complete the work of a financial analyst. The execution took longer than we expected. Anyone can write 50 spreadsheet tasks. Writing 50 tasks that produce trustworthy, reproducible scores from real agents is a different problem.

Six principles apply to any computer use agent eval. Here they are with a real task from SheetBench-50 to make them concrete:

For the ticker with the greatest correlation between volume and next-day price change, find the day with the greatest volume and the next day's price change. Put the ticker in ANSWER A1, the volume in ANSWER B1, and the next-day price change in ANSWER C1. Use CORREL to determine correlation for each ticker group. Sort by ticker Z-A and then date ascending before calculating next-day price change.

That task involves time-series analysis, grouped statistics, and precise sort ordering. A finance team runs work like this every week.

1. Realism

Tasks have to mirror work humans actually do. Toy CSV puzzles do not transfer to enterprise workflows because they leave out the structural complexity real spreadsheets have. SheetBench-50 tasks come from finance professionals describing the work their teams run weekly. The correlation example above is a real analytical pattern from equities research, with sort order requirements that affect the math.

2. Instruction-following

Real workflows have constraints buried in the prompt. Currency vs. decimal. Thousands separators. Z-A sort order before A-Z by date. An agent that gets the math right but ignores formatting fails the actual job. A good eval has to catch instruction-following failures, which means tasks need constraints worth getting wrong.

3. Provable verification

The eval has to know whether the agent succeeded with zero ambiguity. SheetBench-50 uses cell-level checks: ANSWER A1 either contains the right ticker or it does not. For a Sentry debugging agent, the verifier checks whether the agent surfaced the exact tool call ID causing the validation error. Binary verification beats LLM-as-judge whenever you can get it, because the reward signal has no noise. If the verifier is fuzzy, the agent learns to game the verifier instead of solving the task. The verifier and reward design guide covers the deeper theory of how to build scoring systems that hold up under that pressure.

4. Tool dexterity

Good evals force the agent to operate the software like a person would. For a spreadsheet, that means pivots, range selection, formula entry, conditional formatting, and dropdowns. For a CRM like Salesforce, it means navigating between report views, applying filters, and editing dashboard widgets. Tasks that can be solved through pure code generation skip the actual interface, which is the whole reason computer use agents exist. The eval has to test whether the agent can actually navigate the UI.

5. Transparency

Every run on HUD captures every action, every reasoning trace, and every screenshot. Scores alone do not tell teams why an agent failed. Replay does. When Claude Sonnet 4.5 misses a SheetBench-50 task, each trajectory is logged so you can see whether it picked the wrong column, used SUMIF instead of CORREL, or got the right answer in the wrong cell.

6. Blind validation

Three financial experts solved each SheetBench-50 task independently, without seeing each other's work or any reference answer. At HUD, we kept rewriting tasks until all three experts arrived at the same solution. That process is slow. It also produces tasks with exactly one correct answer, which is what cell-level verification needs to work. Not every workflow has a single correct answer. For tasks like drafting client emails or summarizing meetings, teams need rubric-based scoring instead of binary verification.

Why public benchmarks miss the mark

SpreadsheetBench is a recent academic benchmark for spreadsheet manipulation. It pulls 912 tasks from Excel forums and scores agents by checking whether their generated code produces the right cell values. The work advanced the field by showing how much LLMs still struggle with formulas. It also has structural problems most public CUA benchmarks share, and walking through them shows why we built SheetBench-50 differently.

Code generation skips the UI

SpreadsheetBench scores agents on whether their Python or VBA produces the right output. The agent never opens a spreadsheet. Real analysts do not write VBA to fill in a pivot table; they click through the interface. An eval that bypasses the UI tells you about formula generation, not about whether an agent can do the actual job.

Forum tasks miss enterprise complexity

ExcelForum.com questions skew toward "how do I VLOOKUP this" and away from "build me a five-year amortization schedule with FX adjustments." Public forums select for narrow, isolated problems. Enterprise work is multi-step, cross-sheet, and built around domain logic forum users rarely ask about. SheetBench-50 covers amortization schedules, FX conversions, z-score analyses, churn analysis, and breakeven projections because that is what finance teams actually run.

Synthetic data hides real complexity

Forum benchmarks filter out posts that are messy, ambiguous, or hard to formalize. The remaining tasks are cleaner than real spreadsheets. Production workbooks have nested headers, missing column labels, free-form notes in the margin, and three tables stacked on one sheet. Agents tested on clean data fall apart when they meet the actual mess.

No trajectory capture

Online-judge-style evals report a pass/fail and nothing else. When an agent fails, there is no way to see what went wrong. That is fine for grading homework. It is not enough for shipping production agents, where the failure modes are what teams need to fix.

We built SheetBench-50 to be the opposite of all four. Tasks come from finance professionals. The agent operates Google Sheets in a real Chrome browser. Spreadsheets include the kind of structural messiness real workbooks have. Every run captures the full action sequence for replay.

Live systems vs. isolated environments

A common mistake teams make when starting CUA evaluation is pointing the agent at a live system: live Salesforce, live Jira, live production database. The thinking is that real software gives the most realistic signal. In practice, live environments give the worst signal.

Live environments have shared state. If your agent runs a test that creates a Salesforce record, the next run sees that record. Two evaluations running in parallel will collide on the same database row. The score becomes a function of run order, not agent quality.

These systems also change underneath you. A vendor pushes a UI update overnight, the button moves, and yesterday's successful trajectory becomes today's failure even though the agent did nothing different. Trajectory comparisons across weeks stop being valid.

The bigger risk is damage. An untrained agent will click random buttons, submit forms with garbage data, and trigger irreversible actions. Sending a test email to a real customer list is the kind of mistake teams only make once.

HUD environments solve all three problems by spinning up fresh, isolated instances for every evaluation. Each run gets a clean copy of Google Sheets with the input spreadsheet loaded. No shared state. No collisions. No risk of breaking production. HUD supports thousands of concurrent environments with sub-second spin-up, so a 50-task benchmark finishes in minutes instead of hours.

Run your first eval

Before pointing an agent at a heavy CUA benchmark, get the workflow running end-to-end on a tiny scaffolded environment. The same loop — install, authenticate, scaffold, eval — then scales up to SheetBench-50 or your own CUA workloads.

First, install uv (the recommended Python package manager) and the HUD CLI. The --with anthropic flag injects the anthropic SDK into the HUD tool environment so the inference gateway can route Claude calls:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install hud-python --python 3.12 --with anthropic

Authenticate with one API key. Grab it from hud.ai/project/api-keys. The HUD inference gateway then proxies Claude, GPT, and Gemini through this single key — no separate provider credentials required for the first run.

hud set HUD_API_KEY=***

Scaffold a starter project. The blank preset ships with a working tasks.py you can run immediately, then swap out for CUA-specific tasks later:

hud init demo --preset blank
cd demo

Run the eval through the HUD gateway:

hud eval tasks.py claude --gateway

You should see a trace URL on hud.ai/trace/... and a reward between 0.0 and 1.0 within seconds. That trace captures every tool call, every observation, and the final reward — the same data shape the platform uses for SheetBench-50 runs at scale.

To run a real CUA benchmark like SheetBench-50, open hud.ai/tasksets in the dashboard and click Run Taskset on SheetBench-50, OSWorld-Verified, or Autonomy-10. HUD handles the parallel browser instances, trajectory capture, and cell-level grading. A 50-task run completes in minutes and produces a scorecard with pass/fail per task and a replayable trajectory for each one. To customise scoring or add your own tasks, start from hud-evals/cua-template and deploy with hud deploy.

See the hud eval CLI reference for flags like --group-size (variance estimation) and --remote (submit to HUD infrastructure).

Each run produces a full trajectory: the prompt sent to the agent, every tool call it made, every screenshot it observed, the final spreadsheet state, and the reward. That trajectory is the same labeled data shape RL algorithms consume — testing and training run on the same loop.

Beyond SheetBench-50: Autonomy-10 and OSWorld-Verified

SheetBench-50 is the right starting point if your agent operates spreadsheets. For other slices of computer use, HUD hosts two additional public benchmarks alongside 100+ pre-built environments in the HUD hub.

Autonomy-10 is the benchmark OpenAI used to evaluate Operator at launch. It measures general autonomy — whether an agent can take a high-level goal and execute a multi-step plan across the open web without hand-holding. The ten tasks are diverse enough that a model that performs well across the set tends to generalize.

OSWorld-Verified is the most widely used desktop benchmark in the field. It covers 369 tasks across Chrome, LibreOffice, GIMP, and VS Code. HUD hosts the verified version, which fixes 300+ reproducibility issues from the original OSWorld release — the kind of cleanup work that turns a noisy academic benchmark into a stable production signal.

The same hud eval workflow runs on all three benchmarks. Through HUD's integration with Cua, teams can also benchmark any GUI-capable agent on these datasets with one line of code, regardless of which model provider it uses. For a broader comparison of environment platforms, see the top 5 reinforcement learning environments.

Building your own CUA environment with HUD

If your team needs to evaluate against your own software instead of a public benchmark, the fastest path is the hud-evals/cua-template repo. It scaffolds a working browser-based CUA environment with tasks, scenarios, scorers, and a Dockerfile already wired up. You clone it, swap in your tasks, and deploy.

To scaffold a fresh environment from scratch:

hud init my-cua-env --preset blank
cd my-cua-env
hud deploy

The same six principles from earlier still apply: realistic tasks from people who do the work, provable verification, full trajectory capture, isolated environments, instruction-following constraints, and blind-validated answers. For the conceptual groundwork on tools, scenarios, and reward functions, see RL environments: what they are and how to build one.

For teams that want a co-designed environment, HUD works directly with finance, ops, security, and legal teams to build custom benchmarks. The output is an eval that tells you whether an agent can actually do the work you need it to do.

Have an AI coding agent build the eval for you

If you use Cursor, Claude Code, or another MCP-compatible coding agent, you can attach the HUD documentation as an MCP server and let the agent walk you through running your first CUA eval. In Claude Code:

claude mcp add --transport http docs-hud https://docs.hud.ai/mcp

In Cursor, add this to your MCP settings:

"docs-hud": {
  "url": "https://docs.hud.ai/mcp"
}

Then paste the following prompt into your agent to install the CLI, run SheetBench-50, and walk through the trace output:

Help me run my first HUD computer use agent evaluation. Use the docs-hud MCP server I just installed for the latest commands and patterns. Walk me through:

1. Check the prerequisites and help me set HUD_API_KEY + ANTHROPIC_API_KEY (guide me to export them; don't ask me to paste them in chat).
2. Run an eval against hud-evals/SheetBench-50 with Claude and explain the trace output.
3. Show me how to start from hud-evals/cua-template if I want to add my own task.

Explain each concept as we go.

Conclusion

Computer use agents do not produce text. They produce trajectories — sequences of clicks, keystrokes, and scrolls across live software. Testing them well requires real applications running in isolated environments, full action capture, provable verification, and tasks built by people who do the underlying work. Public benchmarks that skip any of these requirements produce scores that fail to predict real-world performance.

The fastest way to test your computer use agent is to run an eval on HUD yourself. With the open-source SDK and free cloud credits, you can score any model across SheetBench-50's 50 financial analyst tasks in minutes and replay every action it took. Building, evaluating, and improving a computer use agent all run on the same HUD platform — because testing and training are the same loop. For the training side, see how to train AI agents with reinforcement learning and GRPO training: what it is and how to run it.

FAQs

What is a computer use agent?

A computer use agent is an AI system that operates software through visual interfaces. It takes screenshots, decides what to click or type, and performs actions in real applications like browsers, spreadsheets, and desktop tools. The output is a sequence of UI actions, not a text response.

How do I run my first computer use agent evaluation?

Install the HUD CLI with uv tool install hud-python --python 3.12 --with anthropic, authenticate with hud set HUD_API_KEY=***, scaffold a starter environment with hud init demo --preset blank, then run hud eval tasks.py claude --gateway from inside the project. The --gateway flag routes Claude through HUD's inference endpoint, so a single HUD key is enough. To run SheetBench-50 itself, open hud.ai/tasksets in the dashboard and click Run Taskset.

What is OSWorld-Verified?

OSWorld-Verified is a benchmark of 369 desktop tasks across Chrome, LibreOffice, GIMP, and VS Code. HUD hosts the verified version, which fixes 300+ reproducibility issues from the original OSWorld release. It is the most widely used benchmark for general-purpose computer use agents.

SheetBench-50 vs SpreadsheetBench: what is the difference?

SheetBench-50 evaluates agents on operating Google Sheets through a real browser, using tasks built by finance professionals at PwC, Cisco, Charles Schwab, and Fannie Mae. SpreadsheetBench scores agents on code that manipulates spreadsheet files, using tasks from public Excel forums. SheetBench-50 measures workflow completion in a real UI. SpreadsheetBench measures formula generation in a code editor.

What makes a good CUA eval different from chatbot evaluation?

Chatbot evaluations grade a single block of text against a reference answer. CUA evaluations have to score a sequence of 15 to 50 UI actions across live software, where success depends on the final state of a spreadsheet, ticket, or dashboard rather than a string match. A good CUA eval runs real software in an isolated environment, captures every action and screenshot, uses provable cell-level or state-level verification, and resets cleanly between runs.