Evaluating Agents on Financial Analyst Workflows (SheetBench)
A case study on developing evaluations for agent performance on finance analyst jobs.
Despite accelerating deployment of AI agents into enterprise workflows, there remains a comparative lack of tools and benchmarks to evaluate their performance on real-world work automation. Even as interactive environments for multi-modal long horizon tasks mature, there are few resources for determining the proficiency of agents at economically relevant work. Why? A few common patterns recur:
- Economic signal. Most public and private evaluations measure breadth, or how well an agent performs across many basic domains. In practice, however, enterprises mitigate risk through narrow automation. Specialist knowledge of processes is thus crucial but less tested.
- Robustness. Many existing public benchmarks sit in community repositories or ad-hoc projects. Whether maintained by groups or evolving over time, they often suffer from execution issues, reproducibility gaps, and brittle code. This is to say, they are not always production-ready.
- Adoption complexity. Common benchmarks for AI Agents, especially Computer Use Agents, have bespoke harnesses and deployment patterns, requiring enterprises to spend engineering resources to set up environments and run their agents against them. For common evals such as WebArena and OSWorld, the implementation costs could be on the order of weeks, with private benchmarks that can easily extend to months.
These patterns hurt both the AI labs that are developing models and the enterprises that are looking to adopt them, the former who benefit from understanding a model's trajectory and the latter its reliability. Alternatives, moreover, are in short supply: in-house evaluation infrastructure is expensive to build and maintain, and qualitative trials raise contract, privacy, and timeline concerns.
To address this gap, we at HUD develop private environments and evaluations for enterprises to benchmark and improve agents on aspects of their workflow. However, we also support major public benchmarks on our platform such as OSWorld, and we're adding SheetBench-50 to this list for anyone to run. We've developed SheetBench-50 in collaboration with Sepal AI, and are excited to release the first public financial-analyst-grade benchmark on spreadsheets.
SheetBench-50
Spreadsheets compress a surprising amount of intelligence: data cleaning, joins, time arithmetic, numerics, formatting, UI manipulation (pivots, filters, dropdowns), and mathematical modeling. This domain requires extreme precision and attention to detail, which is why we partnered with experienced finance professionals from firms such as PwC, Cisco, Charles Schwab, and Fannie Mae (including CFO-level reviewers) to develop these evaluations. At the core of developing production-grade financial analyst evaluations, we considered the following factors:
- Realism – Tasks reflect workflows finance teams actually run (not toy CSV puzzles).
- Instruction-following – Agents need to follow precise constraints (% vs. decimal, thousands separators, currency symbols), both in the prompt and in the input data.
- Provable verification – Cell-level answer checks with unambiguous instructions.
- Tool dexterity – Agents must use spreadsheet tools like a user: create pivots, set formats, and carefully select ranges and formulas.
- Transparency – Every run captures telemetry and trajectories (actions, reasoning, and screenshots) for auditing how and why the agents failed on our platform.
- Blind validation – In task creation, three experts simultaneously solved each task without seeing the answer. We repeated this process and constrained the tasks until each task can only have a single verifiable reproducible solution that had consensus among multiple human experts, thanks to SepalAI.
Our public benchmark, SheetBench-50 is composed of 50 spreadsheet based tasks, and together they capture the most common workflows that analysts encounter in their daily practice, spanning three categories:
- Data Preparation and Hygiene (29): de-duping, type normalization, time parsing, joins/FX conversions, pivots.
- Derivations & Extraction (16): correlations, z-scores, grouping logic, compliance filters (e.g., 1099).
- Modeling & Forecasts (5): revenue/breakeven projections, amortization, depreciation, scenario tables.
This distribution ensures that the evaluation process covers both routine operations and higher-order analytical tasks, while requiring intelligent problem-solving capacities along the way: including multi-step intra-sheet logic, cross-sheet reasoning, and multi-cell outputs. This focus on financial reasoning and sequential calculation is vital for interpreting enterprise value and relatively novel in a benchmark space that is more familiar with formula translation and basic cell entry. While the required approaches are complex, though, the solutions themselves are unambiguous. Each task is paired with a single reproducible ground truth blindly derived by a financial analyst and then verified with the consensus of a team from HUD and Sepal AI.
Example Task
For the ticker that has the greatest correlation between volume and next day price change % find the day with the greatest volume and the next days price change %
- put the ticker in ANSWER A1
- put the volume in ANSWER B1
- put the next day price change in ANSWER C1
NOTE:
- use CORREL to determine correlation for each ticker group
- be sure to first sort the date by ticker z to a and then date ascending before calculating nextdaypricechange %
To replicate a task from the benchmark, run these commands:
uv tool install hud-python@latesthud set HUD_API_KEY=... ANTHROPIC_API_KEY=...hud eval hud-evals/SheetBench-50 claude
You can view an example scorecard for Claude's performance on SheetBench-50, or check out the full leaderboard to see how different models compare.
Or to see the full dataset:
hud get hud-evals/SheetBench-50
Comparing SpreadsheetBench with SheetBench‑50
SpreadsheetBench is a recent dataset that evaluates large‑language‑model (LLM) agents on spreadsheet manipulation. It uses 912 instructions sourced from Excel forums and creates three test spreadsheets per instruction to ensure that generated code generalizes across different data. The benchmark claims to reflect real user demands by pulling problems from sites like excelforum.com. Its evaluation is "Online Judge"‑style: for each instruction, an agent must produce code that transforms the sheet into the correct output across all test cases.
We bring this example into question to show the difference in developing such a dataset for public use, and developing an enterprise-focused suite of evaluations. Several community reviews and academic papers have surfaced notable shortcomings in SpreadsheetBench's design and scope. Understanding these issues is key to appreciating how HUD takes a different approach to evaluation creation:
- Simplified tasks and ignored spreadsheet complexity. The authors themselves note that many existing benchmarks (including those they build on) synthesize queries or use crowd‑workers, resulting in tasks that are much simpler than real forum questions. Even in SpreadsheetBench, each instruction is tested against only a few spreadsheets, and the evaluation is restricted to code‑generated answers, leaving out interactive operations like pivot tables, charts, or formatting. Other researchers have pointed out that SpreadsheetBench primarily tests formula generation and does not assess auditing, symbolic reasoning or domain‑specific modelling, so it cannot tell you whether an agent can actually reason about a model's correctness or adjust a forecast.
- Oversimplified data structures. Many spreadsheets in prior benchmarks contain only a single relational table; they ignore nested headers, missing column labels and cells containing free‑form text. SpreadsheetBench attempts to incorporate more flexible layouts, yet it still filters out posts that lack acknowledged responses or are difficult to formalize, meaning that some realistic but messy problems are excluded.
- Not focused on enterprise workflows. SpreadsheetBench is broad; its tasks are drawn from general Excel help forums, not professional analysts. As a result, it lacks domain‑specific modelling tasks and financial realism. A recent EuSpRIG (European Spreadsheet Risk Interest Group) report criticizes SpreadsheetBench for focusing almost entirely on formula generation and ignoring auditing or sector‑specific logic.
- Lack of motivation and realism. We observe, in general, that SpreadsheetBench and similar datasets do not paint a realistic picture of spreadsheet creation: SheetCopilotBench deals with charts and pivots in overly simple workbooks, whereas SpreadsheetAgent's evaluation focuses on basic data entry.
HUD's evaluations are built expressly to overcome these limitations and to measure whether agents can complete end‑to‑end workflows in real environments. As an example, SheetBench‑50 is created by financial analysts for real-world workflow automation. As opposed to SpreadsheetBench, we've hand‑curated 50 tasks from our work with finance professionals to reflect realistic workflows—amortization schedules, foreign‑exchange conversions, z‑score analyses, correlation studies, pivot‑table reporting, and scenario modelling. Additionally, the tasks themselves have been blindly replicated by independent annotators to ensure there is only one correct result. Our tasks feature different kinds of domain‑specific modelling and forecasting. SheetBench‑50 includes loans, depreciation, revenue forecasting, churn analysis and breakeven projections. These require financial reasoning and sequential calculations, not just formula translation.
Furthermore, every SheetBench‑50 run records the agent's sequence of actions, reasoning traces and screenshots on our platform. Scores are not just plan numbers; you can replay the trajectory to see exactly what the agent did. Our evaluation harness runs in a real Chrome environment using Google Sheets and is easy to extend to any other enterprise environment that runs in a browser. In short, SpreadsheetBench advanced the field by showing LLMs still struggle with formulas, but it remains a narrow, code-centric benchmark. SheetBench-50 takes the next step – measuring whether agents can complete real analyst tasks in real spreadsheets.
Getting Started
To quickly run SheetBench-50, validate our results, or evaluate your own agent you can get started by checking out our open source evaluations and environments SDK:
- Get Keys:
- HUD api key → https://www.hud.ai/project/api-keys
- Anthropic api key → https://console.anthropic.com/settings/keys
- Run:
uv tool install hud-python@latesthud set HUD_API_KEY=... ANTHROPIC_API_KEY=...hud eval hud-evals/SheetBench-50 claude
If you want more examples beyond financial analyst work, ping us – we have hundreds of more high quality case studies on how we develop environments to benchmark and train agents to automate enterprise work.
Work with Us
HUD builds the infrastructure for reproducible, high-signal agent evaluations: hosted environments, high quality tasks, complete telemetry on all agent actions. We collaborate directly with top labs and enterprises to co-design evals and RL environments that move the state of the art.
If you're a lab or enterprise, we can easily work our platform into your agentic stack, feel free to reach out to founders@hud.ai.