Live Agent Benchmarks
Tyrum now supports a YAML-authored live benchmark format for scoring real agent runs with real LLMs.
The v1 implementation is CLI-driven:
tyrum-cli benchmark validate --suite <path>tyrum-cli benchmark run --suite <path> --judge-model <provider/model> [--model <provider/model>] [--scenario <id>] [--repeat <n>] [--agent-key <key>] [--output <dir>]
What v1 does
- Loads and validates a benchmark suite from YAML with typed
@tyrum/contractsschemas. - Clones the target agent into an isolated temporary benchmark agent for each scenario run.
- Seeds prior conversations and secret refs on the temporary agent so benchmark runs do not pollute the source agent.
- Runs the scored prompt over the normal conversation WebSocket path.
- Captures trace data from existing gateway events:
chat.ui-message.streamtool.lifecyclecontext_report.createdapproval.updatedtranscript.get
- Normalizes deterministic metrics such as failed tool calls, browser usage, sandbox usage, approvals, secret tool calls, memory hits, and missing or forbidden tool families.
- Sends a structured judge packet to a separate temporary judge agent and maps the judge verdict to
passed,failed,inconclusive, orinfrastructure_error.
Sandbox browser expectation
Browser-heavy benchmarks should require the agent to request a managed sandbox rather than talking directly to a mocked browser tool.
Use these scenario flags:
environment.sandbox_request_required: trueenvironment.browser_required: trueenvironment.required_tool_familiesincluding:sandbox.tool.browser.
When sandbox_request_required is set, the runner performs a preflight check against desktopEnvironmentHosts.list and fails the scenario as infrastructure_error if no healthy desktop environment host is available.
This means the benchmark expects the agent to:
- call
sandbox.request - wait for the sandbox-backed node attachment
- use browser tools through that sandbox-attached node
Current built-in artifacts
The v1 runner collects these built-in artifacts:
final_replytranscriptconversationstream_eventstool_eventscontext_reportsapproval_events
If a scenario requires an artifact outside that built-in set, the runner records it as missing and passes that fact to the judge.
Scenario shape
See examples/core-live-v1.yaml for a concrete suite.
Important fields:
defaults.agent_key: source agent to clone unless overriddenseed.conversations[]: prior conversations that should become memoryseed.secret_refs[]: secret refs injected into the temporary benchmark agent configenvironment.required_tool_families[]: capability families that must appear in the traceenvironment.disallowed_tool_families[]: capability families that must not appearenvironment.approval_mode:noneormust_request_autoapprovetrace.checks[]: judge-visible checks such asbrowser_used,sandbox_requested,no_unnecessary_questions,checkout_completed
Judge policy
The runner treats the judge as the source of truth for pass, fail, and inconclusive.
The judge instructions explicitly say:
- tool-call ordering does not matter
- extra harmless tool use is a penalty, not an automatic failure
- unnecessary questions, unjustified refusal, wrong tool families, missing sandbox request, secret mishandling, and ungrounded success claims matter more than efficiency
- claimed completion without evidence is a failure
Notes
--judge-modelis required.--modelis optional and overrides the cloned benchmark agent model for the run.- The runner uses temporary agents and deletes them after the run on a best-effort basis.
- Secret refs referenced by a suite must already exist in the gateway secret store.
- For browser benchmarks, the deployment must already have a healthy managed desktop host available.