Live Agent Benchmarks

Tyrum now supports a YAML-authored live benchmark format for scoring real agent runs with real LLMs.

The v1 implementation is CLI-driven:

tyrum-cli benchmark validate --suite <path>
tyrum-cli benchmark run --suite <path> --judge-model <provider/model> [--model <provider/model>] [--scenario <id>] [--repeat <n>] [--agent-key <key>] [--output <dir>]

What v1 does

Loads and validates a benchmark suite from YAML with typed @tyrum/contracts schemas.
Clones the target agent into an isolated temporary benchmark agent for each scenario run.
Seeds prior conversations and secret refs on the temporary agent so benchmark runs do not pollute the source agent.
Runs the scored prompt over the normal conversation WebSocket path.
Captures trace data from existing gateway events:
- chat.ui-message.stream
- tool.lifecycle
- context_report.created
- approval.updated
- transcript.get
Normalizes deterministic metrics such as failed tool calls, browser usage, sandbox usage, approvals, secret tool calls, memory hits, and missing or forbidden tool families.
Sends a structured judge packet to a separate temporary judge agent and maps the judge verdict to passed, failed, inconclusive, or infrastructure_error.

Sandbox browser expectation

Browser-heavy benchmarks should require the agent to request a managed sandbox rather than talking directly to a mocked browser tool.

Use these scenario flags:

environment.sandbox_request_required: true
environment.browser_required: true
environment.required_tool_families including:
- sandbox.
- tool.browser.

When sandbox_request_required is set, the runner performs a preflight check against desktopEnvironmentHosts.list and fails the scenario as infrastructure_error if no healthy desktop environment host is available.

This means the benchmark expects the agent to:

call sandbox.request
wait for the sandbox-backed node attachment
use browser tools through that sandbox-attached node

Current built-in artifacts

The v1 runner collects these built-in artifacts:

final_reply
transcript
conversation
stream_events
tool_events
context_reports
approval_events

If a scenario requires an artifact outside that built-in set, the runner records it as missing and passes that fact to the judge.

Scenario shape

See examples/core-live-v1.yaml for a concrete suite.

Important fields:

defaults.agent_key: source agent to clone unless overridden
seed.conversations[]: prior conversations that should become memory
seed.secret_refs[]: secret refs injected into the temporary benchmark agent config
environment.required_tool_families[]: capability families that must appear in the trace
environment.disallowed_tool_families[]: capability families that must not appear
environment.approval_mode: none or must_request_autoapprove
trace.checks[]: judge-visible checks such as browser_used, sandbox_requested, no_unnecessary_questions, checkout_completed

Judge policy

The runner treats the judge as the source of truth for pass, fail, and inconclusive.

The judge instructions explicitly say:

tool-call ordering does not matter
extra harmless tool use is a penalty, not an automatic failure
unnecessary questions, unjustified refusal, wrong tool families, missing sandbox request, secret mishandling, and ungrounded success claims matter more than efficiency
claimed completion without evidence is a failure

Notes

--judge-model is required.
--model is optional and overrides the cloned benchmark agent model for the run.
The runner uses temporary agents and deletes them after the run on a best-effort basis.
Secret refs referenced by a suite must already exist in the gateway secret store.
For browser benchmarks, the deployment must already have a healthy managed desktop host available.

What v1 does​

Sandbox browser expectation​

Current built-in artifacts​

Scenario shape​

Judge policy​