Scaling and High Availability
Read this if you need the deployment mental model from laptop install to multi-instance cluster. Skip this if you need table-level mechanics first; use the drill-down docs.
Tyrum keeps one logical architecture across deployment sizes. The difference between single host and cluster is replica count and infrastructure shape, not control-plane semantics.
What This Page Covers
- Deployment roles and how they map to single-host vs cluster.
- The durability and coordination primitives that must remain true at every size.
- High-level failure posture for restart, failover, and reconnect scenarios.
Deployment Shapes
Single-host
- Typical form: one machine with co-located gateway edge, turn processor, worker, and scheduler roles.
- Durable state: SQLite StateStore on local persistent disk.
- Backplane/lease behavior still exists conceptually, but usually in uncontested form.
Clustered
- Typical form: replicated gateway edges, replicated workers, lease-coordinated schedulers, optional desktop-runtime role.
- Durable state: HA Postgres StateStore.
- Backplane/outbox handles cross-edge event delivery and directed routing.
- Workers and schedulers coordinate with durable leases/claims instead of in-memory ownership.
Runtime Roles
gateway-edge: long-lived WebSocket/HTTP edge, routing, auth, control-plane coordination.worker: claims and executes work attempts through tools, nodes, and MCP.scheduler: emits cron/watch/heartbeat triggers under DB lease coordination.desktop-runtime(optional): reconciles gateway-managed desktop environments.
Short Invariants (Must Hold Everywhere)
- Durable state is the source of truth; restarts cannot erase execution-critical state.
- Requests/events are retry-safe under at-least-once delivery.
- Execution that must be serialized stays serialized per conversation.
- Secrets remain handles until trusted execution boundaries resolve them.
- Workspace durability is explicit; ToolRunner is the writer boundary for workspace-backed steps.
Coordination Primitives
- StateStore durability: conversation, approval, turn, artifact, audit, and routing state survives process churn.
- Outbox/backplane delivery: cross-instance event routing stays at-least-once with dedupe safety.
- Claim/lease execution: workers claim work atomically and take over safely on expiry.
- Scheduler DB leases: trigger ownership is durable and failover-safe.
- Idempotency + retries: side-effecting work does not duplicate outcomes during retries.
Failure and Recovery Posture
- Edge failure: connections reconnect to healthy edges; state is re-derived from durable records.
- Worker failure: lease expiry enables takeover without replaying completed side effects.
- Scheduler failure: lease transfer prevents double-firing under multi-instance operation.
- Database/transient network issues: retries + durable contracts restore service without changing policy semantics.
Not In Scope Here
- Protocol wire contracts and payload catalogs.
- Component-level execution/approval internals.
- Table-level retention, schema, and index tuning mechanics.