WorkSignals Scheduler/Watchers Design
Goal: Implement durable, dedupe-safe evaluation + firing of WorkSignals (prospective memory) in the gateway, focusing on event-based triggers first. A firing must be durable across restarts, must not double-fire on replay/reconnect, and must emit work.signal.fired with stable identifiers.
Scope (v1)
- Event-based triggers (first pass): WorkItem status transitions
- Example: WorkItem enters
blocked|done|failed|cancelled
- Example: WorkItem enters
- On fire:
- Create a durable firing record (dedupe-safe).
- Mark the signal fired (durable, prevents repeat).
- Emit
work.signal.firedwith stable identifiers. - Enqueue explicit follow-up work by creating a
work_item_tasksrow for the owning WorkItem (when present).
Non-goals (v1)
- A full trigger DSL for all examples in #631 (approvals, artifacts, task completion). The design keeps the storage format open (
trigger_spec_json) but ships one trigger kind first. - Executing the follow-up work. v1 only creates a durable task record; later work can lease + run these tasks.
- Time-zone/local-time scheduling for time-based signals.
Proposed approaches
-
DB-backed polling scheduler (recommended)
- Periodic tick queries durable state (work item events) and derives firings.
- Pros: simplest path to “survives restarts”, easy to test deterministically (
tick()), no need to thread event subscriptions through many modules. - Cons: latency bounded by tick interval; queries must be bounded.
-
Event subscription + in-memory evaluation
- Subscribe to internal events (work item transitions, approvals, artifacts) and fire immediately.
- Pros: low latency.
- Cons: more invasive wiring; restart safety still needs durable dedupe/firing state.
-
Hybrid
- Subscribe for fast-path + periodic reconciliation poller.
- Pros: best reliability + latency.
- Cons: more moving parts (YAGNI for v1).
Architecture (v1)
Trigger spec (event-based, v1)
For v1, support a single event trigger spec:
{
"kind": "work_item.status.transition",
"to": ["blocked", "done", "failed", "cancelled"]
}
- The WorkItem id is taken from
work_signals.work_item_id(required for this trigger kind). - The spec is validated at runtime; invalid specs are ignored (best-effort).
Durable firing record
Add a new table work_signal_firings (SQLite + Postgres) to track firings with:
- DB-lease fields (
lease_owner,lease_expires_at_ms) for cluster safety. attempt+next_attempt_at_msfor bounded retries with backoff.dedupe_keyto guarantee at-most-once per signal + causal event.
For WorkItem status transition triggers, dedupe_key = work_item_events.event_id.
Scheduler / processor loop
Add WorkSignalScheduler (patterned after WatcherScheduler):
tick():- Query active WorkSignals (
status = 'active',trigger_kind = 'event'). - For each signal, query bounded
work_item_eventssincesignal.created_atand find the first matching status transition. - Create a durable firing row using
(signal_id, dedupe_key)uniqueness. - Claim and process queued firings using DB leases and retry/backoff.
- Query active WorkSignals (
On firing (processing step)
Processing a claimed firing:
- Re-read the firing row and the signal row to confirm:
- firing is still
processing - lease_owner matches this scheduler instance
- signal is still
active
- firing is still
- Mark the signal fired (
status='fired',last_fired_at=now). - If the signal has a
work_item_id, create awork_item_tasksrow to represent explicit follow-up work (execution wiring can come later). - Mark the firing row
enqueuedwith any metadata (error cleared). - Emit
work.signal.firedto WS operator clients with:signal_idfiring_id(stable; derived from signal + dedupe key)enqueued_job_id(omitted in v1 unless we later wire a real job)
Idempotency and correctness
- At-most-once: enforced by
UNIQUE(signal_id, dedupe_key)+ durable signal status flip tofired. - No double-fire on replay: the same causal event produces the same dedupe key, so a second evaluation creates no new firing.
- Cluster safety: DB leases prevent two schedulers from processing the same firing concurrently.
- Retries/backoff: failures re-queue with exponential backoff until
maxAttempts, then markfailed.
Rollback
- Disable scheduler startup (feature flag/env) while leaving persistence intact.
- If needed, revert the migration that adds
work_signal_firingsand remove scheduler wiring; WorkSignal CRUD remains unaffected.
Test plan
- Unit tests:
- A WorkSignal with
work_item.status.transitionfires when the WorkItem transitions toblockedand emits exactly onework.signal.fired. - Restart safety: creating a second scheduler instance does not re-fire.
- A WorkSignal with
- Contract/conformance:
- Add a WS contract test that drives
work.create+work.signal.create+work.transitionand asserts the emittedwork.signal.firedframe conforms to@tyrum/schemas.
- Add a WS contract test that drives