WorkSignals Scheduler/Watchers Design

Goal: Implement durable, dedupe-safe evaluation + firing of WorkSignals (prospective memory) in the gateway, focusing on event-based triggers first. A firing must be durable across restarts, must not double-fire on replay/reconnect, and must emit work.signal.fired with stable identifiers.

Scope (v1)

Event-based triggers (first pass): WorkItem status transitions
- Example: WorkItem enters blocked|done|failed|cancelled
On fire:
- Create a durable firing record (dedupe-safe).
- Mark the signal fired (durable, prevents repeat).
- Emit work.signal.fired with stable identifiers.
- Enqueue explicit follow-up work by creating a work_item_tasks row for the owning WorkItem (when present).

Non-goals (v1)

A full trigger DSL for all examples in #631 (approvals, artifacts, task completion). The design keeps the storage format open (trigger_spec_json) but ships one trigger kind first.
Executing the follow-up work. v1 only creates a durable task record; later work can lease + run these tasks.
Time-zone/local-time scheduling for time-based signals.

Proposed approaches

DB-backed polling scheduler (recommended)
- Periodic tick queries durable state (work item events) and derives firings.
- Pros: simplest path to “survives restarts”, easy to test deterministically (tick()), no need to thread event subscriptions through many modules.
- Cons: latency bounded by tick interval; queries must be bounded.
Event subscription + in-memory evaluation
- Subscribe to internal events (work item transitions, approvals, artifacts) and fire immediately.
- Pros: low latency.
- Cons: more invasive wiring; restart safety still needs durable dedupe/firing state.
Hybrid
- Subscribe for fast-path + periodic reconciliation poller.
- Pros: best reliability + latency.
- Cons: more moving parts (YAGNI for v1).

Architecture (v1)

Trigger spec (event-based, v1)

For v1, support a single event trigger spec:

{
  "kind": "work_item.status.transition",
  "to": ["blocked", "done", "failed", "cancelled"]
}

The WorkItem id is taken from work_signals.work_item_id (required for this trigger kind).
The spec is validated at runtime; invalid specs are ignored (best-effort).

Durable firing record

Add a new table work_signal_firings (SQLite + Postgres) to track firings with:

DB-lease fields (lease_owner, lease_expires_at_ms) for cluster safety.
attempt + next_attempt_at_ms for bounded retries with backoff.
dedupe_key to guarantee at-most-once per signal + causal event.

For WorkItem status transition triggers, dedupe_key = work_item_events.event_id.

Scheduler / processor loop

Add WorkSignalScheduler (patterned after WatcherScheduler):

tick():
1. Query active WorkSignals (status = 'active', trigger_kind = 'event').
2. For each signal, query bounded work_item_events since signal.created_at and find the first matching status transition.
3. Create a durable firing row using (signal_id, dedupe_key) uniqueness.
4. Claim and process queued firings using DB leases and retry/backoff.

On firing (processing step)

Processing a claimed firing:

Re-read the firing row and the signal row to confirm:
- firing is still processing
- lease_owner matches this scheduler instance
- signal is still active
Mark the signal fired (status='fired', last_fired_at=now).
If the signal has a work_item_id, create a work_item_tasks row to represent explicit follow-up work (execution wiring can come later).
Mark the firing row enqueued with any metadata (error cleared).
Emit work.signal.fired to WS operator clients with:
- signal_id
- firing_id (stable; derived from signal + dedupe key)
- enqueued_job_id (omitted in v1 unless we later wire a real job)

Idempotency and correctness

At-most-once: enforced by UNIQUE(signal_id, dedupe_key) + durable signal status flip to fired.
No double-fire on replay: the same causal event produces the same dedupe key, so a second evaluation creates no new firing.
Cluster safety: DB leases prevent two schedulers from processing the same firing concurrently.
Retries/backoff: failures re-queue with exponential backoff until maxAttempts, then mark failed.

Rollback

Disable scheduler startup (feature flag/env) while leaving persistence intact.
If needed, revert the migration that adds work_signal_firings and remove scheduler wiring; WorkSignal CRUD remains unaffected.

Test plan

Unit tests:
- A WorkSignal with work_item.status.transition fires when the WorkItem transitions to blocked and emits exactly one work.signal.fired.
- Restart safety: creating a second scheduler instance does not re-fire.
Contract/conformance:
- Add a WS contract test that drives work.create + work.signal.create + work.transition and asserts the emitted work.signal.fired frame conforms to @tyrum/schemas.

Scope (v1)​

Non-goals (v1)​

Proposed approaches​

Architecture (v1)​

Trigger spec (event-based, v1)​

Durable firing record​

Scheduler / processor loop​

On firing (processing step)​

Idempotency and correctness​

Rollback​

Test plan​