Checkpoint & resume¶
When is the storage overhead worth it?
Checkpointing writes one JSON snapshot per step. Worth it when re-running an early step costs more than the storage round-trip; not worth it for short, idempotent pipelines.
Decision tree¶
Short-running pipeline, idempotent if rerun from scratch?
→ No checkpoint. Plan(*steps) — without store= / checkpoint_key=.
Long or expensive pipeline, partial-run survival matters?
→ Plan(*steps,
store=Store(db="run.sqlite"),
checkpoint_key="run-2026-04-30",
resume=True)
# Failed step retries on resume; "done" pipeline short-
# circuits to the cached writes-bucket.
Pipeline waits on external events (webhook, human approval, retry queue)?
→ Same pattern; you split the run across processes and re-enter
with resume=True after the event is delivered.
Dev loop iterating on a specific step?
→ Pin upstream steps via the same checkpoint_key so you don't
re-pay for them on every iteration.
Need a user-visible history of every step's Envelope?
→ Checkpoint is minimal — only writes-bucket + next_step + status.
For full history use Session(exporters=[JsonFileExporter("…")])
and query session.events.query(...).
Quick reference¶
| Situation | Use checkpoint? |
|---|---|
| Short, idempotent pipeline | No |
| Expensive, crash-prone, long-running | Yes — store= + checkpoint_key= + resume=True |
| Async / event-driven (re-enters across processes) | Yes — same pattern |
| Dev iteration loop on a specific step | Yes — pin upstream via checkpoint |
| Want a full run trace for audit | No — use Session + JsonFileExporter |
| Concurrent fan-out runs sharing one Plan shape | on_concurrent="fork" (no resume) |
Notes¶
- Checkpoint is minimal. One JSON write per step: the
writes-bucket payload, the next step name, the status (claimed/running/failed/done), the run UID, and (v2 only) the serialised step-result history. Not a full audit trail — for that pairPlanwithSession. - Concurrent runs sharing a key are serialised. The default
on_concurrent="fail"raisesConcurrentPlanRunErroron collision. Useon_concurrent="fork"to give each run its own keyspace (incompatible withresume=True). - Checkpoint writes happen before durable Store writes.
Eliminates double-writes on resume; the inverse trade-off is
that a crash in the gap loses the durable Store value. The
value still lives in the checkpoint's
kvso the Plan continues correctly — but sidecar consumers reading the Store directly should reconcile against the checkpoint snapshot. - A failed parallel band points the checkpoint at the band's
first step. The whole band re-runs cleanly so all sibling
writesare produced consistently. Branches with non- idempotent side effects need idempotency keys.
See also¶
- Checkpoint & resume — full reference: persisted shape, state transitions, sidecar consumer rules.
- Store — the durable layer behind checkpoints; SQLite WAL mode for thread-safe concurrent access.
- Parallel plan steps — band atomicity rules that drive the "next_step points to the band's first step" behaviour on failure.