Case study

Durable AI Workflows That Should Always Work (Temporal, Celery, Batch)

Software engineer focused on backend systems, product workflows, and integrations.

Problem / Why it mattered

The goal was to turn messy meeting-style inputs into structured artifacts that could safely update real systems like CRMs and ATS tools. These workflows were long-running, dependent on external APIs, and failure-prone. In this context, "usually works" was not acceptable: the system needed reliable execution, replayability, and clear auditability.

Constraints

Workloads had variable runtime and external model/API dependency latency.
Backfills had to run alongside live traffic without degrading user-facing throughput.
Each step had to be idempotent so retries or replays would not create duplicate side effects.
LLM output quality needed guardrails around accuracy, relevance, and timeliness.

What I changed (design + architecture)

Used Temporal for durable orchestration on critical long-running pipelines and Celery for best-effort background tasks.
Defined explicit workflow state transitions, per-activity timeouts, and retry policies.
Classified failures into retryable and non-retryable paths to prevent infinite retry loops.
Kept AI as a bounded step with structured outputs, validation, and fallback handling.
Implemented idempotent batch/backfill execution with resumable checkpoints and replay-safe recovery.

What I measured

Workflow completion rate, retry distribution, and terminal failure classes.
Backfill throughput, recovery time, and replay-safety outcomes.
Operator intervention frequency and mean time to recovery.
Structured output validity and downstream write success rate.

Result

The pipeline moved from fragile background jobs to durable execution with a clean recovery model. Operators gained traceable run IDs and replay workflows, while users received more consistent artifacts that could be written into downstream systems with less manual repair.

What I’d do next

Expand workflow-level SLO dashboards, deepen regression evals for prompt/model changes, and automate anomaly detection for long-tail failures.