
Agent Infrastructure
3 partsTL;DR
Durable execution lands on Vercel. What it means for agents, long-running flows, and indie dev stacks - with code, gotchas, and where it fits the agent stack.
| Resource | Link |
|---|---|
| Vercel durable execution announcement | A New Programming Model for Durable Execution |
| Vercel AI SDK | Vercel AI SDK docs |
| Temporal workflow engine | Temporal docs |
| Inngest durable functions | Inngest docs |
| AWS Step Functions | AWS Step Functions developer guide |
| Cloudflare Workflows | Cloudflare Workflows docs |
For three years now, every indie dev shipping an AI agent on Vercel has hit the same wall. Your function spins up, the agent calls a tool, the tool calls another tool, the loop ticks for ninety seconds, and then the runtime kills it. You retry. The model regenerates the same plan. You burn tokens. Your user waits.
For model-selection context, compare this with AI Agents Explained: A TypeScript Developer's Guide and How to Build AI Agents in TypeScript; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.
Long-running agents have been a deploy nightmare on serverless. The mental model is wrong. A serverless function is a request handler. An agent is a process. The two have never fit together cleanly, and the workarounds - external queues, polling endpoints, status APIs cobbled together with KV stores - have all been variations of "rebuild a worker server, badly, in a serverless shape."
Vercel's new programming model for durable execution is the unlock. It is the first time the platform has shipped a primitive that maps cleanly onto the actual shape of an agent run. If you ship agents, this is the post you want to read this week.
Strip the marketing and the announcement is three things.
First, Vercel now supports durable functions as a first-class deployment target. A durable function looks like a regular async TypeScript function, but the runtime checkpoints its state at every await boundary. If the function gets killed - by timeout, by deploy, by infrastructure failure - it resumes from the last checkpoint instead of starting over. The state, including local variables, lives in Vercel-managed storage.
Second, the programming model is just JavaScript. There is no DSL, no graph builder, no YAML workflow definition. You write a function. You await steps. The runtime handles the rest. This is a meaningful difference from Inngest, Temporal, or AWS Step Functions, all of which require either a separate workflow definition or careful adherence to a step API.
Third, durable functions are integrated with Vercel's agent primitives, so streaming, tool calls, and the AI SDK plug in without ceremony. You can yield partial results to the client, persist progress to a durable store, and resume from a kill cleanly. The same function handles the user-facing stream and the long tail of tool calls that finish minutes later.
Here is a minimal durable agent function. It plans a multi-step task, executes each step, and survives crashes between steps.
import { durable, step } from "@vercel/functions/durable";
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
export const POST = durable(async (req: Request) => {
const { query } = await req.json();
const plan = await step("plan", async () => {
const { text } = await generateText({
model: openai("gpt-5"),
prompt: `Break this into 3 to 5 sub-tasks: ${query}`,
});
return text.split("\n").filter(Boolean);
});
const results: string[] = [];
for (const [i, task] of plan.entries()) {
const result = await step(`task-${i}`, async () => {
const { text } = await generateText({
model: openai("gpt-5"),
tools: { search: webSearchTool },
prompt: task,
});
return text;
});
results.push(result);
}
return Response.json({ plan, results });
});
Three things to notice.
The step() wrapper is the checkpoint boundary. Anything inside step() runs at most once per logical execution. If the function dies after step("task-2") and gets resumed, task-0 and task-1 come back from the durable log without re-running the model. You stop paying for the same tokens twice.
The function body reads top to bottom. There is no graph, no state machine, no callback chain. If you can read async TypeScript, you can read this. That is the dev ergonomics win.
The return value still streams. The runtime is smart enough to flush partial results to the client while the durable execution continues server-side. From the user's perspective, this is a normal API response. From your perspective, it is a process that can run for an hour without losing state.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 28, 2026 • 7 min read
Apr 28, 2026 • 10 min read
Apr 28, 2026 • 12 min read
Apr 28, 2026 • 11 min read
Step boundaries are sticky. Once you ship a step name, you cannot rename or remove it without breaking in-flight executions. Treat step names like database column names. Versioning the function helps, but design the step graph as if every name is permanent.
Non-determinism outside steps is a footgun. Anything outside step() runs every time the function resumes. If you call Date.now() in the function body, you get a different value every resume, which corrupts the state machine. Push every side effect, every clock read, every random number into a step. The rule is: if it would change between runs, it goes in a step.
Step output is serialized to JSON. You cannot return a class instance, a stream, or a function from a step. Plan your data model around plain objects. This catches teams accustomed to passing rich types around.
Local dev is the same model but the durability is in-memory. Crashes during dev do not resume the way production does. Test the resume path with the deploy preview, not just vercel dev.
The honest take. Durable execution on Vercel is not a replacement for a real workflow engine if your workload is heavy event sourcing, complex retries with backoff trees, or human-in-the-loop with weeks-long pauses. Temporal still wins those.
But for the 80 percent case - an agent that plans, calls four to twelve tools, occasionally takes ten minutes to finish, and needs to survive a deploy - this is exactly the right primitive. You do not stand up a separate worker, you do not pay for a queue, you do not write a status polling endpoint. You write the function and ship.
Compared to Inngest, Vercel's model is more locked-in but better integrated. Inngest gives you a richer event model and a separate dashboard, at the cost of running their SDK and their cloud. Vercel's durable functions are tied to Vercel's runtime, which is fine if you already deploy there.
Compared to AWS Step Functions, this is a different universe. Step Functions is a configuration language. Vercel durable execution is just code. If your team is allergic to Amazon's State Language, this is your migration path.
The deepest critique is portability. The moment your step() calls and resume semantics depend on Vercel's runtime, you are tied to Vercel. For most indie devs this is fine. For an enterprise team that may need to leave the platform, factor that into the architecture decision.
We have been running durable execution against Overnight, our long-running task runner that lets you queue agent jobs and check back in the morning. The shape of the workload is exactly the shape Vercel optimized for: a queue of independent agent runs, each one ten to forty minutes of model calls and tool use, each one needs to survive infrastructure churn without losing token spend or partial outputs.
Before durable functions, the architecture was a separate worker process on Hetzner with a Postgres-backed queue. Cheap to run, fine to maintain, but it was a second deploy target with its own monitoring story. Moving the worker logic into durable functions collapsed two deploy units into one. The Postgres queue stayed - we still want a real queue for prioritization and rate limiting - but the worker that pulls a job and runs it is now a durable function on Vercel. Same cost ceiling, half the operational surface.
For a more orchestration-heavy workload, look at Orchestrator, our toolkit for chaining agent runs with shared context. Orchestrator is built around the idea that one agent run feeds the next, and the whole chain needs to be resumable. Durable execution as a primitive makes that pattern significantly easier to ship without rolling your own checkpoint storage.
The integration pattern that has worked best is: keep the queue and the human-facing API as regular handlers, push the agent loop itself into a durable function, and stream partial results back through a Convex-backed reactive channel so the client UI updates without polling. That gives you the best of both worlds - durable backend, reactive frontend, no homemade infrastructure.
A few open questions are worth tracking.
Pricing at scale. Durable execution costs more per second than a regular function because of the checkpoint overhead. For short tasks the math is fine. For workloads that idle for hours waiting on external systems, the per-second cost can add up. Vercel has not published the pricing curve for very long-running durable functions yet, and the answer matters a lot for production economics.
Observability. The dashboard ships with a step-level timeline view, which is the right primitive. What is missing right now is a clean way to export step traces to OpenTelemetry or your own observability stack. If you already run Datadog or Honeycomb, plan on writing a forwarder for now.
Cold start on resume. There is real latency between "function got killed" and "function resumes from checkpoint." For interactive use cases this is fine - you tell the user the agent is working. For latency-sensitive flows, measure it and make sure it fits.
The bigger story to watch is whether other platforms follow. Cloudflare Workflows already exists in a more event-driven shape. AWS will eventually ship something competitive. The shape of durable execution as a deploy primitive is settling into a pattern, and once two or three platforms ship roughly compatible APIs, expect a portability layer to emerge.
We are doing a deeper hands-on walkthrough on the Developers Digest YouTube channel - building a durable agent from scratch, breaking it on purpose to show the resume path, and benchmarking it against a vanilla serverless implementation. If you ship agents on Vercel, the new programming model is worth an afternoon to internalize. The mental shift is small. The operational payoff is real.
Durable execution is a programming model where your function's state is automatically checkpointed at every await boundary. If the function gets killed by a timeout, deploy, or infrastructure failure, it resumes from the last checkpoint instead of starting over. On Vercel, this means you can write long-running agent workflows as regular TypeScript functions without worrying about losing progress.
Vercel's model is pure TypeScript with no DSL or workflow definition language. You write a normal async function and wrap side effects in step() calls. Temporal requires a separate workflow definition and runs on dedicated infrastructure. Inngest uses a step-based API with their own cloud and dashboard. Vercel's approach is more locked-in to their platform but better integrated if you already deploy there.
Durable execution costs more per second than regular serverless functions due to checkpoint overhead. For agent workloads that run 5 to 40 minutes, the math usually works out fine - you are paying for execution time you would have burned anyway, plus minimal storage for checkpoints. For workloads that idle for hours waiting on external systems, the per-second cost can add up. Vercel has not published detailed pricing curves for very long-running durable functions yet.
Three main gotchas: step names are permanent and cannot be renamed without breaking in-flight executions; non-determinism outside steps corrupts state (push Date.now(), random numbers, and all side effects into steps); and step output must be JSON-serializable (no class instances, streams, or functions). Test resume behavior in deploy previews, not just local dev.
Yes. The runtime flushes partial results to the client while durable execution continues server-side. From the user's perspective it looks like a normal streaming API response. From your perspective, it is a process that can run for an hour without losing state. Combine with Convex or a similar reactive backend for real-time UI updates without polling.
Use durable execution when your agent runs longer than your serverless timeout (typically 10 to 60 seconds on Vercel), calls multiple tools in sequence, or needs to survive deploys and infrastructure churn. Use regular functions for simple request-response handlers, quick LLM calls, or anything that completes in under a minute. The checkpoint overhead is not worth it for fast operations.
Steps that fail can be retried according to your logic - wrap the step body in try-catch and decide whether to retry, skip, or abort. The durable log tracks which steps completed successfully, so a retry of the overall function skips completed steps automatically. For complex retry logic with backoff trees or dead-letter queues, consider Temporal or Inngest instead.
Not directly. The step() API and resume semantics are specific to Vercel's runtime. If you need to migrate, you would rewrite the durable function as a Temporal workflow, Inngest step function, or AWS Step Function. For most indie devs, Vercel lock-in is acceptable. For enterprise teams that may need to leave the platform, factor portability into your architecture decision before committing to durable execution.
Read next
Coding changed more in the past two years than in the previous decade. We moved from manual typing to autocomplete, then to multi-file edits.
12 min readAgents forget everything between sessions. Here are the patterns that fix that: CLAUDE.md persistence, RAG retrieval, context compression, and conversation summarization.
9 min readCloudflare's Agent Memory primitive. What it stores, latency profile, how it compares to mem0, and how to wire it into your stack.
9 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Type-safe Python agent framework from the Pydantic team. Brings the FastAPI feeling to AI development. Composable tools,...
View ToolDurable workflows and background jobs for serverless. Step-level retries, scheduling, and event-driven functions.
View ToolVercel's high-performance monorepo build system. Remote caching, task pipelines, and incremental builds. Drop into any p...
View ToolOpen-source cloud sandboxes for AI agents. Isolated environments that start in under 200ms, run code in Python, JavaScri...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
Coding changed more in the past two years than in the previous decade. We moved from manual typing to autocomplete, then...

Agents forget everything between sessions. Here are the patterns that fix that: CLAUDE.md persistence, RAG retrieval, co...

Vercel just declared the agent stack: AI Gateway, Sandbox, Flags, and Microfrontends. Here is how the four primitives co...

Mastra is the strongest fit when a TypeScript product needs agents, workflows, memory, tools, MCP, evals, and traces in...
Fable 5 ships with safety classifiers that route flagged requests away from the model. In production you need to handle...
Claude Managed Agents is in public beta with solid sandboxing and session persistence - but the headline orchestration f...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.