
Agent Infrastructure
3 partsTL;DR
Durable execution lands on Vercel. What it means for agents, long-running flows, and indie dev stacks - with code, gotchas, and where it fits the agent stack.
For three years now, every indie dev shipping an AI agent on Vercel has hit the same wall. Your function spins up, the agent calls a tool, the tool calls another tool, the loop ticks for ninety seconds, and then the runtime kills it. You retry. The model regenerates the same plan. You burn tokens. Your user waits.
Long-running agents have been a deploy nightmare on serverless. The mental model is wrong. A serverless function is a request handler. An agent is a process. The two have never fit together cleanly, and the workarounds - external queues, polling endpoints, status APIs cobbled together with KV stores - have all been variations of "rebuild a worker server, badly, in a serverless shape."
Vercel's new programming model for durable execution is the unlock. It is the first time the platform has shipped a primitive that maps cleanly onto the actual shape of an agent run. If you ship agents, this is the post you want to read this week.
Strip the marketing and the announcement is three things.
First, Vercel now supports durable functions as a first-class deployment target. A durable function looks like a regular async TypeScript function, but the runtime checkpoints its state at every await boundary. If the function gets killed - by timeout, by deploy, by infrastructure failure - it resumes from the last checkpoint instead of starting over. The state, including local variables, lives in Vercel-managed storage.
Second, the programming model is just JavaScript. There is no DSL, no graph builder, no YAML workflow definition. You write a function. You await steps. The runtime handles the rest. This is a meaningful difference from Inngest, Temporal, or AWS Step Functions, all of which require either a separate workflow definition or careful adherence to a step API.
Third, durable functions are integrated with Vercel's agent primitives, so streaming, tool calls, and the AI SDK plug in without ceremony. You can yield partial results to the client, persist progress to a durable store, and resume from a kill cleanly. The same function handles the user-facing stream and the long tail of tool calls that finish minutes later.
Here is a minimal durable agent function. It plans a multi-step task, executes each step, and survives crashes between steps.
import { durable, step } from "@vercel/functions/durable";
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
export const POST = durable(async (req: Request) => {
const { query } = await req.json();
const plan = await step("plan", async () => {
const { text } = await generateText({
model: openai("gpt-5"),
prompt: `Break this into 3 to 5 sub-tasks: ${query}`,
});
return text.split("\n").filter(Boolean);
});
const results: string[] = [];
for (const [i, task] of plan.entries()) {
const result = await step(`task-${i}`, async () => {
const { text } = await generateText({
model: openai("gpt-5"),
tools: { search: webSearchTool },
prompt: task,
});
return text;
});
results.push(result);
}
return Response.json({ plan, results });
});
Three things to notice.
The step() wrapper is the checkpoint boundary. Anything inside step() runs at most once per logical execution. If the function dies after step("task-2") and gets resumed, task-0 and task-1 come back from the durable log without re-running the model. You stop paying for the same tokens twice.
The function body reads top to bottom. There is no graph, no state machine, no callback chain. If you can read async TypeScript, you can read this. That is the dev ergonomics win.
The return value still streams. The runtime is smart enough to flush partial results to the client while the durable execution continues server-side. From the user's perspective, this is a normal API response. From your perspective, it is a process that can run for an hour without losing state.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Step boundaries are sticky. Once you ship a step name, you cannot rename or remove it without breaking in-flight executions. Treat step names like database column names. Versioning the function helps, but design the step graph as if every name is permanent.
Non-determinism outside steps is a footgun. Anything outside step() runs every time the function resumes. If you call Date.now() in the function body, you get a different value every resume, which corrupts the state machine. Push every side effect, every clock read, every random number into a step. The rule is: if it would change between runs, it goes in a step.
Step output is serialized to JSON. You cannot return a class instance, a stream, or a function from a step. Plan your data model around plain objects. This catches teams accustomed to passing rich types around.
Local dev is the same model but the durability is in-memory. Crashes during dev do not resume the way production does. Test the resume path with the deploy preview, not just vercel dev.
The honest take. Durable execution on Vercel is not a replacement for a real workflow engine if your workload is heavy event sourcing, complex retries with backoff trees, or human-in-the-loop with weeks-long pauses. Temporal still wins those.
But for the 80 percent case - an agent that plans, calls four to twelve tools, occasionally takes ten minutes to finish, and needs to survive a deploy - this is exactly the right primitive. You do not stand up a separate worker, you do not pay for a queue, you do not write a status polling endpoint. You write the function and ship.
Compared to Inngest, Vercel's model is more locked-in but better integrated. Inngest gives you a richer event model and a separate dashboard, at the cost of running their SDK and their cloud. Vercel's durable functions are tied to Vercel's runtime, which is fine if you already deploy there.
Compared to AWS Step Functions, this is a different universe. Step Functions is a configuration language. Vercel durable execution is just code. If your team is allergic to Amazon's State Language, this is your migration path.
The deepest critique is portability. The moment your step() calls and resume semantics depend on Vercel's runtime, you are tied to Vercel. For most indie devs this is fine. For an enterprise team that may need to leave the platform, factor that into the architecture decision.
We have been running durable execution against Overnight, our long-running task runner that lets you queue agent jobs and check back in the morning. The shape of the workload is exactly the shape Vercel optimized for: a queue of independent agent runs, each one ten to forty minutes of model calls and tool use, each one needs to survive infrastructure churn without losing token spend or partial outputs.
Before durable functions, the architecture was a separate worker process on Hetzner with a Postgres-backed queue. Cheap to run, fine to maintain, but it was a second deploy target with its own monitoring story. Moving the worker logic into durable functions collapsed two deploy units into one. The Postgres queue stayed - we still want a real queue for prioritization and rate limiting - but the worker that pulls a job and runs it is now a durable function on Vercel. Same cost ceiling, half the operational surface.
For a more orchestration-heavy workload, look at Orchestrator, our toolkit for chaining agent runs with shared context. Orchestrator is built around the idea that one agent run feeds the next, and the whole chain needs to be resumable. Durable execution as a primitive makes that pattern significantly easier to ship without rolling your own checkpoint storage.
The integration pattern that has worked best is: keep the queue and the human-facing API as regular handlers, push the agent loop itself into a durable function, and stream partial results back through a Convex-backed reactive channel so the client UI updates without polling. That gives you the best of both worlds - durable backend, reactive frontend, no homemade infrastructure.
A few open questions are worth tracking.
Pricing at scale. Durable execution costs more per second than a regular function because of the checkpoint overhead. For short tasks the math is fine. For workloads that idle for hours waiting on external systems, the per-second cost can add up. Vercel has not published the pricing curve for very long-running durable functions yet, and the answer matters a lot for production economics.
Observability. The dashboard ships with a step-level timeline view, which is the right primitive. What is missing right now is a clean way to export step traces to OpenTelemetry or your own observability stack. If you already run Datadog or Honeycomb, plan on writing a forwarder for now.
Cold start on resume. There is real latency between "function got killed" and "function resumes from checkpoint." For interactive use cases this is fine - you tell the user the agent is working. For latency-sensitive flows, measure it and make sure it fits.
The bigger story to watch is whether other platforms follow. Cloudflare Workflows already exists in a more event-driven shape. AWS will eventually ship something competitive. The shape of durable execution as a deploy primitive is settling into a pattern, and once two or three platforms ship roughly compatible APIs, expect a portability layer to emerge.
We are doing a deeper hands-on walkthrough on the Developers Digest YouTube channel - building a durable agent from scratch, breaking it on purpose to show the resume path, and benchmarking it against a vanilla serverless implementation. If you ship agents on Vercel, the new programming model is worth an afternoon to internalize. The mental shift is small. The operational payoff is real.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Type-safe Python agent framework from the Pydantic team. Brings the FastAPI feeling to AI development. Composable tools,...
View ToolDurable workflows and background jobs for serverless. Step-level retries, scheduling, and event-driven functions.
View ToolVercel's generative UI tool. Describe a component, get production-ready React code with shadcn/ui and Tailwind. Iterate...
View ToolThe TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
Cloudflare's Agent Memory primitive. What it stores, latency profile, how it compares to mem0, and how to wire it into y...

Agent runs are opaque. TraceTrail turns a Claude Code JSONL into a public share link with a stepped timeline of messages...

agentfs is filesystem-shaped storage for AI agents. Postgres-backed on Neon, no cold starts, no exec by design. Pay-only...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.