Assistants to Responses API: A Migration Field Guide

Developers Digest•April 29, 2026•13 min read

OpenAI Responses API Assistants API Migration API

TL;DR

OpenAI is sunsetting the Assistants API in 2026. Here is a tested migration plan to the Responses API — code, state, threads, tools, every cliff I hit, in order.

The Deprecation Timeline

OpenAI confirmed the Assistants API sunset in the developer changelog: new endpoints frozen now, full shutdown in 2026. Threads, runs, run-steps, and the assistant resource itself all go away. Files and vector stores survive (they moved into the Responses API surface). Function calling survives but the schema is slightly different. The Code Interpreter and File Search tools survive as built-in tools on Responses.

If you are running production code against client.beta.threads.* today, you have homework. I had a 14-month-old Assistants codebase running newsletter automation, customer support triage, and a chunk of internal ops. Last weekend I migrated all of it. This is the field guide — every cliff I hit, in order, with the code diffs that worked.

For the visual walkthrough including the eval harness I used to gate the cutover, see the DevDigest YouTube channel.

Conceptual Diff: Threads vs. State

The Assistants API was server-stateful. You created a thread, posted messages, kicked off runs, polled for completion, and OpenAI held the conversation history. Your code did not own the state.

The Responses API is client-stateful by default, server-stateful by opt-in. Each call returns a response.id. You pass previous_response_id on the next call to get continuity. The server stores the chain for 30 days. After that, you reconstruct from your own DB or pass the message array explicitly.

This is the right design — server-only state was a footgun for compliance, debugging, and multi-region — but it changes how you think about every conversation:

Assistants	Responses
`threads.create()`	nothing — just call `responses.create`
`threads.messages.create()`	include in `input` array
`runs.create()` + poll	`responses.create()` returns synchronously or streams
`run.required_action`	`response.required_action` (similar but flatter)
`assistants.create()`	`prompts` + system messages + tools per call

The big mental shift: there is no assistant object anymore. The "assistant" is your prompt template + tool list + model config, which you supply per call. This is why I version mine in Promptlock — the prompt is now a first-class artifact in your repo, not a row in OpenAI's database.

Code-Level Migration

Here is the minimal-diff before/after for a single conversation turn. The "before" is the standard Assistants pattern most of us wrote in 2024:

// BEFORE — Assistants API
const thread = await client.beta.threads.create();
await client.beta.threads.messages.create(thread.id, {
  role: "user",
  content: userMessage,
});
const run = await client.beta.threads.runs.createAndPoll(thread.id, {
  assistant_id: ASSISTANT_ID,
});
const messages = await client.beta.threads.messages.list(thread.id);
const reply = messages.data[0].content[0].text.value;

// AFTER — Responses API
const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: SYSTEM_PROMPT,
  input: userMessage,
  tools: TOOLS,
  previous_response_id: priorResponseId, // null on first turn
  store: true, // 30-day server retention
});
const reply = response.output_text;
const newResponseId = response.id; // persist for next turn

The "after" version is shorter, synchronous on the happy path, and the conversation chain lives in two places you control: your DB row (the response.id) and your prompt repo (SYSTEM_PROMPT).

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

State and History Handling

This is where I lost the most time. Three patterns I now use:

Pattern 1: Short-lived chains (default). Persist previous_response_id against your conversation row. On each turn, pass it. Trust OpenAI's 30-day retention. This is what most apps want.

await db.conversation.update({
  where: { id: convId },
  data: { lastResponseId: response.id },
});

Pattern 2: Long-lived or compliance-bound chains. Do not rely on server retention. Store every message in your DB and pass them explicitly:

const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: SYSTEM_PROMPT,
  input: messages.map((m) => ({ role: m.role, content: m.content })),
  store: false, // do not retain server-side
});

Pattern 3: Hybrid. Short-lived state via previous_response_id, but you also write every input/output to your DB for replay and eval purposes. This is what I run in production. It is the only pattern that gives you both ergonomic continuity and full-control debugging.

The cliff I hit: I assumed previous_response_id would still work after 31 days. It does not — the server returns a 404. Wrap every call in a fallback that reconstructs from your DB if the chain is missing.

Tool-Use Parity

Function calling works, with a flatter schema. The tools array is the same shape. The big differences:

Built-in tools. code_interpreter and file_search are now first-class tools you enable per call. No more attaching them to an assistant.
Parallel tool calls. Default-on in Responses. If your old code assumed serial tool execution, audit your handlers — they will now fire in parallel.
Streaming tool calls. You can stream tool-call deltas, which means you can render "agent is calling tool X..." in real time. Assistants forced you to wait for requires_action.

Here is the parallel-tool gotcha. In Assistants, this code was safe:

// Assistants — implicit serial
for (const call of run.required_action.submit_tool_outputs.tool_calls) {
  const output = await runTool(call); // safe, one at a time
}

In Responses, the model now expects you to handle multiple tool calls concurrently. If runTool is not idempotent or hits a rate-limited downstream, batch your calls or Promise.all them with a concurrency cap:

import pLimit from "p-limit";
const limit = pLimit(3);
const outputs = await Promise.all(
  response.required_action.submit_tool_outputs.tool_calls.map((call) =>
    limit(() => runTool(call))
  )
);

I missed this on my first migration. The customer-support agent fired four parallel ticket-update calls to a legacy CRM and got rate-limited into oblivion within an hour.

Eval-Driven Cutover

The migration is mechanical but the behavior is not always identical. Different default temperatures, different tool-call patterns, different message-formatting quirks. I would not cut over without a regression eval.

My harness: a flag-gated rollout where 10% of traffic goes to Responses, 90% to Assistants, both runs are logged with the same input, and a nightly job scores the diffs. I open-sourced the bones of this as Agent Eval Bench — input replay, output diff, automated grading via a stronger model.

The cutover schedule that worked for me:

Week 1 — Build the Responses path behind a feature flag. 0% traffic. Run shadow evals on logged inputs.
Week 2 — 10% live traffic. Watch error rates, latency, customer-reported issues.
Week 3 — 50% if metrics hold. Bug-fix anything weird.
Week 4 — 100%. Keep the Assistants code path in the repo with @deprecated comments for one more month, then delete.

Burn-down looked roughly like this in my logs:

Day 1: 47 endpoints calling Assistants
Day 7: 47 (built path, no traffic yet)
Day 9: 47 → 47 (10% rollout, both alive)
Day 14: 47 → 12 (cut the safe ones, kept stateful chains on assistants)
Day 21: 12 → 3 (long-lived chain edge cases)
Day 28: 0

The last three were the long-lived stateful chains where I needed pattern 2 above (explicit history). They took longer because I had to backfill DB writes for conversations that had been server-stateful for months.

What I Would Do Differently

Three things in priority order:

Start with eval, not code. Get the harness running before you write a line of migration code. Without a regression signal you are migrating blind.
Migrate stateless flows first. RAG queries, one-shot tool calls, summarization. These are mechanical search-and-replace. Build confidence before tackling stateful chains.
Audit parallel tool calls explicitly. Do not assume. Grep your runTool implementations for shared mutable state. The parallel-by-default behavior will find every race condition you have.

OpenAI gave us through 2026, which sounds generous until you remember every other library you depend on is also moving. Do not be the one team migrating in October.

The Responses API is the better primitive. It is simpler, more honest about state, and the streaming model finally feels native. The migration is a weekend of work for a small codebase and two weeks for a complex one. Worth it.

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X

Comments

Related Tools

AI Coding

Droid

Factory AI's terminal coding agent. Runs Anthropic and OpenAI models in one subscription. Handles full tasks end-to-end...

View Tool

AI FrameworksEssential

Vercel AI SDK

The TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...

View Tool

AI Models

OpenRouter

Unified API for 200+ models. One API key, one billing dashboard. OpenAI, Anthropic, Google, Meta, Mistral, and more. Aut...

View Tool

AI Models

GPT-5

OpenAI's latest flagship model. Major leap in reasoning, coding, and instruction following over GPT-4o. Powers ChatGPT P...

View Tool

Related Guides

Guide

Claude Code Setup Guide

Configure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.

AI Agents

Guide

Building Your First MCP Server

Step-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.

AI Agents

Guide

Chronicle Research Preview Setup Guide

Set up Codex Chronicle on macOS, manage permissions, and understand privacy, security, and troubleshooting.

Getting Started

11 min read

OpenAI

OpenAI AgentKit in Production: An Honest Builder's Review

AgentKit gives you Agent Builder, Connector Registry, and ChatKit. I rebuilt my newsletter-research agent on it. Here is...

April 29, 2026

12 min read

OpenAI

Shipping OpenAI Symphony in Prod: A Real-World Guide

What it actually takes to wire OpenAI Symphony into a Linear-driven Codex workflow — auth, runs, sandboxes, costs, and t...

April 29, 2026

10 min read

Convex

Convex to Neon: The Playbook After 4 App Migrations

We ran the same Convex to Neon migration on four apps in a week. Here is what stayed identical, what differed per app, a...

April 28, 2026

Get Smarter About AI Dev

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.

One email per weekReal code, not theoryFree forever

Assistants to Responses API: A Migration Field Guide

Developers Digest•April 29, 2026•13 min read

OpenAI Responses API Assistants API Migration API

TL;DR

OpenAI is sunsetting the Assistants API in 2026. Here is a tested migration plan to the Responses API — code, state, threads, tools, every cliff I hit, in order.

The Deprecation Timeline

For the visual walkthrough including the eval harness I used to gate the cutover, see the DevDigest YouTube channel.

Conceptual Diff: Threads vs. State

The Assistants API was server-stateful. You created a thread, posted messages, kicked off runs, polled for completion, and OpenAI held the conversation history. Your code did not own the state.

This is the right design — server-only state was a footgun for compliance, debugging, and multi-region — but it changes how you think about every conversation:

Assistants	Responses
`threads.create()`	nothing — just call `responses.create`
`threads.messages.create()`	include in `input` array
`runs.create()` + poll	`responses.create()` returns synchronously or streams
`run.required_action`	`response.required_action` (similar but flatter)
`assistants.create()`	`prompts` + system messages + tools per call

Code-Level Migration

Here is the minimal-diff before/after for a single conversation turn. The "before" is the standard Assistants pattern most of us wrote in 2024:

// BEFORE — Assistants API
const thread = await client.beta.threads.create();
await client.beta.threads.messages.create(thread.id, {
  role: "user",
  content: userMessage,
});
const run = await client.beta.threads.runs.createAndPoll(thread.id, {
  assistant_id: ASSISTANT_ID,
});
const messages = await client.beta.threads.messages.list(thread.id);
const reply = messages.data[0].content[0].text.value;

// AFTER — Responses API
const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: SYSTEM_PROMPT,
  input: userMessage,
  tools: TOOLS,
  previous_response_id: priorResponseId, // null on first turn
  store: true, // 30-day server retention
});
const reply = response.output_text;
const newResponseId = response.id; // persist for next turn

The "after" version is shorter, synchronous on the happy path, and the conversation chain lives in two places you control: your DB row (the response.id) and your prompt repo (SYSTEM_PROMPT).

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

State and History Handling

This is where I lost the most time. Three patterns I now use:

Pattern 1: Short-lived chains (default). Persist previous_response_id against your conversation row. On each turn, pass it. Trust OpenAI's 30-day retention. This is what most apps want.

await db.conversation.update({
  where: { id: convId },
  data: { lastResponseId: response.id },
});

Pattern 2: Long-lived or compliance-bound chains. Do not rely on server retention. Store every message in your DB and pass them explicitly:

const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: SYSTEM_PROMPT,
  input: messages.map((m) => ({ role: m.role, content: m.content })),
  store: false, // do not retain server-side
});

Tool-Use Parity

Function calling works, with a flatter schema. The tools array is the same shape. The big differences:

Built-in tools. code_interpreter and file_search are now first-class tools you enable per call. No more attaching them to an assistant.
Parallel tool calls. Default-on in Responses. If your old code assumed serial tool execution, audit your handlers — they will now fire in parallel.
Streaming tool calls. You can stream tool-call deltas, which means you can render "agent is calling tool X..." in real time. Assistants forced you to wait for requires_action.

Here is the parallel-tool gotcha. In Assistants, this code was safe:

// Assistants — implicit serial
for (const call of run.required_action.submit_tool_outputs.tool_calls) {
  const output = await runTool(call); // safe, one at a time
}

import pLimit from "p-limit";
const limit = pLimit(3);
const outputs = await Promise.all(
  response.required_action.submit_tool_outputs.tool_calls.map((call) =>
    limit(() => runTool(call))
  )
);

I missed this on my first migration. The customer-support agent fired four parallel ticket-update calls to a legacy CRM and got rate-limited into oblivion within an hour.

Eval-Driven Cutover

The cutover schedule that worked for me:

Week 1 — Build the Responses path behind a feature flag. 0% traffic. Run shadow evals on logged inputs.
Week 2 — 10% live traffic. Watch error rates, latency, customer-reported issues.
Week 3 — 50% if metrics hold. Bug-fix anything weird.
Week 4 — 100%. Keep the Assistants code path in the repo with @deprecated comments for one more month, then delete.

Burn-down looked roughly like this in my logs:

Day 1: 47 endpoints calling Assistants
Day 7: 47 (built path, no traffic yet)
Day 9: 47 → 47 (10% rollout, both alive)
Day 14: 47 → 12 (cut the safe ones, kept stateful chains on assistants)
Day 21: 12 → 3 (long-lived chain edge cases)
Day 28: 0

What I Would Do Differently

Three things in priority order:

Start with eval, not code. Get the harness running before you write a line of migration code. Without a regression signal you are migrating blind.
Migrate stateless flows first. RAG queries, one-shot tool calls, summarization. These are mechanical search-and-replace. Build confidence before tackling stateful chains.
Audit parallel tool calls explicitly. Do not assume. Grep your runTool implementations for shared mutable state. The parallel-by-default behavior will find every race condition you have.

OpenAI gave us through 2026, which sounds generous until you remember every other library you depend on is also moving. Do not be the one team migrating in October.

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X