
TL;DR
OpenAI is sunsetting the Assistants API in 2026. Here is a tested migration plan to the Responses API - code, state, threads, tools, every cliff I hit, in order.
| Topic | Primary source |
|---|---|
| Assistants deprecation | OpenAI deprecations and Assistants API (v2) FAQ |
| Migration guide | Assistants to Responses API migration |
| Responses API reference | Responses API object reference |
| Data retention and controls | Your data and retention |
Last updated: May 31, 2026. Verify details against the official docs before you cut over production traffic.
OpenAI confirmed the Assistants API deprecation on their platform deprecations page and published a migration guide to the Responses API. If you are running production code against client.beta.threads.*, treat this as a scheduled migration, not a someday rewrite.
For the design side of the same problem, read OpenAI Codex: Cloud AI Coding With GPT-5.3 with OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; they show how agent-generated interfaces fail and how to give coding agents better visual constraints.
If you are running production code against client.beta.threads.* today, you have homework. I had a 14-month-old Assistants codebase running newsletter automation, customer support triage, and a chunk of internal ops. Last weekend I migrated all of it. This is the field guide - every cliff I hit, in order, with the code diffs that worked.
For the visual walkthrough including the eval harness I used to gate the cutover, see the DevDigest YouTube channel.
If you want a broader map of the OpenAI agent surface, start at /compare, then follow the OpenAI decision paths from AI coding tools pricing and the pricing hub.
The Assistants API was server-stateful. You created a thread, posted messages, kicked off runs, polled for completion, and OpenAI held the conversation history. Your code did not own the state.
The Responses API is client-stateful by default, server-stateful by opt-in. Each call returns a response.id. You pass previous_response_id on the next call to get continuity. The server stores the chain for 30 days. After that, you reconstruct from your own DB or pass the message array explicitly.
This is the right design - server-only state was a footgun for compliance, debugging, and multi-region - but it changes how you think about every conversation:
| Assistants | Responses |
|---|---|
threads.create() | nothing - just call responses.create |
threads.messages.create() | include in input array |
runs.create() + poll | responses.create() returns synchronously or streams |
run.required_action | response.required_action (similar but flatter) |
assistants.create() | prompts + system messages + tools per call |
The big mental shift: there is no assistant object anymore. The "assistant" is your prompt template + tool list + model config, which you supply per call. This is why I version mine in Promptlock - the prompt is now a first-class artifact in your repo, not a row in OpenAI's database.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 13 min read
Here is the minimal-diff before/after for a single conversation turn. The "before" is the standard Assistants pattern most of us wrote in 2024:
// BEFORE - Assistants API
const thread = await client.beta.threads.create();
await client.beta.threads.messages.create(thread.id, {
role: "user",
content: userMessage,
});
const run = await client.beta.threads.runs.createAndPoll(thread.id, {
assistant_id: ASSISTANT_ID,
});
const messages = await client.beta.threads.messages.list(thread.id);
const reply = messages.data[0].content[0].text.value;
// AFTER - Responses API
const response = await client.responses.create({
model: "gpt-5.5",
instructions: SYSTEM_PROMPT,
input: userMessage,
tools: TOOLS,
previous_response_id: priorResponseId, // null on first turn
store: true, // 30-day server retention
});
const reply = response.output_text;
const newResponseId = response.id; // persist for next turn
The "after" version is shorter, synchronous on the happy path, and the conversation chain lives in two places you control: your DB row (the response.id) and your prompt repo (SYSTEM_PROMPT).
This is where I lost the most time. Three patterns I now use:
Pattern 1: Short-lived chains (default). Persist previous_response_id against your conversation row. On each turn, pass it. Trust OpenAI's 30-day retention. This is what most apps want.
await db.conversation.update({
where: { id: convId },
data: { lastResponseId: response.id },
});
Pattern 2: Long-lived or compliance-bound chains. Do not rely on server retention. Store every message in your DB and pass them explicitly:
const response = await client.responses.create({
model: "gpt-5.5",
instructions: SYSTEM_PROMPT,
input: messages.map((m) => ({ role: m.role, content: m.content })),
store: false, // do not retain server-side
});
Pattern 3: Hybrid. Short-lived state via previous_response_id, but you also write every input/output to your DB for replay and eval purposes. This is what I run in production. It is the only pattern that gives you both ergonomic continuity and full-control debugging.
The cliff I hit: I assumed previous_response_id would still work after 31 days. It does not - the server returns a 404. Wrap every call in a fallback that reconstructs from your DB if the chain is missing.
Function calling works, with a flatter schema. The tools array is the same shape. The big differences:
code_interpreter and file_search are now first-class tools you enable per call. No more attaching them to an assistant.requires_action.Here is the parallel-tool gotcha. In Assistants, this code was safe:
// Assistants - implicit serial
for (const call of run.required_action.submit_tool_outputs.tool_calls) {
const output = await runTool(call); // safe, one at a time
}
In Responses, the model now expects you to handle multiple tool calls concurrently. If runTool is not idempotent or hits a rate-limited downstream, batch your calls or Promise.all them with a concurrency cap:
import pLimit from "p-limit";
const limit = pLimit(3);
const outputs = await Promise.all(
response.required_action.submit_tool_outputs.tool_calls.map((call) =>
limit(() => runTool(call))
)
);
I missed this on my first migration. The customer-support agent fired four parallel ticket-update calls to a legacy CRM and got rate-limited into oblivion within an hour.
The migration is mechanical but the behavior is not always identical. Different default temperatures, different tool-call patterns, different message-formatting quirks. I would not cut over without a regression eval.
My harness: a flag-gated rollout where 10% of traffic goes to Responses, 90% to Assistants, both runs are logged with the same input, and a nightly job scores the diffs. I open-sourced the bones of this as Agent Eval Bench - input replay, output diff, automated grading via a stronger model.
The cutover schedule that worked for me:
@deprecated comments for one more month, then delete.Burn-down looked roughly like this in my logs:
Day 1: 47 endpoints calling Assistants
Day 7: 47 (built path, no traffic yet)
Day 9: 47 → 47 (10% rollout, both alive)
Day 14: 47 → 12 (cut the safe ones, kept stateful chains on assistants)
Day 21: 12 → 3 (long-lived chain edge cases)
Day 28: 0
The last three were the long-lived stateful chains where I needed pattern 2 above (explicit history). They took longer because I had to backfill DB writes for conversations that had been server-stateful for months.
Three things in priority order:
runTool implementations for shared mutable state. The parallel-by-default behavior will find every race condition you have.OpenAI publishes the current timeline on the platform deprecations page and updates it as the plan firms up. Treat the deprecations page as the source of truth, not social threads.
Not long-term. The Responses API does not use the same thread and run primitives. The migration guide shows the mapping, but the conceptual shift is that state is either passed explicitly or chained via previous_response_id.
previous_response_id replace a database?No. It is a continuity mechanism, not a storage strategy. For anything compliance-bound, long-lived, or audit-heavy, store your inputs and outputs in your own database and treat server retention as a convenience.
Run an eval harness before and after. Migrate one workload at a time, keep the old system behind a flag, and record enough receipts (inputs, outputs, tool calls, latency) to debug regressions quickly.
Probably. The Responses API expects parallel tool calls by default. If your tool handlers assume serial execution, add a concurrency cap and make idempotency explicit before you cut over.
OpenAI gave us through 2026, which sounds generous until you remember every other library you depend on is also moving. Do not be the one team migrating in October.
The Responses API is the better primitive. It is simpler, more honest about state, and the streaming model finally feels native. The migration is a weekend of work for a small codebase and two weeks for a complex one. Worth it.
Read next
Fable 5 is mostly a drop-in replacement for Opus 4.8, but 'mostly' is doing real work in that sentence. Here's every breaking change, what to delete from your code, and the prompt audit you should run before flipping the model ID.
9 min readGPT-5.4 ships state-of-the-art computer use, steerable thinking, and a million-token window. Here is the implementation guide for builders, with real OpenAI SDK code, the 272K pricing cliff, and where it actually beats 5.3 and 5.5 in production.
12 min readGPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and the real benchmarks I ran the day it dropped.
11 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Factory AI's terminal coding agent. Runs Anthropic and OpenAI models in one subscription. Handles full tasks end-to-end...
View ToolThe TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolUnified API for 200+ models. One API key, one billing dashboard. OpenAI, Anthropic, Google, Meta, Mistral, and more. Aut...
View ToolOpenAI's latest flagship model. Major leap in reasoning, coding, and instruction following over GPT-4o. Powers ChatGPT P...
View ToolBeat the August 2026 Assistants API sunset. Paste old code, get Responses API.
View AppTurn API documentation and OpenAPI specs into typed SDK plans and demo checklists.
View AppTurn API docs into endpoint maps, auth setup, demo ideas, and build-ready prompts.
View AppConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsSet up Codex Chronicle on macOS, manage permissions, and understand privacy, security, and troubleshooting.
Getting StartedFable 5 is mostly a drop-in replacement for Opus 4.8, but 'mostly' is doing real work in that sentence. Here's every bre...
deepseek-chat is deprecated and disappears July 24, 2026 - here is how to migrate to V4 Flash or Pro, with verified pric...
Same-day-verified llm api pricing june 2026: Claude Fable 5, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 compared per milli...
GPT-5.4 vs Gemini 3.1 Pro vs DeepSeek V4: pricing, benchmarks, context behavior, and license terms for the mid-tier mode...
GPT-5.5 vs Claude Opus 4.8: both cost $5 per million input tokens, so the workhorse-tier decision comes down to output p...
Migrating off retired GPT models in 2026: the live retirement table, what maps to what, an eval-before-switch day plan,...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.