OpenAI Agents SDK Evolution: What Ships in Production

I rebuilt my customer support agent on the new OpenAI Agents SDK and discovered three undocumented foot-guns in the first week. The agent now runs noticeably faster, holds context across longer sessions, and uses filesystem tools the same way Codex agents do. It also nearly nuked our staging knowledge base on day two because of an interaction between sandbox lifetime and memory writes that I will detail below.

This is the writeup of what shipped, what is genuinely better, and what to watch for if you are migrating an existing agent.

What Changed And Why It Is A Real Upgrade

The previous Agents SDK was already capable. You could orchestrate tool calls, hand off between agents, and run multi-agent workflows with reasonable observability. What it could not do natively was hold meaningful state between sessions or operate over a real filesystem the way a Codex agent does.

For the larger agent workflow map, read OpenAI Codex: Cloud AI Coding With GPT-5.3 and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; they give the architecture and implementation context this piece assumes.

The new SDK ships three primitives that close those gaps.

The first is configurable memory. Agents now have first-class short-term and long-term memory tiers, with explicit scoping (per session, per user, per organization) and explicit retention policies. You no longer roll your own vector store integration for "remember what this customer told us last week".

The second is sandbox-aware orchestration. Tool execution can target specific sandbox environments with their own lifetime, file state, and resource limits. Multi-agent workflows can pass a sandbox between agents instead of just passing messages, which is the closest thing to a workspace handoff that any major SDK currently supports.

The third is filesystem tools borrowed directly from the Codex stack. read_file, write_file, apply_patch, list_dir, and run_in_sandbox are now native primitives. Agents can manipulate files the same way Codex does, which makes building "agents that produce artifacts" much cheaper than it used to be.

These three together are what makes the new SDK a real upgrade rather than an incremental refresh. Memory plus sandbox plus filesystem is the ingredient list for agents that do real work over time.

Configurable Memory In Practice

The memory API has two tiers and the distinction matters.

Short-term memory is per-session, ephemeral, and bounded. It is automatically managed: the SDK summarizes older turns when the context window fills, and you can configure the summarization aggressiveness. Most teams should leave this on default.

Long-term memory is persistent, explicitly scoped, and explicitly written. You decide what to store, when to store it, and who owns the scope. The SDK exposes it as a tool the agent can call, which means the agent itself can choose to remember something. This is more powerful and more dangerous than it sounds.

A minimal example with the Python SDK:

from openai import OpenAI
from openai.agents import Agent, Memory

client = OpenAI()

memory = Memory(
    scope={"user_id": "u_8423", "org_id": "org_acme"},
    retention="90d",
    tier="long_term",
)

support_agent = Agent(
    name="support",
    model="gpt-5.3-codex",
    instructions=(
        "You are a customer support agent for Acme. "
        "Use long-term memory to remember user preferences and past issues. "
        "Never write secrets or PII into memory."
    ),
    memory=memory,
    tools=["filesystem", "code_interpreter"],
)

result = support_agent.run(
    input="Customer u_8423 says: my dashboard is showing yesterday's data again. "
          "Check the cache config and remember my preference for east coast timezone."
)

print(result.output_text)

Two things worth noting. First, the scope is what gives you privacy boundaries. If you scope memory per user, you cannot accidentally read another user's history. Get this wrong and you have a bug class that is genuinely hard to detect in testing because it surfaces only at scale. Second, the retention policy is enforced server side. You do not need to write a cron job to expire old memories.

For agents that produce artifacts and need to track them over time, I run a versioned filesystem in front of the memory tier with agentfs. Memory tracks intent and preferences. agentfs tracks the actual files the agent has written, with audit trails. The combination is what makes incident investigation possible after the fact.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

OpenAI Apps SDK: Building MCP UIs Inside ChatGPT

Apr 29, 2026 • 12 min read

Astro vs Next.js 16: Which to Choose in 2026

Apr 29, 2026 • 12 min read

Claude API Reliability: Error Handling Best Practices

Apr 29, 2026 • 10 min read

Claude Batch API: Cutting Async Workload Costs In Half

Apr 29, 2026 • 11 min read

Sandbox-Aware Orchestration

Classic tool-use treats every tool call as stateless. You call a function, it returns a value, the agent moves on. This breaks down the moment you want an agent to actually do work over time, like running a build, modifying files, then running tests against the modified files.

Sandbox-aware orchestration solves this by making the sandbox itself a first-class entity that lives across tool calls and can be passed between agents.

from openai.agents import Sandbox, Agent, Workflow

sandbox = Sandbox.create(
    image="python-3.12-slim",
    timeout_seconds=900,
    memory_mb=2048,
)

planner = Agent(name="planner", model="gpt-5.3", instructions="Plan tasks.")
implementer = Agent(
    name="implementer",
    model="gpt-5.3-codex",
    instructions="Execute the plan in the provided sandbox.",
    tools=["filesystem", "shell"],
)
verifier = Agent(
    name="verifier",
    model="gpt-5.3-codex",
    instructions="Run tests and report results.",
    tools=["shell"],
)

workflow = Workflow(
    agents=[planner, implementer, verifier],
    sandbox=sandbox,
    handoff="sequential",
)

result = workflow.run(input="Add a /healthz endpoint to the FastAPI app and verify it.")

The sandbox is shared across all three agents. The planner produces a plan, the implementer writes files into the sandbox, the verifier runs tests against those files. No serialization of state through messages, no rebuilding the workspace at each handoff. This is the orchestration model that finally feels right.

There is a real architectural question about whether you should be running orchestration through the SDK at all or running it externally. Multi-agent workflows that span more than a few steps, especially ones that touch external systems, are often easier to reason about as explicit DAGs you control. The framing I use in DD Orchestrator is that SDK orchestration is right when the work is contained inside one sandbox or one model call graph, and external orchestration is right the moment you have to coordinate with services outside the agent boundary. For my support agent the SDK is exactly the right tool. For a billing reconciliation pipeline that touches three databases and a Stripe webhook queue, it is not.

Filesystem Tools Borrowed From Codex

The filesystem tools are the dark horse of the release. They are powerful, they are dangerous, and they are the right primitive for the kind of agents people actually want to build.

When this is a superpower: any agent that needs to produce a structured artifact. Code, configuration files, generated reports, image manifests. Giving the agent direct filesystem access lets it iterate, verify its own output by reading it back, and apply patches without you having to wrap every operation in custom tool code.

When this is a footgun: any agent that has filesystem access and does not have hard boundaries on what it can read or write. The Codex tools are deliberately powerful. apply_patch will apply a patch to anything in the sandbox. If the sandbox has access to your knowledge base, the agent can rewrite your knowledge base. Ask me how I know.

The countermeasures are simple but non-negotiable. Mount only the directories the agent needs. Use a read-only mount when the agent does not need to write. Set explicit byte limits on writes. Log every write and replay them in your tracing system before deploying changes that touch production data.

For the architecture diagram showing the old SDK message-passing model versus the new sandbox-aware model, the DevDigest YouTube walkthrough is the clearest visual reference I have seen. It is also where I show the memory-tier debugging session that I will not be able to do justice to in text.

Three Undocumented Gotchas I Hit

These are the things that did not make it into the changelog and that you should know before you ship.

The first is a race condition between memory writes and sandbox shutdown. If you write to long-term memory inside a tool call, then the sandbox terminates before the write is acknowledged, the write may or may not land. The SDK does not currently expose a sync primitive for this. My fix was to write to memory only at the end of the agent run, after the sandbox has closed cleanly. If you write mid-flight, you need to await the memory acknowledgment explicitly.

The second is token bloat from memory recall. By default, the SDK will inject up to roughly 8k tokens of recalled memory into the context per turn. For agents that run for many turns, this compounds quickly. I found my support agent spending 30% of its context budget on recalled memory by turn ten, with most of the recall being irrelevant to the current question. The fix is to constrain recall with explicit query hints in the agent instructions and to lower the recall token budget in the memory configuration. The default is too generous for production use.

The third is eval drift across SDK versions. The SDK is moving fast and the same agent code can behave subtly differently across patch releases. Tool selection, planning style, and recall behavior have all shifted in releases that did not change the API surface. Pin your SDK version. Run your eval suite before bumping. Tag the eval baseline in your version control. This is not unique to the OpenAI SDK but the rate of change here makes it more pressing than usual.

Migration Plan For Existing Agents

If you have an agent on the previous SDK, here is the rollout I would recommend.

Start with a flag-gated branch. The new SDK can run alongside the old one, so route 5% of traffic to the new agent and watch your metrics. Latency, cost, and error rate are the obvious ones. Memory hit rate and tool error distribution are the less obvious ones that will tell you whether the new primitives are actually helping.

Add observability before you add features. The new SDK has more moving parts than the old one. You want to see which memories are being recalled, which tools are being chosen, and how long each step is taking. Without that, you will not be able to tell why the agent is acting differently when it does.

Roll forward gradually. Move from 5% to 25% to 50% over a week, not over an hour. The interesting failure modes (memory drift, sandbox timeouts, recall token bloat) only show up under sustained traffic.

Keep the rollback path warm. The old SDK works. If something goes wrong, you want to be able to flip back in one config change, not a code revert. Treat the migration as a feature flag, not a refactor.

The new Agents SDK is a real upgrade, and the primitives, memory, sandbox-aware orchestration, filesystem tools, are the right primitives for the agents people actually want to ship in 2026. The foot-guns are real but manageable. If you have an agent in production today, the question is not whether to migrate but when, and the answer for most teams is "this quarter, behind a flag, with your eval suite watching".

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

Codex /goal and Claude Managed Outcomes: The New Control Loops

What Changed And Why It Is A Real Upgrade

Configurable Memory In Practice

OpenAI Apps SDK: Building MCP UIs Inside ChatGPT

Astro vs Next.js 16: Which to Choose in 2026

Claude API Reliability: Error Handling Best Practices

Claude Batch API: Cutting Async Workload Costs In Half

Sandbox-Aware Orchestration

Filesystem Tools Borrowed From Codex

Three Undocumented Gotchas I Hit

Migration Plan For Existing Agents

Comments

Related Tools

Agency Swarm

Claude Agent SDK