
TL;DR
Cloudflare Flagship is feature flags built for AI: model swaps, agent gates, and prompt rollouts as first-class primitives. Here is how to use it without rebuilding your control plane.
Read next
Vercel just declared the agent stack: AI Gateway, Sandbox, Flags, and Microfrontends. Here is how the four primitives compose, with code, and where each one actually fits in a real product.
12 min readSAM 3.1 finally hits the latency budget for realtime video. Here is how to wire Meta's new segmentation model into a production pipeline without melting your GPU.
10 min readClaude Code is Anthropic's AI coding agent for your terminal. What it does, how it works, how it compares to Cursor and Codex, and how to ship your first feature with it. Fact-checked against official docs.
15 min readEvery team running AI in production eventually rebuilds the same three things on top of LaunchDarkly or Statsig: a model selector keyed by user cohort, a kill switch for runaway agents, and a prompt template registry that does not require a deploy. Generic flag systems treat these as the same primitive — a boolean or a string — and let you figure out the rest.
For broader context, pair this with Every AI Coding Tool Compared: The 2026 Matrix and What Is an AI Coding Agent? The Complete 2026 Guide; those companion pieces show where this fits in the wider AI developer workflow.
Cloudflare Flagship, announced this week, ships with these three patterns as first-class concepts. It is not just a flag store. It is a control plane for AI behavior: model selection with cost-aware routing, prompt versioning with rollback, and gates with real-time circuit breaking. All of it evaluated at the edge with single-digit millisecond latency.
This is the primitive every AI app already has, written badly, in your own codebase. The question is whether replacing it with a hosted service is worth the migration. For most teams, yes. Here is why and how.
The announcement lists four headline capabilities, but only three of them matter for AI apps.
Model flags are typed flag values that resolve to a model identifier plus optional config (temperature, max tokens, system prompt overrides). You define them once and reference them everywhere. Cohort targeting, gradual rollouts, instant rollback all work the same as a normal flag.
Prompt registry stores versioned prompt templates and resolves them at request time. Each version is addressable, every evaluation is logged, and rollback is a click. This solves the "we changed the system prompt and broke production at 2am" problem by making the change a first-class event with a diff.
Circuit breakers are flags that auto-flip based on rules you define against your own metrics. Error rate above 5% on this model, flip to the fallback. Cost per user above the threshold, flip to a smaller model. This is the piece that you cannot easily build on top of a generic flag service because it requires the flag system to consume your telemetry and write back to itself.
The fourth capability — A/B testing dashboards — is competent but not differentiated. Use it if you are not already wired into a stats platform. Skip it if you are.
The integration is intentionally thin. A single call per flag, evaluated at the edge or in your worker.
import { Flagship } from "@cloudflare/flagship";
const flags = new Flagship({ token: process.env.FLAGSHIP_TOKEN });
export default {
async fetch(req: Request, env: Env): Promise<Response> {
const userId = req.headers.get("x-user-id") ?? "anon";
// Model flag with full config payload
const modelConfig = await flags.model("primary_chat_model", {
user: { id: userId },
defaults: {
provider: "anthropic",
model: "claude-sonnet-4.7",
temperature: 0.7,
},
});
// Prompt flag — resolves to a versioned template
const systemPrompt = await flags.prompt("chat_system_prompt", {
user: { id: userId },
variables: { product_name: "Acme" },
});
// Circuit-broken tool gate
if (!(await flags.gate("agent_tools_enabled", { user: { id: userId } }))) {
return Response.json({ error: "Agent temporarily unavailable" });
}
const body = await req.json();
const reply = await callModel({
...modelConfig,
system: systemPrompt,
messages: body.messages,
});
// Report outcome — circuit breakers consume this
await flags.report("primary_chat_model", {
latency_ms: reply.latency,
error: reply.error ? 1 : 0,
cost_usd: reply.cost,
});
return Response.json(reply);
},
};
The shape worth noting: flags.report closes the loop. The same flag that resolved the model also consumes the outcome. If error rate spikes, the circuit breaker rule you defined in the dashboard flips the flag to the fallback model automatically. No human in the loop, no PagerDuty page at 3am.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 12 min read
Edge eval means edge eval. Flagship runs on Cloudflare's edge, which is a feature if your app is also on the edge and a tax if you are running on AWS in us-east-1 with no edge tier. The flag fetch adds 30-60 ms in that case. For most AI apps that is a rounding error against the model latency, but if you are doing high-frequency flag evaluations inside a tight loop you will feel it.
Prompt templates are not Jinja. The variable substitution syntax is intentionally simple — {{var}} and basic conditionals. If you need real templating logic, render the prompt in your code and pass the rendered string as a variable. Trying to fit complex prompt construction into the registry is the path to pain.
Circuit breaker rules can deadlock. If your fallback model is also a flag with its own circuit breaker, and both fail, you can end up flipped to a third option you did not intend. Always define a hard-coded last-resort default in your code that does not go through the flag system. This is not a Flagship-specific problem but it is more visible here because the system encourages cascading flags.
Flag values are eventually consistent. A flip in the dashboard takes 5-15 seconds to propagate globally. For incident response that is fine. For experimentation it is fine. For "I just deployed this and want to test it" it is annoying enough that you should keep a local override mechanism.
Flag systems sit at an awkward layer in the agent stack. Above your inference layer because they decide which model to call. Below your orchestration layer because the orchestrator does not care about the flag, only the resolved value. Adjacent to observability because the flag system consumes metrics to trigger circuit breakers.
The clean way to slot Flagship in: it owns the "what model, what prompt, what tools" decision for every agent invocation. Your orchestrator owns the workflow. Your observability owns the truth. Flagship reads from observability and writes to the orchestrator's input.
This matters when you are running multi-step agent workflows because each step is an independent decision. Step 1 plans, step 2 executes, step 3 verifies. Each step might want a different model — opus for planning, sonnet for execution, haiku for verification — and each model selection can be a separate flag. We use this pattern inside Orchestrator, our DD product for declarative multi-agent workflows. Each node in the graph reads its model and prompt from Flagship at execution time, which means we can tune the entire graph from a dashboard without redeploying any worker.
The other place flags compose well is observability replay. When you see a bad agent run, you want to know not just the prompt and response but the flag values that were active when the run executed. Recording flag context with every trace is the difference between "we cannot reproduce this" and "the user was in cohort B with the experimental prompt." Traces records every flag value alongside the agent transcript, which is the integration that closes the debugging loop.
I walked through the full setup including circuit breaker rules and rollout strategies on the Developers Digest YouTube channel.
A few patterns that have worked across the agent products we run.
Start with one flag: model selection. That is the highest-value, lowest-risk place to begin. Wrap your existing model call in a Flagship lookup, set the default to your current model, deploy. You now have an instant kill switch and a path to A/B testing. Everything else is incremental.
Version every prompt change as a new prompt flag value, not an edit to the existing one. The audit trail is worth the minor overhead, and rollback becomes trivial. We rotate to a new version of every prompt every time we change it, with the old version archived but reachable.
Define circuit breakers conservatively at first. A 10% error rate threshold with a 5-minute window is a safe starting point for most apps. Tighten it as you learn what normal looks like. Aggressive thresholds on day one will flap.
Keep a hard-coded last-resort default in code for every flag. If Flagship is unreachable, your app should still work, just on the safe defaults. Treat the flag service as a control plane, not a single point of failure.
Two things to keep an eye on. First, whether Cloudflare adds first-class evals as a sibling to flags. The natural integration is "tie a flag rollout to an eval pass rate" — only promote to 100% when the eval suite holds at green. Right now you have to glue that together with your own pipeline. Building it natively would eat a real product category.
Second, whether the prompt registry gains structured types. Right now prompts are strings with variables. The interesting future is prompts as typed schemas with input validation and output parsing baked in. That would make Flagship a competitor to LangSmith's prompt hub, which is the obvious adjacent target.
For now the move is simple. Take the worst part of your AI app — the place where you are deploying just to change a model name, or hand-editing a prompt in production, or hoping nothing breaks because you have no kill switch — and replace it with a flag. The control plane you have been pretending you do not need is the one Cloudflare just shipped.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
CDN, DNS, DDoS protection, and edge computing. Free tier handles most needs. This site uses Cloudflare for DNS and analy...
View ToolLocally-scoped CSS for component-based apps. Plain CSS files with hashed class names, no runtime overhead, no learning c...
View ToolStackBlitz's in-browser AI app builder. Full-stack apps from a prompt - runs Node.js, installs packages, and deploys....
View ToolThe TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
Vercel just declared the agent stack: AI Gateway, Sandbox, Flags, and Microfrontends. Here is how the four primitives co...

A long-form technical read on Flue from Fred K Schott, with deeper comparisons against OpenAI Agents, Vercel AI SDK, Goo...

Opus 4.7 is here. Sharper coding, longer agentic runs, better tool use, and a price that finally makes Opus livable for...

Cloudflare's Agent Memory primitive. What it stores, latency profile, how it compares to mem0, and how to wire it into y...

A hands-on developer guide to Mercury 2 from Inception Labs. OpenAI-compatible API, reasoning levels, tool use, structur...

SAM 3.1 finally hits the latency budget for realtime video. Here is how to wire Meta's new segmentation model into a pro...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.