
TL;DR
DeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $1.74/$3.48 per million tokens, and Flash runs agent inner loops for cents. Here is the worked cost math, the Flash-vs-Pro split, and a clear guide on when to route to DeepSeek instead of a frontier model.
For most of the open-weights era, the cost-quality tradeoff was easy to call. The cheap models were cheap because they were worse, and on real agentic coding work - the long, tool-heavy loops where a model has to plan, edit, run, and recover from its own mistakes - the gap was wide enough that price barely entered the conversation. You paid frontier rates because the alternative did not finish the task.
DeepSeek V4 changed the shape of that argument. With V4 Pro scoring 63.5 on SWE-bench Verified at a standing API price of $1.74 per million input tokens and $3.48 per million output, the question is no longer "is the cheap model good enough." It is "for which tasks is the quality delta worth four to fourteen times the bill." That is a routing decision, not a vendor loyalty decision, and it is the one this post is built to help you make.
This is part of our ongoing AI-model-economics beat. If you want the full setup-and-API walkthrough, our DeepSeek V4 developer guide covers the SDK wiring, the thinking parameter, and the legacy alias cutover. If you want the head-to-head against the premium tier, Fable 5 vs DeepSeek V4 measures the cost-quality gap on real tasks. This piece is about the economics of routing agentic coding work to V4 specifically.
Verify every figure below against primary sources before making a production decision. Prices and benchmarks move.
| Resource | URL | What You Get |
|---|---|---|
| DeepSeek Pricing | api-docs.deepseek.com/quick_start/pricing | Current per-million-token API pricing |
| DeepSeek V4 Pro Model Card | huggingface.co/deepseek-ai/DeepSeek-V4-Pro | Architecture, parameter counts, context limits |
| OpenRouter: V4 Pro | openrouter.ai/deepseek/deepseek-v4-pro | Third-party pricing and benchmark aggregation |
| OpenRouter: V4 Flash | openrouter.ai/deepseek/deepseek-v4-flash | Flash pricing and provider list |
| CloudZero DeepSeek Pricing 2026 | cloudzero.com/blog/deepseek-pricing | Independent pricing breakdown and cache notes |
| Verdent V4 Pricing & Migration | verdent.ai/guides/deepseek-v4-pricing-api-migration-2026 | Pricing history and migration context |
DeepSeek V4 ships as a family, not a single checkpoint. The split is the entire economic story, so it is worth being precise about which model does what.
V4 Flash is the small, fast tier. The Hugging Face model card lists it at 158B total parameters with a smaller active footprint per token. It defaults to non-thinking mode, which keeps latency in the same range as the older deepseek-chat, and it is built for high-throughput work: classifiers, structured extraction, retrieval synthesis, and the inner loops of an agent where the model is making bounded, low-stakes decisions hundreds of times.
V4 Pro is the flagship. The instruction-tuned release weighs 862B total parameters against a 1.6T base checkpoint, and it is the model you reach for on hard reasoning: codebase-scale refactors, multi-step planning, and agent workloads that have to hold state across many tool calls. It is slower and several times more expensive than Flash, but it is still a fraction of closed-model pricing.
Both tiers share a 1M token context window and a 384K maximum output, the latter being a number nobody else is matching right now. For long-context agentic coding - dropping a whole repo into context and asking for a coordinated change - that 1M window is the feature that makes V4 a real frontier-model substitute rather than a budget fallback.
These are per-million-token prices as listed on the DeepSeek pricing docs and corroborated by OpenRouter and CloudZero as of June 17, 2026. The launch discount on Pro expired May 31, so the full-price column is now the standing rate you should model against.
| Model | Cache Hit (input) | Input | Output |
|---|---|---|---|
| DeepSeek V4 Flash | $0.0028 | $0.14 | $0.28 |
| DeepSeek V4 Pro | $0.0145 | $1.74 | $3.48 |
| Claude Sonnet (reference) | - | ~$3.00 | ~$15.00 |
| Frontier premium tier (reference) | - | ~$10.00 | ~$50.00 |
Two things drop out of this table immediately. First, the cache hit price is one tenth of the input price on both tiers, which means cache-friendly prompt structure - stable system context at the top, variable task at the bottom - quietly removes most of your input bill on any repeated workload. Second, V4 Pro output at $3.48 is roughly a quarter of Claude Sonnet's output rate and about a fourteenth of the premium-tier output rate. Output tokens dominate agentic coding bills because the model writes diffs, reasoning traces, and tool calls all day, so that output-side gap is where the real savings live.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 17, 2026 • 7 min read
Jun 17, 2026 • 8 min read
Jun 17, 2026 • 7 min read
Jun 17, 2026 • 7 min read
Abstract per-token numbers do not build intuition. Let us cost a single realistic agentic coding task end to end.
The task: a medium feature implementation across an existing TypeScript codebase. The agent reads relevant files, plans, writes the change across four files, runs the test suite, reads the failures, and patches twice before it goes green. This is a normal afternoon of agent work, not a toy.
Token profile for one run:
V4 Pro cost for this run:
The same run on a frontier premium tier (at the reference $10 input / $50 output, no cache assumed for the comparison floor):
That is roughly a 20x difference on a single task, and it compounds. An agent fleet doing 200 such tasks a day is the difference between about $98 and about $2,000 per day, or about $30K versus $600K annualized on that one workload. Even against a mid-tier model like Claude Sonnet, the same run lands near $1.95, so V4 Pro is still about 4x cheaper.
The honest caveat: this math assumes V4 Pro finishes the task. If a harder problem causes Pro to fail where a frontier model succeeds, you pay the cheap bill twice and then pay the expensive bill anyway, plus the human time to notice. That failure-cost multiplier is exactly what the decision guide below is designed to price in.
The cost case only matters if the quality is real, so here is the benchmark picture. DeepSeek's own published numbers, which line up with early community reports on Hugging Face, put V4 Pro at 63.5 on SWE-bench Verified and 88.7 on MMLU-Pro. V4 Flash lands at 54.7 on SWE-bench Verified and 81.4 on MMLU-Pro, which is roughly R1-class reasoning at a fraction of R1's latency.
A 63.5 on SWE-bench Verified puts V4 Pro in striking distance of mid-tier frontier coding models and clearly ahead of every other open-weights checkpoint you can self-host. It does not top the leaderboard. The current frontier premium models still hold an edge on the hardest real-codebase tasks, and that edge is precisely what you are paying 14x output rates to buy when the task warrants it.
Treat vendor benchmarks as the optimistic case and validate on your own task distribution before you route production traffic. A model that wins on SWE-bench can still underperform on your specific stack, your conventions, and your tool surface.
This is the decision the whole post is built around. Route by task characteristics, not by habit.
Route to V4 Flash when:
Route to V4 Pro when:
Stay on a frontier premium model when:
The cleanest way to capture this in practice is a routing layer rather than a single default model. Run a smart orchestrator that triages tasks and dispatches the bounded ones to Flash, the hard-but-recoverable ones to Pro, and escalates only the genuinely frontier-grade work upward. We laid out the orchestrator-and-workers pattern in depth in the Omnigent meta-harness piece, which is the natural home for this kind of cost-aware dispatch. For a parallel cost-math treatment on a different cheap-but-capable model, the GLM-5.2 developer guide runs the same exercise for Z.ai's 1M-context coder.
One more economic axis that frontier models cannot match: V4's weights are MIT licensed. Flash fits on a single high-memory workstation at 4-bit quantization, and Pro runs on a small cluster or a rented high-memory GPU box. For a steady, high-volume internal workload, self-hosting converts a per-token API bill into a fixed hardware cost, and above a certain throughput the fixed cost wins decisively. It also removes the data-egress and privacy questions that block some teams from sending code to any third-party API at all. That optionality has real value even if you never exercise it, because it caps your downside if API pricing moves against you.
DeepSeek V4 did not win the benchmark crown, and it does not need to. What it did was push the cost-quality frontier far enough that "just use the frontier model for everything" stopped being the obviously correct default for cost-sensitive agentic coding. Flash makes the bounded, repetitive parts of an agent loop nearly free. Pro delivers near-mid-frontier coding quality at a quarter to a fourteenth of the output price, with a 1M context window and a self-host escape hatch.
Route by task. Send the cheap, recoverable, high-volume work to V4 and reserve frontier spend for the high-failure-cost, top-of-curve tasks that actually justify it. Do the worked math on your own token profile, validate the quality on your own task distribution, and let the routing layer enforce the discipline. The economics in mid-2026 reward teams that treat model choice as a per-task decision rather than a standing subscription.
On a representative medium feature task (about 600K input tokens with cache-friendly structure and 80K output tokens), V4 Pro costs roughly $0.49 per run versus about $10.00 on a frontier premium tier at reference rates, a roughly 20x difference. Against a mid-tier model like Claude Sonnet, V4 Pro is about 4x cheaper on the same run. Output tokens dominate the bill, and V4 Pro's $3.48 output rate is where most of the savings come from.
V4 Flash costs $0.14 per million input tokens and $0.28 output, with cache hits at $0.0028. V4 Pro costs $1.74 input and $3.48 output, with cache hits at $0.0145. Flash is built for high-volume bounded work like classification and agent inner loops; Pro is for hard reasoning and codebase-scale tasks. Both share a 1M context window and 384K max output.
It scores 63.5 on SWE-bench Verified, which puts it in striking distance of mid-tier frontier coding models and ahead of every other open-weights model. It does not top the leaderboard, so the frontier still wins on the hardest real-codebase tasks. Route V4 the hard-but-recoverable work where you can verify output cheaply, and reserve frontier spend for high-failure-cost tasks.
Route to DeepSeek when the work is bounded and high-volume (Flash) or hard but recoverable with cheap verification and cost pressure at scale (Pro). Stay on a frontier model when the failure cost is high and hard to detect, the task is at the top of the difficulty curve, or you depend on provider-specific tooling. A routing layer that dispatches by task characteristics captures most of the savings.
Yes. The weights are MIT licensed. Flash fits on a single high-memory workstation at 4-bit quantization, and Pro runs on a small cluster or rented high-memory GPU box. For steady high-volume workloads, self-hosting converts a per-token API bill into a fixed hardware cost that wins above a certain throughput, and it removes data-egress and privacy constraints.
Read next
DeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how to wire it up with the OpenAI SDK, when to pick it over Claude or GPT, and what changed since V3 and R1.
10 min readDeepSeek V4-Flash costs $0.28 per million output tokens. Fable 5 costs $50. That 178x gap is real - but so is the quality difference. Here is where it matters and where it does not.
7 min readZ.ai shipped GLM-5.2 on June 13 with a usable 1M-token context window, two thinking-effort levels, and MIT open weights coming soon. Here is the setup guide for Claude Code, pricing breakdown, and what to test before the benchmarks arrive.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
DeepSeek's open-weights frontier family, previewed April 24, 2026. V4-Pro is 1.6T total / 49B active params; V4-Flash is...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolAnthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedClickable PR link in the footer with review state color coding.
Claude Code
DeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how...
DeepSeek V4-Flash costs $0.28 per million output tokens. Fable 5 costs $50. That 178x gap is real - but so is the qualit...

Z.ai shipped GLM-5.2 on June 13 with a usable 1M-token context window, two thinking-effort levels, and MIT open weights...
Databricks open-sourced Omnigent, a meta-harness that sits above individual agent CLIs so your sessions, policies, and s...

Open weights are free to download, but inference is not free to run. Here is the honest break-even math on when self-hos...
A first-hand visit to DeepSeek HQ reveals something more interesting than benchmark scores: a 300-person company that tr...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.