DeepSeek V4: The Developer's Guide to Flash and Pro

DeepSeek V4 Is Here, And It Is Not One Model

DeepSeek dropped V4 over the weekend and the rollout is bigger than V3 was. Instead of a single flagship plus a reasoning sibling, V4 ships as a family. There is DeepSeek V4 Flash for everyday work and DeepSeek V4 Pro for the heavy lifting. Both models fold reasoning into the same checkpoint, both stretch the context window to one million tokens, and both undercut every closed frontier model on price by roughly an order of magnitude.

If you were already running DeepSeek R1 or V3 in production, V4 is a drop-in upgrade with one config change. If you were on Claude or GPT for cost-sensitive workloads, V4 is the model that finally makes the switch worth running the numbers on. We covered the launch on the channel in DeepSeek v4 in 4 Minutes, but the four-minute version skips the parts that matter when you actually wire it into an app. This is the longer take.

The Family: Flash vs Pro

DeepSeek collapsed the old deepseek-chat and deepseek-reasoner endpoints into a single API surface that splits on model tier instead of on whether reasoning is on or off. Reasoning is now a runtime parameter, not a separate model.

DeepSeek V4 Flash

Flash is the small, fast tier. The model card on Hugging Face lists it at 158B total parameters with a smaller active footprint per token. It is built for high-throughput, latency-sensitive work: chat UIs, autocomplete, classifiers, RAG retrieval rerankers, agent inner loops. The full chain-of-thought trace is available if you ask for it, but Flash defaults to non-thinking mode, which keeps response times in the same ballpark as the older deepseek-chat.

Flash also gets the legacy aliases. If your code is still pointed at deepseek-chat or deepseek-reasoner, those names will keep resolving until 24 July 2026, both backed by V4 Flash with thinking off and on respectively. Migrate when you have a quiet afternoon.

DeepSeek V4 Pro

Pro is the new flagship. The base checkpoint weighs 1.6T parameters, with the released instruction-tuned model at 862B total. This is the model you pull out for hard reasoning: long-horizon coding tasks, multi-step planning, dense math, agent workloads where the model has to keep its own state across many tool calls. It is slower than Flash and several times more expensive, but still cheaper than Claude Sonnet 4 or GPT-5 for the same task.

Both tiers share a 1M token context length and a maximum output of 384K tokens. The 384K output number is the one nobody else is matching right now. If you are doing long-form generation, codebase rewrites, or full-document translations, that headroom is the difference between one call and a stitched chain.

Pricing: The Number That Moved The Market

Here is the current API pricing per million tokens, taken from the live docs as of this morning.

Model	Cache hit input	Cache miss input	Output
DeepSeek V4 Flash	$0.0028	$0.14	$0.28
DeepSeek V4 Pro (launch discount)	$0.003625	$0.435	$0.87
DeepSeek V4 Pro (full price)	$0.0145	$1.74	$3.48
Claude Sonnet 4 (reference)	$0.30	$3.00	$15.00
GPT-5 (reference)	~$0.40	~$2.50	~$10.00

A few things worth flagging.

The cache hit price on Flash is $0.0028 per million input tokens. That is not a typo. DeepSeek dropped the cache hit price to one tenth of the launch number on 26 April, and Flash is now the cheapest serious model to call repeatedly with stable system prompts. Build with cache-friendly prompt structure and your input bill effectively disappears.

V4 Pro is running at a 75 percent launch discount through 31 May 2026. After that the price triples. If you are evaluating Pro, evaluate it now. The full-price column is the long-term number you should be modelling against.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

Apr 29, 2026 • 12 min read

GPT-5.4 for Developers: The Production Guide

Apr 29, 2026 • 12 min read

GPT-5.5-Codex in Production: What Actually Changes

Apr 29, 2026 • 9 min read

GPT-5.5 for Developers: A Production Field Guide

Apr 29, 2026 • 11 min read

OpenAI-Compatible Setup, In One Block

The DeepSeek API speaks the OpenAI Chat Completions dialect, plus an Anthropic-compatible endpoint if you prefer that SDK. The cleanest path is the OpenAI Python SDK with a custom base_url.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a senior backend engineer."},
        {"role": "user", "content": "Write a FastAPI endpoint that streams SSE events from a Postgres LISTEN/NOTIFY channel."},
    ],
    stream=True,
)

for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

That is the whole integration. Swap deepseek-v4-flash for deepseek-v4-pro when you need the bigger model. The TypeScript SDK is identical with the obvious syntax changes.

Turning On Thinking Mode

Flash defaults to non-thinking. To get the reasoning trace, pass an extra body parameter. The current docs use thinking on the request payload.

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Find the bug: <code snippet>"}],
    extra_body={"thinking": {"enabled": True}},
)

reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

The thinking trace comes back on reasoning_content, the final answer on content. Same shape as the OpenAI o-series response, which means most agent frameworks already know how to read it.

Tool Calling

Tool calling works the same way it does on OpenAI. Pass a tools array with JSON schema, the model returns tool_calls, you execute and feed results back in. There is one wrinkle: V4 Pro with thinking enabled produces a noticeably better tool plan than V4 Flash on multi-step agent tasks. If your agent is making the wrong tool choice, that is the first thing to flip.

Benchmarks: V4 vs R1 vs V3

DeepSeek published their own benchmark numbers with the launch. I have not yet seen independent third-party verification, so treat these as the vendor's view. They line up with the early community reports on Hugging Face.

Benchmark	V3 (Mar 2025)	R1 (Jan 2025)	V4 Flash	V4 Pro
MMLU-Pro	75.9	84.0	81.4	88.7
MATH-500	90.2	97.3	95.8	98.4
AIME 2024	39.6	79.8	74.2	86.1
GPQA Diamond	59.1	71.5	70.8	78.3
LiveCodeBench	40.5	65.9	67.4	74.9
SWE-bench Verified	42.0	49.2	54.7	63.5

The pattern is what you would expect. V4 Flash matches or slightly beats R1 on reasoning while running closer to V3's latency. V4 Pro is a clear step up on every axis, with the SWE-bench number being the headline. 63.5 on SWE-bench Verified puts Pro in striking distance of Claude Sonnet 4 and ahead of every other open model. For an open-weights checkpoint you can host yourself, that is a genuinely new thing.

When To Pick DeepSeek V4 Over Claude or GPT

This is the question I get most. There is no universal answer, but the heuristics are clearer with V4 than they were with R1.

Reach for V4 Flash when: you are running a high-volume workload, the prompts repeat enough to benefit from caching, latency matters more than the last few percent of quality, and your task is bounded enough that a cheap model will not embarrass you. Examples: classification, structured extraction, RAG synthesis over retrieved chunks, first-pass code review, customer support drafting.

Reach for V4 Pro when: the task is hard, the failure cost is high, and you want frontier reasoning at a fraction of the closed-model price. Examples: codebase-scale refactors, multi-step agent loops, technical writing where the model has to integrate many sources, math and scientific work, anything that benefits from the 1M context window.

Stay on Claude when: you are doing long agentic coding sessions, the work involves Anthropic-specific tooling like Computer Use or the Claude Code SDK, or you need the absolute best result on SWE-bench-style real codebase work. Claude Sonnet 4 still has the edge there, and Opus opens a wider gap.

Stay on GPT when: you are deep into the OpenAI ecosystem, using Assistants, the Realtime API, or function calling features that have not been mirrored elsewhere yet, or running on Azure where DeepSeek is not first-class.

Run V4 locally when: you have the hardware and the privacy constraint. Flash will fit on a single high-memory workstation at 4-bit quantization. Pro needs a small cluster or a rented H200 box, but the weights are MIT licensed and the inference stack is the same one that already runs V3.

A Note From The Creator Side

We have been covering DeepSeek on the channel since R1 dropped in January 2025. Every new release has been an excuse to redo the cost math for DD Empire's internal tooling, and V4 is the first time the answer has been "move everything." The pricing on Flash makes it the new default for any internal automation that was on gpt-4o-mini or claude-haiku. The 1M context on Pro means the long-context jobs that previously required Gemini are now back on the table for a single provider.

The honest thing to say is that DeepSeek keeps shipping faster than the closed labs. V3 closed the gap, R1 forced the o1 rewrite, and V4 has reset the price floor for the third time in eighteen months. If your stack is closed-only, your next quarter is going to involve a serious build-vs-buy conversation whether you wanted one or not.

For the four-minute video version of this launch, see DeepSeek v4 in 4 Minutes on the Developers Digest YouTube channel. For the older context, the DeepSeek R1 and V3 Developer Guide walks through the architecture and the local deployment story that V4 inherits. For the broader question of where this fits in the 2026 tooling landscape, the AI Coding Tools Pricing 2026 and Best AI Coding Tools 2026 posts have the comparison tables.

What To Build This Week

Three concrete things worth a Saturday.

Wire V4 Flash into your existing OpenAI-SDK code as a fallback or A/B variant. It is one config change. Run it side-by-side for a week and look at the cost and latency deltas on a real workload. The numbers will surprise you.
Try V4 Pro on a task that has been waiting for Claude Opus. Long codebase refactor, dense research synthesis, anything where context length and reasoning depth both matter. Pro is cheap enough during the launch discount to use it speculatively.
Rebuild one cache-friendly prompt. Move the stable parts to the top, the variable parts to the bottom, and watch your input bill on Flash drop by an order of magnitude on the cache hit path.

DeepSeek shipped a release that is genuinely worth a workflow change. The next one will probably be along in three months. Build for that.

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

AI Coding Tools Pricing Comparison 2026

The 10 Best AI Coding Tools in 2026

DeepSeek V4 Is Here, And It Is Not One Model

The Family: Flash vs Pro

DeepSeek V4 Flash

DeepSeek V4 Pro

Pricing: The Number That Moved The Market

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

GPT-5.5 for Developers: A Production Field Guide

OpenAI-Compatible Setup, In One Block

Turning On Thinking Mode

Tool Calling

Benchmarks: V4 vs R1 vs V3

When To Pick DeepSeek V4 Over Claude or GPT

A Note From The Creator Side

What To Build This Week

Comments

Related Tools

DeepSeek

DeepSeek V3.2

Together AI

OpenRouter

Related Guides

Claude Code Setup Guide

Run AI Models Locally with Ollama and LM Studio

Building Your First MCP Server

Related Videos

DeepSeek v4 in 4 Minutes

Related Posts

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

AI Coding Tools Pricing Comparison 2026

The 10 Best AI Coding Tools in 2026

DeepSeek V4 Changes the Coding Agent Cost Equation

NVIDIA Nemotron 3 Super: A Developer's Guide to the 120B Hybrid MoE

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

Get Smarter About AI Dev

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

AI Coding Tools Pricing Comparison 2026

The 10 Best AI Coding Tools in 2026

DeepSeek V4 Is Here, And It Is Not One Model

The Family: Flash vs Pro

DeepSeek V4 Flash

DeepSeek V4 Pro

Pricing: The Number That Moved The Market

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

GPT-5.5 for Developers: A Production Field Guide

OpenAI-Compatible Setup, In One Block

Turning On Thinking Mode

Tool Calling

Benchmarks: V4 vs R1 vs V3

When To Pick DeepSeek V4 Over Claude or GPT

A Note From The Creator Side

What To Build This Week

Comments

Related Tools

DeepSeek

DeepSeek V3.2

Together AI

OpenRouter

Related Guides

Claude Code Setup Guide

Run AI Models Locally with Ollama and LM Studio

Building Your First MCP Server

Related Videos

DeepSeek v4 in 4 Minutes

Related Posts

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

AI Coding Tools Pricing Comparison 2026

The 10 Best AI Coding Tools in 2026

DeepSeek V4 Changes the Coding Agent Cost Equation

NVIDIA Nemotron 3 Super: A Developer's Guide to the 120B Hybrid MoE

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

Get Smarter About AI Dev