
TL;DR
DeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how to wire it up with the OpenAI SDK, when to pick it over Claude or GPT, and what changed since V3 and R1.
Read next
DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the API, run them locally with Ollama, and decide when they beat closed-source alternatives.
9 min readComplete pricing breakdown for every major AI coding tool. Claude Code, Cursor, Copilot, Windsurf, Codex, Augment, and more. Free tiers, pro plans, hidden costs, and what you actually get for your money.
12 min readFrom terminal agents to cloud IDEs - these are the AI coding tools worth using for TypeScript development in 2026.
8 min readDeepSeek dropped V4 over the weekend and the rollout is bigger than V3 was. Instead of a single flagship plus a reasoning sibling, V4 ships as a family. There is DeepSeek V4 Flash for everyday work and DeepSeek V4 Pro for the heavy lifting. Both models fold reasoning into the same checkpoint, both stretch the context window to one million tokens, and both undercut every closed frontier model on price by roughly an order of magnitude.
If you were already running DeepSeek R1 or V3 in production, V4 is a drop-in upgrade with one config change. If you were on Claude or GPT for cost-sensitive workloads, V4 is the model that finally makes the switch worth running the numbers on. We covered the launch on the channel in DeepSeek v4 in 4 Minutes, but the four-minute version skips the parts that matter when you actually wire it into an app. This is the longer take.
DeepSeek collapsed the old deepseek-chat and deepseek-reasoner endpoints into a single API surface that splits on model tier instead of on whether reasoning is on or off. Reasoning is now a runtime parameter, not a separate model.
Flash is the small, fast tier. The model card on Hugging Face lists it at 158B total parameters with a smaller active footprint per token. It is built for high-throughput, latency-sensitive work: chat UIs, autocomplete, classifiers, RAG retrieval rerankers, agent inner loops. The full chain-of-thought trace is available if you ask for it, but Flash defaults to non-thinking mode, which keeps response times in the same ballpark as the older deepseek-chat.
Flash also gets the legacy aliases. If your code is still pointed at deepseek-chat or deepseek-reasoner, those names will keep resolving until 24 July 2026, both backed by V4 Flash with thinking off and on respectively. Migrate when you have a quiet afternoon.
Pro is the new flagship. The base checkpoint weighs 1.6T parameters, with the released instruction-tuned model at 862B total. This is the model you pull out for hard reasoning: long-horizon coding tasks, multi-step planning, dense math, agent workloads where the model has to keep its own state across many tool calls. It is slower than Flash and several times more expensive, but still cheaper than Claude Sonnet 4 or GPT-5 for the same task.
Both tiers share a 1M token context length and a maximum output of 384K tokens. The 384K output number is the one nobody else is matching right now. If you are doing long-form generation, codebase rewrites, or full-document translations, that headroom is the difference between one call and a stitched chain.
Here is the current API pricing per million tokens, taken from the live docs as of this morning.
| Model | Cache hit input | Cache miss input | Output |
|---|---|---|---|
| DeepSeek V4 Flash | $0.0028 | $0.14 | $0.28 |
| DeepSeek V4 Pro (launch discount) | $0.003625 | $0.435 | $0.87 |
| DeepSeek V4 Pro (full price) | $0.0145 | $1.74 | $3.48 |
| Claude Sonnet 4 (reference) | $0.30 | $3.00 | $15.00 |
| GPT-5 (reference) | ~$0.40 | ~$2.50 | ~$10.00 |
A few things worth flagging.
The cache hit price on Flash is $0.0028 per million input tokens. That is not a typo. DeepSeek dropped the cache hit price to one tenth of the launch number on 26 April, and Flash is now the cheapest serious model to call repeatedly with stable system prompts. Build with cache-friendly prompt structure and your input bill effectively disappears.
V4 Pro is running at a 75 percent launch discount through 31 May 2026. After that the price triples. If you are evaluating Pro, evaluate it now. The full-price column is the long-term number you should be modelling against.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 11 min read
The DeepSeek API speaks the OpenAI Chat Completions dialect, plus an Anthropic-compatible endpoint if you prefer that SDK. The cleanest path is the OpenAI Python SDK with a custom base_url.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a senior backend engineer."},
{"role": "user", "content": "Write a FastAPI endpoint that streams SSE events from a Postgres LISTEN/NOTIFY channel."},
],
stream=True,
)
for chunk in response:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
That is the whole integration. Swap deepseek-v4-flash for deepseek-v4-pro when you need the bigger model. The TypeScript SDK is identical with the obvious syntax changes.
Flash defaults to non-thinking. To get the reasoning trace, pass an extra body parameter. The current docs use thinking on the request payload.
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Find the bug: <code snippet>"}],
extra_body={"thinking": {"enabled": True}},
)
reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content
The thinking trace comes back on reasoning_content, the final answer on content. Same shape as the OpenAI o-series response, which means most agent frameworks already know how to read it.
Tool calling works the same way it does on OpenAI. Pass a tools array with JSON schema, the model returns tool_calls, you execute and feed results back in. There is one wrinkle: V4 Pro with thinking enabled produces a noticeably better tool plan than V4 Flash on multi-step agent tasks. If your agent is making the wrong tool choice, that is the first thing to flip.
DeepSeek published their own benchmark numbers with the launch. I have not yet seen independent third-party verification, so treat these as the vendor's view. They line up with the early community reports on Hugging Face.
| Benchmark | V3 (Mar 2025) | R1 (Jan 2025) | V4 Flash | V4 Pro |
|---|---|---|---|---|
| MMLU-Pro | 75.9 | 84.0 | 81.4 | 88.7 |
| MATH-500 | 90.2 | 97.3 | 95.8 | 98.4 |
| AIME 2024 | 39.6 | 79.8 | 74.2 | 86.1 |
| GPQA Diamond | 59.1 | 71.5 | 70.8 | 78.3 |
| LiveCodeBench | 40.5 | 65.9 | 67.4 | 74.9 |
| SWE-bench Verified | 42.0 | 49.2 | 54.7 | 63.5 |
The pattern is what you would expect. V4 Flash matches or slightly beats R1 on reasoning while running closer to V3's latency. V4 Pro is a clear step up on every axis, with the SWE-bench number being the headline. 63.5 on SWE-bench Verified puts Pro in striking distance of Claude Sonnet 4 and ahead of every other open model. For an open-weights checkpoint you can host yourself, that is a genuinely new thing.
This is the question I get most. There is no universal answer, but the heuristics are clearer with V4 than they were with R1.
Reach for V4 Flash when: you are running a high-volume workload, the prompts repeat enough to benefit from caching, latency matters more than the last few percent of quality, and your task is bounded enough that a cheap model will not embarrass you. Examples: classification, structured extraction, RAG synthesis over retrieved chunks, first-pass code review, customer support drafting.
Reach for V4 Pro when: the task is hard, the failure cost is high, and you want frontier reasoning at a fraction of the closed-model price. Examples: codebase-scale refactors, multi-step agent loops, technical writing where the model has to integrate many sources, math and scientific work, anything that benefits from the 1M context window.
Stay on Claude when: you are doing long agentic coding sessions, the work involves Anthropic-specific tooling like Computer Use or the Claude Code SDK, or you need the absolute best result on SWE-bench-style real codebase work. Claude Sonnet 4 still has the edge there, and Opus opens a wider gap.
Stay on GPT when: you are deep into the OpenAI ecosystem, using Assistants, the Realtime API, or function calling features that have not been mirrored elsewhere yet, or running on Azure where DeepSeek is not first-class.
Run V4 locally when: you have the hardware and the privacy constraint. Flash will fit on a single high-memory workstation at 4-bit quantization. Pro needs a small cluster or a rented H200 box, but the weights are MIT licensed and the inference stack is the same one that already runs V3.
We have been covering DeepSeek on the channel since R1 dropped in January 2025. Every new release has been an excuse to redo the cost math for DD Empire's internal tooling, and V4 is the first time the answer has been "move everything." The pricing on Flash makes it the new default for any internal automation that was on gpt-4o-mini or claude-haiku. The 1M context on Pro means the long-context jobs that previously required Gemini are now back on the table for a single provider.
The honest thing to say is that DeepSeek keeps shipping faster than the closed labs. V3 closed the gap, R1 forced the o1 rewrite, and V4 has reset the price floor for the third time in eighteen months. If your stack is closed-only, your next quarter is going to involve a serious build-vs-buy conversation whether you wanted one or not.
For the four-minute video version of this launch, see DeepSeek v4 in 4 Minutes on the Developers Digest YouTube channel. For the older context, the DeepSeek R1 and V3 Developer Guide walks through the architecture and the local deployment story that V4 inherits. For the broader question of where this fits in the 2026 tooling landscape, the AI Coding Tools Pricing 2026 and Best AI Coding Tools 2026 posts have the comparison tables.
Three concrete things worth a Saturday.
DeepSeek shipped a release that is genuinely worth a workflow change. The next one will probably be along in three months. Build for that.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolDeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...
View ToolFastest inference for open-source models. 200+ models via unified API. Ranks #1 on speed benchmarks for DeepSeek, Qwen,...
View ToolUnified API for 200+ models. One API key, one billing dashboard. OpenAI, Anthropic, Google, Meta, Mistral, and more. Aut...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

Complete pricing breakdown for every major AI coding tool. Claude Code, Cursor, Copilot, Windsurf, Codex, Augment, and m...

From terminal agents to cloud IDEs - these are the AI coding tools worth using for TypeScript development in 2026.

DeepSeek V4 is trending because it is close enough to frontier coding models at a much lower token price. The real quest...

A practical walkthrough of Nemotron 3 Super: latent mixture of experts, hybrid Mamba transformer architecture, 1M contex...

Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally,...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.