
TL;DR
Open weights are free to download, but inference is not free to run. Here is the honest break-even math on when self-hosting GLM-5.2, DeepSeek V4, or Llama beats paying per-token API prices - GPU rental and ownership costs, real throughput, utilization, the crossover in tokens per month, and the hidden ops bill nobody budgets for.
| Source | What it covers |
|---|---|
| CloudZero: H100 GPU cost in 2026 - buy, rent, cloud | H100 purchase and rental pricing |
| IntuitionLabs: H100 rental prices across 15+ providers | Per-hour rental comparison |
| Spheron: GPU cloud pricing comparison 2026 | Cross-provider hourly rates |
| DeepSeek V3/R1 671B throughput benchmarks on 8xH100 | vLLM aggregate and single-stream tokens/sec |
| DeepSeek API pricing | V4 Pro and Flash per-token rates |
| PricePerToken: Llama 4 Maverick (Fireworks/Together) | Hosted open-weights API pricing |
The pitch for self-hosting open-weights models is seductive and a little misleading. The weights are free. You download GLM-5.2, DeepSeek V4, or Llama, point a server at your own GPUs, and stop paying anyone per token forever. No vendor lock-in, no rate limits, no surprise invoice.
The weights are free. The inference is not. The honest question is never "is self-hosting cheaper than the API" - it is "at what volume, at what utilization, with whose ops time, does running your own GPU beat paying per token." That crossover exists, it is computable, and for most teams it sits much higher than the marketing implies.
This post does the math both ways. It is not an argument for self-hosting. It is an argument for knowing your break-even before you buy a GPU.
Last verified: June 17, 2026.
Per-token API. You pay a published rate per million input and output tokens. Cost scales linearly with usage, starts at zero, and includes every hidden thing - the GPUs, the ops team, the idle capacity, the redundancy - baked into the price. Predictable per unit, unbounded in total.
Self-hosting. You pay for compute by the hour (rented) or up front (owned), whether or not a single token flows through it. Cost is dominated by a fixed block of capacity. The marginal cost per token approaches zero, but only if you keep that capacity busy. Cheap per unit at high utilization, brutally expensive per unit when idle.
The entire decision turns on one number that the per-token model hides from you and self-hosting exposes mercilessly: utilization. An idle GPU is the most expensive way to run a model that exists.
Renting (on-demand, per GPU-hour), mid-2026:
| GPU | Representative on-demand rate | Notes |
|---|---|---|
| H100 80GB | ~$2.00 to $3.00/hr | Median across neoclouds ~$2.29 to $3.12; hyperscalers run $2 to $8+, Vast.ai marketplace ~$1.87 (IntuitionLabs, Spheron) |
| H200 141GB | ~$4.39/hr | RunPod on-demand; more memory headroom for large MoE weights (Spheron) |
| RTX 4090 / 5090 | ~$0.35 to $1.00/hr | Consumer cards on marketplaces; spot/interruptible can drop lower with risk (Spheron) |
A full 8xH100 node, the unit you need to serve a frontier-class MoE model with real concurrency, therefore lands around $16 to $24 per hour on-demand from a neocloud, which is roughly $11,500 to $17,500 per month if you leave it running 24/7. Reserved and committed contracts cut that meaningfully, but they also lock you into the fixed cost whether you use it or not.
Owning (street price, mid-2026):
| Hardware | Approximate price | Power draw |
|---|---|---|
| H100 80GB | ~$25,000 to $30,000 per card (CloudZero) | ~700W |
| RTX 5090 | ~$2,000 | ~575W (Yahoo Tech) |
| RTX 4090 | ~$1,600 | ~450W |
An 8xH100 HGX node is a $200,000 to $250,000 capital purchase before you add the chassis, networking, cooling, and a rack to put it in. At ~700W per card plus overhead, eight cards pull on the order of 6 to 8 kW under load - call it $700 to $1,200 a month in electricity alone at typical commercial rates, before you account for cooling and the power-usage overhead of the facility. Ownership only makes sense at high, sustained utilization over a multi-year horizon, and it converts a usage problem into a depreciation-and-datacenter problem.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 17, 2026 • 9 min read
Jun 15, 2026 • 9 min read
Jun 15, 2026 • 8 min read
Jun 15, 2026 • 8 min read
This is the number everyone skips, and it is the one that breaks most self-hosting business cases.
Throughput is not a single figure. It splits into two:
The gap between 33 and 620 output tokens/sec is the whole game. You only hit the high number if you keep ~100 requests in flight at once. Serve one user at a time and your expensive node delivers single-stream throughput while costing you the full hourly rate. The per-token economics of self-hosting are entirely a function of batch fullness.
So the realistic capacity of an 8xH100 node at healthy batching is on the order of 620 output tokens/sec sustained, or about 1.6 billion output tokens per month if you run it flat-out 24/7 at full batch. Real workloads never sustain full batch around the clock, which is exactly where utilization assumptions enter.
The prices you are trying to beat, per 1M tokens, mid-2026:
| Model (hosted API) | Input | Output | Source |
|---|---|---|---|
| GLM-5.2 (Z.ai) | ~$1.40 | ~$4.40 | GLM-5.2 cost math |
| DeepSeek V4 Pro | ~$1.74 | ~$3.48 | DeepSeek pricing |
| DeepSeek V4 Flash | ~$0.14 | ~$0.28 | DeepSeek pricing |
| Llama 4 Maverick (Fireworks) | ~$0.22 | ~$0.88 | PricePerToken |
Note the spread. The same open weights that you would self-host are also sold by competing providers who already solved batching at scale, bought their GPUs at volume, and amortize ops across thousands of tenants. That is why a model like Llama 4 Maverick or DeepSeek V4 Flash can be served for cents - the hosted API for an open-weights model is often the cheapest way to run that exact model, because someone else is carrying your utilization risk.
Let us make it concrete. Suppose your workload is dominated by output tokens (agentic coding, long generations) and you are choosing between self-hosting a 671B-class MoE model on a rented 8xH100 node versus paying DeepSeek V4 Pro's API at ~$3.48 per 1M output tokens.
Self-hosting cost (rented):
Capacity at different utilization:
| Avg batch utilization | Effective output tokens/sec | Output tokens/month | Self-host cost per 1M output tokens |
|---|---|---|---|
| 100% (full batch, 24/7) | 620 | ~1.6B | ~$10 |
| 50% | 310 | ~800M | ~$20 |
| 20% | 124 | ~320M | ~$50 |
| 5% (one or two users) | 31 | ~80M | ~$200 |
Set that against the API at $3.48 per 1M output tokens, and the result is uncomfortable: even at 100% batch utilization 24/7, this self-hosted node costs ~$10 per 1M output tokens - nearly 3x the API price.
The reason is not that your math is wrong. It is that the API provider runs the same hardware at scale you cannot match, buys GPUs cheaper, and packs the batch fuller across many customers. To beat $3.48/1M on rented hardware you would need to either drive utilization past what a single tenant can sustain, negotiate reserved pricing far below on-demand, or be serving a model where the API markup is much fatter than DeepSeek's.
Where self-hosting actually wins: flip the comparison against an expensive frontier API. If the alternative is a closed model at, say, $15 to $30 per 1M output tokens, then a self-hosted open-weights node at $10 to $20/1M (high utilization) crosses into the black. The break-even is not "self-hosting vs the API" in the abstract - it is "self-hosting an open-weights model vs paying premium frontier rates for comparable quality, at volume high enough to keep the node busy." That is a real and growing scenario, which is exactly why the orchestration and routing layer has become the place the margin moves to.
The rough rule of thumb: self-hosting starts to pencil out only when (a) your sustained volume reliably fills the batch, (b) the API you are replacing is a premium-priced model rather than a cheap open-weights host, and (c) you can amortize the ops cost across that volume. Miss any one and the API wins.
The clean comparison above already understates self-hosting, because the fixed cost is never just the GPU rental. The line items that do not appear on the GPU invoice:
None of these are hypothetical. They are the difference between the spreadsheet break-even and the real one, and they all push the crossover point higher.
It is not never. The honest cases:
For everyone else - which is most teams, most of the time - the right move is the boring one: use the hosted open-weights API, route cheap traffic to cheap models, and reserve self-hosting for the narrow band where the math truly closes.
And whichever side you land on, put spend guardrails in place. Self-hosting caps your token cost but uncaps your ops and idle cost. The API uncaps your token cost but caps everything else. Both can run away from you without controls.
No. For most teams it is more expensive once you account for utilization and ops. A hosted API for an open-weights model is often the cheapest option because the provider keeps the batch full at a scale a single tenant cannot match. Self-hosting wins mainly when you have high, steady, batch-filling volume replacing a premium-priced frontier model, or when compliance and latency requirements override cost.
There is no universal number - it depends on your batch utilization and which API you are replacing. As a rule of thumb, self-hosting only pencils out when you can keep roughly 50 to 100 concurrent requests in flight most of the time and the API you are replacing is a premium model priced well above cheap open-weights hosts like DeepSeek V4 Flash or Llama 4 Maverick.
For a 671B-class MoE model on 8xH100 with vLLM, published benchmarks show roughly 33 output tokens/sec for a single request, rising to about 620 output tokens/sec aggregate (around 3,000 total tokens/sec including input) at about 100 concurrent requests using 4-bit quantization. You only get the high number at high concurrency.
In mid-2026, an H100 rents for roughly $2 to $3 per GPU-hour on-demand from neoclouds, and costs roughly $25,000 to $30,000 to buy. An 8xH100 node is around $11,500 to $17,500/month rented 24/7, or a $200,000+ capital purchase plus power (each card draws ~700W) and datacenter costs to own.
Because the same open weights are served by multiple competing providers who bought GPUs at volume, solved high-utilization batching at scale, and amortize ops across thousands of tenants. They carry the utilization risk for you, which is why models like DeepSeek V4 Flash ($0.14/$0.28 per 1M tokens) or Llama 4 Maverick ($0.22/$0.88) sell for cents per million tokens.
Read next
A $500M accidental Claude bill and an open-weights model beating GPT-5.5 at one-sixth the cost point to the same conclusion: the margin is moving to the layer that decides when to use which model for what. Here is how routing and orchestration differ, and how to cut your model spend.
12 min readZ.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the per-token cost. Here is the real cost math, a worked cost-per-task example, and a when-to-use-which decision guide.
9 min readDeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $1.74/$3.48 per million tokens, and Flash runs agent inner loops for cents. Here is the worked cost math, the Flash-vs-Pro split, and a clear guide on when to route to DeepSeek instead of a frontier model.
9 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Fastest inference for open-source models. 200+ models via unified API. Ranks #1 on speed benchmarks for DeepSeek, Qwen,...
View ToolDeepSeek's open-weights frontier family, previewed April 24, 2026. V4-Pro is 1.6T total / 49B active params; V4-Flash is...
View Tool
A $500M accidental Claude bill and an open-weights model beating GPT-5.5 at one-sixth the cost point to the same conclus...

Z.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the p...

DeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $1.74/$3.48 per million tokens, and Flash runs agent inner loops f...

A code-heavy field guide to model routing. Real, runnable-style configs for tiering tasks by complexity, routing simple...

A company accidentally spent $500M on Claude in one month. Uber torched its whole 2026 AI budget by April. The fix is no...

DeepSeek V4-Flash costs $0.28 per million output tokens. Fable 5 costs $50. That 178x gap is real - but so is the qualit...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.