Self-Hosting Open-Weights Models: The Real Break-Even Math

Official Sources#

Source	What it covers
CloudZero: H100 GPU cost in 2026 - buy, rent, cloud	H100 purchase and rental pricing
IntuitionLabs: H100 rental prices across 15+ providers	Per-hour rental comparison
Spheron: GPU cloud pricing comparison 2026	Cross-provider hourly rates
DeepSeek V3/R1 671B throughput benchmarks on 8xH100	vLLM aggregate and single-stream tokens/sec
DeepSeek API pricing	V4 Pro and Flash per-token rates
PricePerToken: Llama 4 Maverick (Fireworks/Together)	Hosted open-weights API pricing

The pitch for self-hosting open-weights models is seductive and a little misleading. The weights are free. You download GLM-5.2, DeepSeek V4, or Llama, point a server at your own GPUs, and stop paying anyone per token forever. No vendor lock-in, no rate limits, no surprise invoice.

The weights are free. The inference is not. The honest question is never "is self-hosting cheaper than the API" - it is "at what volume, at what utilization, with whose ops time, does running your own GPU beat paying per token." That crossover exists, it is computable, and for most teams it sits much higher than the marketing implies.

This post does the math both ways. It is not an argument for self-hosting. It is an argument for knowing your break-even before you buy a GPU.

Last verified: June 17, 2026.

The Two Cost Models You Are Comparing#

Per-token API. You pay a published rate per million input and output tokens. Cost scales linearly with usage, starts at zero, and includes every hidden thing - the GPUs, the ops team, the idle capacity, the redundancy - baked into the price. Predictable per unit, unbounded in total.

Self-hosting. You pay for compute by the hour (rented) or up front (owned), whether or not a single token flows through it. Cost is dominated by a fixed block of capacity. The marginal cost per token approaches zero, but only if you keep that capacity busy. Cheap per unit at high utilization, brutally expensive per unit when idle.

The entire decision turns on one number that the per-token model hides from you and self-hosting exposes mercilessly: utilization. An idle GPU is the most expensive way to run a model that exists.

What the GPUs Actually Cost#

Renting (on-demand, per GPU-hour), mid-2026:

GPU	Representative on-demand rate	Notes
H100 80GB	~$2.00 to $3.00/hr	Median across neoclouds ~$2.29 to $3.12; hyperscalers run $2 to $8+, Vast.ai marketplace ~$1.87 (IntuitionLabs, Spheron)
H200 141GB	~$4.39/hr	RunPod on-demand; more memory headroom for large MoE weights (Spheron)
RTX 4090 / 5090	~$0.35 to $1.00/hr	Consumer cards on marketplaces; spot/interruptible can drop lower with risk (Spheron)

A full 8xH100 node, the unit you need to serve a frontier-class MoE model with real concurrency, therefore lands around $16 to $24 per hour on-demand from a neocloud, which is roughly $11,500 to $17,500 per month if you leave it running 24/7. Reserved and committed contracts cut that meaningfully, but they also lock you into the fixed cost whether you use it or not.

Owning (street price, mid-2026):

Hardware	Approximate price	Power draw
H100 80GB	~$25,000 to $30,000 per card (CloudZero)	~700W
RTX 5090	~$2,000	~575W (Yahoo Tech)
RTX 4090	~$1,600	~450W

An 8xH100 HGX node is a $200,000 to $250,000 capital purchase before you add the chassis, networking, cooling, and a rack to put it in. At ~700W per card plus overhead, eight cards pull on the order of 6 to 8 kW under load - call it $700 to $1,200 a month in electricity alone at typical commercial rates, before you account for cooling and the power-usage overhead of the facility. Ownership only makes sense at high, sustained utilization over a multi-year horizon, and it converts a usage problem into a depreciation-and-datacenter problem.

From the archive

Vercel eve: The Framework for Building AI Agents

Jun 17, 2026 • 9 min read

Cursor Automations Developer Guide: Always-On AI Coding Agents

Jun 15, 2026 • 9 min read

OpenRouter Fusion Makes Model Panels Real. Use Them Like Escalation, Not Autopilot

Jun 15, 2026 • 8 min read

Kimi K2.7-Code Developer Guide: The Open-Source Coding Model Worth Running

Jun 14, 2026 • 8 min read

What the GPUs Actually Produce#

This is the number everyone skips, and it is the one that breaks most self-hosting business cases.

Throughput is not a single figure. It splits into two:

Single-stream throughput - how fast one request generates tokens. For a 671B-class MoE model on an 8xH100 node, published vLLM benchmarks put this around 33 output tokens/sec (DeepSeek 671B benchmark).
Aggregate batched throughput - total tokens/sec across all concurrent requests. The same benchmark peaks around 3,000 total tokens/sec (roughly 620 output tokens/sec) at about 100 concurrent requests, using 4-bit quantization on 8xH100.

The gap between 33 and 620 output tokens/sec is the whole game. You only hit the high number if you keep ~100 requests in flight at once. Serve one user at a time and your expensive node delivers single-stream throughput while costing you the full hourly rate. The per-token economics of self-hosting are entirely a function of batch fullness.

So the realistic capacity of an 8xH100 node at healthy batching is on the order of 620 output tokens/sec sustained, or about 1.6 billion output tokens per month if you run it flat-out 24/7 at full batch. Real workloads never sustain full batch around the clock, which is exactly where utilization assumptions enter.

The API Side of the Ledger#

The prices you are trying to beat, per 1M tokens, mid-2026:

Model (hosted API)	Input	Output	Source
GLM-5.2 (Z.ai)	~$1.40	~$4.40	GLM-5.2 cost math
DeepSeek V4 Pro	~$0.435	~$0.87	DeepSeek pricing
DeepSeek V4 Flash	~$0.14	~$0.28	DeepSeek pricing
Llama 4 Maverick (Fireworks)	~$0.22	~$0.88	PricePerToken

Note the spread. The same open weights that you would self-host are also sold by competing providers who already solved batching at scale, bought their GPUs at volume, and amortize ops across thousands of tenants. That is why a model like Llama 4 Maverick or DeepSeek V4 Flash can be served for cents - the hosted API for an open-weights model is often the cheapest way to run that exact model, because someone else is carrying your utilization risk.

The Worked Break-Even#

Let us make it concrete. Suppose your workload is dominated by output tokens (agentic coding, long generations) and you are choosing between self-hosting a 671B-class MoE model on a rented 8xH100 node versus paying DeepSeek V4 Pro's API at ~$0.87 per 1M output tokens.

Self-hosting cost (rented):

8xH100 on-demand: ~$20/hr midpoint, running 24/7 = ~$14,400/month
Add ops, monitoring, and a slice of an engineer (more on this below): call the all-in fixed cost ~$16,000/month for the clean comparison

Capacity at different utilization:

Avg batch utilization	Effective output tokens/sec	Output tokens/month	Self-host cost per 1M output tokens
100% (full batch, 24/7)	620	~1.6B	~$10
50%	310	~800M	~$20
20%	124	~320M	~$50
5% (one or two users)	31	~80M	~$200

Set that against the API at $0.87 per 1M output tokens, and the result is brutal: even at 100% batch utilization 24/7, this self-hosted node costs ~$10 per 1M output tokens - more than 11x the API price.

The reason is not that your math is wrong. It is that the API provider runs the same hardware at scale you cannot match, buys GPUs cheaper, and packs the batch fuller across many customers. To beat $0.87/1M on rented hardware you would need to either drive utilization past what a single tenant can sustain, negotiate reserved pricing far below on-demand, or be serving a model where the API markup is much fatter than DeepSeek's - and DeepSeek's is famously thin.

Where self-hosting actually wins: flip the comparison against an expensive frontier API. If the alternative is a closed model at, say, $15 to $30 per 1M output tokens, then a self-hosted open-weights node at $10 to $20/1M (high utilization) crosses into the black. The break-even is not "self-hosting vs the API" in the abstract - it is "self-hosting an open-weights model vs paying premium frontier rates for comparable quality, at volume high enough to keep the node busy." That is a real and growing scenario, which is exactly why the orchestration and routing layer has become the place the margin moves to.

The rough rule of thumb: self-hosting starts to pencil out only when (a) your sustained volume reliably fills the batch, (b) the API you are replacing is a premium-priced model rather than a cheap open-weights host, and (c) you can amortize the ops cost across that volume. Miss any one and the API wins.

The Hidden Ops Bill#

The clean comparison above already understates self-hosting, because the fixed cost is never just the GPU rental. The line items that do not appear on the GPU invoice:

Ops time. Someone serves the model, patches the inference stack (vLLM and SGLang move fast), tunes batching and quantization, handles OOMs and node failures, and gets paged at 3am. A fractional senior engineer is easily $5,000 to $15,000/month of loaded cost, and it does not scale down when traffic is light.
Idle GPU. The single most expensive failure mode. A node provisioned for peak that sits at 10% average utilization is paying full price for a tenth of the output. The API charges you nothing for the troughs.
Redundancy and scaling. One node is a single point of failure. Real production wants headroom for spikes and a fallback, which means provisioning above average demand - structurally guaranteeing idle capacity.
Cold starts and model swaps. Loading 600B+ weights takes minutes and gigabytes of transfer. If you serve multiple models or scale to zero, you eat that latency and that bandwidth.
Quantization quality risk. The throughput numbers that make self-hosting look good usually assume 4-bit weights. That is a quality tradeoff you are now responsible for measuring, not the provider.

None of these are hypothetical. They are the difference between the spreadsheet break-even and the real one, and they all push the crossover point higher.

When Self-Hosting Actually Makes Sense#

It is not never. The honest cases:

High, steady, batch-filling volume of a model whose hosted API carries a fat markup or whose quality you need at frontier-replacing scale.
Data residency or compliance that forbids sending tokens to a third party at any price. Here the comparison is not cost, it is permission.
Latency or determinism requirements that a shared multi-tenant API cannot guarantee.
Research and experimentation where you need to modify the model, not just call it.
You already own the GPUs for another reason and the marginal cost of inference on spare capacity is close to free.

For everyone else - which is most teams, most of the time - the right move is the boring one: use the hosted open-weights API, route cheap traffic to cheap models, and reserve self-hosting for the narrow band where the math truly closes.

How to Actually Decide#

Measure your real token volume, split into input and output, over a representative month. Output tokens dominate cost for generative workloads.
Price it on three or four hosted APIs, including the cheap open-weights hosts, not just the frontier model you default to. The routing recipes here are usually the fastest win.
Estimate your honest batch utilization, not your peak. If you cannot keep ~50 to 100 requests in flight most of the time, self-hosting math will not close.
Add the ops bill - engineer time, redundancy, idle headroom - to the GPU cost before comparing.
Only then compute self-host cost per token at your real utilization and set it against the API. If it is not at least 2x cheaper, the API wins on a risk-adjusted basis alone.

And whichever side you land on, put spend guardrails in place. Self-hosting caps your token cost but uncaps your ops and idle cost. The API uncaps your token cost but caps everything else. Both can run away from you without controls.

FAQ#

Is self-hosting open-weights models always cheaper than the API?#

No. For most teams it is more expensive once you account for utilization and ops. A hosted API for an open-weights model is often the cheapest option because the provider keeps the batch full at a scale a single tenant cannot match. Self-hosting wins mainly when you have high, steady, batch-filling volume replacing a premium-priced frontier model, or when compliance and latency requirements override cost.

What is the break-even volume for self-hosting?#

There is no universal number - it depends on your batch utilization and which API you are replacing. As a rule of thumb, self-hosting only pencils out when you can keep roughly 50 to 100 concurrent requests in flight most of the time and the API you are replacing is a premium model priced well above cheap open-weights hosts like DeepSeek V4 Flash or Llama 4 Maverick.

How much throughput does an 8xH100 node deliver?#

For a 671B-class MoE model on 8xH100 with vLLM, published benchmarks show roughly 33 output tokens/sec for a single request, rising to about 620 output tokens/sec aggregate (around 3,000 total tokens/sec including input) at about 100 concurrent requests using 4-bit quantization. You only get the high number at high concurrency.

What does it cost to rent versus buy an H100?#

In mid-2026, an H100 rents for roughly $2 to $3 per GPU-hour on-demand from neoclouds, and costs roughly $25,000 to $30,000 to buy. An 8xH100 node is around $11,500 to $17,500/month rented 24/7, or a $200,000+ capital purchase plus power (each card draws ~700W) and datacenter costs to own.

Why is the hosted API for an open-weights model often cheapest?#

Because the same open weights are served by multiple competing providers who bought GPUs at volume, solved high-utilization batching at scale, and amortize ops across thousands of tenants. They carry the utilization risk for you, which is why models like DeepSeek V4 Flash (~~$0.14/$0.28 per 1M tokens) or Llama 4 Maverick (~~$0.22/$0.88) sell for cents per million tokens.

Official Sources#

Source	What it covers
CloudZero: H100 GPU cost in 2026 - buy, rent, cloud	H100 purchase and rental pricing
IntuitionLabs: H100 rental prices across 15+ providers	Per-hour rental comparison
Spheron: GPU cloud pricing comparison 2026	Cross-provider hourly rates
DeepSeek V3/R1 671B throughput benchmarks on 8xH100	vLLM aggregate and single-stream tokens/sec
DeepSeek API pricing	V4 Pro and Flash per-token rates
PricePerToken: Llama 4 Maverick (Fireworks/Together)	Hosted open-weights API pricing

This post does the math both ways. It is not an argument for self-hosting. It is an argument for knowing your break-even before you buy a GPU.

Last verified: June 17, 2026.

The Two Cost Models You Are Comparing#

What the GPUs Actually Cost#

Renting (on-demand, per GPU-hour), mid-2026:

GPU	Representative on-demand rate	Notes
H100 80GB	~$2.00 to $3.00/hr	Median across neoclouds ~$2.29 to $3.12; hyperscalers run $2 to $8+, Vast.ai marketplace ~$1.87 (IntuitionLabs, Spheron)
H200 141GB	~$4.39/hr	RunPod on-demand; more memory headroom for large MoE weights (Spheron)
RTX 4090 / 5090	~$0.35 to $1.00/hr	Consumer cards on marketplaces; spot/interruptible can drop lower with risk (Spheron)

Owning (street price, mid-2026):

Hardware	Approximate price	Power draw
H100 80GB	~$25,000 to $30,000 per card (CloudZero)	~700W
RTX 5090	~$2,000	~575W (Yahoo Tech)
RTX 4090	~$1,600	~450W

From the archive

Vercel eve: The Framework for Building AI Agents

Jun 17, 2026 • 9 min read

Cursor Automations Developer Guide: Always-On AI Coding Agents

Jun 15, 2026 • 9 min read

OpenRouter Fusion Makes Model Panels Real. Use Them Like Escalation, Not Autopilot

Jun 15, 2026 • 8 min read

Kimi K2.7-Code Developer Guide: The Open-Source Coding Model Worth Running

Jun 14, 2026 • 8 min read

What the GPUs Actually Produce#

This is the number everyone skips, and it is the one that breaks most self-hosting business cases.

Throughput is not a single figure. It splits into two:

Single-stream throughput - how fast one request generates tokens. For a 671B-class MoE model on an 8xH100 node, published vLLM benchmarks put this around 33 output tokens/sec (DeepSeek 671B benchmark).
Aggregate batched throughput - total tokens/sec across all concurrent requests. The same benchmark peaks around 3,000 total tokens/sec (roughly 620 output tokens/sec) at about 100 concurrent requests, using 4-bit quantization on 8xH100.

The API Side of the Ledger#

The prices you are trying to beat, per 1M tokens, mid-2026:

Model (hosted API)	Input	Output	Source
GLM-5.2 (Z.ai)	~$1.40	~$4.40	GLM-5.2 cost math
DeepSeek V4 Pro	~$0.435	~$0.87	DeepSeek pricing
DeepSeek V4 Flash	~$0.14	~$0.28	DeepSeek pricing
Llama 4 Maverick (Fireworks)	~$0.22	~$0.88	PricePerToken

The Worked Break-Even#

Self-hosting cost (rented):

8xH100 on-demand: ~$20/hr midpoint, running 24/7 = ~$14,400/month
Add ops, monitoring, and a slice of an engineer (more on this below): call the all-in fixed cost ~$16,000/month for the clean comparison

Capacity at different utilization:

Avg batch utilization	Effective output tokens/sec	Output tokens/month	Self-host cost per 1M output tokens
100% (full batch, 24/7)	620	~1.6B	~$10
50%	310	~800M	~$20
20%	124	~320M	~$50
5% (one or two users)	31	~80M	~$200

The Hidden Ops Bill#

The clean comparison above already understates self-hosting, because the fixed cost is never just the GPU rental. The line items that do not appear on the GPU invoice:

Ops time. Someone serves the model, patches the inference stack (vLLM and SGLang move fast), tunes batching and quantization, handles OOMs and node failures, and gets paged at 3am. A fractional senior engineer is easily $5,000 to $15,000/month of loaded cost, and it does not scale down when traffic is light.
Idle GPU. The single most expensive failure mode. A node provisioned for peak that sits at 10% average utilization is paying full price for a tenth of the output. The API charges you nothing for the troughs.
Redundancy and scaling. One node is a single point of failure. Real production wants headroom for spikes and a fallback, which means provisioning above average demand - structurally guaranteeing idle capacity.
Cold starts and model swaps. Loading 600B+ weights takes minutes and gigabytes of transfer. If you serve multiple models or scale to zero, you eat that latency and that bandwidth.
Quantization quality risk. The throughput numbers that make self-hosting look good usually assume 4-bit weights. That is a quality tradeoff you are now responsible for measuring, not the provider.

None of these are hypothetical. They are the difference between the spreadsheet break-even and the real one, and they all push the crossover point higher.

When Self-Hosting Actually Makes Sense#

It is not never. The honest cases:

High, steady, batch-filling volume of a model whose hosted API carries a fat markup or whose quality you need at frontier-replacing scale.
Data residency or compliance that forbids sending tokens to a third party at any price. Here the comparison is not cost, it is permission.
Latency or determinism requirements that a shared multi-tenant API cannot guarantee.
Research and experimentation where you need to modify the model, not just call it.
You already own the GPUs for another reason and the marginal cost of inference on spare capacity is close to free.

How to Actually Decide#

Measure your real token volume, split into input and output, over a representative month. Output tokens dominate cost for generative workloads.
Price it on three or four hosted APIs, including the cheap open-weights hosts, not just the frontier model you default to. The routing recipes here are usually the fastest win.
Estimate your honest batch utilization, not your peak. If you cannot keep ~50 to 100 requests in flight most of the time, self-hosting math will not close.
Add the ops bill - engineer time, redundancy, idle headroom - to the GPU cost before comparing.
Only then compute self-host cost per token at your real utilization and set it against the API. If it is not at least 2x cheaper, the API wins on a risk-adjusted basis alone.

Official Sources#

The Two Cost Models You Are Comparing#

What the GPUs Actually Cost#

Vercel eve: The Framework for Building AI Agents

Cursor Automations Developer Guide: Always-On AI Coding Agents

OpenRouter Fusion Makes Model Panels Real. Use Them Like Escalation, Not Autopilot

Kimi K2.7-Code Developer Guide: The Open-Source Coding Model Worth Running

What the GPUs Actually Produce#

The API Side of the Ledger#

The Worked Break-Even#

The Hidden Ops Bill#

When Self-Hosting Actually Makes Sense#

How to Actually Decide#

FAQ#

Is self-hosting open-weights models always cheaper than the API?#

What is the break-even volume for self-hosting?#

How much throughput does an 8xH100 node deliver?#

What does it cost to rent versus buy an H100?#

Why is the hosted API for an open-weights model often cheapest?#

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

DeepSeek V4 Economics: The Cost-Quality Frontier for Agentic Coding in 2026

Related Tools

Together AI

DeepSeek V4

Related Guides

Interactive Mode - Claude Code

Related Videos

Gemma 2 Aims for the Crown! Google's Latest Open-Weights 9B & 27B Models

Related Posts

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

DeepSeek V4 Economics: The Cost-Quality Frontier for Agentic Coding in 2026

Model Routing Recipes: Practical Config Patterns to Cut AI Spend

The $500M Claude Bill: A Spend-Guardrails Playbook for AI-Native Teams

Fable 5 vs DeepSeek V4: The Cost-Quality Gap Measured in Real Tasks

Build with the member tools

Get Smarter About AI Dev

Official Sources#

The Two Cost Models You Are Comparing#

What the GPUs Actually Cost#

Vercel eve: The Framework for Building AI Agents

Cursor Automations Developer Guide: Always-On AI Coding Agents

OpenRouter Fusion Makes Model Panels Real. Use Them Like Escalation, Not Autopilot

Kimi K2.7-Code Developer Guide: The Open-Source Coding Model Worth Running

What the GPUs Actually Produce#

The API Side of the Ledger#

The Worked Break-Even#

The Hidden Ops Bill#

When Self-Hosting Actually Makes Sense#

How to Actually Decide#

FAQ#

Is self-hosting open-weights models always cheaper than the API?#

What is the break-even volume for self-hosting?#

How much throughput does an 8xH100 node deliver?#

What does it cost to rent versus buy an H100?#

Why is the hosted API for an open-weights model often cheapest?#

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

DeepSeek V4 Economics: The Cost-Quality Frontier for Agentic Coding in 2026

Related Tools

Together AI

DeepSeek V4

Related Guides

Interactive Mode - Claude Code

Related Videos

Gemma 2 Aims for the Crown! Google's Latest Open-Weights 9B & 27B Models

Related Posts

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

DeepSeek V4 Economics: The Cost-Quality Frontier for Agentic Coding in 2026

Model Routing Recipes: Practical Config Patterns to Cut AI Spend

The $500M Claude Bill: A Spend-Guardrails Playbook for AI-Native Teams

Fable 5 vs DeepSeek V4: The Cost-Quality Gap Measured in Real Tasks

Build with the member tools

Get Smarter About AI Dev