TL;DR
Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to keep code off third-party servers. Here is what to run and on what hardware.
Direct answer
Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to keep code off third-party servers. Here is what to run and on what hardware.
Best for
Developers comparing real tool tradeoffs before choosing a stack.
Covers
Verdict, tradeoffs, pricing signals, workflow fit, and related alternatives.
Read next
Anthropic shipped two names for one architecture on June 9, 2026. Here is what separates Fable 5 from Mythos 5, who can actually get unrestricted access, and what developers should do right now.
7 min readFable 5 launched June 9 at 2x GPT-5.5's price with a 22-point SWE-Bench Pro gap. Here is the decision framework for choosing between them.
7 min readFable 5 lists at $10/$50 per million tokens - twice Opus 4.8. But list price is the wrong number. Here is the cost-per-outcome math that actually decides whether the upgrade pays.
8 min readThe calculus around local AI has shifted. A year ago, running a large language model on your own hardware was mostly a hobby - you accepted worse output to avoid API costs. In 2026, the reasons to self-host have multiplied: data residency requirements, compliance audits, latency budgets, and the rising cost of cloud AI at scale. The models have caught up too. The gap between frontier cloud models and the best locally runnable options has narrowed significantly on the benchmarks that actually predict production value.
This post is a practical guide. Every model claim below is sourced and verifiable. If a number could not be confirmed, it is not here.
Last updated: June 10, 2026
The regulatory pressure point many teams did not anticipate: Anthropic's data retention policy update for its most capable models, effective June 9, 2026. Under the new policy, prompts and outputs submitted to "Mythos-class" models are retained for 30 days for trust and safety purposes - even for organizations with zero data retention (ZDR) agreements, including those accessing the API through AWS Bedrock, Google Cloud Agent Platform, and Microsoft Foundry.
For teams that assumed ZDR meant zero retention across all tiers, this is a meaningful shift. Legal and security teams reviewing AI tooling now have a concrete policy change to respond to. The most direct response is moving workloads - especially code review, architecture drafting, and anything touching proprietary logic - to models that never leave your infrastructure.
That pressure, combined with hardware that has gotten substantially cheaper and models that have gotten substantially better, explains why local LLM adoption is accelerating among professional developers.
HumanEval, the old standard for coding benchmarks, has largely been retired among serious evaluators. SWE-bench has replaced it as the primary coding benchmark in 2026 because it measures real GitHub issue resolution rather than isolated function generation - a far better proxy for how a model performs in actual dev workflows.
The official SWE-bench leaderboard shows the frontier clearly: DeepSeek V3.2 leads at 70.0%, Gemini 3 Pro at 69.6%, and Claude 4.5 Haiku (high reasoning) at 66.6% as of this writing. These are cloud-only models. The question for local deployment is how close the self-hostable options get.
| Model | Parameters | SWE-bench / Key Score | VRAM (Q4_K_M) |
|---|---|---|---|
| Qwen 3.6 27B | 27B dense | 77.2% SWE-bench | ~22 GB |
| Kimi K2.6 | 1T total / 32B active (MoE) | 58.6 SWE-bench Pro | Varies (quantized) |
| Devstral Small 24B | 24B | Agentic-optimized | ~16 GB |
| DeepSeek R1 | 671B | 71.5 GPQA Diamond | 340 GB FP16 / quantized |
| Qwen3 8B | 8B | - | ~5 GB |
| Llama 3.3 70B | 70B | - | 38 GB INT4 |
Scores sourced from PromptQuorum and Onyx AI self-hosted leaderboard.
The model that has generated the most discussion among hardware-constrained developers is Google Gemma 4 12B. The headline benchmark claim from Google is that it delivers "performance nearing our larger 26B MoE model on standard benchmarks" while running on "consumer laptops with 16GB of RAM" - specifically 16GB of VRAM or unified memory.
What makes this notable is the architecture: Gemma 4 12B is a unified, encoder-free multimodal model. Vision and audio inputs route directly into the LLM backbone rather than through separate encoders. That architectural decision is what lets it punch above its parameter weight on tasks that would normally require a much larger model.
For coding work specifically, the Apache 2.0 license matters. You can deploy it commercially, fine-tune it on proprietary codebases, and redistribute modifications without restriction. For enterprise air-gap deployments, that licensing clarity is as important as the benchmark number.
The Gemma 4 family as a whole has crossed 150 million downloads, available via Hugging Face and Kaggle. The 12B variant represents the sweet spot for teams that want multimodal capability (including audio inputs - a first for a mid-sized model) without requiring a dedicated GPU server.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 10, 2026 • 10 min read
Jun 10, 2026 • 6 min read
Jun 10, 2026 • 8 min read
Jun 10, 2026 • 8 min read
The Onyx self-hosted leaderboard notes that VRAM estimates are based on model weight size at FP16 (2 bytes per parameter), with actual overhead typically running 10-20% higher due to KV cache and framework needs. Plan accordingly.
Tier 1: Laptop-viable (8-16 GB unified memory)
Tier 2: Workstation / single GPU server (24-48 GB VRAM)
Tier 3: Multi-GPU or enterprise server
For most individual developers, Tier 1 is the starting point. For small teams running a shared inference server, Tier 2 covers the majority of coding use cases at a cost that amortizes quickly.
The math on hardware amortization has improved as GPU prices normalize. A rough framework for a 12-month horizon:
Cloud API baseline - A developer using a mid-tier cloud coding model at roughly 2M tokens per day (typical for an agentic coding workflow with code context) runs approximately $200-400/month depending on the model and provider. Over 12 months: $2,400 - $4,800.
Tier 1 local (M3 MacBook Pro 24GB) - If you already own the hardware, marginal cost is electricity. If purchasing: amortized over 36 months, the incremental cost attributable to local LLM capability is minimal. For teams that do not own suitable hardware, an M4 Mac Mini with 32GB unified memory runs under $1,200 new - a cost recovered in 3-6 months against active cloud API spend.
Tier 2 local (single RTX 4090 workstation) - Hardware cost $2,500-4,000 new, amortized over 36 months at roughly $70-110/month. Breakeven against cloud API at typical developer usage: 6-12 months.
Tier 3 multi-GPU server - Capital-intensive, justified only for team-wide shared inference. At this scale, the economics almost always favor self-hosting when combined with compliance requirements that would otherwise require expensive ZDR contracts.
Running a local model is the foundation, but the full privacy-first dev stack requires more. The components that enterprise teams combine most often:
The orchestration layer is where teams often underestimate effort. Local inference is solved. Wiring it into the full development workflow - context retrieval, multi-file edits, test running, PR review - requires either an off-the-shelf tool that supports local endpoints or custom integration work.
Solo developer Start with Ollama and Gemma 4 12B or Qwen3 8B on whatever hardware you have. If you have 16GB unified memory (Apple Silicon or equivalent), Devstral Small 24B gives you agentic multi-file editing that covers most day-to-day needs. Cost: $0 if hardware already owned.
Small team (2-10 developers) A shared inference server with Qwen 3.6 27B or Llama 3.3 70B gives the team a capable, shared endpoint. A single machine with 2x RTX 4090 (80GB combined VRAM) handles Llama 3.3 70B at INT4 with headroom. Add a vector store and Continue.dev for IDE integration across the team. Monthly hardware amortization: roughly $100-150/month spread across the team.
Enterprise with air-gap requirement DeepSeek R1 or V3.2 at quantized precision on a dedicated multi-GPU server. Devstral-2-123B at INT4 is an alternative that fits a single high-VRAM server card. The MIT license on DeepSeek R1 and Kimi K2.6 and the Apache 2.0 license on Gemma 4 12B are all compatible with enterprise deployment without legal review friction. Build the inference layer on vLLM for throughput, add OpenAI-compatible API shim so existing tooling requires minimal reconfiguration.
For most developers, Qwen 3.6 27B leads on verified benchmark scores at 77.2% SWE-bench, but requires ~22 GB VRAM. If you are hardware-constrained to 16 GB, Gemma 4 12B from Google offers near-26B benchmark performance on a laptop. For agentic multi-file workflows, Devstral Small 24B is purpose-designed.
The gap has narrowed but not closed. Cloud-only models like DeepSeek V3.2 and Gemini 3 Pro score 70%+ on SWE-bench as of mid-2026. The best locally runnable dense models are competitive for everyday coding tasks; the gap shows mainly on complex multi-step reasoning and very large codebase contexts.
16 GB of VRAM or unified memory (Apple Silicon counts) runs Gemma 4 12B and Devstral Small 24B. 24 GB covers Qwen 3.6 27B with headroom. For the top-tier models like DeepSeek R1, you need a multi-GPU setup or heavily quantized versions.
Anthropic updated its data retention policy for its most capable models effective June 9, 2026, to retain prompts and outputs for 30 days even for enterprise customers with zero data retention agreements. Teams in regulated industries or with IP sensitivity are responding by routing sensitive workloads to self-hosted models that never transmit code off-premises.
Gemma 4 12B uses Apache 2.0 - permissive for commercial use and fine-tuning. DeepSeek R1 and Kimi K2.6 use MIT. Llama 3.3 has its own Meta community license that permits commercial use below certain usage thresholds. Devstral uses the Mistral AI Research License - check commercial terms before enterprise deployment.
Ollama is the simplest setup and suitable for individual and small-team use. For production team-wide inference at scale, vLLM offers significantly better throughput and serves an OpenAI-compatible API. llama.cpp remains the most portable option for air-gapped environments with minimal dependencies.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Full-stack AI dev environment in the browser. Describe an app, get a deployed project with database, auth, and hosting....
View ToolThe easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly download...
View ToolDesktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and autom...
View ToolOpen-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom ass...
View ToolInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedAsk quick side questions without derailing the main task.
Claude CodeTargeted edits to specific sections without rewriting entire files.
Claude CodeAnthropic shipped two names for one architecture on June 9, 2026. Here is what separates Fable 5 from Mythos 5, who can...
Fable 5 launched June 9 at 2x GPT-5.5's price with a 22-point SWE-Bench Pro gap. Here is the decision framework for choo...
Fable 5 lists at $10/$50 per million tokens - twice Opus 4.8. But list price is the wrong number. Here is the cost-per-o...
Fable 5 is mostly a drop-in replacement for Opus 4.8, but 'mostly' is doing real work in that sentence. Here's every bre...
Anthropic's Claude Fable 5 includes undisclosed interventions that silently degrade responses for certain ML development...
The Miasma worm has evolved from package registry poisoning to directly hijacking AI coding tools - if your team clones...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.