
TL;DR
Gemma 4 ships byte-for-byte open weights from Google DeepMind. How developers deploy it locally, fine-tune it, and ship agents on top of it.
Read next
DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the API, run them locally with Ollama, and decide when they beat closed-source alternatives.
9 min readThe exact tools, patterns, and processes I use to ship code 10x faster with AI. From morning briefing to production deploy.
9 min readHow solo developers and indie hackers ship products 10x faster using AI coding tools. The complete stack for building alone.
8 min readGoogle has shipped open-weights models before. Gemma 1 was a respectable showing. Gemma 2 closed the gap. Gemma 3 was genuinely competitive. Gemma 4 is the first time the company has released an open model that you can credibly drop into a production stack and not feel like you are choosing between "open" and "good."
For model-selection context, compare this with AI Design Slop: 15 Patterns That Out Your App as Vibe-Coded and Create Beautiful UI with Claude Code: The Style Guide Method; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.
That matters because the open-weights conversation in 2026 is not theoretical anymore. Llama, Mistral, Qwen, DeepSeek, and now Gemma are all serious. The question developers are actually asking is which one to bet on, and the answer depends on three things that Gemma 4 happens to do well: deployment ergonomics, license clarity, and downstream tunability.
This is the deploy playbook. What it is, how to run it locally, how to fine-tune it, and where it fits in the agent stack.
Gemma 4 came out in three sizes: 2B, 9B, and 27B parameters. Multi-modal across text and images on the larger two. Context length is 128K tokens. The license is the same Gemma terms that allow commercial use with attribution and a permissible use policy. Weights are on Hugging Face, Kaggle, and Google's own model hub.
The headline numbers are competitive at every size class. The 27B sits in the same neighborhood as Llama 3.3 70B on most reasoning benchmarks while running at less than half the inference cost. The 9B is the sweet spot for most application work, fitting comfortably on a single 24GB consumer GPU at 4-bit quantization. The 2B is the on-device tier, targeting laptops, phones, and edge inference.
The architectural changes from Gemma 3 are incremental but useful. Improved sliding-window attention for long contexts. Better RoPE scaling. A tokenizer that handles code and structured output noticeably better than the previous generation. The image encoder on the multi-modal variants is the same family that ships in Gemini's smaller tiers, which means quality is meaningfully ahead of bolt-on vision adapters.
The fastest path to having Gemma 4 on your laptop is Ollama. The Ollama team typically ships day-zero support for new Google models, and Gemma 4 was no exception.
ollama pull gemma4:9b
ollama run gemma4:9b "Explain GRPO in two sentences."
That is the entire setup on a Mac with at least 16GB of unified memory. The 27B variant needs 32GB or better. The 2B runs comfortably on anything with a GPU made in the last five years.
For programmatic access:
import ollama
response = ollama.chat(
model="gemma4:9b",
messages=[
{"role": "user", "content": "Write a Python function to flatten a nested list."}
],
)
print(response["message"]["content"])
If you want streaming, OpenAI-compatible endpoints, or multi-model serving on the same box, Ollama exposes all of that on localhost:11434. The mental model is "drop-in OpenAI replacement that runs on your machine."
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 9 min read
For production inference, Ollama is not the right tool. vLLM is. The throughput difference at batch size greater than one is roughly an order of magnitude.
pip install vllm
vllm serve google/gemma-4-9b-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1
That spins up an OpenAI-compatible server on port 8000. You hit it with the standard openai client by pointing base_url at your server.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="google/gemma-4-9b-it",
messages=[{"role": "user", "content": "Generate 5 startup ideas about open weights."}],
max_tokens=512,
)
print(response.choices[0].message.content)
A few production considerations worth pinning down before you ship this in front of users.
Quantization. vLLM supports AWQ and GPTQ quantizations of Gemma 4. The 9B at 4-bit fits comfortably on a 24GB card with 32K context. The 27B at 4-bit needs 48GB. Quantization quality on Gemma 4 is unusually good. The published 4-bit AWQ checkpoints lose less than a point on most benchmarks compared to the bf16 baseline, which is meaningfully better than the same pattern for Llama-class models.
Batching. Throughput climbs steeply with batch size. If your workload is a steady stream of requests, vLLM's continuous batching pulls 5 to 10 times more tokens per second per GPU than a naive serving setup.
Long context. The 128K window is real. It is also expensive. KV cache memory is the bottleneck. Plan for context length carefully if you are serving many concurrent requests, and consider whether your application actually needs the full window or whether 32K suffices.
Most developers will not need to pretrain Gemma 4. Most developers will need to fine-tune it on their specific task. Two paths, both tractable.
For LoRA fine-tuning at scale, TRL is the canonical tool:
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from peft import LoraConfig
dataset = load_dataset("your-org/your-task", split="train")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
)
trainer = SFTTrainer(
model="google/gemma-4-9b-it",
train_dataset=dataset,
peft_config=lora_config,
args=SFTConfig(
output_dir="gemma4-finetuned",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
),
)
trainer.train()
For solo developers on a single GPU, Unsloth is the speed-and-memory winner. Same API surface as TRL, roughly twice as fast, half the memory, and Gemma 4 is one of the supported models from launch:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-4-9b-it",
load_in_4bit=True,
max_seq_length=8192,
)
model = FastLanguageModel.get_peft_model(model, r=16)
# ... use it with your existing trainer
The main gotcha across both paths is the chat template. Gemma 4 uses a specific turn-marking format with <start_of_turn> and <end_of_turn> tokens. If you assemble training data with the wrong template, the model trains fine but generates with the wrong stop tokens at inference. Use tokenizer.apply_chat_template rather than rolling your own format strings. This bites someone every release cycle.
Honest framing: Gemma 4 is not the best model in the world. The frontier closed-weights tier still outperforms it on the hardest benchmarks. What it is, is the best open model in the small-to-mid size class, with a license that lets you ship it commercially without a lawyer present.
That makes it the right choice for several specific jobs.
On-device inference. The 2B is small enough to run on a phone with no API key, no network round trip, and no per-call cost. For features that need a small model close to the user, Gemma 4 2B is the default starting point. The latency profile is usable for interactive features, not just background batch jobs.
Self-hosted production with privacy constraints. If your data cannot leave your network, Gemma 4 9B on a single GPU at 4-bit is the path. It is not GPT-4. It is good enough for most production tasks and your data never crosses a wire. For regulated industries this is the entire ballgame.
Fine-tuned domain experts. A specialized 9B fine-tuned on your domain typically beats a generalist GPT-4 call for that domain, at a fraction of the inference cost. Gemma 4 is a particularly good base for this because the chat-tuned variant is well-aligned out of the box and does not require extensive retraining to make it usable.
Cost-sensitive agent loops. Agent loops burn tokens. An agent that makes ten tool calls per task is paying ten times the per-token cost. Self-hosted Gemma 4 on commodity hardware drops the marginal cost to near-zero, which changes the design space for what agent loops are economically viable.
We run small-model agents on this profile inside AgentFS, where the orchestrator handles tool dispatch and the model only needs to be smart enough to pick the right tool and format the call. Gemma 4 9B fine-tuned on tool-call traces handles that with room to spare. For exposing the agent's tool surface as MCP servers, we pair it with MCPaaS, which makes the model's tool registry a managed service rather than a per-app concern.
The walkthrough video for the full deploy-and-fine-tune pipeline is on the DevDigest YouTube channel, including a side-by-side comparison of Gemma 4 9B against the closed-weights peers on the same agent task.
Three threads worth following over the next quarter.
The 70B variant. Google has not announced a Gemma 4 in the 70B class. The gap between 27B open and frontier closed is real, and a 70B Gemma would close most of it. If it ships, it changes the calculus for self-hosted production work meaningfully.
Gemma-specific reasoning fine-tunes. DeepSeek R1's recipe is being applied to every credible open base, and Gemma 4 will be no exception. Expect community-trained "Gemma 4 R1" variants within weeks. Some will be excellent. Watch the leaderboards rather than the announcements.
Multi-modal tool use. The image encoder on Gemma 4 is good. Tool-use benchmarks for vision-language models on agent tasks are still immature. There is a real opening for someone to build the first credible open multi-modal agent on top of Gemma 4 27B and publish numbers that the rest of the field has to chase.
The takeaway is simple. If you are picking an open-weights base today, Gemma 4 is on the short list. If you have not run it locally yet, the Ollama command above takes thirty seconds. Do that first, see how it feels, then decide where it fits.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolMeta's open-source model family. Llama 4 available in Scout (17B active) and Maverick (17B active, 128 experts). Free to...
View ToolOpenAI's cloud coding agent. Runs in a sandboxed container, reads your repo, executes tasks, and submits PRs. Uses GPT-5...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

The exact tools, patterns, and processes I use to ship code 10x faster with AI. From morning briefing to production depl...

How solo developers and indie hackers ship products 10x faster using AI coding tools. The complete stack for building al...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.