
TL;DR
GRPO is suddenly the standard RL recipe for reasoning models. A no-prior-knowledge mental model of PPO, GRPO, and how DeepSeek R1's training works under the hood.
Read next
DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the API, run them locally with Ollama, and decide when they beat closed-source alternatives.
9 min readThe exact tools, patterns, and processes I use to ship code 10x faster with AI. From morning briefing to production deploy.
9 min readAI-native development is not about using AI tools. It is about restructuring how you plan, build, review, and ship code around agent capabilities. The five-layer stack that defines how the most productive developers work in 2026.
14 min readSix months ago, if you asked an ML engineer how reasoning models were trained, the answer involved PPO, a reward model, a value head, and a lot of careful KL constraints. Today, half the open-weights reasoning models on Hugging Face mention GRPO in their model card, and the other half are racing to switch. DeepSeek R1 was the inflection point. Its training recipe leaned on Group Relative Policy Optimization, the results spoke loudly, and the rest of the field followed.
For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.
If you are a developer who builds with LLMs but has never trained one, the alphabet soup is intimidating. PPO. GRPO. DPO. RLHF. RLAIF. Reward models, value heads, advantage estimates, clip ratios. Most explanations assume two semesters of RL background you do not have. This post is the version that does not.
The goal here is a working mental model. Not a derivation, not a proof, not a reproduction of the math. By the end you should be able to read the DeepSeek R1 paper, understand the Hugging Face GRPO writeup it leans on, and have an opinion about why this recipe is winning.
Start from a base model. It has been pretrained on the internet and then supervised-fine-tuned on instruction data. It can answer questions. It is not particularly good at reasoning, math, or following complex constraints. You want to make it better at those things.
You have a way to score answers. Maybe it is a learned reward model trained on human preference data. Maybe it is a programmatic checker, like running a math problem through SymPy or executing generated code against unit tests. Either way, given a prompt and a model output, you can produce a number that says "this answer was good" or "this answer was bad."
Now you want to nudge the model so that it produces more of the high-scoring answers and fewer of the low-scoring ones. That is the entire game. Every method on the menu, PPO, GRPO, DPO, REINFORCE, is a different answer to one question: how do you turn a reward signal into a gradient update without the model collapsing?
Proximal Policy Optimization, the original RLHF workhorse, looks like this. You generate an answer. You score it. You also run a separate value model that predicts, for any partial sequence, what the eventual reward will probably be. The difference between the actual reward and the predicted reward is the advantage, the surprise factor. You take a gradient step that pushes the model to produce more answers that beat the value model's prediction and fewer that fall short. You add a KL penalty against the original base model so the policy does not drift into nonsense, and you clip the gradient ratio so a single big update cannot destabilize training.
The conceptual cost of PPO is the value model. It is roughly the same size as the policy. You have to train it, store it, run it on every step, and tune it. RLHF infrastructure is dominated by the bookkeeping around this second model.
# Pseudocode, PPO step
prompts = sample_batch()
responses = policy.generate(prompts)
rewards = reward_model(prompts, responses)
values = value_model(prompts, responses) # the expensive part
advantages = rewards - values
loss = ppo_clip(policy.logprobs(responses), old_logprobs, advantages)
loss += beta * kl_divergence(policy, reference_policy)
loss.backward()
That value model is what GRPO removes.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 8 min read
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 13 min read
Group Relative Policy Optimization makes a clever swap. Instead of training a value model to predict expected reward, it samples a group of responses for each prompt, scores them all, and uses the group's mean and standard deviation as the baseline. The advantage of any response is just how much better or worse it scored than its peers from the same prompt. No value model needed.
# Pseudocode, GRPO step
prompts = sample_batch()
groups = [policy.generate(p, n=8) for p in prompts] # 8 responses per prompt
rewards = [[reward_model(p, r) for r in group] for p, group in zip(prompts, groups)]
# advantage = (reward - group_mean) / group_std
advantages = [(r - mean(group_r)) / std(group_r) for r, group_r in zip(rewards, rewards)]
loss = ppo_clip(policy.logprobs(responses), old_logprobs, advantages)
loss += beta * kl_divergence(policy, reference_policy)
loss.backward()
The change is small, the consequences are large. You eliminated the value model. Your training infrastructure is half the size. Your reward signal is denser per prompt, because you are computing eight rewards where PPO computed one. And empirically, on math and code reasoning benchmarks, the recipe just works.
DeepSeek R1's training recipe is the highest-profile validation of GRPO to date. The team did not need a learned reward model. The reasoning tasks they cared about, math problems and code, have programmatic verifiers. A math answer is right or wrong. Code passes the tests or it does not. That gives you a deterministic, free, infinitely scalable reward signal.
Combine that with GRPO's no-value-model property, and the entire training stack collapses to: a policy model, a reference model for the KL term, and a verifier function. You can run that on commodity training infrastructure. You do not need to train and serve a separate reward network. You do not need preference annotation pipelines. The cost structure changes from "RLHF needs a research lab" to "RL post-training needs a verifier and a GPU cluster."
R1's other contribution was showing that you can do a lot of GRPO before the model breaks. The training run was long. The reward signal was simple. The model learned to produce long chain-of-thought reasoning traces because longer correct traces won relative to shorter wrong ones in the group comparison, and the group baseline kept that signal stable.
Strip the math out and the recipe becomes simple enough to remember.
Rule one: rewards drive direction. Whatever you reward more, you get more of. If your verifier rewards correct final answers regardless of reasoning, you get a model that gets answers right with terse reasoning. If you reward longer reasoning that ends in correct answers, you get a chain-of-thought model. The reward function is the product spec.
Rule two: a baseline keeps the gradient sane. The model needs to know what counts as "better than expected." PPO uses a value model to estimate that. GRPO uses peer responses. Either way, you are subtracting a baseline so that the gradient signal is the surprise, not the absolute reward. Without a baseline, training is noisy and unstable.
Rule three: a leash keeps the model honest. The KL term against a reference model is what stops the policy from drifting into reward hacks. If you remove the leash, the model finds adversarial outputs that score high on the reward function but are nonsense. The leash is non-negotiable.
That is GRPO. Sample a group, compute advantages relative to the group, take a clipped gradient step, keep a leash. The rest is engineering.
Hugging Face's TRL library shipped GRPO support shortly after the DeepSeek R1 paper landed. A minimal training script looks like this:
from trl import GRPOConfig, GRPOTrainer
from datasets import load_dataset
dataset = load_dataset("openai/gsm8k", "main", split="train")
def reward_correctness(prompts, completions, **kwargs):
rewards = []
for prompt, completion in zip(prompts, completions):
gold = extract_answer(prompt["answer"])
predicted = extract_answer(completion)
rewards.append(1.0 if predicted == gold else 0.0)
return rewards
config = GRPOConfig(
output_dir="grpo-r1-replication",
num_generations=8, # group size
max_completion_length=2048,
learning_rate=5e-6,
beta=0.04, # KL coefficient
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
)
trainer = GRPOTrainer(
model="meta-llama/Llama-3.1-8B-Instruct",
reward_funcs=[reward_correctness],
args=config,
train_dataset=dataset,
)
trainer.train()
A few things worth flagging when you actually run this.
The group size matters. Eight is a reasonable default. Smaller groups give noisier baselines. Larger groups burn more compute per step and the marginal benefit drops. The DeepSeek paper used larger groups, but they had the budget.
The KL coefficient matters. Too low, the model wanders. Too high, the model cannot learn anything new. 0.04 is a common starting point for instruction-tuned bases. Tune it.
Reward functions can be lists. TRL accepts multiple reward functions and sums them. That is how you do "correctness plus formatting" or "correctness plus length penalty" without rewriting the whole pipeline.
For the full reproduction walkthrough with a working dataset and evaluation harness, the DevDigest YouTube channel has the video version, and the DD Academy hosts the full course on RL post-training.
GRPO is most relevant for one specific use case: you have a domain where you can write a verifier, and you want a model that gets noticeably better at that domain than the base. Code generation. Math. Tool calling with strict schemas. Structured extraction. Anything that has a checker.
It is less relevant for open-ended chat where the only reward signal is human preference. There, DPO and its variants are still simpler to run than GRPO, because they sidestep the rollout step entirely.
For agent builders, the most interesting application is fine-tuning a small open-weights model on a verifiable agent task. Think: a 7B model that learns to call a specific tool API correctly, judged by whether the API call succeeds and returns the expected shape. GRPO on that setup is cheap, the verifier is free, and the resulting model is small enough to deploy on commodity inference hardware.
We use this kind of pipeline internally on AgentFS for the agent-side filesystem operations. The verifier is whether the operation succeeded against a reference virtual filesystem. The training run is small. The deployed model is tiny. The behavior is rock solid because the verifier is deterministic.
Three threads worth following.
Process reward models. GRPO uses a final-answer reward. Process reward models score every reasoning step. The combination, GRPO with a process reward, is the next obvious move and several labs are working on it.
Verifier-free GRPO. The recipe assumes a verifier. The interesting research direction is whether learned reward models that judge reasoning quality can substitute for programmatic verifiers without the usual reward-hacking failure modes.
Smaller-model viability. Most GRPO results are at 7B and up. The question is how small you can go before the recipe stops working. There is a real prize at 1B to 3B for on-device reasoning models if the answer is "small enough."
If you take one thing away: GRPO is PPO minus the value model, with a group baseline filling the gap. That is the whole trick. Now go read the R1 paper without flinching.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolDeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...
View ToolFastest inference for open-source models. 200+ models via unified API. Ranks #1 on speed benchmarks for DeepSeek, Qwen,...
View ToolWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsRead file contents with line limiting, offset, and binary support.
Claude CodeSearch file contents by pattern with regex support.
Claude Code
DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

The exact tools, patterns, and processes I use to ship code 10x faster with AI. From morning briefing to production depl...

AI-native development is not about using AI tools. It is about restructuring how you plan, build, review, and ship code...

DeepSeek V4 is trending because it is close enough to frontier coding models at a much lower token price. The real quest...

DeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.