DeepSeek R1, PPO, and GRPO Explained for Devs

Why GRPO is suddenly everywhere

Six months ago, if you asked an ML engineer how reasoning models were trained, the answer involved PPO, a reward model, a value head, and a lot of careful KL constraints. Today, half the open-weights reasoning models on Hugging Face mention GRPO in their model card, and the other half are racing to switch. DeepSeek R1 was the inflection point. Its training recipe leaned on Group Relative Policy Optimization, the results spoke loudly, and the rest of the field followed.

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

If you are a developer who builds with LLMs but has never trained one, the alphabet soup is intimidating. PPO. GRPO. DPO. RLHF. RLAIF. Reward models, value heads, advantage estimates, clip ratios. Most explanations assume two semesters of RL background you do not have. This post is the version that does not.

The goal here is a working mental model. Not a derivation, not a proof, not a reproduction of the math. By the end you should be able to read the DeepSeek R1 paper, understand the Hugging Face GRPO writeup it leans on, and have an opinion about why this recipe is winning.

The setup: what RL on LLMs is actually trying to do

Start from a base model. It has been pretrained on the internet and then supervised-fine-tuned on instruction data. It can answer questions. It is not particularly good at reasoning, math, or following complex constraints. You want to make it better at those things.

You have a way to score answers. Maybe it is a learned reward model trained on human preference data. Maybe it is a programmatic checker, like running a math problem through SymPy or executing generated code against unit tests. Either way, given a prompt and a model output, you can produce a number that says "this answer was good" or "this answer was bad."

Now you want to nudge the model so that it produces more of the high-scoring answers and fewer of the low-scoring ones. That is the entire game. Every method on the menu, PPO, GRPO, DPO, REINFORCE, is a different answer to one question: how do you turn a reward signal into a gradient update without the model collapsing?

PPO in one paragraph

Proximal Policy Optimization, the original RLHF workhorse, looks like this. You generate an answer. You score it. You also run a separate value model that predicts, for any partial sequence, what the eventual reward will probably be. The difference between the actual reward and the predicted reward is the advantage, the surprise factor. You take a gradient step that pushes the model to produce more answers that beat the value model's prediction and fewer that fall short. You add a KL penalty against the original base model so the policy does not drift into nonsense, and you clip the gradient ratio so a single big update cannot destabilize training.

The conceptual cost of PPO is the value model. It is roughly the same size as the policy. You have to train it, store it, run it on every step, and tune it. RLHF infrastructure is dominated by the bookkeeping around this second model.

# Pseudocode, PPO step
prompts = sample_batch()
responses = policy.generate(prompts)
rewards = reward_model(prompts, responses)
values = value_model(prompts, responses)  # the expensive part
advantages = rewards - values
loss = ppo_clip(policy.logprobs(responses), old_logprobs, advantages)
loss += beta * kl_divergence(policy, reference_policy)
loss.backward()

That value model is what GRPO removes.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

mlinter: Hugging Face's New Linter for Transformers Modeling Files

Apr 29, 2026 • 8 min read

KV Caching: A Practical Guide to Optimizing Transformer Inference

Apr 29, 2026 • 11 min read

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Apr 29, 2026 • 10 min read

Model Context Protocol: A Production Guide To Building MCP Servers

Apr 29, 2026 • 13 min read

GRPO in one paragraph

Group Relative Policy Optimization makes a clever swap. Instead of training a value model to predict expected reward, it samples a group of responses for each prompt, scores them all, and uses the group's mean and standard deviation as the baseline. The advantage of any response is just how much better or worse it scored than its peers from the same prompt. No value model needed.

# Pseudocode, GRPO step
prompts = sample_batch()
groups = [policy.generate(p, n=8) for p in prompts]  # 8 responses per prompt
rewards = [[reward_model(p, r) for r in group] for p, group in zip(prompts, groups)]

# advantage = (reward - group_mean) / group_std
advantages = [(r - mean(group_r)) / std(group_r) for r, group_r in zip(rewards, rewards)]

loss = ppo_clip(policy.logprobs(responses), old_logprobs, advantages)
loss += beta * kl_divergence(policy, reference_policy)
loss.backward()

The change is small, the consequences are large. You eliminated the value model. Your training infrastructure is half the size. Your reward signal is denser per prompt, because you are computing eight rewards where PPO computed one. And empirically, on math and code reasoning benchmarks, the recipe just works.

Why this matters for DeepSeek R1

DeepSeek R1's training recipe is the highest-profile validation of GRPO to date. The team did not need a learned reward model. The reasoning tasks they cared about, math problems and code, have programmatic verifiers. A math answer is right or wrong. Code passes the tests or it does not. That gives you a deterministic, free, infinitely scalable reward signal.

Combine that with GRPO's no-value-model property, and the entire training stack collapses to: a policy model, a reference model for the KL term, and a verifier function. You can run that on commodity training infrastructure. You do not need to train and serve a separate reward network. You do not need preference annotation pipelines. The cost structure changes from "RLHF needs a research lab" to "RL post-training needs a verifier and a GPU cluster."

R1's other contribution was showing that you can do a lot of GRPO before the model breaks. The training run was long. The reward signal was simple. The model learned to produce long chain-of-thought reasoning traces because longer correct traces won relative to shorter wrong ones in the group comparison, and the group baseline kept that signal stable.

The dev's mental model, in three rules

Strip the math out and the recipe becomes simple enough to remember.

Rule one: rewards drive direction. Whatever you reward more, you get more of. If your verifier rewards correct final answers regardless of reasoning, you get a model that gets answers right with terse reasoning. If you reward longer reasoning that ends in correct answers, you get a chain-of-thought model. The reward function is the product spec.

Rule two: a baseline keeps the gradient sane. The model needs to know what counts as "better than expected." PPO uses a value model to estimate that. GRPO uses peer responses. Either way, you are subtracting a baseline so that the gradient signal is the surprise, not the absolute reward. Without a baseline, training is noisy and unstable.

Rule three: a leash keeps the model honest. The KL term against a reference model is what stops the policy from drifting into reward hacks. If you remove the leash, the model finds adversarial outputs that score high on the reward function but are nonsense. The leash is non-negotiable.

That is GRPO. Sample a group, compute advantages relative to the group, take a clipped gradient step, keep a leash. The rest is engineering.

Hands-on: minimal GRPO with TRL

Hugging Face's TRL library shipped GRPO support shortly after the DeepSeek R1 paper landed. A minimal training script looks like this:

from trl import GRPOConfig, GRPOTrainer
from datasets import load_dataset

dataset = load_dataset("openai/gsm8k", "main", split="train")

def reward_correctness(prompts, completions, **kwargs):
    rewards = []
    for prompt, completion in zip(prompts, completions):
        gold = extract_answer(prompt["answer"])
        predicted = extract_answer(completion)
        rewards.append(1.0 if predicted == gold else 0.0)
    return rewards

config = GRPOConfig(
    output_dir="grpo-r1-replication",
    num_generations=8,         # group size
    max_completion_length=2048,
    learning_rate=5e-6,
    beta=0.04,                 # KL coefficient
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
)

trainer = GRPOTrainer(
    model="meta-llama/Llama-3.1-8B-Instruct",
    reward_funcs=[reward_correctness],
    args=config,
    train_dataset=dataset,
)
trainer.train()

A few things worth flagging when you actually run this.

The group size matters. Eight is a reasonable default. Smaller groups give noisier baselines. Larger groups burn more compute per step and the marginal benefit drops. The DeepSeek paper used larger groups, but they had the budget.

The KL coefficient matters. Too low, the model wanders. Too high, the model cannot learn anything new. 0.04 is a common starting point for instruction-tuned bases. Tune it.

Reward functions can be lists. TRL accepts multiple reward functions and sums them. That is how you do "correctness plus formatting" or "correctness plus length penalty" without rewriting the whole pipeline.

For the full reproduction walkthrough with a working dataset and evaluation harness, the DevDigest YouTube channel has the video version, and the DD Academy hosts the full course on RL post-training.

Where this fits the agent stack

GRPO is most relevant for one specific use case: you have a domain where you can write a verifier, and you want a model that gets noticeably better at that domain than the base. Code generation. Math. Tool calling with strict schemas. Structured extraction. Anything that has a checker.

It is less relevant for open-ended chat where the only reward signal is human preference. There, DPO and its variants are still simpler to run than GRPO, because they sidestep the rollout step entirely.

For agent builders, the most interesting application is fine-tuning a small open-weights model on a verifiable agent task. Think: a 7B model that learns to call a specific tool API correctly, judged by whether the API call succeeds and returns the expected shape. GRPO on that setup is cheap, the verifier is free, and the resulting model is small enough to deploy on commodity inference hardware.

We use this kind of pipeline internally on AgentFS for the agent-side filesystem operations. The verifier is whether the operation succeeded against a reference virtual filesystem. The training run is small. The deployed model is tiny. The behavior is rock solid because the verifier is deterministic.

What to watch next

Three threads worth following.

Process reward models. GRPO uses a final-answer reward. Process reward models score every reasoning step. The combination, GRPO with a process reward, is the next obvious move and several labs are working on it.

Verifier-free GRPO. The recipe assumes a verifier. The interesting research direction is whether learned reward models that judge reasoning quality can substitute for programmatic verifiers without the usual reward-hacking failure modes.

Smaller-model viability. Most GRPO results are at 7B and up. The question is how small you can go before the recipe stops working. There is a real prize at 1B to 3B for on-device reasoning models if the answer is "small enough."

If you take one thing away: GRPO is PPO minus the value model, with a group baseline filling the gap. That is the whole trick. Now go read the R1 paper without flinching.

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

My AI Developer Workflow in 2026

The AI-Native Development Workflow: How Top Developers Actually Work in 2026

Why GRPO is suddenly everywhere

The setup: what RL on LLMs is actually trying to do

PPO in one paragraph

mlinter: Hugging Face's New Linter for Transformers Modeling Files

KV Caching: A Practical Guide to Optimizing Transformer Inference

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Model Context Protocol: A Production Guide To Building MCP Servers

GRPO in one paragraph

Why this matters for DeepSeek R1

The dev's mental model, in three rules

Hands-on: minimal GRPO with TRL

Where this fits the agent stack

What to watch next

Comments

Related Tools

DeepSeek

DeepSeek V3.2

Together AI

Related Guides

MCP Servers Explained

Read Tool - Claude Code

Grep Tool - Claude Code

Related Videos

DeepSeek v4 in 4 Minutes

Related Posts

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

My AI Developer Workflow in 2026

The AI-Native Development Workflow: How Top Developers Actually Work in 2026

DeepSeek V4 Changes the Coding Agent Cost Equation

DeepSeek V4: The Developer's Guide to Flash and Pro

Get Smarter About AI Dev

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

My AI Developer Workflow in 2026

The AI-Native Development Workflow: How Top Developers Actually Work in 2026

Why GRPO is suddenly everywhere

The setup: what RL on LLMs is actually trying to do

PPO in one paragraph

mlinter: Hugging Face's New Linter for Transformers Modeling Files

KV Caching: A Practical Guide to Optimizing Transformer Inference

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Model Context Protocol: A Production Guide To Building MCP Servers

GRPO in one paragraph

Why this matters for DeepSeek R1

The dev's mental model, in three rules

Hands-on: minimal GRPO with TRL

Where this fits the agent stack

What to watch next

Comments

Related Tools

DeepSeek

DeepSeek V3.2

Together AI

Related Guides

MCP Servers Explained

Read Tool - Claude Code

Grep Tool - Claude Code

Related Videos

DeepSeek v4 in 4 Minutes

Related Posts

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

My AI Developer Workflow in 2026

The AI-Native Development Workflow: How Top Developers Actually Work in 2026

DeepSeek V4 Changes the Coding Agent Cost Equation

DeepSeek V4: The Developer's Guide to Flash and Pro

Get Smarter About AI Dev