Microsoft PHI-4: A 14B Parameter Model That Rivals Models 5x Its Size

Q: How do I install and run PHI-4 with Ollama?

Install Ollama from ollama.com, then run a single command: `ollama run phi4`. The first run downloads approximately 10GB of model data, which takes a few minutes depending on your connection. After the initial download, subsequent startups are nearly instant. You can then interact with the model directly in your terminal, or configure VS Code extensions like Continue to use your local Ollama instance for IDE integration.

Official Sources
PHI-4 on Hugging Face	Model weights and documentation
Microsoft Research Blog	Technical report and architecture details
Azure AI Model Catalog	Managed deployment options
PHI-4 on Ollama	Local installation and quick start
Microsoft Phi Cookbook	Sample code and model files
PHI-4 Technical Report (arXiv)	Full technical paper

Microsoft quietly released PHI-4 in December 2024, and it got buried under the noise of OpenAI's 12 Days of Shipmas and a wave of Gemini announcements. That is unfortunate, because PHI-4 is one of the most impressive small language models released to date. At just 14 billion parameters, it matches models that are five times its size on multiple benchmarks, runs comfortably on consumer hardware, and ships under an MIT license that allows unrestricted commercial use.

The model is available on Hugging Face right now. You can pull it down through Ollama and have it running locally in under five minutes. And the performance is good enough that for many tasks, you would not know you are using a model this small.

What Makes PHI-4 Different#

PHI-4's approach to training is what sets it apart from other models in its size class. Instead of training on the largest possible dataset, Microsoft focused on data quality. The training data is a blend of synthetic datasets, filtered public domain websites, academic books, and QA datasets. The goal was to optimize for high-quality reasoning rather than broad coverage.

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

This data-centric approach produced a model that punches well above its weight class. On MMLU, PHI-4 ranks alongside Llama 3.3 70B and Qwen 2.5 72B. These are models with five times the parameters and substantially higher hardware requirements. The fact that a 14B model competes at this level says something meaningful about how far training methodology has progressed.

The model went through both supervised fine-tuning and direct preference optimization for alignment. This combination ensures the model follows instructions precisely while maintaining the safety guardrails that enterprise users need.

Technical Specifications#

The architecture is a dense transformer with 14 billion parameters. Unlike mixture-of-experts models that activate only a subset of parameters per token, PHI-4 uses all 14 billion parameters for every inference step. This makes the compute requirements predictable and the model behavior more consistent.

Key specifications:

Parameters: 14 billion (dense)
Context length: 16,000 tokens
Input: Text only (no vision or multimodal)
Training data: Approximately 10 trillion tokens
Training hardware: 1,920 H100 GPUs over 21 days
Knowledge cutoff: June 2024
License: MIT (fully permissive, commercial use allowed)
Format: Optimized for chat/instruction following

The 16,000 token context length is adequate for most coding tasks and document analysis, though it falls short of the 128,000 tokens offered by larger models like Llama 3.3. For applications that require longer context, you will need to use chunking strategies or switch to a larger model.

Benchmark Performance#

The benchmark results are where PHI-4 gets interesting. Here are the highlights from both Microsoft's evaluations and the technical report:

MMLU: Competitive with Llama 3.3 70B and Qwen 2.5 72B. This is a general knowledge and reasoning benchmark, and scoring at this level with a 14B model is exceptional.

GPQA (Graduate-level science questions): PHI-4 outperforms GPT-4o by approximately 6 points. This is a demanding benchmark that tests deep reasoning on complex scientific topics.

Math benchmarks: Also outperforms GPT-4o by about 6 points. The synthetic data approach appears to have been particularly effective for mathematical reasoning.

HumanEval (code generation): Scores 82.6, compared to 78.9 for Llama 3.3 70B Instruct and 80.4 for Qwen 2.5. Still about 8 points below GPT-4o, but remarkably strong for a model this size.

The pattern across benchmarks is consistent: PHI-4 performs at or near the level of models that are 4-5x larger. The gap to the absolute frontier models (GPT-4o, Claude 3.5 Sonnet) exists but is narrower than you would expect given the size difference.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Build an AI Agent Web App with LangGraph and CopilotKit

Dec 12, 2024 • 14 min read

Lovable: Building Full-Stack Web Apps with AI and Supabase

Dec 1, 2024 • 10 min read

ChatGPT Desktop Now Reads Your VS Code, Terminal, and Xcode

Nov 14, 2024 • 8 min read

OpenAI Realtime Voice API: Getting Started Guide

Oct 4, 2024 • 8 min read

Running PHI-4 Locally#

The most practical way to get started with PHI-4 is through Ollama. If you do not have Ollama installed, the setup is straightforward - download from ollama.com for Mac, Linux, or Windows, and you are ready to go.

Pull and run the model with a single command:

Terminal

ollama run phi4

The first time you run this, it downloads roughly 10GB of model data. After that, startup is nearly instant.

In terms of hardware requirements, PHI-4 is one of the most accessible frontier-quality models available. Testing on an M3 MacBook Pro with 18GB of unified memory showed responsive inference times. This is not a machine optimized for running local models - there is no discrete GPU and the memory is modest by ML standards. Yet the model runs well enough for interactive use.

For developers with more capable hardware - machines with 32GB or more of memory, or NVIDIA GPUs with 16GB+ VRAM - the inference speed improves substantially. But the key point is that PHI-4 is usable even on standard developer hardware. You do not need a specialized ML workstation.

Using PHI-4 in Your IDE#

Ollama pairs well with Continue, an open-source VS Code extension that provides a chat interface and code assistance powered by local models. Install Continue from the VS Code marketplace, configure it to use your local Ollama instance, and you have an AI coding assistant running entirely on your machine.

The workflow is similar to Copilot or Cursor's chat: open the chat panel with Command+L, describe what you want, and the model generates code. You can insert generated code directly into your files or apply it as a diff. For straightforward generation tasks like scaffolding an Express server, writing utility functions, or generating test cases, PHI-4 through Continue is a capable and completely free alternative to paid AI coding tools.

The local execution model also means zero latency for the network round trip. Your prompts never leave your machine. For developers working with sensitive codebases, or in environments where sending code to external APIs is not allowed, this is a meaningful advantage.

When to Use PHI-4 vs. Larger Models#

PHI-4 excels in situations where you need:

Fast local inference. The model runs well on consumer hardware and provides interactive response times without cloud dependencies.

Cost-free operation. No API keys, no subscription, no per-token charges. Once downloaded, the model runs indefinitely at zero marginal cost.

Privacy. All inference happens locally. No data leaves your machine.

Math and reasoning tasks. The benchmark results show genuine strength in quantitative reasoning and scientific analysis.

PHI-4 is less suitable when you need:

Long context. The 16,000 token limit means you cannot feed entire codebases or long documents. Larger models with 128K+ context windows are better for these use cases.

Best-in-class code generation. While PHI-4 is strong for its size, GPT-4o and Claude 3.5 Sonnet still produce cleaner, more idiomatic code on complex generation tasks.

Multimodal input. PHI-4 is text-only. If you need image understanding or vision capabilities, look at models like Llama 3.2 Vision or GPT-4o.

The Small Model Revolution#

PHI-4 is part of a broader trend toward smaller, more efficient models that deliver surprising quality. The old assumption that bigger models are always better is breaking down. Training methodology, data quality, and alignment techniques increasingly matter more than raw parameter count.

For developers, this is excellent news. It means capable AI assistance is becoming accessible without expensive API subscriptions or cloud infrastructure. A model that rivals GPT-4o on math and science benchmarks, runs on a standard laptop, and costs nothing to use - that was not possible a year ago.

The trajectory suggests this will only accelerate. Each generation of small models closes the gap with frontier models while maintaining their practical advantages in cost, speed, and privacy. PHI-4 represents the current state of the art for this class, but the next generation is already in development.

Getting Started#

Install Ollama from ollama.com
Pull the model: ollama run phi4
Optionally install Continue for VS Code integration
Test with your actual use cases to evaluate quality for your needs

The model download is about 10GB, and first-run setup takes a few minutes. After that, you have a frontier-competitive language model running locally with no ongoing costs. For anyone interested in local AI development, PHI-4 is one of the strongest starting points available.

Frequently Asked Questions#

How does PHI-4 compare to GPT-4o?#

PHI-4 actually outperforms GPT-4o by approximately 6 points on GPQA (graduate-level science questions) and math benchmarks. On HumanEval code generation, PHI-4 scores 82.6 compared to GPT-4o's approximately 90 - still about 8 points behind. The remarkable aspect is that PHI-4 achieves this with 14 billion parameters versus GPT-4o's estimated hundreds of billions, running locally on consumer hardware with zero API costs.

What hardware do I need to run PHI-4 locally?#

PHI-4 runs on surprisingly modest hardware. Testing shows responsive inference on an M3 MacBook Pro with 18GB unified memory - not a machine optimized for ML workloads. For better performance, 32GB or more RAM helps, and NVIDIA GPUs with 16GB+ VRAM significantly improve inference speed. The model download is approximately 10GB, so you need that much free disk space plus room for Ollama.

Is PHI-4 free to use commercially?#

Yes. PHI-4 is released under the MIT license, which is fully permissive and allows unrestricted commercial use. You can deploy it in production applications, modify it, and redistribute it without paying licensing fees or royalties. This makes it one of the most accessible frontier-quality models for commercial development.

How does PHI-4 compare to Llama 3.3 70B?#

PHI-4 matches Llama 3.3 70B on MMLU benchmarks despite having only 14 billion parameters versus Llama's 70 billion. On HumanEval code generation, PHI-4 scores 82.6 compared to Llama 3.3 70B Instruct's 78.9 - PHI-4 actually outperforms the larger model on this benchmark. The key tradeoff is context length: PHI-4 supports 16K tokens while Llama 3.3 offers 128K tokens.

What are PHI-4's main limitations?#

PHI-4 has three primary limitations. First, the 16,000 token context window is short compared to 128K token models - you cannot feed entire codebases. Second, it is text-only with no vision or multimodal capabilities. Third, while strong for its size, frontier models like GPT-4o and Claude 3.5 Sonnet still produce cleaner code on complex generation tasks. For tasks requiring long context, multimodal input, or best-in-class code quality, larger models remain superior.

How do I install and run PHI-4 with Ollama?#

Install Ollama from ollama.com, then run a single command: ollama run phi4. The first run downloads approximately 10GB of model data, which takes a few minutes depending on your connection. After the initial download, subsequent startups are nearly instant. You can then interact with the model directly in your terminal, or configure VS Code extensions like Continue to use your local Ollama instance for IDE integration.

Why is PHI-4 so efficient compared to larger models?#

Microsoft used a data-centric training approach focused on quality over quantity. Instead of training on the largest possible dataset, they curated a blend of synthetic datasets, filtered public domain websites, academic books, and QA datasets - approximately 10 trillion tokens optimized for high-quality reasoning. The model architecture is a dense transformer (all 14B parameters active per inference), trained on 1,920 H100 GPUs over 21 days with both supervised fine-tuning and direct preference optimization.

Can I use PHI-4 for coding in VS Code?#

Yes. Install the Continue extension from the VS Code marketplace and configure it to use your local Ollama instance running PHI-4. Open the chat panel with Command+L, describe what you want, and the model generates code you can insert directly into your files or apply as a diff. For tasks like scaffolding servers, writing utility functions, or generating test cases, this setup provides a capable AI coding assistant running entirely on your machine with zero latency and complete privacy.

Official Sources
PHI-4 on Hugging Face	Model weights and documentation
Microsoft Research Blog	Technical report and architecture details
Azure AI Model Catalog	Managed deployment options
PHI-4 on Ollama	Local installation and quick start
Microsoft Phi Cookbook	Sample code and model files
PHI-4 Technical Report (arXiv)	Full technical paper

What Makes PHI-4 Different#

Technical Specifications#

Key specifications:

Parameters: 14 billion (dense)
Context length: 16,000 tokens
Input: Text only (no vision or multimodal)
Training data: Approximately 10 trillion tokens
Training hardware: 1,920 H100 GPUs over 21 days
Knowledge cutoff: June 2024
License: MIT (fully permissive, commercial use allowed)
Format: Optimized for chat/instruction following

Benchmark Performance#

The benchmark results are where PHI-4 gets interesting. Here are the highlights from both Microsoft's evaluations and the technical report:

MMLU: Competitive with Llama 3.3 70B and Qwen 2.5 72B. This is a general knowledge and reasoning benchmark, and scoring at this level with a 14B model is exceptional.

GPQA (Graduate-level science questions): PHI-4 outperforms GPT-4o by approximately 6 points. This is a demanding benchmark that tests deep reasoning on complex scientific topics.

Math benchmarks: Also outperforms GPT-4o by about 6 points. The synthetic data approach appears to have been particularly effective for mathematical reasoning.

HumanEval (code generation): Scores 82.6, compared to 78.9 for Llama 3.3 70B Instruct and 80.4 for Qwen 2.5. Still about 8 points below GPT-4o, but remarkably strong for a model this size.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Build an AI Agent Web App with LangGraph and CopilotKit

Dec 12, 2024 • 14 min read

Lovable: Building Full-Stack Web Apps with AI and Supabase

Dec 1, 2024 • 10 min read

ChatGPT Desktop Now Reads Your VS Code, Terminal, and Xcode

Nov 14, 2024 • 8 min read

OpenAI Realtime Voice API: Getting Started Guide

Oct 4, 2024 • 8 min read

Running PHI-4 Locally#

Pull and run the model with a single command:

Terminal

ollama run phi4

The first time you run this, it downloads roughly 10GB of model data. After that, startup is nearly instant.

Using PHI-4 in Your IDE#

When to Use PHI-4 vs. Larger Models#

PHI-4 excels in situations where you need:

Fast local inference. The model runs well on consumer hardware and provides interactive response times without cloud dependencies.

Cost-free operation. No API keys, no subscription, no per-token charges. Once downloaded, the model runs indefinitely at zero marginal cost.

Privacy. All inference happens locally. No data leaves your machine.

Math and reasoning tasks. The benchmark results show genuine strength in quantitative reasoning and scientific analysis.

PHI-4 is less suitable when you need:

Long context. The 16,000 token limit means you cannot feed entire codebases or long documents. Larger models with 128K+ context windows are better for these use cases.

Best-in-class code generation. While PHI-4 is strong for its size, GPT-4o and Claude 3.5 Sonnet still produce cleaner, more idiomatic code on complex generation tasks.

Multimodal input. PHI-4 is text-only. If you need image understanding or vision capabilities, look at models like Llama 3.2 Vision or GPT-4o.

What Makes PHI-4 Different#

Technical Specifications#

Benchmark Performance#

Build an AI Agent Web App with LangGraph and CopilotKit

Lovable: Building Full-Stack Web Apps with AI and Supabase

ChatGPT Desktop Now Reads Your VS Code, Terminal, and Xcode

OpenAI Realtime Voice API: Getting Started Guide

Running PHI-4 Locally#

Using PHI-4 in Your IDE#

When to Use PHI-4 vs. Larger Models#

The Small Model Revolution#

Getting Started#