Run AI Models Locally with Ollama
Install Ollama, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Run AI Models Locally with Ollama
Running AI models on your own machine gives you something no cloud API can: complete control. No usage limits, no API keys, no data leaving your computer. This guide walks you through setting up Ollama, choosing the right models, and integrating local AI into your development workflow.
Why run models locally?
There are four compelling reasons to run models on your own hardware instead of relying on cloud APIs.
Privacy. Your code and prompts never leave your machine. This matters when you are working on proprietary codebases, handling sensitive data, or operating under compliance requirements. Local inference means zero data exposure.
Cost. Cloud API calls add up fast. GPT-4 class models cost $10-30 per million tokens. A local model running on your GPU costs nothing per request after the initial hardware investment. If you run hundreds of queries a day, the savings are significant.
Speed. No network round trip. Local models respond in milliseconds for short prompts, especially on modern GPUs. You skip DNS lookups, TLS handshakes, queue times, and rate limits entirely.
Offline access. Airplanes, coffee shops with bad wifi, network outages - none of these stop a local model. Once downloaded, the model works with zero internet connectivity.
The tradeoff is clear: local models are smaller and less capable than the largest cloud models. But for many tasks - code completion, documentation, refactoring, Q&A - a well-chosen local model is more than sufficient.
Install Ollama
Ollama is the easiest way to run local models. It handles model downloads, quantization, memory management, and provides both a CLI and an API server.
macOS
# Install via Homebrew
brew install ollama
# Or download directly from ollama.com
curl -fsSL https://ollama.com/install.sh | sh
After installation, Ollama runs as a background service automatically. You can verify it is running:
ollama --version
Linux
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama and sets up a systemd service. The service starts automatically:
# Check status
systemctl status ollama
# Start manually if needed
systemctl start ollama
For NVIDIA GPU support, make sure you have the NVIDIA Container Toolkit or up-to-date CUDA drivers installed. Ollama detects your GPU automatically.
Windows
Download the installer from ollama.com/download. Run the .exe and follow the prompts. Ollama runs in the system tray.
For WSL2 users, install the Linux version inside your WSL2 distro instead. This gives you better GPU passthrough and a more consistent development experience.
Verify the installation
# Should print the version number
ollama --version
# List downloaded models (empty on fresh install)
ollama list
# The API server runs on port 11434 by default
curl http://localhost:11434/api/tags
Your first model: ollama run llama4
Let's pull and run a model. Llama 4 is Meta's latest open-weight model and a solid starting point.
# Pull and start an interactive chat session
ollama run llama4
The first run downloads the model (this takes a few minutes depending on your connection). Subsequent runs start instantly since the model is cached locally.
Once the model loads, you get an interactive prompt:
>>> What is the time complexity of quicksort?
Quicksort has an average-case time complexity of O(n log n) and a
worst-case time complexity of O(n^2). The worst case occurs when the
pivot selection consistently picks the smallest or largest element,
leading to unbalanced partitions...
Type /bye to exit the session.
Useful Ollama commands
# List all downloaded models
ollama list
# Pull a model without starting a chat
ollama pull qwen3.5-coder:32b
# Remove a model to free disk space
ollama rm llama4
# Show model details (parameters, quantization, size)
ollama show llama4
# Run with a system prompt
ollama run llama4 --system "You are a senior Python developer. Be concise."
# Pipe input from a file
cat bug-report.txt | ollama run llama4 "Summarize this bug report in 3 bullet points"
# Run the API server explicitly (usually auto-started)
ollama serve
Best models for coding
Not all models are created equal for programming tasks. Here are the top choices for code generation, completion, and refactoring as of March 2026.
Qwen 3.5 Coder
The current leader for local code generation. Available in multiple sizes to fit your hardware.
# 32B parameters - best quality, needs 20GB+ VRAM
ollama run qwen3.5-coder:32b
# 14B - great balance of quality and speed
ollama run qwen3.5-coder:14b
# 7B - fast, works on 8GB VRAM
ollama run qwen3.5-coder:7b
Qwen 3.5 Coder excels at:
- Multi-file code generation
- Understanding complex codebases
- TypeScript, Python, Rust, and Go
- Following coding conventions from context
DeepSeek Coder V3
Strong at code reasoning and multi-step problem solving. Particularly good at debugging.
# 33B - full quality
ollama run deepseek-coder-v3:33b
# 7B - lightweight option
ollama run deepseek-coder-v3:7b
Best for:
- Debugging and error analysis
- Algorithm implementation
- Code review and suggestions
- Mathematical and logical reasoning in code
CodeLlama
Meta's code-specialized Llama variant. Mature, well-tested, and widely supported by tools.
# 34B - best quality
ollama run codellama:34b
# 13B - good middle ground
ollama run codellama:13b
# 7B - lightweight
ollama run codellama:7b
Best for:
- Code infilling (fill-in-the-middle)
- Large context windows (up to 100K tokens)
- Broad language support
- Integration with older tooling that expects CodeLlama
Quick comparison for coding models
| Model | Size | VRAM Needed | Speed | Code Quality |
|---|---|---|---|---|
| Qwen 3.5 Coder 32B | 18GB | 24GB | Medium | Excellent |
| Qwen 3.5 Coder 14B | 8GB | 12GB | Fast | Very Good |
| DeepSeek Coder V3 33B | 19GB | 24GB | Medium | Excellent |
| DeepSeek Coder V3 7B | 4GB | 8GB | Very Fast | Good |
| CodeLlama 34B | 19GB | 24GB | Medium | Very Good |
| CodeLlama 7B | 4GB | 8GB | Very Fast | Decent |
Best models for general use
For chat, writing, summarization, and general reasoning tasks, these models lead the pack.
Llama 4
Meta's flagship open model. Strong across the board for general tasks.
# Scout variant - lighter, faster
ollama run llama4
# Maverick variant - larger, more capable
ollama run llama4:maverick
Best for:
- General chat and Q&A
- Writing and editing
- Summarization
- Instruction following
Mistral
Mistral's models punch well above their weight class. Excellent efficiency-to-quality ratio.
# Mistral Large - top quality
ollama run mistral-large
# Mistral Small - fast and capable
ollama run mistral-small
# Mistral 7B - lightweight classic
ollama run mistral:7b
Best for:
- Fast responses with good quality
- Multilingual tasks (strong in European languages)
- Structured output generation
- Function calling and tool use
Phi-4
Microsoft's compact model series. Surprisingly capable for its size.
# Phi-4 14B - best in class for its size
ollama run phi4:14b
Best for:
- Machines with limited VRAM (runs well on 8GB)
- Reasoning tasks
- Math and science questions
- Fast iteration when you need quick answers
Quick comparison for general models
| Model | Size | VRAM Needed | Speed | Quality |
|---|---|---|---|---|
| Llama 4 Scout | 15GB | 20GB | Medium | Excellent |
| Llama 4 Maverick | 25GB | 32GB | Slow | Outstanding |
| Mistral Large | 22GB | 28GB | Medium | Excellent |
| Mistral Small | 8GB | 12GB | Fast | Very Good |
| Phi-4 14B | 8GB | 10GB | Fast | Very Good |
Using local models with AI coding tools
The real power of local models comes from integrating them into your existing development workflow. Here is how to connect Ollama to popular AI coding tools.
Claude Code
Claude Code can use local models as a backend through the OpenAI-compatible API that Ollama provides.
# Set the environment variables to point at your local Ollama
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
You can also configure a model alias in your shell profile:
# Add to ~/.zshrc or ~/.bashrc
alias claude-local='OPENAI_API_BASE=http://localhost:11434/v1 claude'
Cursor
Cursor has built-in support for Ollama models.
- Open Cursor Settings (Cmd+Shift+P on macOS, Ctrl+Shift+P on Linux/Windows)
- Navigate to Models > Model Provider
- Select Ollama as the provider
- Choose your model from the dropdown (Cursor auto-detects running models)
Alternatively, configure it in ~/.cursor/settings.json:
{
"ai.provider": "ollama",
"ai.model": "qwen3.5-coder:32b",
"ai.endpoint": "http://localhost:11434"
}
Continue.dev
Continue is an open-source AI coding assistant that runs in VS Code and JetBrains. It has excellent Ollama support.
Install the Continue extension, then edit ~/.continue/config.yaml:
models:
- title: "Qwen 3.5 Coder 32B"
provider: ollama
model: qwen3.5-coder:32b
apiBase: http://localhost:11434
- title: "Llama 4"
provider: ollama
model: llama4
apiBase: http://localhost:11434
tabAutocompleteModel:
title: "Qwen Coder 7B"
provider: ollama
model: qwen3.5-coder:7b
apiBase: http://localhost:11434
This gives you a full local AI coding setup: the 32B model for chat and generation, and the fast 7B model for tab autocomplete.
Using the Ollama API directly
Ollama exposes an OpenAI-compatible REST API. You can call it from any language or tool.
# Generate a completion
curl http://localhost:11434/api/generate -d '{
"model": "qwen3.5-coder:32b",
"prompt": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes",
"stream": false
}'
# Chat completion (OpenAI-compatible endpoint)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "qwen3.5-coder:32b",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain async/await in JavaScript"}
]
}'
Python example using the openai library:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but unused
)
response = client.chat.completions.create(
model="qwen3.5-coder:32b",
messages=[
{"role": "system", "content": "You are a senior developer."},
{"role": "user", "content": "Review this function for bugs"},
],
)
print(response.choices[0].message.content)
Performance tips
Getting the best performance out of local models requires understanding a few key concepts.
Quantization
Models come in different quantization levels that trade quality for speed and memory usage. Ollama handles this automatically, but you can choose specific quantizations.
# Q4_K_M - default, good balance (recommended)
ollama run qwen3.5-coder:32b
# Q8_0 - higher quality, more memory
ollama run qwen3.5-coder:32b-q8_0
# Q2_K - smallest, fastest, lowest quality
ollama run qwen3.5-coder:32b-q2_k
| Quantization | Quality | Size (32B model) | Speed |
|---|---|---|---|
| Q2_K | Decent | ~12GB | Fastest |
| Q4_K_M | Very Good | ~18GB | Fast |
| Q5_K_M | Excellent | ~22GB | Medium |
| Q8_0 | Near-Original | ~34GB | Slow |
| FP16 | Original | ~64GB | Slowest |
For coding tasks, Q4_K_M is the sweet spot. Below Q4, you start seeing noticeable quality degradation in code generation. Q8_0 is worth it if you have the VRAM.
GPU vs CPU inference
GPU inference is dramatically faster than CPU inference. If you have a dedicated GPU, make sure Ollama is using it.
# Check if Ollama detects your GPU
ollama ps
# Force GPU layers (useful for partial offloading)
OLLAMA_NUM_GPU=999 ollama run llama4
Approximate speed comparison for a 14B model:
| Hardware | Tokens/second | Time for 500-token response |
|---|---|---|
| NVIDIA RTX 4090 | 80-100 t/s | ~5 seconds |
| NVIDIA RTX 4070 | 40-60 t/s | ~10 seconds |
| Apple M3 Max (GPU) | 30-50 t/s | ~12 seconds |
| Apple M2 Pro (GPU) | 20-35 t/s | ~18 seconds |
| CPU only (modern) | 5-10 t/s | ~60 seconds |
Memory requirements
The golden rule: you need enough VRAM (or unified memory on Apple Silicon) to fit the entire model. If the model does not fit in VRAM, it spills to system RAM, which is 10-20x slower.
# Check current memory usage
ollama ps
# Set maximum VRAM usage
OLLAMA_MAX_VRAM=20000 ollama serve # 20GB limit
Apple Silicon users: You are in a good position. The unified memory architecture means your GPU can access all system RAM. A MacBook Pro with 36GB of unified memory can run 32B parameter models comfortably.
NVIDIA users: Your VRAM is the hard limit. A 24GB RTX 4090 fits most 32B quantized models. For 70B+ models, you need multi-GPU setups or significant CPU offloading.
Context length optimization
Longer context windows use more memory. If you are running tight on VRAM, reduce the context length.
# Default context length is 2048
# Increase for larger codebases
ollama run qwen3.5-coder:32b --num-ctx 8192
# Reduce to save memory
ollama run qwen3.5-coder:32b --num-ctx 1024
Running multiple models
Ollama can keep multiple models loaded in memory simultaneously. This is useful when you want a fast small model for autocomplete and a large model for complex tasks.
# Load two models at once
OLLAMA_MAX_LOADED_MODELS=2 ollama serve
Just be sure your system has enough total memory for both models.
Comparison: local vs cloud API
Neither local nor cloud is universally better. The right choice depends on your specific situation.
When local models win
- High-volume usage. If you send hundreds of requests per day, local inference is essentially free after hardware costs. Cloud APIs charge per token.
- Privacy requirements. Regulated industries, proprietary codebases, or personal preference for data sovereignty. Local means no third-party data processing.
- Offline workflows. Traveling, unreliable connections, or air-gapped environments.
- Latency-sensitive tasks. Tab autocomplete, inline suggestions, and real-time code generation benefit from zero network latency.
- Predictable costs. No surprise bills. The hardware cost is fixed regardless of usage.
When cloud APIs win
- Maximum capability. The largest cloud models (Claude, GPT-4.5, Gemini Ultra) are still significantly more capable than anything you can run locally. For complex multi-step reasoning, architectural decisions, or nuanced code review, cloud models have the edge.
- No hardware investment. You do not need an expensive GPU. A $20/month API subscription gives you access to frontier models.
- Always up to date. Cloud providers update models continuously. Local models require manual pulls and version management.
- Scale to zero. Pay only when you use it. If you have light, sporadic usage, cloud APIs are more cost-effective than dedicated hardware.
- Multi-modal capabilities. Cloud models increasingly support images, audio, and video inputs that local models cannot match.
The hybrid approach (recommended)
The best setup for most developers is a hybrid approach:
- Local model for autocomplete and quick tasks. Run a fast 7B model for tab completion, inline suggestions, and quick questions. This handles 80% of your daily AI interactions with zero latency and zero cost.
- Cloud API for complex tasks. Use Claude or GPT-4.5 for architectural decisions, complex refactoring, multi-file changes, and deep code review. These tasks benefit from the larger model's superior reasoning.
# Example hybrid setup
# Terminal 1: Ollama running locally for autocomplete
ollama serve
# Terminal 2: Use Claude Code for complex tasks (cloud)
claude
# Your editor: Continue.dev with Ollama for autocomplete,
# cloud model for chat
This gives you the best of both worlds: fast, free, private AI for routine tasks, and maximum capability when you need it.
Next steps
Now that you have Ollama running, here are some ways to go deeper:
- Explore the model library. Browse ollama.com/library for hundreds of available models.
- Create custom models. Write a
Modelfileto create models with custom system prompts, parameters, and fine-tuning. - Set up a team server. Run Ollama on a shared machine so your whole team can access local models over the network.
- Try different quantizations. Experiment with Q4 vs Q8 for your specific use case to find your quality-speed sweet spot.
# Example Modelfile for a custom coding assistant
cat > Modelfile << 'HEREDOC'
FROM qwen3.5-coder:32b
SYSTEM "You are a senior full-stack developer. You write clean, well-tested TypeScript and Python. Be concise. Show code, not explanations."
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
HEREDOC
ollama create my-coder -f Modelfile
ollama run my-coder
Local AI is not a replacement for cloud models. It is a complement that fills a different niche: fast, private, free, and always available. Set it up once, and it becomes a natural part of your development workflow.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.


