Local AI

llama.cpp

C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools build on.

Try llama.cppgithub.com/ggml-org/llama.cpp

Save

llama.cpp is the C++ inference library from Georgi Gerganov that most of the local-LLM ecosystem is built on. Ollama, LM Studio, Jan, and GPT4All all ship a wrapped or forked llama.cpp at their core. Using it directly gives you the lowest overhead and the most control: quantization down to 2-bit, Metal acceleration on Apple Silicon, CUDA on NVIDIA, Vulkan on AMD, and first-class support for the GGUF weight format that the entire local ecosystem has standardized on. For developers who want to embed local inference in a product rather than wrap an existing tool, llama.cpp is the layer to integrate against. The bundled llama-server binary exposes an OpenAI-compatible HTTP endpoint so any existing tool can swap to a local model with one config change.

local inference cpp gguf quantization metal cuda

Similar Tools

Local AI

MLX

Apple's array framework for machine learning on Apple Silicon. Native Metal support, unified memory, first-class LLM inference.

Local AI

vLLM

High-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted serving.

Local AI

Ollama

The easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly downloads. Supports GGUF, Safetensors, and custom Modelfiles.

Local AI

LM Studio

Desktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and automatic GPU detection. MLX engine optimized for Apple Silicon.

Get started with llama.cpp

C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools build on.

Try llama.cpp

Get weekly tool reviews

Honest takes on AI dev tools, frameworks, and infrastructure - delivered to your inbox.

Subscribe Free

Compare all pricing Compare side by side

More Local AI Tools

Ollama

The easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly downloads. Supports GGUF, Safetensors, and custom Modelfiles.

LM Studio

Desktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and automatic GPU detection. MLX engine optimized for Apple Silicon.

Jan

Open-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom assistants, and MCP integration. AGPLv3 licensed.