llama.cpp
C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools build on.
llama.cpp is the C++ inference library from Georgi Gerganov that most of the local-LLM ecosystem is built on. Ollama, LM Studio, Jan, and GPT4All all ship a wrapped or forked llama.cpp at their core. Using it directly gives you the lowest overhead and the most control: quantization down to 2-bit, Metal acceleration on Apple Silicon, CUDA on NVIDIA, Vulkan on AMD, and first-class support for the GGUF weight format that the entire local ecosystem has standardized on. For developers who want to embed local inference in a product rather than wrap an existing tool, llama.cpp is the layer to integrate against. The bundled llama-server binary exposes an OpenAI-compatible HTTP endpoint so any existing tool can swap to a local model with one config change.
Similar Tools
MLX
Apple's array framework for machine learning on Apple Silicon. Native Metal support, unified memory, first-class LLM inference.
vLLM
High-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted serving.
Ollama
The easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly downloads. Supports GGUF, Safetensors, and custom Modelfiles.
LM Studio
Desktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and automatic GPU detection. MLX engine optimized for Apple Silicon.
Get started with llama.cpp
C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools build on.
Try llama.cppGet weekly tool reviews
Honest takes on AI dev tools, frameworks, and infrastructure - delivered to your inbox.
Subscribe FreeMore Local AI Tools
Ollama
The easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly downloads. Supports GGUF, Safetensors, and custom Modelfiles.
LM Studio
Desktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and automatic GPU detection. MLX engine optimized for Apple Silicon.
Jan
Open-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom assistants, and MCP integration. AGPLv3 licensed.
