Large language models - architecture, capabilities, and how to pick the right one for the job.
17 resources - 10 posts, 7 tools

How KV caching speeds up LLM inference - the math, the code, the memory tradeoffs, and when it stops helping. Every dev running local models hits this wall.

A hands-on developer guide to Mercury 2 from Inception Labs. OpenAI-compatible API, reasoning levels, tool use, structured outputs, and when a diffusion LLM beats an autoregressive one in real apps.

Promptlock gives every prompt a 12-char content-addressable id and a diff-able artifact, turning silent prompt drift into a reviewable change.

Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tokens per second, competitive benchmarks, and a fundamentally different approach to how AI generates text.

A comprehensive look at Claude Skills-modular, persistent task modules that shatter AI's memory constraints and enable progressive, composable, code-capable workflows for developers and organizations.

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure reasoning parameters, the model operates as a unified system with real-time rou...

Inception Labs launched Mercury, the first commercial-grade diffusion large language model. It generates over 1,000 tokens per second on standard Nvidia hardware by replacing autoregressive generation with a coarse-to-fine diffusion process.

Unstract is an open-source, no-code platform for extracting structured data from PDFs, invoices, scanned documents, and more. Here is how it works, how to set it up, and why automated document processing is becoming essential for organizations drowning in unstructured data.

Microsoft's PHI-4 is an MIT-licensed 14 billion parameter model that matches Llama 3.3 70B and Qwen 2.5 72B on key benchmarks. Here is what makes it special, how to run it locally, and why small language models are increasingly practical for real development work.

Meta surprised the AI community with Llama 3.3, a 70 billion parameter model that delivers 405B-class performance at a fraction of the cost. Here is what the benchmarks show, where to run it, and why this release matters for developers building with open-source models.
The easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly downloads. Supports GGUF, Safetensors, and custom Modelfiles.
Local AIDesktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and automatic GPU detection. MLX engine optimized for Apple Silicon.
Local AIOpen-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom assistants, and MCP integration. AGPLv3 licensed.
Local AIPrivate local AI chatbot by Nomic. 250K+ monthly users, 65K GitHub stars. LocalDocs feature lets you chat with your own files. Runs on Windows, macOS, and Linux.
Local AIOpen-source OpenAI API replacement. Runs LLMs, vision, voice, image, and video models on any hardware - no GPU required. 35+ backends. Distributed mode for scaling.
Local AIHigh-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted serving.
Local AISelf-healing browser automation harness that lets LLMs complete any browser task. 5,000+ stars in under a week.
Infrastructure
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Explore 351 topics
Browse All Topics