
11 min read
How KV caching speeds up LLM inference - the math, the code, the memory tradeoffs, and when it stops helping. Every dev running local models hits this wall.
2 articles
Google released DiffusionGemma today, a 26B MoE open model that generates entire 256-token blocks in parallel instead of one token at a time. Here is what that means for latency, local inference, and the post-autoregressive landscape.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Explore 522 topics
Browse All Topics