INFERENCE

10 items

3 posts, 7 tools

BlogJun 23, 2026

Cerebras Stock Is a Public Test of AI Inference Demand

Google Trends put CBRS stock on the board after Cerebras' first public-company earnings. The developer takeaway is not a trade. It is that AI inference demand is now being priced, questioned, and audited in public.

AI Infrastructure Cerebras Markets AI Chips Inference

BlogJun 10, 2026

DiffusionGemma: Google Bets Diffusion Can Make Text Generation 4x Faster

Google released DiffusionGemma today, a 26B MoE open model that generates entire 256-token blocks in parallel instead of one token at a time. Here is what that means for latency, local inference, and the post-autoregressive landscape.

ai open-source inference local-ai google diffusion-models

BlogApr 29, 2026

KV Caching: A Practical Guide to Optimizing Transformer Inference

How KV caching speeds up LLM inference - the math, the code, the memory tradeoffs, and when it stops helping. Every dev running local models hits this wall.

LLM Inference Optimization Hugging Face Local Models

ToolApr 23, 2026

llama.cpp

C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools build on.

local inference cpp gguf quantization metal cuda

ToolApr 23, 2026

vLLM

High-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted serving.

inference server throughput self-hosted gpu pagedattention

ToolApr 23, 2026

MLX

Apple's array framework for machine learning on Apple Silicon. Native Metal support, unified memory, first-class LLM inference.

apple-silicon metal local inference apple m-series

ToolApr 9, 2026

Replicate

Run 50,000+ ML models with a simple API. No infrastructure management. Pay-per-second billing. Deploy custom models with Cog. Popular for image generation and audio.

infrastructure api models gpu inference image-generation

ToolApr 9, 2026

Together AI

Fastest inference for open-source models. 200+ models via unified API. Ranks #1 on speed benchmarks for DeepSeek, Qwen, Kimi, and Llama. Serverless pay-per-token pricing.

infrastructure inference api open-source-models gpu fast

ToolApr 9, 2026

Groq

LPU-powered inference delivering 500-1,000+ tokens/sec. Purpose-built chip with on-chip SRAM instead of HBM. 5-10x faster than GPU providers. Free tier available.

infrastructure inference lpu fast hardware api

ToolApr 9, 2026

Cerebras

Wafer-scale AI inference at 3,000+ tokens/sec. The WSE-3 chip has 4 trillion transistors and 900K AI cores. 20x faster than GPU providers. OpenAI partnership for inference.

infrastructure inference wafer-scale hardware fast api

Browse All Tags

INFERENCE

Get Smarter About AI Dev

INFERENCE

Get Smarter About AI Dev