AI model benchmarks - what the scores actually mean and how they translate to real-world coding.
5 resources - 5 posts

GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and the real benchmarks I ran the day it dropped.

Same prompt, different models, live comparison. Here is what I learned testing Cursor Composer 2, Kimi, Droid, and MiniMax on 10 real web development tasks.

Anthropic's Sonnet 4.6 narrows the gap to Opus on agentic tasks, leads computer use benchmarks, and ships with a beta million-token context window. Here's what actually changed.

xAI has launched Grok 4, claiming the title of the world's most powerful AI model. With a $300/month Super Grok tier, saturated AMI benchmarks, and a coding model on the horizon, this is xAI's bigge...

xAI launched Grok 3 with 200,000 GPUs, outperforming GPT-4o, Sonnet 3.5, and DeepSeek R1 on reasoning benchmarks. Here is what the hardware, the benchmarks, and the new features actually mean for developers.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Explore 351 topics
Browse All Topics