BENCHMARKS

11 articles

All TopicsBenchmarksAI Coding Claude AI News Hacker News GPT-5.5

LATEST

Long-Horizon Terminal Bench Shows Why Coding Agents Still Stall

Long-Horizon-Terminal-Bench tests coding agents on 46 terminal tasks that can run for 90 minutes. The takeaway is not that agents are useless. It is that evals need to measure endurance, recovery, and partial progress.

July 14, 2026•9 min read

Read Article

New5 min read

Apple SpeechAnalyzer vs Whisper: Independent Benchmark Shows Apple Winning on Accuracy

New benchmarks on 5,559 test utterances show Apple's iOS 26 SpeechAnalyzer API achieving 2.12% word error rate - beating all Whisper model sizes while running 3x faster.

AI Apple Speech Recognition Benchmarks News Hacker News

8 min read

Cursor Composer 2.5 Developer Guide 2026

Cursor shipped Composer 2.5 in May 2026 - a 1T parameter agentic coding model that matches Opus 4.7 and GPT-5.5 on benchmarks at roughly one tenth the cost. Here is everything you need to know to use it effectively.

cursor ai-coding-tools agentic-coding developer-guide benchmarks

6 min read

GPT-5.5 Has a 3x Higher Hallucination Rate Than MIT-Licensed GLM-5.2

New benchmark data shows GPT-5.5 hallucinates 86% of the time when it does not know the answer - versus 28% for the open-weights GLM-5.2. The numbers challenge the assumption that bigger models equal more reliable output.

News Hacker News LLMs GPT Benchmarks Open Weights

7 min read

Claude Fable 5 vs GPT-5.5: Benchmarks, Pricing, and When Each Wins

Fable 5 launched June 9 at 2x GPT-5.5's price with a 22-point SWE-Bench Pro gap. Here is the decision framework for choosing between them.

Claude GPT-5.5 AI Coding Benchmarks Comparison

8 min read

FrontierCode Benchmark Explained: Why AI Coding Quality Scores Are Wrong (And the Fix)

SWE-Bench has an 81% false-positive problem. FrontierCode replaces it with mergeability as the metric - and the scores are sobering for every AI coding tool on the market.

Benchmarks AI Coding Code Quality Claude GPT-5

11 min read

GPT-5.5 for Developers: A Production Field Guide

GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and the real benchmarks I ran the day it dropped.

OpenAI GPT-5.5 Agents Production Benchmarks

8 min read

Web Dev Arena: How to Test AI Coding Models on Real Frontend Work

Benchmarks are useful, but frontend work fails in places leaderboards barely measure. Here is how Web Dev Arena turns AI model comparison into a practical UI evaluation workflow.

AI Coding Benchmarks Cursor Model Comparison

6 min read

Claude Sonnet 4.6: Approaching Opus at Half the Cost

Anthropic's Sonnet 4.6 narrows the gap to Opus on agentic tasks, leads computer use benchmarks, and ships with a beta million-token context window. Here's what actually changed.

Claude Sonnet AI Anthropic Benchmarks

7 min read

Grok 4: xAI's Most Powerful AI Model

xAI has launched Grok 4, claiming the title of the world's most powerful AI model. With a $300/month Super Grok tier, saturated AMI benchmarks, and a coding model on the horizon, this is xAI's bigge...

Grok xAI AI Models Benchmarks

9 min read

xAI Grok 3 Launch: The Smartest AI on Earth?

xAI launched Grok 3 with 200,000 GPUs, outperforming GPT-4o, Sonnet 3.5, and DeepSeek R1 on reasoning benchmarks. Here is what the hardware, the benchmarks, and the new features actually mean for developers.

xAI Grok AI Models Benchmarks

Showing 10 of 10 articles

Keep exploring Benchmarks

- Benchmarks Topic Hub - tools and guides for Benchmarks from the Developers Digest directory
- Glossary - dive deeper across the Developers Digest knowledge base
- Developers Digest on YouTube - video tutorials covering Benchmarks and more

Explore 742 topics

Browse All Topics

BENCHMARKS

Long-Horizon Terminal Bench Shows Why Coding Agents Still Stall

Keep exploring Benchmarks

Get Smarter About AI Dev

BENCHMARKS

Long-Horizon Terminal Bench Shows Why Coding Agents Still Stall

Keep exploring Benchmarks

Get Smarter About AI Dev