Briefing · Monday, June 29, 2026

GLM 5.2 Beats Claude on Security, MRI Second Opinions, and the EU Chat Control Backroom

Good morning. It's Sunday, June 29, and we're covering an open-weight model beating Claude on security benchmarks, a developer's AI radiology experiment that drew actual doctors into the comments, and EU legislators negotiating Chat Control behind closed doors.

The GLM 5.2 thread hit 860 points by morning - open weights are having a moment.

In today's brief:

GLM 5.2 scores 39% F1 on IDOR detection versus Claude Code's 32% - at one-sixth the cost
A 266MB MRI upload to Claude Code produces a diagnosis that contradicts the human radiologist
EU lawmakers negotiate Chat Control through informal backroom channels
A still-open Codex issue highlights agent file access blind spots

THE BIG ONE

GLM 5.2 Outperforms Claude on Semgrep's IDOR Benchmarks

Semgrep's security research team published benchmark results that caught HN's attention: the Chinese open-weight model GLM 5.2 beat Claude Code on IDOR (Insecure Direct Object Reference) vulnerability detection - and did it at roughly one-sixth the cost per finding.

The numbers: GLM 5.2 scored 39% F1 versus Claude Code's 32%, with no scaffolding or multi-agent system. Just a prompt and a model. Semgrep's own engineered multimodal pipeline still wins at 61%, but the comparison shows what raw model capability looks like versus assembled systems.

The HN thread (860 points, 399 comments) split predictably. Critics called it marketing for a narrow benchmark. Open-weight advocates countered that GLM 5.2 is available today, unrestricted, while Mythos-class models face regulatory uncertainty in the EU.

Why it matters: Security benchmarks are becoming the proving ground where open-weight models demonstrate parity - or superiority - on tasks that matter to enterprise buyers.

Our coverage: GLM 5.2 Outperforms Claude Code on Semgrep's IDOR Vulnerability Benchmarks

MEDICAL AI

Claude Code Gave a Second Opinion on an MRI - And Disagreed With the Doctor

A developer named Antoine published an experiment that landed at 436 points on HN: he fed 266MB of DICOM MRI data from his right shoulder into Claude Code Opus to get a second opinion on his orthopedist's diagnosis.

The human radiologist reported a Grade III partial-thickness tear - a significant rotator cuff injury. Claude Code reported an "intact tendon." When asked to arbitrate between the two readings, the AI concluded with "moderate-to-high confidence" that its own assessment was correct.

The HN discussion (575 comments) brought actual radiologists into the thread. They pushed back hard, noting that MRI is a 3D medium where slicing incorrectly can miss features entirely. Others pointed out that Claude specifically underperforms on image understanding compared to other frontier models.

Why it matters: The experiment surfaces the real question for AI medical imaging: not whether models can read scans, but whether patients and providers can calibrate trust appropriately when the AI disagrees with the specialist.

Our coverage: Using Claude Code for a Second Opinion on MRI Scans

POLICY

EU Chat Control Negotiations Move Behind Closed Doors

Patrick Breyer published details on EU lawmakers' latest Chat Control negotiations, which have shifted to informal backroom channels. The proposal would mandate client-side scanning of encrypted messages before they're sent - effectively breaking end-to-end encryption for compliance purposes.

The HN thread hit 668 points, 380 comments. The concern: decisions about private messaging are being made through processes that bypass normal legislative scrutiny.

Relatedly, a separate thread on the KIDS Act (494 points) and an essay on age verification as speech attribution (495 points) both examined how identity verification requirements create infrastructure for tracking what people say online.

Why it matters: Whether framed as child safety or encryption access, these policy moves shape what developers can build with private messaging - and what tradeoffs platforms will face.

TOOLS WORTH A LOOK

Herdr - Agent multiplexer that lives in your terminal. Routes prompts to multiple AI backends, manages context across sessions. OSS, 69 points on HN.
Librepods - Open firmware that liberates AirPods from Apple's ecosystem. Enables third-party app control and removes artificial pairing restrictions. OSS, 398 points on HN.
HackerRank's Open-Source ATS - Resume scoring released as open source, though one user's score varied from 74 to 90 across runs. OSS, 457 points on HN.

WHAT ELSE IS HAPPENING

Codex file exclusion issue still open: The most-commented security concern in the Codex repo - no way to exclude sensitive files from agent context - remains unresolved. 208 points, 132 comments.
TOP500 has a new number 1: ISC'26 crowned a new supercomputer at the top of the list. 111 points, 69 comments.
Tokenmaxxing is dead, long live tokenmaxxing: An essay on context window strategy now that frontier models have moved past simple length competition. 153 points, 206 comments.
Memory prices 1960-2026: Stanford's historical dataset updated through current quarter. 304 points, 111 comments.
AI fraud at Brown: A professor publicly denounced mass AI cheating on an exam. 411 points, 545 comments.

FROM THE SITE

What We Published Yesterday

We covered both lead HN stories: the GLM 5.2 benchmark breakdown with full scoring methodology and cost analysis, and the Claude Code MRI experiment with radiologist commentary from the thread.

Every link above goes to a primary source or our sourced coverage. Tomorrow's brief lands when the news does - subscribe to get it by email.

Get the next one in your inbox

The daily brief, delivered. Free, unsubscribe anytime.