AtomiMed: Hierarchical Atomic Fact-Checking for Universal Clinical-Aware Medical Report Evaluation

Authors: Yuan Wang, Wanxing Chang, Songtao Jiang, Shujian Gao, Xiaotian Zhang, Ruifeng Yuan, Weiwei Cao, Bowen Shi, Ling Zhang, Zuozhu Liu, Jianpeng Zhang arXiv ID: 2606.31292

Problem: Traditional n-gram overlap metrics for medical report generation fail to capture clinical factual accuracy, often overlooking catastrophic diagnostic errors like a missed pneumothorax or inverted laterality.

Key Methodology:

Decomposes medical reports into a hierarchical structure of Atomic Clinical Facts (ACFs) at two levels: Disease-level (binary presence/absence of findings) and Attribute-level (location, morphology, severity, size, quantity, temporal change).
Implements an Agentic Cross-Verification loop that bidirectionally questions each report (ground-truth vs. predicted) as evidence for the other's ACFs, computing separate precision/recall/F1 for diagnostic detection and descriptive accuracy - simulating a multi-radiologist peer review.
Curates OmniMRG-Bench, the first multi-modal MRG benchmark spanning X-ray, CT, MRI, and Ultrasound with over 178K expert-verified QA pairs, alongside MRGEvalKit (open-source toolkit).

Key Results:

Achieves Spearman's ρ = 0.806 on ReXVal, outperforming GREEN (0.798) and all prior metrics in correlation with radiologist error counts.
On pairwise preference (modality-agnostic), reaches 95.71% accuracy (τ = 0.9807) on X-ray with MAE of 0.0214 - an order of magnitude lower than GREEN (MAE 0.1857, ACC 63.57%).
GREEN's correlation collapses outside chest X-ray (τ falls from 0.6481 on X-ray to 0.1513 on MRI); AtomiMed sustains meaningful correlation across all modalities (84.33% ACC on CT, 49.86% on Ultrasound).
Granular analysis reveals models score markedly higher on Morphology (6.0–13.2) but near floor on Severity (0.80–5.9), and respiratory system dominates while digestive/urinary systems are severely underserved.

Applied Context: AtomiMed gives builders a modality-agnostic, interpretable evaluation framework that surfaces exactly which clinical findings and attributes a model gets wrong - enabling targeted debugging of medical VLMs and safe deployment across imaging modalities beyond chest X-ray.