AgenticDataBench: A Comprehensive Benchmark for Data Agents

Authors: Zhaoyan Sun, Shan Zhong, Daizhou Wen, Jiaxing Han, Guoliang Li, Ying Yan, Peng Zhang, Yu Su, Xiang Qi, Baolin Sun, Chengyuan Yang, Tao Fang, Huaiyu Ruan

arXiv ID: 2607.01647

Problem: There is no comprehensive benchmark to rigorously evaluate LLM-based data agents across diverse, realistic data science scenarios with fine-grained granularity.

Key Methodology:

Constructed 344 tasks across 97 datasets (27.3 GB) spanning 15 vertical domains, including 5 real-world B2B use cases from a fintech company
Extracted 433 reusable data science skills from large-scale Stack Overflow solutions using skill-aligned hierarchical clustering, then generated synthetic tasks via a systematic LLM-based approach to ensure domain coverage without redundancy
Evaluated SOTA data agents with fine-grained ground-truth labels, measuring both overall task success and per-skill performance

Key Results:

The benchmark revealed that even the best data agents (GPT-4o) achieve only ~60-80% task-level accuracy depending on the domain, with significant skill-level variability - some fundamental skills like data validation and type conversion show <40% pass rates even in top models
A strong positive correlation exists between the number of skills a task requires and model failures, with task success dropping sharply beyond 5-7 skills
Cost-performance Pareto analysis shows open-source agents (e.g. DeepSeek-Coder, Qwen2.5-Coder) can match GPT-4o on simpler tasks but fall off rapidly on complex multi-skill workflows

Applied Context: Benchmarks are your canary. Use AgenticDataBench to instrument your data pipeline agents before deploying them: a model that fumbles type-conversion (<40% pass rate) will silently corrupt production ETL. Invest in testing your agent on the 5-7+ skill combos that trigger most breakage.