LLM / Agent / Code
AgenticDataBench: A Comprehensive Benchmark for Data Agents
** Zhaoyan Sun, Shan Zhong, Daizhou Wen, Jiaxing Han, Guoliang Li, Ying Yan, Peng Zhang, Yu Su, Xiang Qi, Baolin Sun, Chengyuan Yang, Tao Fang, Huaiyu Ruan
AgenticDataBench: A Comprehensive Benchmark for Data Agents
Authors: Zhaoyan Sun, Shan Zhong, Daizhou Wen, Jiaxing Han, Guoliang Li, Ying Yan, Peng Zhang, Yu Su, Xiang Qi, Baolin Sun, Chengyuan Yang, Tao Fang, Huaiyu Ruan
arXiv ID: 2607.01647
Problem: There is no comprehensive benchmark to rigorously evaluate LLM-based data agents across diverse, realistic data science scenarios with fine-grained granularity.
Key Methodology:
- Constructed 344 tasks across 97 datasets (27.3 GB) spanning 15 vertical domains, including 5 real-world B2B use cases from a fintech company
- Extracted 433 reusable data science skills from large-scale Stack Overflow solutions using skill-aligned hierarchical clustering, then generated synthetic tasks via a systematic LLM-based approach to ensure domain coverage without redundancy
- Evaluated SOTA data agents with fine-grained ground-truth labels, measuring both overall task success and per-skill performance
Key Results:
- The benchmark revealed that even the best data agents (GPT-4o) achieve only ~60-80% task-level accuracy depending on the domain, with significant skill-level variability - some fundamental skills like data validation and type conversion show <40% pass rates even in top models
- A strong positive correlation exists between the number of skills a task requires and model failures, with task success dropping sharply beyond 5-7 skills
- Cost-performance Pareto analysis shows open-source agents (e.g. DeepSeek-Coder, Qwen2.5-Coder) can match GPT-4o on simpler tasks but fall off rapidly on complex multi-skill workflows
Applied Context: Benchmarks are your canary. Use AgenticDataBench to instrument your data pipeline agents before deploying them: a model that fumbles type-conversion (<40% pass rate) will silently corrupt production ETL. Invest in testing your agent on the 5-7+ skill combos that trigger most breakage.