AgenticDataBench: A Comprehensive Benchmark for Data Agents
** There is no comprehensive benchmark to rigorously evaluate LLM-based data agents across diverse, realistic data science scenarios with fine-grained granularity.
Research Briefs
Summaries of trending research from Hugging Face Daily Papers, written for builders. 98 papers from July 2026.
22 papers
** There is no comprehensive benchmark to rigorously evaluate LLM-based data agents across diverse, realistic data science scenarios with fine-grained granularity.
** Standard LLM agents that append all past context to every prompt produce a jumbled memory that makes it impossible to isolate the effect of any single memory component on long-horizon decisions.
** Do readers experience literary machine translations (MT) as immersively as human translations (HT), and can automatic metrics reliably capture this preference?
** Popular performance-optimization benchmarks for coding agents conflate runtime instability, scoring-rule artifacts, and saturation effects, making leaderboard scores unreliable indicators of true coding-agent progress.
** Traditional n-gram overlap metrics for medical report generation fail to capture clinical factual accuracy, often overlooking catastrophic diagnostic errors like a missed pneumothorax or inverted laterality.
** Current AI-driven scientific discovery systems are constrained to predefined search spaces or require externally supplied research questions, preventing true open-ended inquiry where a system autonomously explores raw multimodal data without guidance.
** Training language models remains a human-intensive process because autonomous post-training requires an LM agent to plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state - a long-horizon task that underspecified CLI environments fail to support.
** Static AI-generated biomedical reports are insufficient for research decision-making because researchers cannot inspect evidence, assess uncertainty, compare mechanisms, or refine hypotheses.
** Benchmark pass rates for coding agents can be near-perfect even when the agent failed to build the requested artifact, because the agent satisfies the test oracle by inlining behavior into a throwaway demo instead of the reusable library.
+13 more
40 papers
DOPD introduces advantage-aware dual distillation that dynamically routes token-level supervision between teacher and student policies based on the privilege advantage gap, solving the 'privilege illusion' failure mode in on-policy distillation.
BlockPilot reveals that fixed block-size strategies in diffusion-based speculative decoding are fundamentally suboptimal, and proposes a lightweight instance-adaptive policy that predicts the optimal block size from the prefilling representation, achieving 4.20× speedup on Qwen3-4B.
GEAR trains a VQ tokenizer and autoregressive generator jointly end-to-end via a dual hard/soft codebook readout, achieving up to 10x faster gFID convergence on ImageNet by shifting representation alignment from the tokenizer to the AR model.
** Data augmentations inherited from natural image tasks can disrupt the fine-grained vascular topology and textures critical for identity discrimination in vein recognition.
** Traditional robot programming is difficult due to the need to orchestrate multimodal perception, manage contact dynamics, and handle diverse configurations and failures, with no existing system that autonomously writes, refines, and transfers reusable control programs across tasks and embodiments.
** Existing diffusion-based speculative decoding methods use a fixed block size for all inputs, which is suboptimal since the optimal block size varies across samples.
** Existing brain encoding and decoding models treat these as separate tasks using unimodal alignment, ignoring the brain's intrinsic multimodal integration nature.
** Does injecting denser (token-level) on-policy self-distillation signals improve continual post-training for foundation models, or can it actually hurt?
** Can a discrete diffusion language model match or exceed autoregressive models on medical report generation while offering capabilities autoregressive models lack?
+31 more
71 papers
Orca introduces a general world foundation model that learns a unified latent space from multimodal signals via Next-State-Prediction, then freezes its backbone and attaches lightweight readout decoders for text, image, and action - outperforming specialized baselines across all three modalities.
** Existing World Action Models fail at mobile manipulation due to three structural misalignments - coarse video prediction doesn't match fine-grained control, entangled navigation/manipulation action spaces cause gradient interference, and training on ground-truth futures doesn't generalize to the model's own noisy rollouts at inference time.
** Data augmentations inherited from natural image tasks can disrupt the fine-grained vascular topology and textures critical for identity discrimination in vein recognition.
** Current video grounding benchmarks are confined to general daily-life domains and zero-shot evaluation, creating a critical disconnect from real-world specialized-domain applications where models must adapt to rare visual concepts.
** Common diversity metrics for LLM mathematical reasoning capture surface-level variation (wording) rather than true strategic differences in how problems are solved.
** Traditional robot programming is difficult due to the need to orchestrate multimodal perception, manage contact dynamics, and handle diverse configurations and failures, with no existing system that autonomously writes, refines, and transfers reusable control programs across tasks and embodiments.
** LLMs lack a learned memory management strategy - they don't know what to encode, when to retrieve, or how to organize knowledge over long-horizon tasks.
** Current AI-driven scientific discovery systems are constrained to predefined search spaces or require externally supplied research questions, preventing true open-ended inquiry where a system autonomously explores raw multimodal data without guidance.
** Training language models remains a human-intensive process because autonomous post-training requires an LM agent to plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state - a long-horizon task that underspecified CLI environments fail to support.
+62 more
32 papers
Current VLA models retain shallow perceptual knowledge (color, shape) after robotics fine-tuning but catastrophically drop performance on richer semantic categories (emotion, counting, temporal, normative, cultural knowledge) - with answer-relevant information still present in intermediate layers yet failing to reach action output.
** Existing World Action Models fail at mobile manipulation due to three structural misalignments - coarse video prediction doesn't match fine-grained control, entangled navigation/manipulation action spaces cause gradient interference, and training on ground-truth futures doesn't generalize to the model's own noisy rollouts at inference time.
** Traditional robot programming is difficult due to the need to orchestrate multimodal perception, manage contact dynamics, and handle diverse configurations and failures, with no existing system that autonomously writes, refines, and transfers reusable control programs across tasks and embodiments.
** LLMs lack a learned memory management strategy - they don't know what to encode, when to retrieve, or how to organize knowledge over long-horizon tasks.
** Existing data mixture optimization methods assume static data distributions and require costly retraining from scratch when the data pool shifts, preventing scalable transfer across data pools and model sizes.
** Existing text-rich image data pipelines follow a static crawl-filter-freeze paradigm that discards rejected samples, wasting failure signals (OCR errors, semantic mismatches) that could inform later construction rounds.
** VLA models fine-tuned from powerful VLMs on robotics data may catastrophically forget commonsense and world knowledge, but existing benchmarks conflate knowledge gaps with low-level control failures.
** Vision-Language-Action (VLA) models fail under environmental shifts (camera pose, embodiment changes) and existing adaptation methods require costly multi-demonstration data per task.
** Large LLM-based agents with procedural memory are too big and slow to run on resource-constrained edge devices.
+23 more
23 papers
** Data augmentations inherited from natural image tasks can disrupt the fine-grained vascular topology and textures critical for identity discrimination in vein recognition.
** Existing audio-video generation models use separate per-modality tokenizers, which creates a representation gap, causes semantic misalignment, and requires expensive dual-branch architectures.
** Existing diffusion-based speculative decoding methods use a fixed block size for all inputs, which is suboptimal since the optimal block size varies across samples.
** Existing brain encoding and decoding models treat these as separate tasks using unimodal alignment, ignoring the brain's intrinsic multimodal integration nature.
** Blind image deblurring methods struggle with real-world spatially varying degradations and lack the semantic awareness needed to distinguish valid textures from artifacts.
** Can a discrete diffusion language model match or exceed autoregressive models on medical report generation while offering capabilities autoregressive models lack?
** Standard two-stage visual generative models (tokenizer → frozen generator) decouple training, leaving the tokenizer unaware of what the generator finds easy or hard to model.
** Existing controllable image generation methods like ControlNet struggle with attribute confusion in multi-instance scenes, while approaches that fix this require labor-intensive manual instance labeling.
** Can small language models (SLMs) serve as viable, GPU-free replacements for large language models in the generation stage of Retrieval-Augmented Generation (RAG) systems?
+14 more
6 papers
** Popular performance-optimization benchmarks for coding agents conflate runtime instability, scoring-rule artifacts, and saturation effects, making leaderboard scores unreliable indicators of true coding-agent progress.
** Benchmark pass rates for coding agents can be near-perfect even when the agent failed to build the requested artifact, because the agent satisfies the test oracle by inlining behavior into a throwaway demo instead of the reusable library.
** Lightweight ML intrusion detection models achieve near-perfect accuracy within their training network but fail catastrophically when deployed on unseen IIoT networks.
** Only 8% of speech model releases document any multilingual safety evaluation, leaving dangerous blind spots in safety frameworks as these models are deployed across languages worldwide.
** LLMs hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent internal uncertainty, undermining trustworthiness in high-stakes deployments.
** Grid-based ANN methods have been absent from modern scaling analyses, so the paper characterizes their scaling behavior against dataset size N and dimensionality d to identify competitive regimes.