HF Research Papers - Developers Digest

Building Agents

22 papers

AgenticDataBench: A Comprehensive Benchmark for Data Agents

** There is no comprehensive benchmark to rigorously evaluate LLM-based data agents across diverse, realistic data science scenarios with fine-grained granularity.

Read summary

LLMAgent

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

** Standard LLM agents that append all past context to every prompt produce a jumbled memory that makes it impossible to isolate the effect of any single memory component on long-horizon decisions.

Read summary

LLMAgent

AI translation of literary texts is "fine", but readers still prefer human translations

** Do readers experience literary machine translations (MT) as immersively as human translations (HT), and can automatic metrics reliably capture this preference?

Read summary

AgentEvaluation

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

** Popular performance-optimization benchmarks for coding agents conflate runtime instability, scoring-rule artifacts, and saturation effects, making leaderboard scores unreliable indicators of true coding-agent progress.

Read summary

MultimodalAgent

AtomiMed: Hierarchical Atomic Fact-Checking for Universal Clinical-Aware Medical Report Evaluation

** Traditional n-gram overlap metrics for medical report generation fail to capture clinical factual accuracy, often overlooking catastrophic diagnostic errors like a missed pneumothorax or inverted laterality.

Read summary

LLMMultimodal

Autonomous Scientific Discovery via Iterative Meta-Reflection

** Current AI-driven scientific discovery systems are constrained to predefined search spaces or require externally supplied research questions, preventing true open-ended inquiry where a system autonomously explores raw multimodal data without guidance.

Read summary

LLMAgent

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

** Training language models remains a human-intensive process because autonomous post-training requires an LM agent to plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state - a long-horizon task that underspecified CLI environments fail to support.

Read summary

AgentReasoning

BioInsight: Multi-Agent Orchestration for Interactive Biomedical Knowledge Discovery

** Static AI-generated biomedical reports are insufficient for research decision-making because researchers cannot inspect evidence, assess uncertainty, compare mechanisms, or refine hypotheses.

Read summary

AgentCode

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

** Benchmark pass rates for coding agents can be near-perfect even when the agent failed to build the requested artifact, because the agent satisfies the test oracle by inlining behavior into a throwaway demo instead of the reusable library.

Read summary

+13 more

Faster Inference

40 papers

LLMDistillation

DOPD: Dual On-policy Distillation

DOPD introduces advantage-aware dual distillation that dynamically routes token-level supervision between teacher and student policies based on the privilege advantage gap, solving the 'privilege illusion' failure mode in on-policy distillation.

Read summary

DiffusionSpeculative Decoding

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

BlockPilot reveals that fixed block-size strategies in diffusion-based speculative decoding are fundamentally suboptimal, and proposes a lightweight instance-adaptive policy that predicts the optimal block size from the prefilling representation, achieving 4.20× speedup on Qwen3-4B.

Read summary

Image GenerationAutoregressive

GEAR: Guided End-to-End AutoRegression for Image Synthesis

GEAR trains a VQ tokenizer and autoregressive generator jointly end-to-end via a dual hard/soft codebook readout, achieving up to 10x faster gFID convergence on ImageNet by shifting representation alignment from the tokenizer to the AR model.

Read summary

TrainingGeneration

AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition

** Data augmentations inherited from natural image tasks can disrupt the fine-grained vascular topology and textures critical for identity discrimination in vein recognition.

Read summary

MultimodalAgent

ASPIRE: Agentic Skill Programming through Iterative Robot Exploration

** Traditional robot programming is difficult due to the need to orchestrate multimodal perception, manage contact dynamics, and handle diverse configurations and failures, with no existing system that autonomously writes, refines, and transfers reusable control programs across tasks and embodiments.

Read summary

LLMDiffusion

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

** Existing diffusion-based speculative decoding methods use a fixed block size for all inputs, which is suboptimal since the optimal block size varies across samples.

Read summary

MultimodalDiffusion

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

** Existing brain encoding and decoding models treat these as separate tasks using unimodal alignment, ignoring the brain's intrinsic multimodal integration nature.

Read summary

MultimodalDistillation

Denser $ eq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

** Does injecting denser (token-level) on-policy self-distillation signals improve continual post-training for foundation models, or can it actually hurt?

Read summary

LLMMultimodal

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

** Can a discrete diffusion language model match or exceed autoregressive models on medical report generation while offering capabilities autoregressive models lack?

Read summary

+31 more

Training & Fine-Tuning

71 papers

LLMReasoning

Orca: The World is in Your Mind

Orca introduces a general world foundation model that learns a unified latent space from multimodal signals via Next-State-Prediction, then freezes its backbone and attaches lightweight readout decoders for text, image, and action - outperforming specialized baselines across all three modalities.

Read summary

RoboticsTraining

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

** Existing World Action Models fail at mobile manipulation due to three structural misalignments - coarse video prediction doesn't match fine-grained control, entangled navigation/manipulation action spaces cause gradient interference, and training on ground-truth futures doesn't generalize to the model's own noisy rollouts at inference time.

Read summary

TrainingGeneration

AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition

** Data augmentations inherited from natural image tasks can disrupt the fine-grained vascular topology and textures critical for identity discrimination in vein recognition.

Read summary

LLMMultimodal

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

** Current video grounding benchmarks are confined to general daily-life domains and zero-shot evaluation, creating a critical disconnect from real-world specialized-domain applications where models must adapt to rare visual concepts.

Read summary

LLMTraining

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

** Common diversity metrics for LLM mathematical reasoning capture surface-level variation (wording) rather than true strategic differences in how problems are solved.

Read summary

MultimodalAgent

ASPIRE: Agentic Skill Programming through Iterative Robot Exploration

** Traditional robot programming is difficult due to the need to orchestrate multimodal perception, manage contact dynamics, and handle diverse configurations and failures, with no existing system that autonomously writes, refines, and transfers reusable control programs across tasks and embodiments.

Read summary

LLMAgent

AutoMem: Automated Learning of Memory as a Cognitive Skill

** LLMs lack a learned memory management strategy - they don't know what to encode, when to retrieve, or how to organize knowledge over long-horizon tasks.

Read summary

LLMMultimodal

Autonomous Scientific Discovery via Iterative Meta-Reflection

** Current AI-driven scientific discovery systems are constrained to predefined search spaces or require externally supplied research questions, preventing true open-ended inquiry where a system autonomously explores raw multimodal data without guidance.

Read summary

LLMAgent

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

** Training language models remains a human-intensive process because autonomous post-training requires an LM agent to plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state - a long-horizon task that underspecified CLI environments fail to support.

Read summary

+62 more

Robotics & Embodied AI

32 papers

VLARobotics

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Current VLA models retain shallow perceptual knowledge (color, shape) after robotics fine-tuning but catastrophically drop performance on richer semantic categories (emotion, counting, temporal, normative, cultural knowledge) - with answer-relevant information still present in intermediate layers yet failing to reach action output.

Read summary

RoboticsTraining

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

** Existing World Action Models fail at mobile manipulation due to three structural misalignments - coarse video prediction doesn't match fine-grained control, entangled navigation/manipulation action spaces cause gradient interference, and training on ground-truth futures doesn't generalize to the model's own noisy rollouts at inference time.

Read summary

MultimodalAgent

ASPIRE: Agentic Skill Programming through Iterative Robot Exploration

** Traditional robot programming is difficult due to the need to orchestrate multimodal perception, manage contact dynamics, and handle diverse configurations and failures, with no existing system that autonomously writes, refines, and transfers reusable control programs across tasks and embodiments.

Read summary

LLMAgent

AutoMem: Automated Learning of Memory as a Cognitive Skill

** LLMs lack a learned memory management strategy - they don't know what to encode, when to retrieve, or how to organize knowledge over long-horizon tasks.

Read summary

LLMRobotics

CausalMix: Data Mixture as Causal Inference for Language Model Training

** Existing data mixture optimization methods assume static data distributions and require costly retraining from scratch when the data pool shifts, preventing scalable transfer across data pools and model sizes.

Read summary

MultimodalAgent

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

** Existing text-rich image data pipelines follow a static crawl-filter-freeze paradigm that discards rejected samples, wasting failure signals (OCR errors, semantic mismatches) that could inform later construction rounds.

Read summary

MultimodalAgent

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

** VLA models fine-tuned from powerful VLMs on robotics data may catastrophically forget commonsense and world knowledge, but existing benchmarks conflate knowledge gaps with low-level control failures.

Read summary

MultimodalRobotics

Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts

** Vision-Language-Action (VLA) models fail under environmental shifts (camera pose, embodiment changes) and existing adaptation methods require costly multi-demonstration data per task.

Read summary

LLMAgent

DuoMem: Towards Capable On-Device Memory Agents via Dual-Space Distillation

** Large LLM-based agents with procedural memory are too big and slow to run on resource-constrained edge devices.

Read summary

+23 more

Content Generation

23 papers

TrainingGeneration

AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition

** Data augmentations inherited from natural image tasks can disrupt the fine-grained vascular topology and textures critical for identity discrimination in vein recognition.

Read summary

LLMCode

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

** Existing audio-video generation models use separate per-modality tokenizers, which creates a representation gap, causes semantic misalignment, and requires expensive dual-branch architectures.

Read summary

LLMDiffusion

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

** Existing diffusion-based speculative decoding methods use a fixed block size for all inputs, which is suboptimal since the optimal block size varies across samples.

Read summary

MultimodalDiffusion

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

** Existing brain encoding and decoding models treat these as separate tasks using unimodal alignment, ignoring the brain's intrinsic multimodal integration nature.

Read summary

MultimodalGeneration

CogSENet: Blind Image Deblurring with Blur-Conditioned Semantic Routing and Explicit Frequency Fusion

** Blind image deblurring methods struggle with real-world spatially varying degradations and lack the semantic awareness needed to distinguish valid textures from artifacts.

Read summary

LLMMultimodal

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

** Can a discrete diffusion language model match or exceed autoregressive models on medical report generation while offering capabilities autoregressive models lack?

Read summary

DiffusionCode

GEAR: Guided End-to-End AutoRegression for Image Synthesis

** Standard two-stage visual generative models (tokenizer → frozen generator) decouple training, leaving the tokenizer unaware of what the generator finds easy or hard to model.

Read summary

LLMMultimodal

InstanceControl: Controllable Complex Image Generation without Instance Labeling

** Existing controllable image generation methods like ControlNet struggle with attribute confusion in multi-instance scenes, while approaches that fix this require labor-intensive manual instance labeling.

Read summary

LLMReasoning

Little Brains, Big Feats: Exploring Compact Language Models

** Can small language models (SLMs) serve as viable, GPU-free replacements for large language models in the generation stage of Retrieval-Augmented Generation (RAG) systems?

Read summary

+14 more

Evaluation & Benchmarks

6 papers

AgentEvaluation

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

** Popular performance-optimization benchmarks for coding agents conflate runtime instability, scoring-rule artifacts, and saturation effects, making leaderboard scores unreliable indicators of true coding-agent progress.

Read summary

AgentCode

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

** Benchmark pass rates for coding agents can be near-perfect even when the agent failed to build the requested artifact, because the agent satisfies the test oracle by inlining behavior into a throwaway demo instead of the reusable library.

Read summary

LLMTraining

Cross-Domain Generalization Failure in Lightweight Intrusion Detection Models for IIoT Networks

** Lightweight ML intrusion detection models achieve near-perfect accuracy within their training network but fail catastrophically when deployed on unseen IIoT networks.

Read summary

MultimodalSafety

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

** Only 8% of speech model releases document any multilingual safety evaluation, leaving dangerous blind spots in safety frameworks as these models are deployed across languages worldwide.

Read summary

LLMTraining

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

** LLMs hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent internal uncertainty, undermining trustworthiness in high-stakes deployments.

Read summary

Evaluation

Scaling Laws for Grid-Based Approximate Nearest Neighbor Search in High Dimensions

** Grid-based ANN methods have been absent from modern scaling analyses, so the paper characterizes their scaling behavior against dataset size N and dimensionality d to identify competitive regimes.

Read summary