AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

Authors: Zhaojian Yu, Penghao Yin, Shuzheng Gao, Shilin He, Kai Cai, Xiao-Ping Zhang

arXiv ID: 2606.31551

Problem: Training language models remains a human-intensive process because autonomous post-training requires an LM agent to plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state - a long-horizon task that underspecified CLI environments fail to support.

Key Methodology:

Introduces AutoTrainess, an LM agent that exposes training operations as a structured repository of agent-computer interfaces (planning, data prep, training, evaluation, logging) rather than leaving the agent in a raw CLI environment.
Externalizes prior human experience as explicit workflows, rules, and execution constraints that guide the agent toward reliable training behavior.

Key Results:

On PostTrainBench, AutoTrainess achieves 26.94 average score with GPT-5.4 (Codex) vs. 23.21 for CLI-only baselines.
Improves DeepSeek-V4-Flash (OpenCode) from 12.13 to 19.58, demonstrating generalization across models and harnesses.

Applied Context: For builders, AutoTrainess shows that wrapping LM training infrastructure in structured agent-computer interfaces (rather than raw CLIs) unlocks autonomous self-improvement loops - meaning the next generation of coding agents may increasingly train and fine-tune themselves without human-in-the-loop.