LLM / Agent / Memory
AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
** Xiangchen Cheng, Yunwei Jiang, Jianwen Sun, Zizhen Li, Chuanhao Li, Xiangcheng Cao, Yihao Liu, Fanrui Zhang, Li Jin, Kaipeng Zhang
AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
Authors: Xiangchen Cheng, Yunwei Jiang, Jianwen Sun, Zizhen Li, Chuanhao Li, Xiangcheng Cao, Yihao Liu, Fanrui Zhang, Li Jin, Kaipeng Zhang
arXiv ID: 2607.02255
Problem: Standard LLM agents that append all past context to every prompt produce a jumbled memory that makes it impossible to isolate the effect of any single memory component on long-horizon decisions.
Key Methodology:
- Introduces a bounded-memory contract: each decision starts from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended - keeping the prompt bounded across runs of any length and enabling per-layer ablation.
- Instantiates the contract in Slay the Spire 2, a closed-rule stochastic deck-building game requiring hundreds of tactical and strategic decisions per run.
- Releases 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts as a reproducible testbed.
Key Results:
- Frontier LLMs across five configurations achieve zero wins at the lowest difficulty on the public online benchmark; the developer-reported human win rate at the same difficulty is 16%.
- In the authors' harness, a fixed-A0 ablation shows the no-store baseline wins 3/10 games vs. 6/10 with triggered strategic skills enabled (Fisher exact p≈0.37 - directional, not statistically decisive).
Applied Context: Builders evaluating long-horizon agent architectures should adopt typed-retrieval, bounded-memory designs to isolate which memory mechanisms actually drive performance, rather than relying on monolithic context-windowing that conflates all past signals.