AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

Authors: Xiangchen Cheng, Yunwei Jiang, Jianwen Sun, Zizhen Li, Chuanhao Li, Xiangcheng Cao, Yihao Liu, Fanrui Zhang, Li Jin, Kaipeng Zhang

arXiv ID: 2607.02255

Problem: Standard LLM agents that append all past context to every prompt produce a jumbled memory that makes it impossible to isolate the effect of any single memory component on long-horizon decisions.

Key Methodology:

Introduces a bounded-memory contract: each decision starts from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended - keeping the prompt bounded across runs of any length and enabling per-layer ablation.
Instantiates the contract in Slay the Spire 2, a closed-rule stochastic deck-building game requiring hundreds of tactical and strategic decisions per run.
Releases 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts as a reproducible testbed.

Key Results:

Frontier LLMs across five configurations achieve zero wins at the lowest difficulty on the public online benchmark; the developer-reported human win rate at the same difficulty is 16%.
In the authors' harness, a fixed-A0 ablation shows the no-store baseline wins 3/10 games vs. 6/10 with triggered strategic skills enabled (Fisher exact p≈0.37 - directional, not statistically decisive).

Applied Context: Builders evaluating long-horizon agent architectures should adopt typed-retrieval, bounded-memory designs to isolate which memory mechanisms actually drive performance, rather than relying on monolithic context-windowing that conflates all past signals.