LLM / Training / Reasoning
Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning
** Sangmook Lee, Minbeom Kim, Jeonghye Kim, Dohyung Kim, Sojeong Rhee, Kyomin Jung
Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning
Authors: Sangmook Lee, Minbeom Kim, Jeonghye Kim, Dohyung Kim, Sojeong Rhee, Kyomin Jung
arXiv ID: 2606.29985
Problem: Common diversity metrics for LLM mathematical reasoning capture surface-level variation (wording) rather than true strategic differences in how problems are solved.
Key Methodology:
- Introduces approach-level diversity - variation in problem-solving strategies across correct solutions to the same math problem.
- Develops a human-calibrated LLM judge framework to reliably classify solution strategies.
- Evaluates the gap between surface-level metrics (n-gram overlap, embedding similarity) and approach-level diversity in both evaluation and diversity-aware RLVR training.
Key Results:
- Prior diversity measures are unreliable proxies for approach-level diversity (e.g., high surface diversity can coincide with low strategic diversity).
- In diversity-aware RLVR, target surface metrics are preserved while approach-level diversity actually declines during training.
- Approach-diverse candidate sets improve test-time scaling, but directly optimizing an LLM judge diversity reward causes the policy to exploit judge-specific preferences rather than genuinely broadening its reasoning strategies.
Applied Context: For builders of reasoning LLMs, this means optimizing for surface-level diversity (common in RLVR pipelines) can actively harm strategic diversity - you need to measure and reward what you actually want. The paper's LLM judge framework for approach-level classification is a practical starting point, but direct optimization of approach diversity remains an unsolved problem.