Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Authors: Sangmook Lee, Minbeom Kim, Jeonghye Kim, Dohyung Kim, Sojeong Rhee, Kyomin Jung

arXiv ID: 2606.29985

Problem: Common diversity metrics for LLM mathematical reasoning capture surface-level variation (wording) rather than true strategic differences in how problems are solved.

Key Methodology:

Introduces approach-level diversity - variation in problem-solving strategies across correct solutions to the same math problem.
Develops a human-calibrated LLM judge framework to reliably classify solution strategies.
Evaluates the gap between surface-level metrics (n-gram overlap, embedding similarity) and approach-level diversity in both evaluation and diversity-aware RLVR training.

Key Results:

Prior diversity measures are unreliable proxies for approach-level diversity (e.g., high surface diversity can coincide with low strategic diversity).
In diversity-aware RLVR, target surface metrics are preserved while approach-level diversity actually declines during training.
Approach-diverse candidate sets improve test-time scaling, but directly optimizing an LLM judge diversity reward causes the policy to exploit judge-specific preferences rather than genuinely broadening its reasoning strategies.

Applied Context: For builders of reasoning LLMs, this means optimizing for surface-level diversity (common in RLVR pipelines) can actively harm strategic diversity - you need to measure and reward what you actually want. The paper's LLM judge framework for approach-level classification is a practical starting point, but direct optimization of approach diversity remains an unsolved problem.