Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Authors: Gabrielle Kaili-May Liu, Avi Caciularu, Gal Yona, Idan Szpektor, Arman Cohan (Yale University & Google Research)

arXiv ID: 2606.32032

Problem: LLMs hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent internal uncertainty, undermining trustworthiness in high-stakes deployments.

Key Methodology:

RLMF (Reinforcement Learning with Metacognitive Feedback) - a novel paradigm that scales the RL advantage signal by the accuracy of the model's self-judgment of its own performance, preferentially reinforcing completions where the model both performs well and accurately assesses that performance.
Metacognitive Data Selection - uses the model's self-assessed performance scores to select both the highest- and lowest-scoring training examples, outperforming random and active-learning baselines.
Two-stage decoupled pipeline - Stage 1 applies RLMF to calibrate faithful numerical confidence scores; Stage 2 maps those scores to natural linguistic hedges via targeted rewriting, adaptable to user preferences without repeating RL.

Key Results:

RLMF achieves cMFG ≥ 0.80* across all 10 evaluation tasks (spanning 6+ content domains), surpassing prior SOTA (MetaFaith, FUT) by 29% and 25% respectively, and outperforming standard RL by up to 63%.
Sub-8B models with RLMF outperform GPT-5, Gemini-3.1-Pro, and Gemini-3-Flash in faithful calibration (by 37%, 17%, and 25% average gains).
Human evaluation shows 96% average win rate over the strongest baseline across diversity, naturalness, helpfulness, and contextual suitability.
Metacognitive data selection achieves cMFG 0.84* vs 0.80 for random and 0.79 for active learning (Llama3.1-8B).

Applied Context: If you're building LLM-based systems where users need to know when to trust outputs (medical, legal, scientific), RLMF provides a training-time method to make models honestly communicate uncertainty - both as calibrated confidence scores and as natural hedging language - without sacrificing task accuracy.