When Can LLMs Learn to Reason with Weak Supervision?
Abstract
Research reveals that model generalization in reasoning tasks under weak supervision depends on reward saturation dynamics and reasoning faithfulness, with supervised fine-tuning on explicit traces being crucial for successful adaptation.
Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
Community
As LLMs get stronger, reliable reward signals get harder to build. We study when RLVR generalizes under weak supervision: scarce data, noisy rewards, and self-supervised proxies. Some models generalize, others fail entirely. The difference is reasoning faithfulness, not diversity. Failing models memorize answers with unfaithful chain of thought. The fix: continual pretraining plus thinking SFT before RL recovers generalization across all settings.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning (2026)
- SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions (2026)
- What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time (2026)
- Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.18574 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper