Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Abstract
Majority voting improves mathematical reasoning but is limited by correlated errors; diverse reasoning strategies and model capability are more impactful than prompt engineering.
Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.
Community
Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
Diverse Prompt Mixer assigns different reasoning strategies to majority-voting members to decorrelate errors. Tested on 50 IMO-level problems (1×H100, 5-hour limit, 3 models, 23+ experiments). It does not work.
Why it fails:
High-temperature sampling already pushes pairwise error correlation to zero or below (mean ρ̂ = −0.348 across 19 computable points). There is no correlation headroom left. Diverse prompts reduce per-attempt accuracy more than they reduce correlation.
What dominates:
At equal N=8, the 8-point model capability gap (gpt-oss-120b at 39.3 vs. gpt-oss-20b at 31.0) is 4× larger than any prompt optimization (±2 points). Scaling N past the compute budget backfires.
Where the real gap is:
The model's pass@20 ≈ 45.5, but majority voting peaks at 42. Six points of selection loss. A verifier-based selector could close it. Prompt engineering cannot.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents (2026)
- Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus (2026)
- An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2 (2026)
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference (2026)
- Do We Need Frontier Models to Verify Mathematical Proofs? (2026)
- Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs (2026)
- Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.27844 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
