I Spent $0.48 to Find Out When MCTS Actually Works for LLM Reasoning
Does tree search help LLM reasoning? The literature can’t decide.
ReST-MCTS* says yes. AB-MCTS got a NeurIPS spotlight. “Limits of PRM-Guided Tree Search” says no: MCTS with a process reward model used 11x more tokens than best-of-N for zero accuracy gain. Snell et al. found beam search degrades performance on easy problems.
I built mcts-reasoning and ran controlled experiments to find where the boundary is. Total API cost: $0.48.
Setup
Four methods, same budget. Eight solution attempts per problem:
- Pass@1: One shot.
- Best-of-N: 8 independent solutions, verifier scores each, pick the best.
- Self-consistency: 8 solutions, majority vote.
- MCTS: 5 initial solutions scored by verifier, then 3 more guided by UCB1, informed by what worked and what failed.
Model: Claude Haiku 4.5. Problems: constraint satisfaction. Find integer values for variables satisfying simultaneous constraints. Example:
Find A, B, C, D, E, F satisfying ALL constraints:
1. A * D = 21
2. C = A + F
3. B * F = 20
4. E = A + 2
5. D + B = A
6. E - F = B
7. A mod 7 = 0
8. C mod 4 = 0
9. B * D = 12
10. C > E > A > F > B > D
11. A + B + C + D + E + F = 40
12. E * D = 27
The verifier is a Python function. Checks each constraint, returns the fraction satisfied. No LLM in the loop. Deterministic.
Calibration
Easy problems first (3-5 variables, 5-9 constraints). Haiku solved them in one pass. All methods tied at 100%.
5-variable problems with 9 constraints: Pass@1 dropped to 65%. Self-consistency failed one problem. But BestOfN still tied MCTS, because with 8 independent samples at least one is usually correct. BestOfN just picks it.
I needed problems where blind sampling hits a ceiling.
Results
Ten harder problems: 6-8 variables, 12-15 constraints. Products, modular arithmetic, ordering chains, cascading dependencies. Pass@1 dropped to 29%.
| Method | Solve Rate | Avg Score |
|---|---|---|
| Pass@1 | 29% | 0.29 |
| Pass@8 oracle | 90% | 0.90 |
| SC@8 | 90% | 0.90 |
| BestOfN@8 | 90% | 0.90 |
| MCTS(5+3) | 100% | 1.00 |
MCTS solved all 10. Every other method failed on one problem (v6_3).
v6_3 is a 6-variable, 12-constraint problem where none of 8 independent samples found the correct solution. Pass@8 oracle: 0/8. Self-consistency picks the most popular wrong answer. BestOfN picks the best wrong answer. Both fail.
MCTS sees that initial attempts satisfied 10/12 constraints but violated specific ones. UCB1 selects the most promising partial solution. The next attempt, informed by the failure pattern, satisfies all 12.
Total: $0.48. 180 API calls, about 190K tokens.
When MCTS Helps
The pattern across three rounds of experiments:
...Read more →