LLMs are surprisingly capable reasoners, but coaxing good reasoning out of them is still mostly an art. You tweak a prompt, run it, squint at the output, tweak again. It works, sort of, but it doesn’t scale and it doesn’t compose.
I wanted to try something different: treat prompting as a search problem.
The idea
MCTS (Monte Carlo Tree Search) is the algorithm that made AlphaGo work. It’s good at navigating large decision spaces where you can’t enumerate everything but you can sample and learn. Prompt engineering has exactly this structure. The space of possible prompts is enormous, you can evaluate any particular prompt by running it, and small changes can have outsized effects.
The key design choice is decomposing prompts into a 5-dimensional action space:
- Context: what background information to include
- Examples: which few-shot examples to provide
- Constraints: what guidelines or restrictions to specify
- Format: how to structure the output
- Reasoning: what thinking strategies to encourage
Each dimension has a discrete set of primitives. MCTS explores combinations of these primitives, learning which compositions produce better outputs.
Why bother?
Hand-crafted prompts have a composability problem. You find a prompt that works for task A. You find another for task B. Combining them for task A+B often breaks both. There’s no principled way to compose prompt strategies.
A search-based approach sidesteps this. The system explores compositions directly and keeps what works. It doesn’t need your intuition about what “should” work together.
The search loop
# Simplified conceptual flow
while not converged:
# MCTS exploration
prompt = select_promising_prompt()
# Try it with LLM
result = llm.generate(prompt)
# Evaluate quality
score = evaluate(result)
# Update search tree
backpropagate(score)
Standard MCTS, adapted for discrete prompt composition instead of game moves. The tree nodes represent partial prompt specifications, and rollouts are actual LLM calls with evaluation.
Results
On reasoning benchmarks, the system finds prompts that outperform hand-crafted baselines by about 30%. More interesting than the numbers: it discovers strategies I wouldn’t have thought to try. Some of the effective compositions look strange to a human prompt engineer but work well empirically.
The system also composes across task types, finding prompt structures that generalize rather than overfitting to specific problems.
Where this is useful
Complex reasoning tasks (math, logic, planning) benefit most. If you’re in a domain where you don’t have deep prompt engineering experience, or you need prompts that adapt to changing task distributions, search beats intuition.
The paper
Full technical details, including the formal action space specification, MCTS adaptations, experimental methodology, ablation studies, and analysis of discovered strategies:
Discussion