MCTS-Reasoning: A Canonical Specification of
Monte Carlo Tree Search for LLM Reasoning
Abstract
This technical report provides a rigorous specification of Monte Carlo Tree Search (MCTS) applied to Large Language Model (LLM) reasoning. We present formal definitions of the search tree structure, the four phases of MCTS (Selection, Expansion, Simulation, Backpropagation), and their adaptation for step-by-step reasoning with language models. Key design choices—including tree-building rollouts and terminal-only evaluation—are explicitly documented and justified. The goal is to establish a canonical reference specification that authentically captures the MCTS algorithm while adapting it for the unique characteristics of LLM-based reasoning tasks.
1 Introduction
Monte Carlo Tree Search (MCTS) is a best-first search algorithm that uses random sampling to build a search tree and evaluate actions [1, 2]. Originally developed for game playing—notably achieving superhuman performance in Go [4]—MCTS has proven effective in domains where the state space is too large for exhaustive search but where states can be evaluated through simulation.
In this report, we apply MCTS to reasoning with Large Language Models. The key insight is that multi-step reasoning can be viewed as a sequential decision problem:
-
•
States are partial reasoning traces (text strings)
-
•
Actions are reasoning continuations (next steps)
-
•
Terminal states are complete solutions with answers
-
•
Rewards are quality assessments of final answers
This formulation allows MCTS to systematically explore different reasoning paths, allocating more search effort to promising directions while maintaining exploration of alternatives. Recent work has explored similar ideas under the names “Tree of Thoughts” [5] and “Reasoning via Planning” [6].
1.1 Goals of This Specification
-
1.
Provide rigorous definitions of all components
-
2.
Present the MCTS algorithm with precise pseudocode
-
3.
Specify the adaptation to LLM reasoning with clear interfaces
-
4.
Document design choices that distinguish this variant from classical MCTS
1.2 Scope and Assumptions
This specification covers single-threaded MCTS for text-based reasoning. We assume:
-
•
The LLM produces valid text output for each query
-
•
The Evaluator returns scores in
-
•
State strings have bounded length (context window limits)
2 Preliminaries
2.1 Notation
We use the following notation throughout:
| Symbol | Meaning |
|---|---|
| State space (set of all possible reasoning traces) | |
| Action space (set of all possible continuations) | |
| A state (partial or complete reasoning trace) | |
| An action (reasoning continuation) | |
| Set of nodes in the search tree | |
| A node in the search tree | |
| Visit count of node | |
| Cumulative value of node | |
| Exploration constant in UCB1 | |
| Branching factor bound (max children per node) | |
| Maximum rollout depth |
3 Formal Definitions
Definition 3.1 (State).
A state is a string representing a partial or complete reasoning trace. A state consists of the concatenation of:
-
1.
The original question
-
2.
Zero or more reasoning steps
-
3.
Optionally, a terminal marker with a final answer
Definition 3.2 (Terminal State).
A state is terminal if a terminal detector determines it contains a complete answer. Formally:
The detector also extracts the answer: .
Definition 3.3 (Terminal Detector).
A terminal detector is a component that determines when reasoning is complete:
-
•
: determines if is terminal
-
•
: extracts the answer from terminal
-
•
: provides prompt guidance for the LLM
Remark 3.1 (Terminal Detector Implementations).
The canonical implementation uses a marker-based detector that looks for the string “ANSWER:”. Alternative implementations include:
-
•
BoxedDetector: Looks for LaTeX-style boxed{...} (common in math benchmarks)
-
•
MultiMarkerDetector: Accepts multiple completion signals
-
•
LLM-as-Judge: Asks an LLM whether reasoning is complete (more expensive)
The terminal detector’s FormatInstruction is included in prompts so the LLM knows the expected answer format.
Definition 3.4 (Action).
An action is an operation that transforms a state into a new state. Formally:
The canonical action is Continue, which prompts the LLM to generate the next reasoning step:
where denotes string concatenation with appropriate formatting.
Definition 3.5 (Action Space).
The action space is the set of actions available at state :
Remark 3.2 (State-Dependent Actions).
In standard MCTS for games, different positions have different legal moves. Similarly, different reasoning states may have different available actions. The canonical implementation uses only Continue, but extensions (Section 8) may include additional actions such as Compress for long traces.
Definition 3.6 (Node).
A node in the search tree consists of:
-
•
: the reasoning trace at this node
-
•
: the number of times this node has been visited
-
•
: the cumulative value from simulations through this node
-
•
: the set of child nodes
-
•
: the parent node (or for root)
-
•
: whether this is a terminal state
-
•
: the extracted answer (if terminal)
Definition 3.7 (Search Tree).
A search tree is a rooted tree where:
-
•
is the set of nodes
-
•
is the edge set
-
•
is the root node with
The tree is an arborescence: every node except the root has exactly one parent, and there is a unique path from to any node.
Definition 3.8 (Average Value).
The average value of a node with is:
For , the average value is undefined.
Definition 3.9 (Branching Factor Bound).
The branching factor bound is the maximum number of children any node may have. A node is fully expanded if and only if:
4 The MCTS Algorithm
MCTS builds a search tree incrementally through repeated simulations. Each simulation consists of four phases: Selection, Expansion, Rollout, and Backpropagation.
4.1 Phase 1: Selection
Selection traverses the tree from root to a leaf, choosing children according to the UCB1 policy.
Definition 4.1 (UCB1).
For a non-root node with parent , the Upper Confidence Bound is:
where is the exploration constant. The first term is the exploitation component; the second term is the exploration component.
Remark 4.1 (Exploration Constant).
The theoretical value derives from the UCB1 bandit algorithm [1]. In practice, may be tuned: larger values encourage exploration, smaller values favor exploitation.
Remark 4.2 (Tie-Breaking).
When multiple children have equal UCB1 values (including multiple unvisited children with ), we select uniformly at random among them.
4.2 Phase 2: Expansion
Expansion adds a new child node by generating a continuation from the Generator.
4.3 Phase 3: Rollout (Simulation)
The rollout phase continues from the expanded node until reaching a terminal state or the maximum depth.
Remark 4.3 (Design Choice: Tree-Building Rollout).
In classical MCTS for games, rollouts typically use a fast, random policy without adding nodes to the tree. However, for LLM reasoning, we make a deliberate design choice to add rollout nodes to the tree. This preserves the full reasoning trace and allows:
-
1.
Reuse of reasoning steps in future simulations
-
2.
Inspection of the complete reasoning path
-
3.
Consistent state representation throughout the tree
This is sometimes called “tree policy all the way” or “persistent tree” MCTS.
4.4 Phase 4: Backpropagation
Backpropagation updates visit counts and values along the path from the terminal node to the root.
Property 4.1 (Visit Count Invariant).
For any node in the tree:
where is the indicator function. In particular, after simulations.
5 Application to LLM Reasoning
The MCTS algorithm requires two external components: a Generator to produce continuations and an Evaluator to score terminal states.
5.1 Generator
Definition 5.1 (Generator).
A Generator is a function that produces reasoning continuations:
Given a state , it returns a continuation . The continuation may result in a terminal state (if it contains the answer marker) or a non-terminal state (requiring further reasoning).
For LLM-based generation, we use the following procedure:
Definition 5.2 (Prompt Template).
The prompt template formats the current state for the LLM:
You are solving a problem step by step.
Current reasoning:
{state}
Continue with the next step. When you reach a final answer,
write "ANSWER: " followed by your answer.
Next step:
The temperature controls diversity: higher temperatures yield more varied continuations.
5.2 Evaluator
Definition 5.3 (Evaluator).
An Evaluator is a function that scores terminal states:
Given a terminal state and its extracted answer, it returns a quality score.
Remark 5.1 (Terminal-Only Evaluation).
The Evaluator is invoked only on terminal states. This design choice reduces computational cost: intermediate reasoning states are not evaluated, and LLM-as-judge calls occur only when a complete answer is produced.
Alternative Evaluators:
-
•
Ground Truth: Compare answer to known correct answer with normalization (lowercase, strip punctuation). Supports partial credit when answer contains the truth or vice versa.
-
•
Numeric: For mathematical problems, compares numeric values with configurable tolerance (absolute and relative). Extracts numbers from text, handles fractions (e.g., “1/2”), scientific notation, and provides partial credit based on relative error.
-
•
Process: Evaluates reasoning quality using heuristics—presence of step-by-step structure, mathematical notation, verification statements, and logical connectives. Can be combined with an answer evaluator via weighted sum.
-
•
Composite: Weighted combination of multiple evaluators. For example: .
-
•
Self-Consistency: Score based on agreement with other sampled solutions (not implemented in canonical version).
6 Result Extraction
After search completes, we extract the best answer from the tree.
Remark 6.1.
Following most-visited children (rather than highest-value) is standard practice in MCTS, as visit counts are more robust to value noise than raw averages.
7 Complexity Analysis
Property 7.1 (Time Complexity).
For simulations with maximum depth :
-
•
Selection: node traversals per simulation
-
•
Expansion: plus one Generator call
-
•
Rollout: Generator calls (worst case)
-
•
Backpropagation: node updates
Total tree operations: . The dominant cost is typically LLM calls: up to Generator calls and Evaluator calls.
Property 7.2 (Space Complexity).
The search tree contains at most nodes, each storing a state string. With bounded state length , space complexity is .
8 Extensions
The following extensions are under consideration for future versions:
8.1 Extended Action Spaces
The canonical implementation uses only the Continue action. Extended action spaces may include:
-
•
Compress: When the reasoning trace exceeds a threshold length, the Compress action summarizes the trace into a more compact representation. Subsequent reasoning continues from the compressed state, reducing context length while preserving key insights.
-
•
Verify: Ask the LLM to verify the current reasoning before continuing
-
•
ToolCall: Invoke external tools (calculator, search, code execution)
-
•
Backtrack: Explicitly mark a reasoning path as unpromising
8.2 Algorithm Variants
-
•
Progressive Widening: Dynamically adjust based on visit count
-
•
RAVE: Rapid Action Value Estimation for faster convergence
-
•
Parallel MCTS: Root parallelization or tree parallelization
-
•
Value Networks: Learned evaluation functions for intermediate states
-
•
Policy Networks: Learned action selection to guide expansion
-
•
Self-Consistency Integration: Aggregate across multiple terminal states
-
•
Beam Search Hybrid: Combine MCTS exploration with beam search
8.3 Beyond Trees: Graph-Based Reasoning
The tree structure of MCTS imposes fundamental constraints on reasoning patterns. Certain operations naturally suggest non-tree structures:
-
•
Problem Decomposition: Splitting a problem into independent subproblems, solving each separately, and combining results creates a DAG structure where the “combine” node has multiple parents.
-
•
Critique-Revision Cycles: Evaluating a solution, identifying issues, and refining creates potential cycles that trees cannot represent.
-
•
Cross-Path Information Sharing: Using insights from one reasoning path to inform another requires edges between branches.
These patterns require graph-based or hypergraph-based search with:
-
•
Multiple edge types (continuation, decomposition, aggregation, critique)
-
•
Generalized propagation beyond simple backpropagation
-
•
Synchronization semantics for subproblem completion
Such extensions are outside the scope of this tree-based MCTS specification but represent promising directions for future work on structured reasoning.
Remark 8.1 (Natural Decomposition within MCTS).
Note that the LLM within a Continue action can naturally perform soft versions of these operations: reasoning about subproblems sequentially within a single path, self-critiquing before continuing, or reconsidering earlier steps. The tree structure captures alternative reasoning paths, while complex reasoning within a path remains expressible through the continuation mechanism.
9 Conclusion
This report establishes a canonical specification for MCTS applied to LLM reasoning. The key contributions are:
-
1.
Rigorous formal definitions of states, actions, nodes, and the search tree
-
2.
Precise specification of the four MCTS phases with pseudocode
-
3.
Clear interfaces for Generator and Evaluator components
-
4.
Explicit documentation of design choices (tree-building rollouts, terminal-only evaluation, bounded branching)
The specification serves as a reference for implementation, ensuring the algorithm is authentically represented while being appropriately adapted for language model reasoning tasks.
References
- [1] Kocsis, L., & Szepesvári, C. (2006). Bandit based monte-carlo planning. European Conference on Machine Learning, 282–293.
- [2] Coulom, R. (2006). Efficient selectivity and backup operators in monte-carlo tree search. International Conference on Computers and Games, 72–83.
- [3] Browne, C. B., et al. (2012). A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1–43.
- [4] Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
- [5] Yao, S., et al. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- [6] Hao, S., et al. (2023). Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.