MCTS-Reasoning: A Canonical Specification of
Monte Carlo Tree Search for LLM Reasoning

Technical Report v0.4
(December 3, 2025)
Abstract

This technical report provides a rigorous specification of Monte Carlo Tree Search (MCTS) applied to Large Language Model (LLM) reasoning. We present formal definitions of the search tree structure, the four phases of MCTS (Selection, Expansion, Simulation, Backpropagation), and their adaptation for step-by-step reasoning with language models. Key design choices—including tree-building rollouts and terminal-only evaluation—are explicitly documented and justified. The goal is to establish a canonical reference specification that authentically captures the MCTS algorithm while adapting it for the unique characteristics of LLM-based reasoning tasks.

1 Introduction

Monte Carlo Tree Search (MCTS) is a best-first search algorithm that uses random sampling to build a search tree and evaluate actions [1, 2]. Originally developed for game playing—notably achieving superhuman performance in Go [4]—MCTS has proven effective in domains where the state space is too large for exhaustive search but where states can be evaluated through simulation.

In this report, we apply MCTS to reasoning with Large Language Models. The key insight is that multi-step reasoning can be viewed as a sequential decision problem:

  • States are partial reasoning traces (text strings)

  • Actions are reasoning continuations (next steps)

  • Terminal states are complete solutions with answers

  • Rewards are quality assessments of final answers

This formulation allows MCTS to systematically explore different reasoning paths, allocating more search effort to promising directions while maintaining exploration of alternatives. Recent work has explored similar ideas under the names “Tree of Thoughts” [5] and “Reasoning via Planning” [6].

1.1 Goals of This Specification

  1. 1.

    Provide rigorous definitions of all components

  2. 2.

    Present the MCTS algorithm with precise pseudocode

  3. 3.

    Specify the adaptation to LLM reasoning with clear interfaces

  4. 4.

    Document design choices that distinguish this variant from classical MCTS

1.2 Scope and Assumptions

This specification covers single-threaded MCTS for text-based reasoning. We assume:

  • The LLM produces valid text output for each query

  • The Evaluator returns scores in [0,1]

  • State strings have bounded length (context window limits)

2 Preliminaries

2.1 Notation

We use the following notation throughout:

Symbol Meaning
𝒮 State space (set of all possible reasoning traces)
𝒜 Action space (set of all possible continuations)
s𝒮 A state (partial or complete reasoning trace)
a𝒜 An action (reasoning continuation)
𝒩 Set of nodes in the search tree
n𝒩 A node in the search tree
ν(n) Visit count of node n
v(n) Cumulative value of node n
c Exploration constant in UCB1
B Branching factor bound (max children per node)
D Maximum rollout depth

3 Formal Definitions

Definition 3.1 (State).

A state s𝒮 is a string representing a partial or complete reasoning trace. A state consists of the concatenation of:

  1. 1.

    The original question

  2. 2.

    Zero or more reasoning steps

  3. 3.

    Optionally, a terminal marker with a final answer

Definition 3.2 (Terminal State).

A state s is terminal if a terminal detector determines it contains a complete answer. Formally:

IsTerminal(s)=TerminalDetector.Check(s)

The detector also extracts the answer: ExtractAnswer(s)Answer{}.

Definition 3.3 (Terminal Detector).

A terminal detector is a component that determines when reasoning is complete:

  • Check(s){true,false}: determines if s is terminal

  • ExtractAnswer(s)Answer: extracts the answer from terminal s

  • FormatInstruction()String: provides prompt guidance for the LLM

Remark 3.1 (Terminal Detector Implementations).

The canonical implementation uses a marker-based detector that looks for the string “ANSWER:”. Alternative implementations include:

  • BoxedDetector: Looks for -style \boxed{...} (common in math benchmarks)

  • MultiMarkerDetector: Accepts multiple completion signals

  • LLM-as-Judge: Asks an LLM whether reasoning is complete (more expensive)

The terminal detector’s FormatInstruction is included in prompts so the LLM knows the expected answer format.

Definition 3.4 (Action).

An action a is an operation that transforms a state into a new state. Formally:

a:𝒮𝒮

The canonical action is Continue, which prompts the LLM to generate the next reasoning step:

Continue(s)=s𝒢(s)

where denotes string concatenation with appropriate formatting.

Definition 3.5 (Action Space).

The action space 𝒜(s) is the set of actions available at state s:

𝒜(s)={if IsTerminal(s){Continue}otherwise (canonical case)
Remark 3.2 (State-Dependent Actions).

In standard MCTS for games, different positions have different legal moves. Similarly, different reasoning states may have different available actions. The canonical implementation uses only Continue, but extensions (Section 8) may include additional actions such as Compress for long traces.

Definition 3.6 (Node).

A node n in the search tree consists of:

  • state(n)𝒮: the reasoning trace at this node

  • ν(n)0: the number of times this node has been visited

  • v(n)0: the cumulative value from simulations through this node

  • children(n)𝒩: the set of child nodes

  • parent(n)𝒩{}: the parent node (or for root)

  • terminal(n){true,false}: whether this is a terminal state

  • answer(n): the extracted answer (if terminal)

Definition 3.7 (Search Tree).

A search tree 𝒯=(𝒩,E,r) is a rooted tree where:

  • 𝒩 is the set of nodes

  • E={(parent(n),n):n𝒩,parent(n)} is the edge set

  • r𝒩 is the root node with parent(r)=

The tree is an arborescence: every node except the root has exactly one parent, and there is a unique path from r to any node.

Definition 3.8 (Average Value).

The average value of a node n with ν(n)>0 is:

v¯(n)=v(n)ν(n)

For ν(n)=0, the average value is undefined.

Definition 3.9 (Branching Factor Bound).

The branching factor bound B is the maximum number of children any node may have. A node n is fully expanded if and only if:

FullyExpanded(n)=(|children(n)|B)IsTerminal(state(n))

4 The MCTS Algorithm

MCTS builds a search tree incrementally through repeated simulations. Each simulation consists of four phases: Selection, Expansion, Rollout, and Backpropagation.

Algorithm 1 MCTS Main Loop
1:Question q, number of simulations K, branching factor B, max depth D
2:Search tree with root r
3:procedure Search(q,K,B,D)
4:     rCreateNode(q) Root state is the question
5:     for i=1 to K do
6:         nleafSelect(r)
7:         nexpExpand(nleaf,B)
8:         nterm,vRollout(nexp,D)
9:         Backpropagate(nterm,v)
10:     end for
11:     return r
12:end procedure

4.1 Phase 1: Selection

Selection traverses the tree from root to a leaf, choosing children according to the UCB1 policy.

Definition 4.1 (UCB1).

For a non-root node n with parent p=parent(n), the Upper Confidence Bound is:

UCB1(n)={+if ν(n)=0v¯(n)+clnν(p)ν(n)if ν(n)>0

where c>0 is the exploration constant. The first term v¯(n) is the exploitation component; the second term is the exploration component.

Remark 4.1 (Exploration Constant).

The theoretical value c=2 derives from the UCB1 bandit algorithm [1]. In practice, c may be tuned: larger values encourage exploration, smaller values favor exploitation.

Remark 4.2 (Tie-Breaking).

When multiple children have equal UCB1 values (including multiple unvisited children with UCB1=+), we select uniformly at random among them.

Algorithm 2 Selection Phase
1:Node n (typically root)
2:Leaf node for expansion
3:procedure Select(n)
4:     while FullyExpanded(n) and children(n) do
5:         nargmaxcchildren(n)UCB1(c) Break ties randomly
6:     end while
7:     return n
8:end procedure

4.2 Phase 2: Expansion

Expansion adds a new child node by generating a continuation from the Generator.

Algorithm 3 Expansion Phase
1:Node n, branching factor bound B
2:Expanded node (new child or n if unexpandable)
3:procedure Expand(n,B)
4:     if IsTerminal(state(n)) then
5:         return n Cannot expand terminal nodes
6:     end if
7:     if |children(n)|B then
8:         return n Fully expanded
9:     end if
10:     a𝒢.Generate(state(n)) Get one continuation
11:     sstate(n)a
12:     nCreateNode(s)
13:     parent(n)n
14:     children(n)children(n){n}
15:     return n
16:end procedure

4.3 Phase 3: Rollout (Simulation)

The rollout phase continues from the expanded node until reaching a terminal state or the maximum depth.

Remark 4.3 (Design Choice: Tree-Building Rollout).

In classical MCTS for games, rollouts typically use a fast, random policy without adding nodes to the tree. However, for LLM reasoning, we make a deliberate design choice to add rollout nodes to the tree. This preserves the full reasoning trace and allows:

  1. 1.

    Reuse of reasoning steps in future simulations

  2. 2.

    Inspection of the complete reasoning path

  3. 3.

    Consistent state representation throughout the tree

This is sometimes called “tree policy all the way” or “persistent tree” MCTS.

Algorithm 4 Rollout Phase (Tree-Building)
1:Node n, maximum depth D
2:Terminal node nterm and its value v
3:procedure Rollout(n,D)
4:     currentn
5:     depth0
6:     while ¬IsTerminal(state(current)) and depth<D do
7:         a𝒢.Generate(state(current))
8:         sstate(current)a
9:         nCreateNode(s)
10:         parent(n)current
11:         children(current)children(current){n}
12:         currentn
13:         depthdepth+1
14:     end while
15:     if IsTerminal(state(current)) then
16:         v.Evaluate(state(current),answer(current))
17:     else
18:         v0 No terminal reached within depth limit
19:     end if
20:     return (current,v)
21:end procedure

4.4 Phase 4: Backpropagation

Backpropagation updates visit counts and values along the path from the terminal node to the root.

Algorithm 5 Backpropagation Phase
1:Terminal node nterm, value v
2:procedure Backpropagate(nterm,v)
3:     nnterm
4:     while n do
5:         ν(n)ν(n)+1
6:         v(n)v(n)+v
7:         nparent(n)
8:     end while
9:end procedure
Property 4.1 (Visit Count Invariant).

For any node n in the tree:

ν(n)=cchildren(n)ν(c)+𝟙[n is a rollout endpoint]

where 𝟙[] is the indicator function. In particular, ν(root)=K after K simulations.

5 Application to LLM Reasoning

The MCTS algorithm requires two external components: a Generator to produce continuations and an Evaluator to score terminal states.

5.1 Generator

Definition 5.1 (Generator).

A Generator 𝒢 is a function that produces reasoning continuations:

𝒢:𝒮𝒜

Given a state s, it returns a continuation a. The continuation may result in a terminal state (if it contains the answer marker) or a non-terminal state (requiring further reasoning).

For LLM-based generation, we use the following procedure:

Algorithm 6 LLM Generator
1:State s, temperature τ
2:Continuation a
3:procedure Generate(s)
4:     promptFormatPrompt(s)
5:     responseLLM(prompt,temperature=τ)
6:     aParseContinuation(response)
7:     return a
8:end procedure
Definition 5.2 (Prompt Template).

The prompt template formats the current state for the LLM:

You are solving a problem step by step.
Current reasoning:
{state}

Continue with the next step. When you reach a final answer,
write "ANSWER: " followed by your answer.

Next step:

The temperature τ>0 controls diversity: higher temperatures yield more varied continuations.

5.2 Evaluator

Definition 5.3 (Evaluator).

An Evaluator is a function that scores terminal states:

:𝒮×Answer[0,1]

Given a terminal state and its extracted answer, it returns a quality score.

Remark 5.1 (Terminal-Only Evaluation).

The Evaluator is invoked only on terminal states. This design choice reduces computational cost: intermediate reasoning states are not evaluated, and LLM-as-judge calls occur only when a complete answer is produced.

Algorithm 7 LLM Evaluator (LLM-as-Judge)
1:Terminal state s, extracted answer a
2:Score [0,1]
3:procedure Evaluate(s,a)
4:     promptFormatJudgePrompt(s,a)
5:     responseLLM(prompt,temperature=0)
6:     scoreParseScore(response)
7:     return score
8:end procedure

Alternative Evaluators:

  • Ground Truth: Compare answer to known correct answer with normalization (lowercase, strip punctuation). Supports partial credit when answer contains the truth or vice versa.

  • Numeric: For mathematical problems, compares numeric values with configurable tolerance (absolute and relative). Extracts numbers from text, handles fractions (e.g., “1/2”), scientific notation, and provides partial credit based on relative error.

  • Process: Evaluates reasoning quality using heuristics—presence of step-by-step structure, mathematical notation, verification statements, and logical connectives. Can be combined with an answer evaluator via weighted sum.

  • Composite: Weighted combination of multiple evaluators. For example: 0.7×NumericScore+0.3×ProcessScore.

  • Self-Consistency: Score based on agreement with other sampled solutions (not implemented in canonical version).

6 Result Extraction

After search completes, we extract the best answer from the tree.

Algorithm 8 Best Answer Extraction
1:Root node r
2:Best answer and confidence score
3:procedure GetBestAnswer(r)
4:     nr
5:     while children(n) do
6:         nargmaxcchildren(n)ν(c) Most-visited child
7:     end while
8:     if IsTerminal(state(n)) then
9:         return (answer(n),v¯(n))
10:     else
11:         return (null,0)
12:     end if
13:end procedure
Remark 6.1.

Following most-visited children (rather than highest-value) is standard practice in MCTS, as visit counts are more robust to value noise than raw averages.

7 Complexity Analysis

Property 7.1 (Time Complexity).

For K simulations with maximum depth D:

  • Selection: O(D) node traversals per simulation

  • Expansion: O(1) plus one Generator call

  • Rollout: O(D) Generator calls (worst case)

  • Backpropagation: O(D) node updates

Total tree operations: O(KD). The dominant cost is typically LLM calls: up to O(KD) Generator calls and O(K) Evaluator calls.

Property 7.2 (Space Complexity).

The search tree contains at most O(KD) nodes, each storing a state string. With bounded state length L, space complexity is O(KDL).

8 Extensions

The following extensions are under consideration for future versions:

8.1 Extended Action Spaces

The canonical implementation uses only the Continue action. Extended action spaces may include:

  • Compress: When the reasoning trace exceeds a threshold length, the Compress action summarizes the trace into a more compact representation. Subsequent reasoning continues from the compressed state, reducing context length while preserving key insights.

    𝒜(s)={{Continue,Compress}if |s|>threshold{Continue}otherwise
  • Verify: Ask the LLM to verify the current reasoning before continuing

  • ToolCall: Invoke external tools (calculator, search, code execution)

  • Backtrack: Explicitly mark a reasoning path as unpromising

8.2 Algorithm Variants

  • Progressive Widening: Dynamically adjust B based on visit count

  • RAVE: Rapid Action Value Estimation for faster convergence

  • Parallel MCTS: Root parallelization or tree parallelization

  • Value Networks: Learned evaluation functions for intermediate states

  • Policy Networks: Learned action selection to guide expansion

  • Self-Consistency Integration: Aggregate across multiple terminal states

  • Beam Search Hybrid: Combine MCTS exploration with beam search

8.3 Beyond Trees: Graph-Based Reasoning

The tree structure of MCTS imposes fundamental constraints on reasoning patterns. Certain operations naturally suggest non-tree structures:

  • Problem Decomposition: Splitting a problem into independent subproblems, solving each separately, and combining results creates a DAG structure where the “combine” node has multiple parents.

  • Critique-Revision Cycles: Evaluating a solution, identifying issues, and refining creates potential cycles that trees cannot represent.

  • Cross-Path Information Sharing: Using insights from one reasoning path to inform another requires edges between branches.

These patterns require graph-based or hypergraph-based search with:

  • Multiple edge types (continuation, decomposition, aggregation, critique)

  • Generalized propagation beyond simple backpropagation

  • Synchronization semantics for subproblem completion

Such extensions are outside the scope of this tree-based MCTS specification but represent promising directions for future work on structured reasoning.

Remark 8.1 (Natural Decomposition within MCTS).

Note that the LLM within a Continue action can naturally perform soft versions of these operations: reasoning about subproblems sequentially within a single path, self-critiquing before continuing, or reconsidering earlier steps. The tree structure captures alternative reasoning paths, while complex reasoning within a path remains expressible through the continuation mechanism.

9 Conclusion

This report establishes a canonical specification for MCTS applied to LLM reasoning. The key contributions are:

  1. 1.

    Rigorous formal definitions of states, actions, nodes, and the search tree

  2. 2.

    Precise specification of the four MCTS phases with pseudocode

  3. 3.

    Clear interfaces for Generator and Evaluator components

  4. 4.

    Explicit documentation of design choices (tree-building rollouts, terminal-only evaluation, bounded branching)

The specification serves as a reference for implementation, ensuring the algorithm is authentically represented while being appropriately adapted for language model reasoning tasks.

References

  • [1] Kocsis, L., & Szepesvári, C. (2006). Bandit based monte-carlo planning. European Conference on Machine Learning, 282–293.
  • [2] Coulom, R. (2006). Efficient selectivity and backup operators in monte-carlo tree search. International Conference on Computers and Games, 72–83.
  • [3] Browne, C. B., et al. (2012). A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1–43.
  • [4] Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
  • [5] Yao, S., et al. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  • [6] Hao, S., et al. (2023). Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.