Skip to main content

MCTS-Reasoning: Tree Search for LLM Reasoning

I’ve been working on applying Monte Carlo Tree Search to LLM reasoning. The idea: multi-step reasoning is a sequential decision problem, and MCTS is good at those.

The Problem with Single-Shot Reasoning

When you ask an LLM a hard question, it generates one response. If that response goes down a wrong path early, there’s no recovery. The model commits to its initial approach and follows it to completion, even when better alternatives existed.

This is a waste. The model might have gotten it right if it had taken a different first step. MCTS addresses this by building a tree of reasoning paths and using the UCB1 bandit algorithm to balance exploration of new paths with exploitation of promising ones.

How It Works

The system models reasoning as a search problem:

  • States: Partial reasoning traces (what’s been written so far)
  • Actions: Reasoning continuations (the next step)
  • Terminal states: Complete solutions with final answers
  • Rewards: Quality assessments of final answers

Each MCTS simulation runs through four phases:

  1. Selection: Traverse the tree using UCB1 to pick promising paths
  2. Expansion: Add a new reasoning step via LLM generation
  3. Rollout: Continue reasoning until reaching a terminal state
  4. Backpropagation: Update statistics back up the tree

Tree-Building Rollouts

One design choice worth noting: I use tree-building rollouts. Standard game-playing MCTS uses a fast random policy during rollouts and doesn’t store those nodes. Here, we add every rollout node to the tree. This preserves the full reasoning trace and allows reuse of reasoning steps in future simulations. It’s more expensive per simulation, but reasoning steps are expensive to generate anyway, so you want to keep them.

Terminal-Only Evaluation

The evaluator runs only on terminal states. Intermediate reasoning states aren’t evaluated, which reduces computational cost. LLM-as-judge calls happen only when a complete answer is produced. This keeps the search cheap where it can be cheap.

The Technical Report

I wrote a formal specification that provides rigorous definitions for all components: states, actions, nodes, and the search tree. It includes precise pseudocode for all four MCTS phases, clear interfaces for the Generator and Evaluator components, and complexity analysis showing O(KD) tree operations for K simulations with max depth D.

The goal was to write a canonical reference that authentically captures MCTS while adapting it for LLM reasoning. Too many implementations of this kind of thing exist only as code, and the code doesn’t explain the why.

Usage

The library provides both a fluent API and an interactive shell:

from mcts_reasoning import ReasoningMCTS, get_llm

mcts = (
    ReasoningMCTS()
    .with_llm(get_llm("anthropic"))
    .with_question("What is the optimal sorting algorithm for 1M integers?")
    .with_exploration(1.414)
    .with_max_rollout_depth(5)
)

mcts.search("Let's think step by step...", simulations=50)
print(f"Solution: {mcts.solution}")
print(f"Confidence: {mcts.best_value:.2%}")

Or interactively:

mcts-shell
> ask What is the sum of primes less than 100?
> search 50
> solution
> sample 5  # Get diverse reasoning paths
> consistency 20  # Check solution consistency

Extensions

The tech report discusses several extensions I’m considering:

  • Extended action spaces: Compress (for long traces), Verify, ToolCall, Backtrack
  • Algorithm variants: Progressive widening, RAVE, parallel MCTS
  • Graph-based reasoning: DAG structures for problem decomposition and critique-revision cycles

The code is at github.com/queelius/mcts-reasoning and the technical report is on the papers page.

Discussion