Data Generating Processes for Sequential Prediction

November 10, 2015 5 min read Updated: February 24, 2026

So far in this series, we have discussed how to predict sequences. But where do sequences come from? What statistical properties do they have?

Understanding the data generating process (DGP) matters for three reasons:

Designing appropriate predictors
Analyzing theoretical guarantees
Interpreting experimental results

This post covers Markov processes and tree sources, the model classes that CTW is provably optimal for.

What is a Data Generating Process?

A DGP is the probabilistic mechanism that produces a sequence. It specifies:

What symbols can appear (the alphabet)
The probability of each symbol given the history

Formally, a DGP is a distribution P(x1, x2, x3, …) over infinite sequences.

Markov Processes

The most important class of DGPs is Markov processes.

Definition

A k-th order Markov process is one where the next symbol depends only on the previous k symbols:

P(xn | x1, ..., xn-1) = P(xn | xn-k, ..., xn-1)

The k previous symbols form the context or state.

Example: First-Order Binary Markov

Consider a first-order (k=1) process over {0, 1}:

Transition matrix:
       Next=0  Next=1
Prev=0 [0.2    0.8  ]  <- After seeing 0, P(1) = 0.8
Prev=1 [0.7    0.3  ]  <- After seeing 1, P(1) = 0.3

This process tends to alternate. 0s are followed by 1s (with p=0.8) and 1s by 0s (with p=0.7).

Higher-Order Markov

For a k-th order binary process:

2^k possible contexts
Each context has its own P(next = 1)
Total: 2^k parameters

This exponential growth limits practical Markov orders. CTW handles it through model averaging.

Tree Sources

A tree source generalizes Markov processes by allowing different contexts to have different “effective orders.”

         Root
        /    \
       0      1
      / \      \
     0   1      (leaf: use context "1")
   (leaf) (leaf)
   "00"   "01"

In this tree source:

Context “00” has its own distribution
Context “01” has its own distribution
Any context ending in “1” shares a single distribution

Tree sources are more parsimonious than full Markov processes when some contexts do not need the full history.

Why Tree Sources Matter

Tree sources are the natural model class for CTW:

CTW averages over all possible tree structures up to a maximum depth
For any tree source, CTW converges to the true distribution at the optimal rate
CTW is provably minimax optimal for tree sources

Example: Sparse Markov Processes

I use sparse Markov processes in my experiments, where most contexts are uninformative:

Context	P(next=1)	Description
000	0.99	High certainty, predicts 1
100	0.99	High certainty, predicts 1
010	0.025	High certainty, predicts 0
101	0.025	High certainty, predicts 0
011	0.25	Moderate certainty, predicts 0
110	0.25	Moderate certainty, predicts 0
001	0.5	Uniform (uninformative)
111	0.5	Uniform (uninformative)

This 3rd-order process has:

4 highly predictable contexts (near-deterministic)
2 moderately predictable contexts
2 completely random contexts

The theoretical maximum accuracy for an oracle predictor is about 95%, achieved by always predicting the more likely outcome for each context.

Stationarity

A process is stationary if its statistics do not change over time:

P(xt+1 | xt, ..., x1) = P(xs+1 | xs, ..., x1) for all t, s

Markov processes with fixed transition probabilities are stationary.

Non-stationary processes (changing statistics over time) are harder to predict and require adaptive methods. Real-world sequences are often non-stationary:

Language style varies within a document
Financial markets have regime changes
User behavior evolves

CTW adapts somewhat through its recency-weighted updates, but explicit change-point detection may be needed for strongly non-stationary sources.

Ergodicity

A Markov process is ergodic if:

Irreducible: Can reach any state from any other state
Aperiodic: Does not cycle deterministically

Ergodic processes have a unique stationary distribution: the long-run fraction of time spent in each state is well-defined and independent of the starting state.

For ergodic processes, time averages equal ensemble averages. This justifies evaluating predictors on a single long sequence.

Why DGP Matters for Model Selection

The DGP determines which predictor is optimal:

DGP Property	Optimal Approach
k-th order Markov	CTW with depth >= k
Tree source	CTW naturally handles
Long-range dependencies	Neural models (LSTM, Transformer)
Non-stationary	Adaptive methods, forgetting
Unknown structure	Universal methods (Solomonoff, CTW)

Key finding from experiments: CTW achieves the theoretical maximum accuracy when its depth is set to at least the true order of the DGP.

Creating Custom DGPs

For experimentation, we can create various Markov processes:

from experiments.ctw.generators import (
    create_paper_markov,
    create_sparse_markov,
    create_structured_markov,
    create_random_markov
)

# Sparse process (few informative contexts)
sparse = create_sparse_markov(order=4, num_rules=6, seed=42)

# Structured: alternating, xor, majority, parity
alt = create_structured_markov(order=3, structure="alternating")
xor = create_structured_markov(order=3, structure="xor")

# Fully random transitions
rand = create_random_markov(order=3, seed=42)

# Compute theoretical maximum accuracy
max_acc = sparse.theoretical_max_accuracy()
print(f"Theoretical max: {max_acc:.1%}")

Structured Markov Processes

These have deterministic rules:

Alternating: Predict opposite of last bit
XOR: Predict XOR of context bits
Majority: Predict majority bit in context
Parity: Predict based on count of 1s mod 2

Useful for testing edge cases, but not representative of typical real-world data.

The Theoretical Maximum

For any Markov process, we can compute the Bayes-optimal accuracy, the best any predictor could achieve:

def theoretical_max_accuracy(transition_probs, stationary_dist):
    """Oracle predicts the more likely outcome for each context."""
    accuracy = 0
    for context, p1 in transition_probs.items():
        p_context = stationary_dist[context]
        accuracy += p_context * max(p1, 1 - p1)
    return accuracy

This represents the ceiling. No predictor can do better without additional information.

Key Takeaways

DGPs specify the probabilistic mechanism generating sequences
Markov processes have limited memory (only k previous symbols matter)
Tree sources generalize Markov with variable effective order
CTW is provably optimal for tree sources
The theoretical maximum provides an upper bound on prediction accuracy
Real-world processes are often non-stationary, which remains an active research area

What’s Next

The next post examines n-gram models, the classical approach to language modeling. We will see how they relate to Markov processes and why their limitations motivated both CTW and neural approaches.

Data Generating Processes for Sequential Prediction

What is a Data Generating Process?

Markov Processes

Definition

Example: First-Order Binary Markov

Higher-Order Markov

Tree Sources

Why Tree Sources Matter

Example: Sparse Markov Processes

Stationarity

Ergodicity

Why DGP Matters for Model Selection

Creating Custom DGPs

Structured Markov Processes

The Theoretical Maximum

Key Takeaways

What’s Next

Further Reading

Discussion

What is a Data Generating Process?

Markov Processes

Definition

Example: First-Order Binary Markov

Higher-Order Markov

Tree Sources

Why Tree Sources Matter

Example: Sparse Markov Processes

Stationarity

Ergodicity

Why DGP Matters for Model Selection

Creating Custom DGPs

Structured Markov Processes

The Theoretical Maximum

Key Takeaways

What’s Next

Further Reading

Related Posts

CTW Experimental Results: Theory Meets Practice

Solomonoff Induction: The Theoretical Ideal

Introduction to Sequential Prediction

N-gram Language Models

The Bayesian Prediction Framework

Discussion