So far in this series, we’ve discussed how to predict sequences. But where do sequences come from? What statistical properties do they have?
Understanding the data generating process (DGP) is crucial for:
- Designing appropriate predictors
- Analyzing theoretical guarantees
- Interpreting experimental results
This post explores Markov processes and tree sources—the model classes that CTW is provably optimal for.
What is a Data Generating Process?
A DGP is the probabilistic mechanism that produces a sequence. It specifies:
- What symbols can appear (the alphabet)
- The probability of each symbol given the history
Formally, a DGP is a distribution P(x₁, x₂, x₃, …) over infinite sequences.
Markov Processes
The most important class of DGPs is Markov processes.
Definition
A k-th order Markov process is one where the next symbol depends only on the previous k symbols:
P(xₙ | x₁, ..., xₙ₋₁) = P(xₙ | xₙ₋ₖ, ..., xₙ₋₁)
The k previous symbols form the context or state.
Example: First-Order Binary Markov
Consider a first-order (k=1) process over {0, 1}:
Transition matrix:
Next=0 Next=1
Prev=0 [0.2 0.8 ] ← After seeing 0, P(1) = 0.8
Prev=1 [0.7 0.3 ] ← After seeing 1, P(1) = 0.3
This process tends to alternate—0s are followed by 1s (with p=0.8) and 1s by 0s (with p=0.7).
Higher-Order Markov
For a k-th order binary process:
- 2^k possible contexts
- Each context has its own P(next = 1)
- Total: 2^k parameters
This exponential growth limits practical Markov orders. But CTW handles it gracefully through model averaging.
Tree Sources
A tree source generalizes Markov processes by allowing different contexts to have different “effective orders.”
Root
/ \
0 1
/ \ \
0 1 (leaf: use context "1")
(leaf) (leaf)
"00" "01"
In this tree source:
- Context “00” has its own distribution
- Context “01” has its own distribution
- Any context ending in “1” shares a single distribution
Tree sources are more parsimonious than full Markov processes when some contexts don’t need the full history.
Why Tree Sources Matter
Tree sources are the natural model class for CTW:
- CTW averages over all possible tree structures up to a maximum depth
- For any tree source, CTW converges to the true distribution at the optimal rate
- CTW is provably minimax optimal for tree sources
Example: Sparse Markov Processes
Our experimental framework includes sparse Markov processes where most contexts are uninformative:
| Context | P(next=1) | Description |
|---|---|---|
| 000 | 0.99 | High certainty → 1 |
| 100 | 0.99 | High certainty → 1 |
| 010 | 0.025 | High certainty → 0 |
| 101 | 0.025 | High certainty → 0 |
| 011 | 0.25 | Moderate certainty → 0 |
| 110 | 0.25 | Moderate certainty → 0 |
| 001 | 0.5 | Uniform (uninformative) |
| 111 | 0.5 | Uniform (uninformative) |
This 3rd-order process has:
- 4 highly predictable contexts (near-deterministic)
- 2 moderately predictable contexts
- 2 completely random contexts
The theoretical maximum accuracy for an oracle predictor is about 95%—achieved by predicting the more likely outcome for each context.
Stationarity
A process is stationary if its statistics don’t change over time:
P(xₜ₊₁ | xₜ, ..., x₁) = P(xₛ₊₁ | xₛ, ..., x₁) for all t, s
Markov processes with fixed transition probabilities are stationary.
Non-stationary processes (changing statistics over time) are harder to predict and require adaptive methods. Real-world sequences are often non-stationary:
- Language style varies within a document
- Financial markets have regime changes
- User behavior evolves
CTW adapts somewhat through its recency-weighted updates, but explicit change-point detection may be needed for strongly non-stationary sources.
Ergodicity
A Markov process is ergodic if:
- Irreducible: Can reach any state from any other state
- Aperiodic: Doesn’t cycle deterministically
Ergodic processes have a unique stationary distribution—the long-run fraction of time spent in each state is well-defined and independent of the starting state.
For ergodic processes, time averages equal ensemble averages. This justifies evaluating predictors on a single long sequence.
Why DGP Matters for Model Selection
The DGP determines which predictor is optimal:
| DGP Property | Optimal Approach |
|---|---|
| k-th order Markov | CTW with depth ≥ k |
| Tree source | CTW naturally handles |
| Long-range dependencies | Neural models (LSTM, Transformer) |
| Non-stationary | Adaptive methods, forgetting |
| Unknown structure | Universal methods (Solomonoff, CTW) |
Key finding from experiments: CTW achieves the theoretical maximum accuracy when its depth is set to at least the true order of the DGP.
Creating Custom DGPs
For experimentation, we can create various Markov processes:
from experiments.ctw.generators import (
create_paper_markov,
create_sparse_markov,
create_structured_markov,
create_random_markov
)
# Sparse process (few informative contexts)
sparse = create_sparse_markov(order=4, num_rules=6, seed=42)
# Structured: alternating, xor, majority, parity
alt = create_structured_markov(order=3, structure="alternating")
xor = create_structured_markov(order=3, structure="xor")
# Fully random transitions
rand = create_random_markov(order=3, seed=42)
# Compute theoretical maximum accuracy
max_acc = sparse.theoretical_max_accuracy()
print(f"Theoretical max: {max_acc:.1%}")
Structured Markov Processes
These have deterministic rules:
- Alternating: Predict opposite of last bit
- XOR: Predict XOR of context bits
- Majority: Predict majority bit in context
- Parity: Predict based on count of 1s mod 2
These are useful for testing edge cases but don’t represent typical real-world data.
The Theoretical Maximum
For any Markov process, we can compute the Bayes-optimal accuracy—the best any predictor could achieve:
def theoretical_max_accuracy(transition_probs, stationary_dist):
"""Oracle predicts the more likely outcome for each context."""
accuracy = 0
for context, p1 in transition_probs.items():
p_context = stationary_dist[context]
accuracy += p_context * max(p1, 1 - p1)
return accuracy
This represents the ceiling—no predictor can do better without additional information.
Key Takeaways
- DGPs specify the probabilistic mechanism generating sequences
- Markov processes have limited memory (only k previous symbols matter)
- Tree sources generalize Markov with variable effective order
- CTW is provably optimal for tree sources
- The theoretical maximum provides an upper bound on prediction accuracy
- Real-world processes are often non-stationary—an active research area
What’s Next
The next post examines n-gram models—the classical approach to language modeling. We’ll see how they relate to Markov processes and why their limitations motivated both CTW and neural approaches.
Further Reading
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. Chapter 4.
- Willems, F. M. J., et al. (1995). “The context-tree weighting method: basic properties.”
Discussion