Neural language models have revolutionized how we approach sequential prediction. GPT, Claude, LLaMA—these systems achieve remarkable performance on tasks that seemed impossible a decade ago. In this post, we trace the architectural evolution from simple feedforward networks to transformers, connecting modern LLMs back to the information-theoretic foundations we developed earlier in this series.
The Core Task
Every neural language model solves the same fundamental problem:
Given x₁, x₂, ..., xₙ₋₁, estimate P(xₙ | x₁, ..., xₙ₋₁)
This is exactly the sequential prediction problem from our first post. The difference is how neural models parameterize the conditional distribution.
Evolution of Architectures
Feedforward Neural LMs (Bengio et al., 2003)
The first neural language models used feedforward networks:
Input: one-hot encodings of last N words
↓
Embedding layer: learn dense representations
↓
Hidden layers: nonlinear transformations
↓
Output: softmax over vocabulary
This is essentially a neural n-gram—fixed context window, but with learned representations that allow similar contexts to share information.
Key innovation: Word embeddings. Instead of treating each word as an atomic symbol, learn a dense vector where similar words have similar representations.
Recurrent Neural Networks (RNNs)
RNNs process sequences with a hidden state that evolves over time:
hₜ = f(W_h × hₜ₋₁ + W_x × xₜ)
yₜ = softmax(W_y × hₜ)
The hidden state hₜ theoretically carries information from the entire history, not just the last N words.
Problems:
- Vanishing gradients: Gradients shrink exponentially, limiting effective memory
- Sequential computation: Can’t parallelize—each step depends on the previous
- Practical context: Only ~10-20 tokens effectively remembered
Long Short-Term Memory (LSTM)
LSTMs add gating mechanisms to control information flow:
- Forget gate: What to discard from cell state
- Input gate: What new information to add
- Output gate: What to expose to the next layer
fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f) # forget gate
iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i) # input gate
c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c) # candidate cell state
cₜ = fₜ * cₜ₋₁ + iₜ * c̃ₜ # new cell state
oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o) # output gate
hₜ = oₜ * tanh(cₜ) # hidden state
LSTMs capture longer dependencies than vanilla RNNs but still struggle beyond ~100 tokens.
Transformers (Vaswani et al., 2017)
The transformer architecture abandoned recurrence entirely:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Key innovations:
Self-attention: Every position attends to every other position directly. No sequential bottleneck.
Parallelization: All positions computed simultaneously during training. Massive speedups on GPUs.
Positional encoding: Since there’s no recurrence, position information must be explicitly injected.
The attention mechanism computes: “For each position, how relevant is every other position?” This allows direct information flow across arbitrary distances.
Complexity trade-off: O(n²) computation and memory for sequence length n. Long contexts are expensive.
Modern Large Language Models
Scale
The scaling has been remarkable:
| Model | Parameters | Training Data |
|---|---|---|
| GPT-2 (2019) | 1.5B | 40GB text |
| GPT-3 (2020) | 175B | 570GB text |
| GPT-4 (2023) | ~1T? | Unknown |
| LLaMA (2023) | 7B-70B | 1-1.4T tokens |
| Claude 3 (2024) | Unknown | Unknown |
Each generation scales parameters, data, and compute roughly together.
Training Paradigm
Modern LLM training has three phases:
Pretraining: Predict next token on massive text corpora (books, web, code). Learn general language understanding.
Supervised Fine-tuning (SFT): Train on curated examples of helpful assistant behavior.
RLHF: Reinforcement Learning from Human Feedback. Train a reward model from human preferences, then optimize the LLM against this reward.
The result: models that are not just good predictors, but helpful assistants.
Emergent Capabilities
Scale brings emergent abilities:
- In-context learning: Learn from examples in the prompt without weight updates
- Chain-of-thought reasoning: Produce intermediate steps that improve final answers
- Instruction following: Generalize from instructions, not just examples
- Code generation: Write and debug programs
These weren’t explicitly trained for—they emerged from scale.
Connection to Classical Methods
Despite their complexity, neural LMs are still doing sequential prediction:
The Compression Connection
Cross-entropy loss = -Σ log P(xₙ | x<n)
This is identical to the code length in arithmetic coding. Neural LMs are implicitly learning to compress.
State-of-the-art LLMs achieve ~1.5 bits per character on English text—approaching the estimated entropy of English.
What Neural Models Learn
Neural LMs implicitly learn:
- Syntax and grammar (structure)
- Semantic relationships (meaning)
- World knowledge (facts)
- Reasoning patterns (debated)
They can be seen as learning a very flexible, high-capacity approximation to what CTW does with a restricted model class.
The Inductive Bias Trade-off
| Method | Inductive Bias | Data Needed |
|---|---|---|
| CTW | Tree structures | Thousands |
| N-grams | Fixed context | Millions |
| Neural LMs | Minimal | Billions |
Neural models trade inductive bias for flexibility. With enough data, this wins. With limited data, CTW wins.
Limitations of Neural LMs
Modern LLMs are remarkable, but have clear limitations:
Hallucination
LLMs generate plausible but false content. They model the form of text perfectly but have no grounding in truth.
No Uncertainty Quantification
Unlike Bayesian methods (CTW), neural LMs produce point estimates. They’re confidently wrong with no warning.
Black Box
We don’t understand how they work internally. Mechanistic interpretability is an active research area but we’re far from complete understanding.
Compute Cost
Training GPT-4 reportedly cost >$100M. Inference at scale requires significant GPU resources.
Alignment Challenges
Making LLMs reliably helpful, harmless, and honest is an unsolved problem. RLHF helps but doesn’t guarantee safety.
When to Use Neural LMs vs. CTW
Use Neural LMs When:
- Large vocabulary (natural language, code)
- Massive training data available
- Long-range dependencies matter
- State-of-the-art performance required
- Can afford compute costs
Use CTW When:
- Binary or small alphabet
- Limited training data
- Need theoretical guarantees
- Interpretability required
- Compute constrained
- The source is plausibly tree-structured
Hybrid Approaches
The future may combine strengths:
CTW for low-resource adaptation: Use CTW-style Bayesian methods when fine-tuning data is limited.
Compression-based model selection: Use description length (related to CTW) to choose among neural models.
Bayesian neural networks: Add uncertainty estimation to neural predictions.
Neural architecture search: Learn architectures with appropriate inductive biases.
Key Takeaways
- Neural LMs evolved from feedforward → RNN → LSTM → Transformer
- Self-attention enables direct long-range dependencies
- Scale (parameters, data, compute) drives emergent capabilities
- Neural LMs are still doing sequential prediction—just with massive flexibility
- Trade-off: Flexibility vs. sample efficiency
- CTW and neural LMs serve different regimes
What’s Next
The final post presents experimental results validating CTW’s theoretical properties. We’ll see that when depth matches the true Markov order, CTW achieves the Bayes-optimal accuracy—confirming theory with practice.
Further Reading
- Vaswani et al. (2017). “Attention Is All You Need.”
- Radford et al. (2019). “Language Models are Unsupervised Multitask Learners.”
- Brown et al. (2020). “Language Models are Few-Shot Learners.”
- Jurafsky & Martin. Speech and Language Processing. Chapters 7-10.
Discussion