Neural Language Models: From RNNs to Transformers

September 20, 2024 6 min read Updated: February 24, 2026

GPT, Claude, LLaMA: these systems achieve remarkable performance on tasks that seemed impossible a decade ago. But they’re still doing the same thing every language model does, just with a lot more parameters. In this post I trace the architectural evolution from feedforward networks to transformers, connecting modern LLMs back to the information-theoretic foundations from earlier in this series.

The Core Task

Every neural language model solves the same fundamental problem:

Given x_1, x_2, ..., x_{n-1}, estimate P(x_n | x_1, ..., x_{n-1})

This is exactly the sequential prediction problem from our first post. The difference is how neural models parameterize the conditional distribution.

Evolution of Architectures

Feedforward Neural LMs (Bengio et al., 2003)

The first neural language models used feedforward networks:

Input: one-hot encodings of last N words
     |
Embedding layer: learn dense representations
     |
Hidden layers: nonlinear transformations
     |
Output: softmax over vocabulary

This is essentially a neural n-gram. Fixed context window, but with learned representations that allow similar contexts to share information.

Key innovation: Word embeddings. Instead of treating each word as an atomic symbol, learn a dense vector where similar words have similar representations.

Recurrent Neural Networks (RNNs)

RNNs process sequences with a hidden state that evolves over time:

h_t = f(W_h * h_{t-1} + W_x * x_t)
y_t = softmax(W_y * h_t)

The hidden state h_t theoretically carries information from the entire history, not just the last N words.

Problems:

Vanishing gradients: Gradients shrink exponentially, limiting effective memory
Sequential computation: Can’t parallelize, each step depends on the previous
Practical context: Only about 10-20 tokens effectively remembered

Long Short-Term Memory (LSTM)

LSTMs add gating mechanisms to control information flow:

Forget gate: What to discard from cell state
Input gate: What new information to add
Output gate: What to expose to the next layer

f_t = sigma(W_f . [h_{t-1}, x_t] + b_f)     # forget gate
i_t = sigma(W_i . [h_{t-1}, x_t] + b_i)     # input gate
c~_t = tanh(W_c . [h_{t-1}, x_t] + b_c)     # candidate cell state
c_t = f_t * c_{t-1} + i_t * c~_t             # new cell state
o_t = sigma(W_o . [h_{t-1}, x_t] + b_o)     # output gate
h_t = o_t * tanh(c_t)                        # hidden state

LSTMs capture longer dependencies than vanilla RNNs but still struggle beyond roughly 100 tokens.

Transformers (Vaswani et al., 2017)

The transformer architecture abandoned recurrence entirely:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Key innovations:

Self-attention: Every position attends to every other position directly. No sequential bottleneck.
Parallelization: All positions computed simultaneously during training. Massive speedups on GPUs.
Positional encoding: Since there’s no recurrence, position information must be explicitly injected.

The attention mechanism computes: “For each position, how relevant is every other position?” This allows direct information flow across arbitrary distances.

Complexity trade-off: O(n^2) computation and memory for sequence length n. Long contexts are expensive.

Modern Large Language Models

Scale

The scaling has been remarkable:

Model	Parameters	Training Data
GPT-2 (2019)	1.5B	40GB text
GPT-3 (2020)	175B	570GB text
GPT-4 (2023)	~1T?	Unknown
LLaMA (2023)	7B-70B	1-1.4T tokens
Claude 3 (2024)	Unknown	Unknown

Each generation scales parameters, data, and compute roughly together.

Training Paradigm

Modern LLM training has three phases:

Pretraining: Predict next token on massive text corpora (books, web, code). Learn general language understanding.
Supervised Fine-tuning (SFT): Train on curated examples of helpful assistant behavior.
RLHF: Reinforcement Learning from Human Feedback. Train a reward model from human preferences, then optimize the LLM against this reward.

The result: models that are not just good predictors, but helpful assistants.

Emergent Capabilities

Scale brings emergent abilities:

In-context learning: Learn from examples in the prompt without weight updates
Chain-of-thought reasoning: Produce intermediate steps that improve final answers
Instruction following: Generalize from instructions, not just examples
Code generation: Write and debug programs

These weren’t explicitly trained for. They emerged from scale.

Connection to Classical Methods

Despite their complexity, neural LMs are still doing sequential prediction:

The Compression Connection

Cross-entropy loss = -Sum log P(x_n | x_{<n})

This is identical to the code length in arithmetic coding. Neural LMs are implicitly learning to compress.

State-of-the-art LLMs achieve roughly 1.5 bits per character on English text, approaching the estimated entropy of English.

What Neural Models Learn

Neural LMs implicitly learn:

Syntax and grammar (structure)
Semantic relationships (meaning)
World knowledge (facts)
Reasoning patterns (debated)

They can be seen as learning a very flexible, high-capacity approximation to what CTW does with a restricted model class.

The Inductive Bias Trade-off

Method	Inductive Bias	Data Needed
CTW	Tree structures	Thousands
N-grams	Fixed context	Millions
Neural LMs	Minimal	Billions

Neural models trade inductive bias for flexibility. With enough data, this wins. With limited data, CTW wins.

Limitations of Neural LMs

Modern LLMs are remarkable, but the limitations are real:

Hallucination

LLMs generate plausible but false content. They model the form of text perfectly but have no grounding in truth.

No Uncertainty Quantification

Unlike Bayesian methods (CTW), neural LMs produce point estimates. They’re confidently wrong with no warning.

Black Box

We don’t understand how they work internally. Mechanistic interpretability is an active research area but we’re far from complete understanding.

Compute Cost

Training GPT-4 reportedly cost over $100M. Inference at scale requires significant GPU resources.

Alignment Challenges

Making LLMs reliably helpful, harmless, and honest is an unsolved problem. RLHF helps but doesn’t guarantee safety.

When to Use Neural LMs vs. CTW

Use Neural LMs When:

Large vocabulary (natural language, code)
Massive training data available
Long-range dependencies matter
State-of-the-art performance required
Can afford compute costs

Use CTW When:

Binary or small alphabet
Limited training data
Need theoretical guarantees
Interpretability required
Compute constrained
The source is plausibly tree-structured

Hybrid Approaches

The future may combine strengths:

CTW for low-resource adaptation: Use CTW-style Bayesian methods when fine-tuning data is limited.
Compression-based model selection: Use description length (related to CTW) to choose among neural models.
Bayesian neural networks: Add uncertainty estimation to neural predictions.
Neural architecture search: Learn architectures with appropriate inductive biases.

Key Takeaways

Neural LMs evolved from feedforward to RNN to LSTM to Transformer
Self-attention enables direct long-range dependencies
Scale (parameters, data, compute) drives emergent capabilities
Neural LMs are still doing sequential prediction, just with massive flexibility
Trade-off: Flexibility vs. sample efficiency
CTW and neural LMs serve different regimes

What’s Next

The final post presents experimental results validating CTW’s theoretical properties. We’ll see that when depth matches the true Markov order, CTW achieves the Bayes-optimal accuracy, confirming theory with practice.

Neural Language Models: From RNNs to Transformers

The Core Task

Evolution of Architectures

Feedforward Neural LMs (Bengio et al., 2003)

Recurrent Neural Networks (RNNs)

Long Short-Term Memory (LSTM)

Transformers (Vaswani et al., 2017)

Modern Large Language Models

Scale

Training Paradigm

Emergent Capabilities

Connection to Classical Methods

The Compression Connection

What Neural Models Learn

The Inductive Bias Trade-off

Limitations of Neural LMs

Hallucination

No Uncertainty Quantification

Black Box

Compute Cost

Alignment Challenges

When to Use Neural LMs vs. CTW

Use Neural LMs When:

Use CTW When:

Hybrid Approaches

Key Takeaways

What’s Next

Further Reading

Discussion

The Core Task

Evolution of Architectures

Feedforward Neural LMs (Bengio et al., 2003)

Recurrent Neural Networks (RNNs)

Long Short-Term Memory (LSTM)

Transformers (Vaswani et al., 2017)

Modern Large Language Models

Scale

Training Paradigm

Emergent Capabilities

Connection to Classical Methods

The Compression Connection

What Neural Models Learn

The Inductive Bias Trade-off

Limitations of Neural LMs

Hallucination

No Uncertainty Quantification

Black Box

Compute Cost

Alignment Challenges

When to Use Neural LMs vs. CTW

Use Neural LMs When:

Use CTW When:

Hybrid Approaches

Key Takeaways

What’s Next

Further Reading

Related Posts

Comparing Prediction Methods: CTW vs. N-grams vs. Neural LMs

N-gram Language Models

Language Calculus: An Algebraic Framework for LLM Composition

CTW Experimental Results: Theory Meets Practice

Data Generating Processes for Sequential Prediction

Discussion