Skip to main content

Neural Language Models: From RNNs to Transformers

Neural language models have revolutionized how we approach sequential prediction. GPT, Claude, LLaMA—these systems achieve remarkable performance on tasks that seemed impossible a decade ago. In this post, we trace the architectural evolution from simple feedforward networks to transformers, connecting modern LLMs back to the information-theoretic foundations we developed earlier in this series.

The Core Task

Every neural language model solves the same fundamental problem:

Given x₁, x₂, ..., xₙ₋₁, estimate P(xₙ | x₁, ..., xₙ₋₁)

This is exactly the sequential prediction problem from our first post. The difference is how neural models parameterize the conditional distribution.

Evolution of Architectures

Feedforward Neural LMs (Bengio et al., 2003)

The first neural language models used feedforward networks:

Input: one-hot encodings of last N words
Embedding layer: learn dense representations
Hidden layers: nonlinear transformations
Output: softmax over vocabulary

This is essentially a neural n-gram—fixed context window, but with learned representations that allow similar contexts to share information.

Key innovation: Word embeddings. Instead of treating each word as an atomic symbol, learn a dense vector where similar words have similar representations.

Recurrent Neural Networks (RNNs)

RNNs process sequences with a hidden state that evolves over time:

hₜ = f(W_h × hₜ₋₁ + W_x × xₜ)
yₜ = softmax(W_y × hₜ)

The hidden state hₜ theoretically carries information from the entire history, not just the last N words.

Problems:

  • Vanishing gradients: Gradients shrink exponentially, limiting effective memory
  • Sequential computation: Can’t parallelize—each step depends on the previous
  • Practical context: Only ~10-20 tokens effectively remembered

Long Short-Term Memory (LSTM)

LSTMs add gating mechanisms to control information flow:

  • Forget gate: What to discard from cell state
  • Input gate: What new information to add
  • Output gate: What to expose to the next layer
fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f)     # forget gate
iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i)     # input gate
c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)  # candidate cell state
cₜ = fₜ * cₜ₋₁ + iₜ * c̃ₜ           # new cell state
oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o)     # output gate
hₜ = oₜ * tanh(cₜ)                  # hidden state

LSTMs capture longer dependencies than vanilla RNNs but still struggle beyond ~100 tokens.

Transformers (Vaswani et al., 2017)

The transformer architecture abandoned recurrence entirely:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Key innovations:

  1. Self-attention: Every position attends to every other position directly. No sequential bottleneck.

  2. Parallelization: All positions computed simultaneously during training. Massive speedups on GPUs.

  3. Positional encoding: Since there’s no recurrence, position information must be explicitly injected.

The attention mechanism computes: “For each position, how relevant is every other position?” This allows direct information flow across arbitrary distances.

Complexity trade-off: O(n²) computation and memory for sequence length n. Long contexts are expensive.

Modern Large Language Models

Scale

The scaling has been remarkable:

ModelParametersTraining Data
GPT-2 (2019)1.5B40GB text
GPT-3 (2020)175B570GB text
GPT-4 (2023)~1T?Unknown
LLaMA (2023)7B-70B1-1.4T tokens
Claude 3 (2024)UnknownUnknown

Each generation scales parameters, data, and compute roughly together.

Training Paradigm

Modern LLM training has three phases:

  1. Pretraining: Predict next token on massive text corpora (books, web, code). Learn general language understanding.

  2. Supervised Fine-tuning (SFT): Train on curated examples of helpful assistant behavior.

  3. RLHF: Reinforcement Learning from Human Feedback. Train a reward model from human preferences, then optimize the LLM against this reward.

The result: models that are not just good predictors, but helpful assistants.

Emergent Capabilities

Scale brings emergent abilities:

  • In-context learning: Learn from examples in the prompt without weight updates
  • Chain-of-thought reasoning: Produce intermediate steps that improve final answers
  • Instruction following: Generalize from instructions, not just examples
  • Code generation: Write and debug programs

These weren’t explicitly trained for—they emerged from scale.

Connection to Classical Methods

Despite their complexity, neural LMs are still doing sequential prediction:

The Compression Connection

Cross-entropy loss = -Σ log P(xₙ | x<n)

This is identical to the code length in arithmetic coding. Neural LMs are implicitly learning to compress.

State-of-the-art LLMs achieve ~1.5 bits per character on English text—approaching the estimated entropy of English.

What Neural Models Learn

Neural LMs implicitly learn:

  • Syntax and grammar (structure)
  • Semantic relationships (meaning)
  • World knowledge (facts)
  • Reasoning patterns (debated)

They can be seen as learning a very flexible, high-capacity approximation to what CTW does with a restricted model class.

The Inductive Bias Trade-off

MethodInductive BiasData Needed
CTWTree structuresThousands
N-gramsFixed contextMillions
Neural LMsMinimalBillions

Neural models trade inductive bias for flexibility. With enough data, this wins. With limited data, CTW wins.

Limitations of Neural LMs

Modern LLMs are remarkable, but have clear limitations:

Hallucination

LLMs generate plausible but false content. They model the form of text perfectly but have no grounding in truth.

No Uncertainty Quantification

Unlike Bayesian methods (CTW), neural LMs produce point estimates. They’re confidently wrong with no warning.

Black Box

We don’t understand how they work internally. Mechanistic interpretability is an active research area but we’re far from complete understanding.

Compute Cost

Training GPT-4 reportedly cost >$100M. Inference at scale requires significant GPU resources.

Alignment Challenges

Making LLMs reliably helpful, harmless, and honest is an unsolved problem. RLHF helps but doesn’t guarantee safety.

When to Use Neural LMs vs. CTW

Use Neural LMs When:

  • Large vocabulary (natural language, code)
  • Massive training data available
  • Long-range dependencies matter
  • State-of-the-art performance required
  • Can afford compute costs

Use CTW When:

  • Binary or small alphabet
  • Limited training data
  • Need theoretical guarantees
  • Interpretability required
  • Compute constrained
  • The source is plausibly tree-structured

Hybrid Approaches

The future may combine strengths:

  1. CTW for low-resource adaptation: Use CTW-style Bayesian methods when fine-tuning data is limited.

  2. Compression-based model selection: Use description length (related to CTW) to choose among neural models.

  3. Bayesian neural networks: Add uncertainty estimation to neural predictions.

  4. Neural architecture search: Learn architectures with appropriate inductive biases.

Key Takeaways

  • Neural LMs evolved from feedforward → RNN → LSTM → Transformer
  • Self-attention enables direct long-range dependencies
  • Scale (parameters, data, compute) drives emergent capabilities
  • Neural LMs are still doing sequential prediction—just with massive flexibility
  • Trade-off: Flexibility vs. sample efficiency
  • CTW and neural LMs serve different regimes

What’s Next

The final post presents experimental results validating CTW’s theoretical properties. We’ll see that when depth matches the true Markov order, CTW achieves the Bayes-optimal accuracy—confirming theory with practice.

Further Reading

  • Vaswani et al. (2017). “Attention Is All You Need.”
  • Radford et al. (2019). “Language Models are Unsupervised Multitask Learners.”
  • Brown et al. (2020). “Language Models are Few-Shot Learners.”
  • Jurafsky & Martin. Speech and Language Processing. Chapters 7-10.

Discussion