How Automatic Differentiation Works
Source:vignettes/computational-graphs.Rmd
computational-graphs.RmdWhy Automatic Differentiation?
In statistics, we constantly need derivatives:
- MLE: Find parameters that maximize the log-likelihood (gradient = 0)
- Standard errors: Derived from the Hessian (matrix of second derivatives)
- Newton’s method: Uses both gradient and Hessian for optimization
- Fisher information: The expected negative Hessian
There are three ways to compute derivatives:
| Method | Pros | Cons |
|---|---|---|
| Symbolic | Exact, closed-form | Explosion of terms, hard to automate |
| Numerical (finite differences) | Easy to implement | Approximation errors, slow for Hessians |
| Automatic differentiation | Exact, efficient, automated | Requires computational graph |
Femtograd uses automatic differentiation (AD), specifically reverse-mode AD (backpropagation) for gradients and forward-over-reverse AD for Hessians.
The Computational Graph
Every computation in femtograd builds a directed acyclic graph (DAG).
Each node is a value object storing both a number and its
gradient.
Consider :
[z = v3 + v4] <- output
/ \
[v3 = x^2] [v4 = x*y] <- intermediate
| / \
[x] [x] [y] <- inputs (leaves)
In femtograd:
The Forward Pass
When you write x^2 + x * y, femtograd executes each
operation and records the computation graph. This is the forward
pass: compute the function value while building the graph
structure.
Each operation creates a new value node that stores:
- data: The computed result (a matrix)
- grad: Gradient, initially zero
- prev: Pointers to parent nodes
- backward_fn: A closure that knows how to propagate gradients backward
Reverse-Mode AD (Backpropagation)
To compute gradients, we traverse the graph in reverse topological order. This is the backward pass.
The key insight is the chain rule. For a composition :
In reverse mode, we start from the output and propagate backward through each operation.
# After building the graph above, compute gradients:
backward(z)
# dz/dx = 2x + y = 2(3) + 4 = 10
grad(x)
#> [1] 10
# dz/dy = x = 3
grad(y)
#> [1] 3Step by Step
Here is what backward(z) does internally:
- Initialize: Set
- Topological sort: Order nodes so each node comes after its children
-
Reverse traverse: For each node, call its
backward_fn
For our example :
| Step | Node | Local derivative | Accumulated gradient |
|---|---|---|---|
| Init | z | grad(z) = 1 |
|
| 1 | z = v3 + v4 | , |
grad(v3) += 1, grad(v4) += 1
|
| 2 | v4 = x * y | , |
grad(x) += 1 * 4 = 4,
grad(y) += 1 * 3 = 3
|
| 3 | v3 = x^2 | grad(x) += 1 * 6 = 6 |
|
| Result |
grad(x) = 10, grad(y) = 3
|
Why Reverse Mode?
For a function (scalar output, many inputs), reverse mode computes all partial derivatives in a single backward pass.
This is exactly the setting for MLE: the log-likelihood is a scalar function of parameters. One backward pass gives us the entire gradient vector.
Forward mode, by contrast, computes one directional derivative per pass—it would need passes for the full gradient.
| Mode | Cost | Best for |
|---|---|---|
| Forward | passes for inputs | Few inputs, many outputs |
| Reverse | passes for any | Many inputs, scalar output (MLE!) |
Second Derivatives: Forward-over-Reverse
For Hessians (second-derivative matrices), femtograd uses forward-over-reverse AD. This nests forward-mode AD inside reverse-mode:
-
Forward mode (using
dualnumbers): Propagate tangent vectors through the computation - Reverse mode: Differentiate the tangent computation to get second derivatives
# Compute Hessian of f(x,y) = x^2 + x*y
f <- function(params) {
x <- params[[1]]
y <- params[[2]]
x^2 + x * y
}
H <- hessian(f, list(val(3), val(4)))
H
#> [,1] [,2]
#> [1,] 2 1
#> [2,] 1 0The Hessian is:
Dual Numbers
Forward-mode AD uses dual numbers: pairs where is the primal value and is the tangent (directional derivative).
Arithmetic on dual numbers automatically propagates derivatives:
In femtograd, the primal is a value object, so the
tangent computation builds its own computational graph. Running
backward() on the tangent then gives the second
derivative.
Connection to Statistics
Fisher Information and Standard Errors
At the MLE , the observed Fisher information matrix is:
where is the Hessian of the log-likelihood. The variance-covariance matrix of the MLE is approximately:
Standard errors are the square roots of the diagonal of this matrix.
# Generate data
set.seed(42)
x <- rnorm(200, mean = 5, sd = 2)
# Fit model - femtograd computes the Hessian automatically
result <- fit(
function(mu, log_sigma) {
sigma <- exp(log_sigma)
loglik_normal(mu, sigma, x)
},
params = c(mu = 0, log_sigma = 0)
)
# Standard errors from inverse Fisher information
se(result)
#> mu log_sigma
#> 0.1374822 0.0500000
# For comparison: theoretical SE for mu is sigma/sqrt(n)
2 / sqrt(200)
#> [1] 0.1414214Newton-Raphson Optimization
Newton’s method uses the Hessian to take optimal steps:
This converges quadratically near the optimum, while gradient-only
methods converge linearly. Femtograd’s method = "newton"
option in fit() uses this approach.
Summary
| Concept | In femtograd | Purpose |
|---|---|---|
| Forward pass | z <- x^2 + x*y |
Build graph, compute value |
| Backward pass | backward(z) |
Compute all gradients |
| Gradient access | grad(x) |
Get |
| Hessian | hessian(f, params) |
Second derivatives for SEs |
| Dual numbers | Internal to hessian()
|
Forward-over-reverse AD |
| Fisher information | observed_info(result) |
at MLE |
The computational graph approach gives us exact derivatives (no numerical approximation) efficiently (one backward pass for all parameters), which is precisely what statistical inference requires.