Skip to contents

Why Automatic Differentiation?

In statistics, we constantly need derivatives:

  • MLE: Find parameters that maximize the log-likelihood (gradient = 0)
  • Standard errors: Derived from the Hessian (matrix of second derivatives)
  • Newton’s method: Uses both gradient and Hessian for optimization
  • Fisher information: The expected negative Hessian

There are three ways to compute derivatives:

Method Pros Cons
Symbolic Exact, closed-form Explosion of terms, hard to automate
Numerical (finite differences) Easy to implement Approximation errors, slow for Hessians
Automatic differentiation Exact, efficient, automated Requires computational graph

Femtograd uses automatic differentiation (AD), specifically reverse-mode AD (backpropagation) for gradients and forward-over-reverse AD for Hessians.

The Computational Graph

Every computation in femtograd builds a directed acyclic graph (DAG). Each node is a value object storing both a number and its gradient.

Consider f(x,y)=x2+xyf(x, y) = x^2 + x \cdot y:

      [z = v3 + v4]        <- output
       /          \
  [v3 = x^2]   [v4 = x*y]  <- intermediate
      |          /    \
     [x]       [x]   [y]    <- inputs (leaves)

In femtograd:

x <- val(3, label = "x")
y <- val(4, label = "y")

v3 <- x^2         # v3 = 9
v4 <- x * y       # v4 = 12
z  <- v3 + v4     # z  = 21

get_data(z)
#> [1] 21

The Forward Pass

When you write x^2 + x * y, femtograd executes each operation and records the computation graph. This is the forward pass: compute the function value while building the graph structure.

Each operation creates a new value node that stores:

  • data: The computed result (a matrix)
  • grad: Gradient, initially zero
  • prev: Pointers to parent nodes
  • backward_fn: A closure that knows how to propagate gradients backward

Reverse-Mode AD (Backpropagation)

To compute gradients, we traverse the graph in reverse topological order. This is the backward pass.

The key insight is the chain rule. For a composition f(g(x))f(g(x)):

dfdx=dfdgdgdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

In reverse mode, we start from the output and propagate outputnode\frac{\partial \text{output}}{\partial \text{node}} backward through each operation.

# After building the graph above, compute gradients:
backward(z)

# dz/dx = 2x + y = 2(3) + 4 = 10
grad(x)
#> [1] 10

# dz/dy = x = 3
grad(y)
#> [1] 3

Step by Step

Here is what backward(z) does internally:

  1. Initialize: Set zz=1\frac{\partial z}{\partial z} = 1
  2. Topological sort: Order nodes so each node comes after its children
  3. Reverse traverse: For each node, call its backward_fn

For our example z=x2+xyz = x^2 + x \cdot y:

Step Node Local derivative Accumulated gradient
Init z z/z=1\partial z / \partial z = 1 grad(z) = 1
1 z = v3 + v4 z/v3=1\partial z / \partial v_3 = 1, z/v4=1\partial z / \partial v_4 = 1 grad(v3) += 1, grad(v4) += 1
2 v4 = x * y v4/x=y\partial v_4 / \partial x = y, v4/y=x\partial v_4 / \partial y = x grad(x) += 1 * 4 = 4, grad(y) += 1 * 3 = 3
3 v3 = x^2 v3/x=2x\partial v_3 / \partial x = 2x grad(x) += 1 * 6 = 6
Result grad(x) = 10, grad(y) = 3

Gradient Accumulation

Notice that grad(x) receives contributions from both paths through the graph (via v3v_3 and v4v_4). This is gradient accumulation—it naturally handles the multivariate chain rule when a variable appears in multiple sub-expressions.

Why Reverse Mode?

For a function f:nf : \mathbb{R}^n \to \mathbb{R} (scalar output, many inputs), reverse mode computes all nn partial derivatives in a single backward pass.

This is exactly the setting for MLE: the log-likelihood is a scalar function of pp parameters. One backward pass gives us the entire gradient vector.

Forward mode, by contrast, computes one directional derivative per pass—it would need pp passes for the full gradient.

Mode Cost Best for
Forward O(n)O(n) passes for nn inputs Few inputs, many outputs
Reverse O(1)O(1) passes for any nn Many inputs, scalar output (MLE!)

Second Derivatives: Forward-over-Reverse

For Hessians (second-derivative matrices), femtograd uses forward-over-reverse AD. This nests forward-mode AD inside reverse-mode:

  1. Forward mode (using dual numbers): Propagate tangent vectors through the computation
  2. Reverse mode: Differentiate the tangent computation to get second derivatives
# Compute Hessian of f(x,y) = x^2 + x*y
f <- function(params) {
  x <- params[[1]]
  y <- params[[2]]
  x^2 + x * y
}

H <- hessian(f, list(val(3), val(4)))
H
#>      [,1] [,2]
#> [1,]    2    1
#> [2,]    1    0

The Hessian is:

H=[2fx22fxy2fyx2fy2]=[2110]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 0 \end{bmatrix}

Dual Numbers

Forward-mode AD uses dual numbers: pairs (v,v̇)(v, \dot{v}) where vv is the primal value and v̇\dot{v} is the tangent (directional derivative).

Arithmetic on dual numbers automatically propagates derivatives:

  • (a,ȧ)+(b,ḃ)=(a+b,ȧ+ḃ)(a, \dot{a}) + (b, \dot{b}) = (a + b, \dot{a} + \dot{b})
  • (a,ȧ)×(b,ḃ)=(ab,aḃ+ȧb)(a, \dot{a}) \times (b, \dot{b}) = (a \cdot b, a \cdot \dot{b} + \dot{a} \cdot b)

In femtograd, the primal is a value object, so the tangent computation builds its own computational graph. Running backward() on the tangent then gives the second derivative.

Connection to Statistics

Fisher Information and Standard Errors

At the MLE θ̂\hat{\theta}, the observed Fisher information matrix is:

(θ̂)=H(θ̂)\mathcal{I}(\hat{\theta}) = -H(\hat{\theta})

where HH is the Hessian of the log-likelihood. The variance-covariance matrix of the MLE is approximately:

Var(θ̂)(θ̂)1=(H)1\text{Var}(\hat{\theta}) \approx \mathcal{I}(\hat{\theta})^{-1} = (-H)^{-1}

Standard errors are the square roots of the diagonal of this matrix.

# Generate data
set.seed(42)
x <- rnorm(200, mean = 5, sd = 2)

# Fit model - femtograd computes the Hessian automatically
result <- fit(
  function(mu, log_sigma) {
    sigma <- exp(log_sigma)
    loglik_normal(mu, sigma, x)
  },
  params = c(mu = 0, log_sigma = 0)
)

# Standard errors from inverse Fisher information
se(result)
#>        mu log_sigma 
#> 0.1374822 0.0500000

# For comparison: theoretical SE for mu is sigma/sqrt(n)
2 / sqrt(200)
#> [1] 0.1414214

Newton-Raphson Optimization

Newton’s method uses the Hessian to take optimal steps:

θt+1=θtH1(θt)\theta_{t+1} = \theta_t - H^{-1} \nabla \ell(\theta_t)

This converges quadratically near the optimum, while gradient-only methods converge linearly. Femtograd’s method = "newton" option in fit() uses this approach.

Summary

Concept In femtograd Purpose
Forward pass z <- x^2 + x*y Build graph, compute value
Backward pass backward(z) Compute all gradients
Gradient access grad(x) Get z/x\partial z / \partial x
Hessian hessian(f, params) Second derivatives for SEs
Dual numbers Internal to hessian() Forward-over-reverse AD
Fisher information observed_info(result) H-H at MLE

The computational graph approach gives us exact derivatives (no numerical approximation) efficiently (one backward pass for all parameters), which is precisely what statistical inference requires.