How Automatic Differentiation Works • femtograd

Why Automatic Differentiation?

In statistics, we constantly need derivatives:

MLE: Find parameters that maximize the log-likelihood (gradient = 0)
Standard errors: Derived from the Hessian (matrix of second derivatives)
Newton’s method: Uses both gradient and Hessian for optimization
Fisher information: The expected negative Hessian

There are three ways to compute derivatives:

Method	Pros	Cons
Symbolic	Exact, closed-form	Explosion of terms, hard to automate
Numerical (finite differences)	Easy to implement	Approximation errors, slow for Hessians
Automatic differentiation	Exact, efficient, automated	Requires computational graph

Femtograd uses automatic differentiation (AD), specifically reverse-mode AD (backpropagation) for gradients and forward-over-reverse AD for Hessians.

The Computational Graph

Every computation in femtograd builds a directed acyclic graph (DAG). Each node is a value object storing both a number and its gradient.

Consider $f(x, y) = x^2 + x \cdot y$ :

      [z = v3 + v4]        <- output
       /          \
  [v3 = x^2]   [v4 = x*y]  <- intermediate
      |          /    \
     [x]       [x]   [y]    <- inputs (leaves)

In femtograd:

x <- val(3, label = "x")
y <- val(4, label = "y")

v3 <- x^2         # v3 = 9
v4 <- x * y       # v4 = 12
z  <- v3 + v4     # z  = 21

get_data(z)
#> [1] 21

The Forward Pass

When you write x^2 + x * y, femtograd executes each operation and records the computation graph. This is the forward pass: compute the function value while building the graph structure.

Each operation creates a new value node that stores:

data: The computed result (a matrix)
grad: Gradient, initially zero
prev: Pointers to parent nodes
backward_fn: A closure that knows how to propagate gradients backward

Reverse-Mode AD (Backpropagation)

To compute gradients, we traverse the graph in reverse topological order. This is the backward pass.

The key insight is the chain rule. For a composition $f(g(x))$ :

$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$

In reverse mode, we start from the output and propagate $\frac{\partial \text{output}}{\partial \text{node}}$ backward through each operation.

# After building the graph above, compute gradients:
backward(z)

# dz/dx = 2x + y = 2(3) + 4 = 10
grad(x)
#> [1] 10

# dz/dy = x = 3
grad(y)
#> [1] 3

Step by Step

Here is what backward(z) does internally:

Initialize: Set $\frac{\partial z}{\partial z} = 1$
Topological sort: Order nodes so each node comes after its children
Reverse traverse: For each node, call its backward_fn

For our example $z = x^2 + x \cdot y$ :

Step	Node	Local derivative	Accumulated gradient
Init	z	$\partial z / \partial z = 1$	`grad(z) = 1`
1	z = v3 + v4	$\partial z / \partial v_3 = 1$ , $\partial z / \partial v_4 = 1$	`grad(v3) += 1`, `grad(v4) += 1`
2	v4 = x * y	$\partial v_4 / \partial x = y$ , $\partial v_4 / \partial y = x$	`grad(x) += 1 * 4 = 4`, `grad(y) += 1 * 3 = 3`
3	v3 = x^2	$\partial v_3 / \partial x = 2x$	`grad(x) += 1 * 6 = 6`
Result			`grad(x) = 10`, `grad(y) = 3`

Gradient Accumulation

Notice that grad(x) receives contributions from both paths through the graph (via $v_3$ and $v_4$ ). This is gradient accumulation—it naturally handles the multivariate chain rule when a variable appears in multiple sub-expressions.

Why Reverse Mode?

For a function $f : \mathbb{R}^n \to \mathbb{R}$ (scalar output, many inputs), reverse mode computes all $n$ partial derivatives in a single backward pass.

This is exactly the setting for MLE: the log-likelihood is a scalar function of $p$ parameters. One backward pass gives us the entire gradient vector.

Forward mode, by contrast, computes one directional derivative per pass—it would need $p$ passes for the full gradient.

Mode	Cost	Best for
Forward	$O(n)$ passes for $n$ inputs	Few inputs, many outputs
Reverse	$O(1)$ passes for any $n$	Many inputs, scalar output (MLE!)

Second Derivatives: Forward-over-Reverse

For Hessians (second-derivative matrices), femtograd uses forward-over-reverse AD. This nests forward-mode AD inside reverse-mode:

Forward mode (using dual numbers): Propagate tangent vectors through the computation
Reverse mode: Differentiate the tangent computation to get second derivatives

# Compute Hessian of f(x,y) = x^2 + x*y
f <- function(params) {
  x <- params[[1]]
  y <- params[[2]]
  x^2 + x * y
}

H <- hessian(f, list(val(3), val(4)))
H
#>      [,1] [,2]
#> [1,]    2    1
#> [2,]    1    0

The Hessian is:

$H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 0 \end{bmatrix}$

Dual Numbers

Forward-mode AD uses dual numbers: pairs $(v, \dot{v})$ where $v$ is the primal value and $\dot{v}$ is the tangent (directional derivative).

Arithmetic on dual numbers automatically propagates derivatives:

$(a, \dot{a}) + (b, \dot{b}) = (a + b, \dot{a} + \dot{b})$
$(a, \dot{a}) \times (b, \dot{b}) = (a \cdot b, a \cdot \dot{b} + \dot{a} \cdot b)$

In femtograd, the primal is a value object, so the tangent computation builds its own computational graph. Running backward() on the tangent then gives the second derivative.

Connection to Statistics

Fisher Information and Standard Errors

At the MLE $\hat{\theta}$ , the observed Fisher information matrix is:

$\mathcal{I}(\hat{\theta}) = -H(\hat{\theta})$

where $H$ is the Hessian of the log-likelihood. The variance-covariance matrix of the MLE is approximately:

$\text{Var}(\hat{\theta}) \approx \mathcal{I}(\hat{\theta})^{-1} = (-H)^{-1}$

Standard errors are the square roots of the diagonal of this matrix.

# Generate data
set.seed(42)
x <- rnorm(200, mean = 5, sd = 2)

# Fit model - femtograd computes the Hessian automatically
result <- fit(
  function(mu, log_sigma) {
    sigma <- exp(log_sigma)
    loglik_normal(mu, sigma, x)
  },
  params = c(mu = 0, log_sigma = 0)
)

# Standard errors from inverse Fisher information
se(result)
#>        mu log_sigma 
#> 0.1374822 0.0500000

# For comparison: theoretical SE for mu is sigma/sqrt(n)
2 / sqrt(200)
#> [1] 0.1414214

Newton-Raphson Optimization

Newton’s method uses the Hessian to take optimal steps:

$\theta_{t+1} = \theta_t - H^{-1} \nabla \ell(\theta_t)$

This converges quadratically near the optimum, while gradient-only methods converge linearly. Femtograd’s method = "newton" option in fit() uses this approach.

Summary

Concept	In femtograd	Purpose
Forward pass	`z <- x^2 + x*y`	Build graph, compute value
Backward pass	`backward(z)`	Compute all gradients
Gradient access	`grad(x)`	Get $\partial z / \partial x$
Hessian	`hessian(f, params)`	Second derivatives for SEs
Dual numbers	Internal to `hessian()`	Forward-over-reverse AD
Fisher information	`observed_info(result)`	$-H$ at MLE

The computational graph approach gives us exact derivatives (no numerical approximation) efficiently (one backward pass for all parameters), which is precisely what statistical inference requires.