Differentiation: Three Ways

January 15, 2025 5 min read Updated: February 24, 2026

A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.

Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:

Forward-mode AD via dual numbers
Reverse-mode AD via computational graphs
Numerical differentiation via finite differences

Each has different strengths. The right choice depends on the shape of your problem.

The Landscape

Method	Accuracy	Cost for \(f: \mathbb{R}^n \to \mathbb{R}\)	Cost for \(f: \mathbb{R} \to \mathbb{R}^m\)	Memory
Forward AD	Exact	\(O(n)\) passes	\(O(1)\) pass	\(O(1)\)
Reverse AD	Exact	\(O(1)\) pass	\(O(m)\) passes	\(O(\text{ops})\)
Finite Diff	\(O(h^p)\)	\(O(n)\) evaluations	\(O(n)\) evaluations	\(O(1)\)

The key point: problem structure determines the best method.

Forward-Mode AD: Dual Numbers

Forward-mode AD extends numbers with an infinitesimal \(\varepsilon\) where \(\varepsilon^2 = 0\). The derivative falls out of the arithmetic for free:

// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3

auto x = dual<double>::variable(2.0);  // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;

std::cout << f.value() << "\n";       // 3.0
std::cout << f.derivative() << "\n";  // 9.0

Strengths:

Simple implementation (operator overloading)
No memory overhead
Naturally composable for higher derivatives
Works with any function of overloaded operators

When to use:

Single input variable (or few inputs)
Computing Jacobian-vector products
Higher-order derivatives via nesting
Sensitivity analysis along one direction

Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.

Reverse-Mode AD: Computational Graphs

Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:

auto f = [](const auto& x) {
    return sum(pow(x, 2.0));  // f(x) = sum(x^2)
};

auto df = grad(f);  // Returns gradient function
auto gradient = df(x);  // One backward pass for all partials

Strengths:

O(1) backward passes regardless of input dimension
Powers modern deep learning (backpropagation)
Efficient for loss functions: f: R^n -> R

When to use:

Many inputs, scalar output (neural networks)
Computing vector-Jacobian products
Optimization where you need the full gradient

Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.

Numerical Differentiation: Finite Differences

Approximate the derivative using the limit definition:

// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);

Strengths:

Works with black-box functions
No special types required
Simple to implement and understand

Limitations:

Approximate (truncation error + round-off error)
Requires careful step size selection
O(n) function evaluations for gradient

When to use:

Black-box functions (no source code)
Quick approximations
Validating AD implementations

Decision Tree

Need derivatives?
|
+-- Source code available?
|   |
|   +-- Yes: Use AD
|   |   |
|   |   +-- f: R -> R^m (few inputs, many outputs)?
|   |   |   --> Forward-mode AD
|   |   |
|   |   +-- f: R^n -> R (many inputs, scalar output)?
|   |   |   --> Reverse-mode AD
|   |   |
|   |   +-- f: R^n -> R^m (both)?
|   |       --> Forward for m < n, Reverse for n < m
|   |
|   +-- No: Black-box function
|       --> Finite differences
|
+-- Validating correctness?
    --> Finite differences (ground truth)

Side-by-Side Comparison

Consider f(x,y,z) = x^2 + y^2 + z^2, evaluated at (1, 2, 3).

Forward-Mode (3 passes for full gradient)

// Pass 1: differentiate w.r.t. x
auto dx = dual<double>::variable(1.0);
auto y1 = dual<double>::constant(2.0);
auto z1 = dual<double>::constant(3.0);
auto f1 = dx*dx + y1*y1 + z1*z1;
// f1.derivative() = 2 (df/dx)

// Pass 2: differentiate w.r.t. y
// Pass 3: differentiate w.r.t. z

Reverse-Mode (1 pass for full gradient)

auto f = [](const auto& v) {
    return v(0)*v(0) + v(1)*v(1) + v(2)*v(2);
};
auto gradient = grad(f)(vector{1.0, 2.0, 3.0});
// gradient = {2, 4, 6} in one backward pass

Finite Differences (3 evaluations for gradient)

// df/dx ~ (f(1+h,2,3) - f(1-h,2,3)) / 2h
// df/dy ~ (f(1,2+h,3) - f(1,2-h,3)) / 2h
// df/dz ~ (f(1,2,3+h) - f(1,2,3-h)) / 2h
auto gradient = finite_diff::gradient(f, {1.0, 2.0, 3.0});

Combining Methods: Validation

The gold standard for testing AD: compare against finite differences.

auto ad_grad = grad(f)(x);
auto fd_grad = finite_diff::gradient(f, x);

// Should match within tolerance
REQUIRE(approx_equal(ad_grad, fd_grad, 1e-4, 1e-6));

Finite differences are slow and approximate, but they don’t lie. If your AD implementation disagrees with finite differences, the AD is wrong.

Complexity Analysis

For \(f: \mathbb{R}^n \to \mathbb{R}^m\) with \(p\) elementary operations:

Method	Time	Space	Function Evals
Forward AD	\(O(np)\)	\(O(1)\)	\(n\)
Reverse AD	\(O(mp)\)	\(O(p)\)	\(m\)
Finite Diff	\(O(n)\)	\(O(1)\)	\(2n\) (central)

The crossover point:

\(n < m\): Forward-mode wins
\(n > m\): Reverse-mode wins
\(n \approx m\): Either works; consider memory constraints

For neural networks with millions of parameters and scalar loss, reverse mode is the only practical choice.

Integration with Numerical Quadrature

One of my favorite applications: differentiation under the integral sign.

// Compute d/da of integral from 0 to 1 of a*sin(x) dx
using D = dual<double>;

D a = D::variable(2.0);
auto integrand = [&a](D x) { return a * sin(x); };

D result = integrate(integrand, D::constant(0.0), D::constant(1.0));

// result.value() = 2 * (1 - cos(1)) ~ 0.919
// result.derivative() = (1 - cos(1)) ~ 0.459

Forward-mode AD composes naturally with generic numerical algorithms because dual numbers form a ring. They satisfy the algebraic requirements that the algorithms need. You don’t have to do anything special.

Conclusion

Three methods, three use cases:

Forward-mode AD: Simple, memory-efficient, ideal for few inputs
Reverse-mode AD: Powers deep learning, ideal for many inputs
Finite differences: Universal fallback, validation tool

The Stepanov perspective: each method exploits different structure. Forward AD uses the ring structure of dual numbers. Reverse AD uses the DAG structure of computation. Finite differences use only function evaluation. Understanding which structure your problem has tells you which method to reach for.

Differentiation: Three Ways

The Landscape

Forward-Mode AD: Dual Numbers

Reverse-Mode AD: Computational Graphs

Numerical Differentiation: Finite Differences

Decision Tree

Side-by-Side Comparison

Forward-Mode (3 passes for full gradient)

Reverse-Mode (1 pass for full gradient)

Finite Differences (3 evaluations for gradient)

Combining Methods: Validation

Complexity Analysis

Integration with Numerical Quadrature

Conclusion

Further Reading

Discussion

The Landscape

Forward-Mode AD: Dual Numbers

Reverse-Mode AD: Computational Graphs

Numerical Differentiation: Finite Differences

Decision Tree

Side-by-Side Comparison

Forward-Mode (3 passes for full gradient)

Reverse-Mode (1 pass for full gradient)

Finite Differences (3 evaluations for gradient)

Combining Methods: Validation

Complexity Analysis

Integration with Numerical Quadrature

Conclusion

Further Reading

Related Posts

Forward-Mode Automatic Differentiation

Reverse-Mode Automatic Differentiation

Numerical Differentiation

Overview

Duality: The Hidden Structure of Opposites

The Motivating Example …

Numerical Integration with Generic Concepts

Overview

Discussion