Reverse-Mode Automatic Differentiation

January 17, 2023 5 min read Updated: February 24, 2026

Reverse-mode automatic differentiation in C++20.

Automatic differentiation powers modern machine learning. PyTorch’s autograd, JAX’s jit, TensorFlow’s GradientTape all rely on the same idea: compute exact derivatives by mechanically applying the chain rule. There is no approximation. No symbolic manipulation. Just the chain rule, over and over, on a computation graph.

I built a reverse-mode AD system to understand how this actually works. The goal was not to compete with PyTorch. It was to see clearly what PyTorch is doing under the hood.

The Functional API

The core interface is a single function: grad. It takes a function and returns its gradient function:

auto f = [](const auto& x) {
    return sum(pow(x, 2.0));  // f(x) = sum(x^2)
};

auto df = grad(f);  // df(x) = 2x

matrix<double> x{{1.0}, {2.0}, {3.0}};
auto gradient = df(x);  // Returns {{2.0}, {4.0}, {6.0}}

Because grad returns a callable, you can compose it:

auto d2f = grad(grad(f));  // Second derivative

No classes to instantiate, no global state to manage. Just functions transforming functions.

How Reverse-Mode AD Works

The grad function reveals the algorithm:

template <typename F>
auto grad(F&& f) {
    return [f = std::forward<F>(f)]<typename T>(const T& input) {
        graph g;                          // 1. Fresh graph
        auto x = g.make_var(input);       // 2. Create input variable
        auto y = f(x);                    // 3. Forward pass (builds graph)
        g.backward<output_type>(y);       // 4. Backward pass
        return g.gradient(x);             // 5. Extract gradient
    };
}

Forward pass: When f(x) executes, each operation creates a node in the graph. The node stores the computed value, references to parent nodes, and a backward function implementing that operation’s local Jacobian.

Backward pass: Starting from the output, we traverse the graph in reverse topological order. Each node’s backward function receives the upstream gradient and propagates it to parents via the chain rule.

That is the whole algorithm. Everything else is details about which Jacobians to implement and how to handle edge cases.

Implementing Jacobians

Each differentiable operation needs a backward function. Here are three matrix calculus results that come up constantly.

Matrix Multiplication: C = AB

Given dL/dC (the upstream gradient), the chain rule gives:

$\partial L/\partial A = (\partial L/\partial C) B^T$
$\partial L/\partial B = A^T (\partial L/\partial C)$

Determinant: $d = \det(A)$

A classic result from matrix calculus:

$$\frac{\partial \det(A)}{\partial A} = \det(A) \cdot A^{-T}$$

where $A^{-T}$ denotes the transpose of the inverse.

Matrix Inverse: $B = A^{-1}$

Start from $A \cdot A^{-1} = I$ and differentiate implicitly:

$$dA \cdot A^{-1} + A \cdot dA^{-1} = 0$$

$$dA^{-1} = -A^{-1} \cdot dA \cdot A^{-1}$$

Applying the chain rule to a scalar loss $L$ through $B = A^{-1}$:

$$\frac{\partial L}{\partial A} = -A^{-T} \left(\frac{\partial L}{\partial B}\right) A^{-T}$$

These are the formulas you look up once and then implement as backward functions. The AD system handles composing them.

Gradient Accumulation

When a variable appears multiple times in a computation, gradients accumulate:

void accumulate_gradient(node_id id, const T& grad) {
    auto& n = nodes_.at(id.index);
    if (!n.gradient.has_value()) {
        n.gradient = grad;
    } else {
        auto& existing = std::any_cast<T&>(n.gradient);
        existing = existing + grad;
    }
}

If y = x + x, then dy/dx = 2, not 1. Each use of x contributes its gradient. This falls out naturally from the multivariate chain rule, but it is easy to get wrong in an implementation.

Testing AD with Finite Differences

How do we know our Jacobians are correct? We check against finite differences, which I covered in a previous post in this series:

f'(x) ~ [f(x + h) - f(x - h)] / 2h

Central differences with O(h^2) error. Slow (O(n) function evaluations for n parameters) and approximate, but they do not lie. They are the ground truth for testing AD implementations.

TEST_CASE("matmul gradient") {
    auto f = [](const auto& A) {
        return sum(matmul(A, A));  // f(A) = sum(A^2)
    };

    matrix<double> A{{1, 2}, {3, 4}};
    auto ad_grad = grad(f)(A);
    auto fd_grad = finite_diff_gradient([&](const auto& x) {
        return sum(elementa::matmul(x, x));
    }, A);

    REQUIRE(approx_equal(ad_grad, fd_grad, 1e-4, 1e-6));
}

If finite differences and AD disagree, the bug is in your Jacobian. Always.

Practical Example: Gradient Descent

Here is logistic regression training using the AD system:

// Binary cross-entropy loss
auto bce_loss = [&X, &y](const auto& beta) {
    auto logits = matmul(X, beta);
    auto p = sigmoid(logits);
    return -mean(y * log(p) + (ones_like(y) - y) * log(ones_like(p) - p));
};

auto grad_bce = grad(bce_loss);

// Training loop
matrix<double> beta(features, 1, 0.0);  // Initialize weights
for (int iter = 0; iter < 1000; ++iter) {
    auto gradient = grad_bce(beta);
    beta = beta - gradient * learning_rate;
}

The gradient computation is automatic. Change the loss function, and grad adapts. You do not rewrite the backward pass. This is why AD matters for ML: it separates the model definition from the optimization.

Conclusion

Automatic differentiation is not magic. It is the chain rule, applied systematically:

Graph construction: Operations build a DAG during the forward pass
Backward traversal: Reverse topological order ensures correct dependency ordering
Gradient accumulation: Variables used multiple times sum their gradients

Understanding this makes PyTorch and JAX less mysterious. When .backward() runs, it is doing exactly what I have described: traversing the graph, calling backward functions, accumulating gradients. The scale is different, the optimizations are extensive, but the core algorithm is the same.

Reverse-Mode Automatic Differentiation

The Functional API

How Reverse-Mode AD Works

Implementing Jacobians

Matrix Multiplication: C = AB

Determinant: \(d = \det(A)\)

Matrix Inverse: \(B = A^{-1}\)

Gradient Accumulation

Testing AD with Finite Differences

Practical Example: Gradient Descent

Conclusion

Discussion

The Functional API

How Reverse-Mode AD Works

Implementing Jacobians

Matrix Multiplication: C = AB

Determinant: \(d = \det(A)\)

Matrix Inverse: \(B = A^{-1}\)

Gradient Accumulation

Testing AD with Finite Differences

Practical Example: Gradient Descent

Conclusion

Related Posts

Differentiation: Three Ways

Forward-Mode Automatic Differentiation

Duality: The Hidden Structure of Opposites

The Motivating Example …

Numerical Integration with Generic Concepts

Overview

Seeing Structure First

The Question Behind the Code

Discussion