Architecture of an AD Engine
From femtograd to micrograd
Source:vignettes/architecture.Rmd
architecture.RmdIntroduction
femtograd is deliberately simple. Every operation creates an R6 object, every backward pass walks a DAG of R closures, and every optimization step rebuilds the entire computational graph from scratch. This makes the code readable and debuggable—you can inspect any node, print any gradient, and trace any computation in plain R.
But simplicity has costs. For large datasets or many parameters, femtograd will be slower than production tools like TMB or torch. This is by design: femtograd exists to teach, not to compete.
This vignette explains why femtograd is structured the way it is, and what architectural changes would be needed to make it faster. We present three tiers that any AD system progresses through, from pure-interpreter implementations (where femtograd lives) to compiled backends (where production tools live). Understanding these tiers makes production AD tools less opaque.
Tier 0: Graph-per-Call (femtograd today)
femtograd’s current architecture rebuilds the computational graph on every function evaluation. Consider what happens during one step of gradient ascent:
# Generate some data
set.seed(42)
x <- rexp(50, rate = 2)
# One step of the "inner loop":
# 1. Create fresh value objects from numeric parameters
params <- list(val(1.5))
# 2. Run the forward pass (builds the graph)
obj <- loglik_exponential(params[[1]], x)
# 3. Extract the scalar result
get_data(obj)
#> [1] -22.29947
# 4. Run the backward pass (walks the graph)
backward(obj)
# 5. Extract the gradient
grad(params[[1]])
#> [1] 4.951519Steps 1–5 are the fundamental unit of work for any gradient-based
optimizer. femtograd’s gradient_ascent(),
bfgs(), and lbfgs() all perform this sequence
on every iteration.
What happens under the hood
Each arithmetic operation (+, *,
log, etc.) creates a new R6 value object on
the R heap. The object stores:
- data: The numeric result (a matrix)
- grad: The gradient accumulator (initially zero)
- prev: Pointers to parent nodes (forming the DAG)
- backward_fn: An R closure encoding the local derivative
For a log-likelihood summing over observations, this creates nodes per evaluation. Each node involves R6 object allocation, S3 method dispatch, and closure creation—all interpreted R operations.
The cost profile
| Operation | Mechanism | Cost |
|---|---|---|
| Node creation | R6$new() per operation | Heap allocation + GC pressure |
| Forward pass | S3 dispatch per operation | Interpreter overhead |
| Backward pass | Walk DAG, call closures | Interpreted loop + closure calls |
| Graph structure | Rebuilt every iteration | Redundant work |
For a likelihood with 100 observations and 2 parameters, each optimizer step creates hundreds of R6 objects, dispatches hundreds of S3 methods, and discards the entire graph before the next step.
Why this is the right choice for pedagogy
Despite the overhead, Tier 0 has critical advantages for learning:
-
Inspectability: You can
print()any node and see its value, gradient, and parents -
Debuggability: R’s debugger works on every
operation—set breakpoints in
backward_fnclosures - Extensibility: Adding a new operation means writing one R function with a closure
- No build step: Pure R, no compilation, installs anywhere R runs
The code is the documentation. Reading R/ops.R
teaches you how multiplication’s backward pass works. Reading
R/value.R teaches you how graphs are built. This
transparency is the whole point.
Tier 1: Tape-Based Tracing
The first major optimization is to separate graph construction from graph evaluation. Instead of rebuilding the graph on every call, trace it once and replay it.
The idea
A tape is a linear record of operations. Instead of creating R6 objects during the forward pass, you record each operation as a tuple:
{operation, input_indices, output_index}
For example, the log-likelihood might produce a tape like:
[1] LOG input=0 output=1 # log(lambda)
[2] MUL input=1,2 output=3 # n * log(lambda)
[3] MUL input=0,4 output=5 # lambda * sum_x
[4] SUB input=3,5 output=6 # n*log(lambda) - lambda*sum_x
The numeric values live in a flat vector (the value buffer), not in individual R6 objects. To evaluate at new parameter values, you update slot 0 in the buffer and replay the tape—no object creation, no S3 dispatch, no garbage collection.
What this buys you
| Aspect | Tier 0 | Tier 1 |
|---|---|---|
| Graph construction | Every call | Once |
| Memory pattern | R6 objects on heap | Flat numeric vectors |
| Dispatch overhead | S3 per operation | Switch/table per operation |
| Backward pass | Walk DAG of closures | Reverse-iterate the tape |
The key insight is that the structure of the computation doesn’t change between optimizer iterations—only the values do. Taping exploits this by caching the structure.
The trade-off
Taping requires that the computation’s control flow is
static: no if/else branches that
depend on parameter values. If the graph structure can change (e.g.,
different code paths for positive vs. negative parameters), the tape
must be re-recorded.
This is exactly the distinction between TensorFlow 1.x’s static
graphs (taped) and PyTorch’s eager mode (graph-per-call, like
femtograd). JAX’s jit is a hybrid: it traces once and
re-traces only when the control flow changes.
For MLE problems, the log-likelihood almost always has static structure, so taping works well.
Tier 2: Compiled Backends
The second optimization is to move the tape interpreter out of R entirely.
C++ via Rcpp
Even with taping, replaying the tape in R’s interpreter has per-operation overhead: function call setup, type checking, vector recycling rules. Moving the tape interpreter to C++ via Rcpp eliminates this: node data becomes contiguous C++ arrays, and the backward pass becomes a tight loop with no R dispatch.
This is the approach taken by TMB (Template Model Builder), one of the most widely used AD tools in statistics. TMB uses CppAD (a C++ automatic differentiation library) behind an R interface. Users write their log-likelihood in C++ templates, and TMB compiles it into an efficient tape that can be evaluated and differentiated without returning to R.
The performance difference is substantial: TMB can handle models with thousands of parameters and millions of observations, while femtograd is practical for modest-sized problems.
GPU acceleration
For problems where the bottleneck is arithmetic throughput rather than per-node overhead—large matrix operations, batch likelihoods over massive datasets—the tape can be evaluated on a GPU.
This is what torch (the R interface to PyTorch’s libtorch) provides. The R API stays the same; only the backend changes. For femtograd’s typical use case (small parameter spaces, moderate data), GPU acceleration is rarely necessary. The bottleneck is per-node interpreter overhead, not floating-point throughput, and GPUs don’t help with that until the tape is already compiled.
The abstraction boundary
The key architectural insight is that Tier 2 doesn’t change the
API—it changes the backend. In principle, you could
keep femtograd’s val(), backward(), and
grad() interface and swap the implementation from R6
objects to a C++ tape engine. The user-facing code would be identical;
only the performance characteristics would change.
This is why understanding Tier 0 is valuable even if you’ll eventually use a Tier 2 tool: the concepts (computational graphs, chain rule, reverse-mode traversal) are the same at every tier. The tiers differ only in how much interpreter overhead sits between the math and the metal.
Where femtograd Chooses to Live
femtograd stays at Tier 0 because its goal is not speed—it’s understanding. Moving to Tier 1 would hide the graph structure behind a tape abstraction. Moving to Tier 2 would hide the operations behind a C++ wall. Each tier trades transparency for performance.
For real-world statistical modeling:
- Small problems (< 10 parameters, < 1000 observations): femtograd is perfectly adequate
- Medium problems: Consider TMB for compiled AD with an R interface
- Large problems or deep learning: Use torch for GPU-accelerated tensor operations
But for learning how automatic differentiation works—for tracing a gradient from output back to input, for understanding why the Hessian gives you standard errors, for seeing the chain rule operate on real code—Tier 0 is exactly right.
# The whole pipeline, readable end to end:
set.seed(42)
data <- rnorm(100, mean = 5, sd = 2)
result <- fit(
function(mu, log_sigma) {
sigma <- exp(log_sigma)
loglik_normal(mu, sigma, data)
},
params = c(mu = 0, log_sigma = 0)
)
summary(result)
#> femtofit: Maximum Likelihood Estimation
#>
#> Call:
#> fit(loglik = function(mu, log_sigma) {
#> sigma <- exp(log_sigma)
#> loglik_normal(mu, sigma, data)
#> }, params = c(mu = 0, log_sigma = 0))
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> mu 5.065030 0.207227 24.442 < 2.2e-16 ***
#> log_sigma 0.728647 0.070711 10.305 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ---
#> Log-likelihood: -214.8 AIC: 433.5
#> Converged in 22 iterationsThe code above builds a graph, differentiates it, inverts the Hessian, and reports standard errors—all in pure R, all inspectable, all debuggable. That’s the point.
Summary
| Tier | Strategy | Example tools | Transparency | Speed |
|---|---|---|---|---|
| 0 | Graph-per-call (R6 + closures) | femtograd | Fully inspectable | Modest |
| 1 | Tape-based tracing | TensorFlow 1.x, JAX | Graph cached, less visible | Faster |
| 2 | Compiled backend (C++, GPU) | TMB, torch | Opaque backend | Production-grade |
Every tier implements the same mathematical ideas: computational graphs, the chain rule, and reverse-mode traversal. The difference is how much machinery sits between you and those ideas. femtograd minimizes that machinery so you can see the math clearly.