Architecture of an AD Engine

Introduction

femtograd is deliberately simple. Every operation creates an R6 object, every backward pass walks a DAG of R closures, and every optimization step rebuilds the entire computational graph from scratch. This makes the code readable and debuggable—you can inspect any node, print any gradient, and trace any computation in plain R.

But simplicity has costs. For large datasets or many parameters, femtograd will be slower than production tools like TMB or torch. This is by design: femtograd exists to teach, not to compete.

This vignette explains why femtograd is structured the way it is, and what architectural changes would be needed to make it faster. We present three tiers that any AD system progresses through, from pure-interpreter implementations (where femtograd lives) to compiled backends (where production tools live). Understanding these tiers makes production AD tools less opaque.

Tier 0: Graph-per-Call (femtograd today)

femtograd’s current architecture rebuilds the computational graph on every function evaluation. Consider what happens during one step of gradient ascent:

# Generate some data
set.seed(42)
x <- rexp(50, rate = 2)

# One step of the "inner loop":
# 1. Create fresh value objects from numeric parameters
params <- list(val(1.5))

# 2. Run the forward pass (builds the graph)
obj <- loglik_exponential(params[[1]], x)

# 3. Extract the scalar result
get_data(obj)
#> [1] -22.29947

# 4. Run the backward pass (walks the graph)
backward(obj)

# 5. Extract the gradient
grad(params[[1]])
#> [1] 4.951519

Steps 1–5 are the fundamental unit of work for any gradient-based optimizer. femtograd’s gradient_ascent(), bfgs(), and lbfgs() all perform this sequence on every iteration.

What happens under the hood

Each arithmetic operation (+, *, log, etc.) creates a new R6 value object on the R heap. The object stores:

data: The numeric result (a matrix)
grad: The gradient accumulator (initially zero)
prev: Pointers to parent nodes (forming the DAG)
backward_fn: An R closure encoding the local derivative

For a log-likelihood summing over $n$ observations, this creates $O(n)$ nodes per evaluation. Each node involves R6 object allocation, S3 method dispatch, and closure creation—all interpreted R operations.

The cost profile

Operation	Mechanism	Cost
Node creation	R6$new() per operation	Heap allocation + GC pressure
Forward pass	S3 dispatch per operation	Interpreter overhead
Backward pass	Walk DAG, call closures	Interpreted loop + closure calls
Graph structure	Rebuilt every iteration	Redundant work

For a likelihood with 100 observations and 2 parameters, each optimizer step creates hundreds of R6 objects, dispatches hundreds of S3 methods, and discards the entire graph before the next step.

Why this is the right choice for pedagogy

Despite the overhead, Tier 0 has critical advantages for learning:

Inspectability: You can print() any node and see its value, gradient, and parents
Debuggability: R’s debugger works on every operation—set breakpoints in backward_fn closures
Extensibility: Adding a new operation means writing one R function with a closure
No build step: Pure R, no compilation, installs anywhere R runs

The code is the documentation. Reading R/ops.R teaches you how multiplication’s backward pass works. Reading R/value.R teaches you how graphs are built. This transparency is the whole point.

Tier 1: Tape-Based Tracing

The first major optimization is to separate graph construction from graph evaluation. Instead of rebuilding the graph on every call, trace it once and replay it.

The idea

A tape is a linear record of operations. Instead of creating R6 objects during the forward pass, you record each operation as a tuple:

{operation, input_indices, output_index}

For example, the log-likelihood $\ell(\lambda) = n \log \lambda - \lambda \sum x_i$ might produce a tape like:

[1] LOG    input=0   output=1      # log(lambda)
[2] MUL    input=1,2 output=3      # n * log(lambda)
[3] MUL    input=0,4 output=5      # lambda * sum_x
[4] SUB    input=3,5 output=6      # n*log(lambda) - lambda*sum_x

The numeric values live in a flat vector (the value buffer), not in individual R6 objects. To evaluate at new parameter values, you update slot 0 in the buffer and replay the tape—no object creation, no S3 dispatch, no garbage collection.

What this buys you

Aspect	Tier 0	Tier 1
Graph construction	Every call	Once
Memory pattern	R6 objects on heap	Flat numeric vectors
Dispatch overhead	S3 per operation	Switch/table per operation
Backward pass	Walk DAG of closures	Reverse-iterate the tape

The key insight is that the structure of the computation doesn’t change between optimizer iterations—only the values do. Taping exploits this by caching the structure.

The trade-off

Taping requires that the computation’s control flow is static: no if/else branches that depend on parameter values. If the graph structure can change (e.g., different code paths for positive vs. negative parameters), the tape must be re-recorded.

This is exactly the distinction between TensorFlow 1.x’s static graphs (taped) and PyTorch’s eager mode (graph-per-call, like femtograd). JAX’s jit is a hybrid: it traces once and re-traces only when the control flow changes.

For MLE problems, the log-likelihood almost always has static structure, so taping works well.

Tier 2: Compiled Backends

The second optimization is to move the tape interpreter out of R entirely.

C++ via Rcpp

Even with taping, replaying the tape in R’s interpreter has per-operation overhead: function call setup, type checking, vector recycling rules. Moving the tape interpreter to C++ via Rcpp eliminates this: node data becomes contiguous C++ arrays, and the backward pass becomes a tight loop with no R dispatch.

This is the approach taken by TMB (Template Model Builder), one of the most widely used AD tools in statistics. TMB uses CppAD (a C++ automatic differentiation library) behind an R interface. Users write their log-likelihood in C++ templates, and TMB compiles it into an efficient tape that can be evaluated and differentiated without returning to R.

The performance difference is substantial: TMB can handle models with thousands of parameters and millions of observations, while femtograd is practical for modest-sized problems.

GPU acceleration

For problems where the bottleneck is arithmetic throughput rather than per-node overhead—large matrix operations, batch likelihoods over massive datasets—the tape can be evaluated on a GPU.

This is what torch (the R interface to PyTorch’s libtorch) provides. The R API stays the same; only the backend changes. For femtograd’s typical use case (small parameter spaces, moderate data), GPU acceleration is rarely necessary. The bottleneck is per-node interpreter overhead, not floating-point throughput, and GPUs don’t help with that until the tape is already compiled.

The abstraction boundary

The key architectural insight is that Tier 2 doesn’t change the API—it changes the backend. In principle, you could keep femtograd’s val(), backward(), and grad() interface and swap the implementation from R6 objects to a C++ tape engine. The user-facing code would be identical; only the performance characteristics would change.

This is why understanding Tier 0 is valuable even if you’ll eventually use a Tier 2 tool: the concepts (computational graphs, chain rule, reverse-mode traversal) are the same at every tier. The tiers differ only in how much interpreter overhead sits between the math and the metal.

Where femtograd Chooses to Live

femtograd stays at Tier 0 because its goal is not speed—it’s understanding. Moving to Tier 1 would hide the graph structure behind a tape abstraction. Moving to Tier 2 would hide the operations behind a C++ wall. Each tier trades transparency for performance.

For real-world statistical modeling:

Small problems (< 10 parameters, < 1000 observations): femtograd is perfectly adequate
Medium problems: Consider TMB for compiled AD with an R interface
Large problems or deep learning: Use torch for GPU-accelerated tensor operations

But for learning how automatic differentiation works—for tracing a gradient from output back to input, for understanding why the Hessian gives you standard errors, for seeing the chain rule operate on real code—Tier 0 is exactly right.

# The whole pipeline, readable end to end:
set.seed(42)
data <- rnorm(100, mean = 5, sd = 2)

result <- fit(
  function(mu, log_sigma) {
    sigma <- exp(log_sigma)
    loglik_normal(mu, sigma, data)
  },
  params = c(mu = 0, log_sigma = 0)
)

summary(result)
#> femtofit: Maximum Likelihood Estimation
#> 
#> Call:
#> fit(loglik = function(mu, log_sigma) {
#>     sigma <- exp(log_sigma)
#>     loglik_normal(mu, sigma, data)
#> }, params = c(mu = 0, log_sigma = 0))
#> 
#> Coefficients:
#>           Estimate Std. Error z value  Pr(>|z|)    
#> mu        5.065030   0.207227  24.442 < 2.2e-16 ***
#> log_sigma 0.728647   0.070711  10.305 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ---
#> Log-likelihood: -214.8    AIC: 433.5
#> Converged in 22 iterations

The code above builds a graph, differentiates it, inverts the Hessian, and reports standard errors—all in pure R, all inspectable, all debuggable. That’s the point.

Summary

Tier	Strategy	Example tools	Transparency	Speed
0	Graph-per-call (R6 + closures)	femtograd	Fully inspectable	Modest
1	Tape-based tracing	TensorFlow 1.x, JAX	Graph cached, less visible	Faster
2	Compiled backend (C++, GPU)	TMB, torch	Opaque backend	Production-grade

Every tier implements the same mathematical ideas: computational graphs, the chain rule, and reverse-mode traversal. The difference is how much machinery sits between you and those ideas. femtograd minimizes that machinery so you can see the math clearly.

From femtograd to micrograd