FemtoGrad is a minimalist automatic differentiation library I built for learning. The goal was to strip autodiff down to its core and see what’s left.
What is Automatic Differentiation?
Automatic differentiation (autodiff) computes derivatives of functions specified by computer programs. It’s distinct from numerical differentiation (approximate, unstable) and symbolic differentiation (expression trees grow exponentially, inefficient). Autodiff gives you exact derivatives with computational cost proportional to the function evaluation itself.
Reverse Mode AD
FemtoGrad implements reverse mode AD, which is what everyone calls backpropagation.
- Forward pass: compute the function value, recording operations as you go.
- Backward pass: accumulate gradients by applying the chain rule in reverse.
- The cost is O(1) per output, regardless of input dimensionality.
This is why backprop scales to millions of parameters. The cost of computing the gradient is proportional to the cost of computing the function.
Core Abstractions
class Tensor:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
Each tensor tracks its value, its gradient, how it was computed (the parent nodes and the operation), and how to backpropagate through that operation. That’s it. The whole thing is a DAG with local gradient rules at each node.
What It Demonstrates
Computational graphs: how operations form a DAG. Gradient flow: the chain rule in action. Dynamic construction: graphs built during the forward pass, not declared ahead of time. And simplicity: core autodiff in about 100 lines.
Supported Operations
Arithmetic (add, multiply, divide, power), activation functions (ReLU, sigmoid, tanh), and reductions (sum, mean). This is enough to build and train neural networks.
Example
# Create tensors
a = Tensor(2.0)
b = Tensor(3.0)
# Build computation
c = a * b + b**2
c.backward()
# Gradients computed
print(a.grad) # dc/da
print(b.grad) # dc/db
Beyond FemtoGrad
Understanding FemtoGrad gives you insight into PyTorch’s autograd, TensorFlow’s GradientTape, and JAX’s grad function. They all implement the same core ideas with additional optimizations and features. But the basic mechanism is exactly this.
Discussion