Your Optimizer Is (Approximately) Propagating Fisher Information

September 12, 2023 9 min read Updated: March 16, 2026

Adam’s second moment estimate is a diagonal approximation to the empirical Fisher Information Matrix. K-FAC uses a Kronecker-factored approximation. EWC uses the Fisher as a regularization penalty. Natural gradient uses the full thing. Same underlying structure, different computational budgets.

None of this is new. Amari formalized natural gradient in 1998. Martens (2020) wrote a thorough 76-page survey connecting it to second-order optimization. Khan and Rue (2023) unified these algorithms under a single Bayesian learning rule. Kunstner, Hennig, and Balles (2019) showed where the empirical Fisher approximation breaks. I spent a while working through the math, and the pattern is clean once you see it, so I want to write it down.

Fisher Information

Parametric model $p(x|\theta)$. Here $x$ is observed data and $\theta \in \mathbb{R}^d$ is the parameter vector. The score function is the gradient of the log-likelihood with respect to the parameters:

$$s(\theta; x) = \nabla_\theta \log p(x|\theta)$$

Under regularity conditions, the score has zero mean: $\mathbb{E}_{x \sim p(x|\theta)}[s(\theta; x)] = 0$. The Fisher Information Matrix $\mathcal{I}(\theta) \in \mathbb{R}^{d \times d}$ is its covariance:

$$\mathcal{I}(\theta) = \text{Cov}[s(\theta; x)] = \mathbb{E}_{x \sim p(x|\theta)}\!\left[s(\theta; x)\, s(\theta; x)^\top\right]$$

Under the same regularity conditions (differentiability, integrability, interchange of differentiation and integration), this equals $-\mathbb{E}[\nabla^2_\theta \log p(x|\theta)]$, the expected negative Hessian. This is Bartlett’s second identity.

What does the Fisher measure? Not how sharply the likelihood peaks around the data. It measures how well the data distinguishes nearby parameter values. If $\mathcal{I}(\theta)$ is large in some direction, then a small change in $\theta$ along that direction produces a large change in the distribution $p(x|\theta)$, which means the data is informative about that component. If $\mathcal{I}(\theta)$ is small, nearby parameter values produce nearly indistinguishable distributions, and the data can’t tell them apart. Formally, the Fisher is the Hessian of the KL divergence $D_{\text{KL}}(p_\theta | p_{\theta + \delta\theta})$ at $\delta\theta = 0$.

Fisher information from independent observations adds:

$$\mathcal{I}_{1:n}(\theta) = \sum_{i=1}^n \mathcal{I}_{x_i}(\theta)$$

where $\mathcal{I}{x_i}(\theta) = s(\theta; x_i) s(\theta; x_i)^\top$ is the per-observation Fisher contribution. (For i.i.d. observations, $\mathcal{I}{1:n}(\theta) = n \cdot \mathcal{I}(\theta)$, but the additivity holds for non-identical observations too, which is what makes sequential and distributed estimation clean.)

This gives you a natural update rule. Maintain a running estimate $\hat\theta_t \in \mathbb{R}^d$ and accumulated Fisher information $\mathcal{I}t \in \mathbb{R}^{d \times d}$. When new data arrives with score $s_t$ and Fisher contribution $\mathcal{I}{\text{new}}$:

$$\mathcal{I}_t = \mathcal{I}_{t-1} + \mathcal{I}_{\text{new}}$$

$$\hat\theta_t = \hat\theta_{t-1} + \eta\, \mathcal{I}_t^{-1} s_t$$

where $\eta > 0$ is a learning rate. Since the score $s_t$ points in the direction of increasing log-likelihood, this is gradient ascent on the likelihood, preconditioned by the inverse Fisher. The preconditioner $\mathcal{I}_t^{-1}$ makes this a natural gradient step: it rescales the update to respect the Fisher-Rao metric on the statistical manifold, so that equal-sized steps correspond to equal changes in the distribution, not equal changes in the parameter vector. If you’re minimizing a loss $L = -\log p$, the gradient is $-s_t$ and the update has the usual descent sign.

The Approximation Hierarchy

Every common optimizer can be viewed as using some approximation to the Fisher as its preconditioner. The framing is most natural for natural gradient, less natural for SGD (where it’s more of an interpretation than a design choice):

Method	Preconditioner structure	Per-step cost	What’s discarded
SGD	$\eta I_d$ (learning rate $\times$ identity)	$O(d)$	All curvature information
Adam	$\text{diag}(\hat{v}_t)^{-1/2}$, $\hat{v}_t \in \mathbb{R}^d$	$O(d)$	Off-diagonal structure, uses square root
K-FAC	$(A \otimes B)^{-1}$, $A \in \mathbb{R}^{n \times n}$, $B \in \mathbb{R}^{m \times m}$	$O(m^3 + n^3)$*	Cross-layer coupling
Natural gradient	$\mathcal{I}(\theta)^{-1} \in \mathbb{R}^{d \times d}$	$O(d^3)$	Nothing

$d$ is the total parameter count. $I_d$ is the $d \times d$ identity. $\otimes$ is the Kronecker product. For K-FAC, $A = \mathbb{E}[a_{l-1} a_{l-1}^\top]$ is the input activation covariance and $B = \mathbb{E}[\delta_l \delta_l^\top]$ is the output gradient covariance for a layer with weight matrix $W_l \in \mathbb{R}^{m \times n}$, input activations $a_{l-1} \in \mathbb{R}^n$, and backpropagated gradients $\delta_l \in \mathbb{R}^m$ (Martens and Grosse, 2015).

*K-FAC’s $O(m^3 + n^3)$ cost is for the periodic inversion of the Kronecker factors. K-FAC does not invert every step; it amortizes the cost by inverting every $T_{\text{inv}}$ steps and using stale inverses in between. The per-step factor update (accumulating $a a^\top$ and $\delta \delta^\top$) costs $O(n^2 + m^2)$.

Calling SGD a “scalar Fisher” approximation is the Fisher perspective’s interpretation, not SGD’s self-understanding. SGD uses a fixed learning rate because it works, not because someone decided to approximate the Fisher with a scalar. But the interpretation is structurally correct: SGD treats all parameter directions as equally curved, which is what happens when you precondition with a multiple of the identity.

Adam and the Diagonal Fisher

Adam’s second moment update is $\hat{v}t = \beta_2 \hat{v}{t-1} + (1-\beta_2) g_t^2$. Here $g_t \in \mathbb{R}^d$ is the mini-batch gradient at step $t$, $\beta_2 \in [0,1)$ is the exponential decay rate, and the square is element-wise. This is an exponentially-weighted moving average of squared gradients, which tracks the diagonal of what you might call the “empirical Fisher.”

I say “might call” because there are two things people mean by “empirical Fisher,” and they’re not the same:

Per-sample empirical Fisher: $\hat{\mathcal{I}} = \frac{1}{n} \sum_{i=1}^n s(\theta; x_i) s(\theta; x_i)^\top$, where each $s(\theta; x_i)$ is the per-sample gradient. The diagonal of this is $\frac{1}{n} \sum_i [s(\theta; x_i)]^2$, the mean of per-sample squared gradients.
Mini-batch gradient outer product: $g_t g_t^\top$, where $g_t = \frac{1}{|B|}\sum_{i \in B} s(\theta; x_i)$ is the batch-averaged gradient. The diagonal is $g_t^2$.

Adam tracks the second one, not the first. These differ because $(\frac{1}{n}\sum_i s_i)^2 \neq \frac{1}{n}\sum_i s_i^2$ in general. The distinction is discussed by Kunstner et al. (2019), who show that neither version generally approximates the true Fisher well away from critical points.

The connection between Adam and natural gradient is real but requires four simplifications to make exact:

Drop bias correction. Remove the $1/(1 - \beta_2^t)$ warmup term.
Drop epsilon. Set the numerical stability constant to zero.
Drop the square root. This is the big one. Adam divides by $\sqrt{\hat{v}_t} + \epsilon$. The natural gradient preconditioner divides by $\hat{v}_t$. Adam is using a square-root preconditioner: $\text{diag}(\hat{v}_t)^{-1/2}$ rather than $\text{diag}(\hat{v}_t)^{-1}$. These are different objects with different scaling properties.
Accept the empirical Fisher. The true Fisher is $\mathbb{E}{x \sim p\theta}[s(\theta; x) s(\theta; x)^\top]$, an expectation under the model’s distribution. The empirical Fisher uses the training data distribution. They agree when the gradient is zero (i.e., at critical points of the expected log-likelihood, where the model’s distribution matches the data distribution in the relevant sense). In practice you are almost never at such a point.

With all four simplifications, simplified Adam and diagonal natural gradient with exponential forgetting $\beta_2$ produce identical parameter trajectories. Without them, they don’t. The square root is the most consequential difference: it changes the effective preconditioning from $1/v_i$ to $1/\sqrt{v_i}$, compressing the dynamic range of the per-parameter learning rates.

This tells you what Adam is actually doing: per-parameter learning rate adaptation based on recent gradient magnitudes. It is not principled curvature estimation in most of parameter space. The Fisher interpretation holds approximately near convergence, where the empirical and true Fisher converge and the square root is a monotonic distortion that preserves the ranking of directions. Away from convergence, the interpretation is loose.

EWC and the Fisher as Memory

Elastic Weight Consolidation (Kirkpatrick et al., 2017) is the simplest case because there is no approximation gap to analyze. EWC defines its loss as:

$$\mathcal{L}\_{\text{EWC}}(\theta) = \mathcal{L}\_{\text{new}}(\theta) + \frac{\lambda}{2} \sum\_{i=1}^{d} F_i (\theta_i - \theta_i^\*)^2$$

$\mathcal{L}_{\text{new}}(\theta)$ is the loss on the current task. $\lambda > 0$ controls regularization strength. $F_i$ is the $i$-th diagonal entry of the (empirical) Fisher Information Matrix computed at the end of training on the previous task. $\theta^* \in \mathbb{R}^d$ are the parameters saved after that training.

The Fisher acts as a memory of which parameters mattered. High $F_i$ means the likelihood was sensitive to parameter $\theta_i$ during the old task (equivalently, the old task’s data was informative about $\theta_i$). The quadratic penalty is large, so $\theta_i$ resists moving. Low $F_i$ means the old task didn’t depend on that parameter, so it’s free to change for the new task.

This is Fisher regularization. Kirkpatrick et al. designed it this way, citing the Fisher explicitly. There is no hidden approximation in the connection. (There is an approximation in the method: EWC uses a diagonal Fisher and a quadratic penalty around $\theta^$, which is a Laplace approximation to the log-posterior. Whether that’s a good approximation depends on the geometry of the loss surface near $\theta^$.)

Two Regimes

All of the above assumes something that usually isn’t true in practice: convergence.

Classical regime. Train to convergence on a well-specified model. At the MLE $\hat\theta_n$, the observed Fisher (negative Hessian of the log-likelihood) converges to the expected Fisher. Asymptotic theory gives you consistency, normality, and Cramer-Rao efficiency: the MLE achieves the lowest possible variance among unbiased estimators, and that variance is $\mathcal{I}(\theta_0)^{-1}/n$ where $\theta_0$ is the true parameter. Uncertainty quantification via the inverse Fisher is principled.

Modern ML regime. Stop before convergence, which is what actually happens. Computational budgets, early stopping for generalization, training instability. The Fisher information accumulated along the optimization trajectory is not the Fisher at any stationary point. It reflects the geometry of the path through parameter space, not curvature at the destination.

The distinction matters because most results that make Fisher information elegant require the classical regime. Additivity is always true (it’s a property of independent data, not of convergence). But the Cramer-Rao bound, the interpretation of $\mathcal{I}^{-1}$ as a covariance, the equivalence of expected and observed Fisher: these require convergence to the true parameter (or at least to a neighborhood of it).

When you’re not converged, the accumulated Fisher still does useful work, but as a different kind of object. Barrett and Dherin (2021) showed that SGD with finite learning rate implicitly penalizes the trace of the empirical Fisher, connecting the optimizer’s discretization error to a Fisher-based regularizer. Jastrzebski et al. (2021) showed that the Fisher trace in the early phase of training has a lasting effect on generalization. The information-geometric view still applies in this regime. The classical statistical guarantees don’t.

References

Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251-276.
Barrett, D.G.T. and Dherin, B. (2021). Implicit Gradient Regularization. ICLR.
Jastrzebski, S. et al. (2021). Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization. ICML.
Khan, M.E. and Rue, H. (2023). The Bayesian Learning Rule. JMLR, 24(281), 1-76.
Kirkpatrick, J. et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
Kunstner, F., Hennig, P., and Balles, L. (2019). Limitations of the Empirical Fisher Approximation for Natural Gradient Descent. NeurIPS.
Martens, J. (2020). New Insights and Perspectives on the Natural Gradient Method. JMLR, 21(146), 1-76.
Martens, J. and Grosse, R. (2015). Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. ICML.

Fisher Information

The Approximation Hierarchy

Adam and the Diagonal Fisher

EWC and the Fisher as Memory

Two Regimes

References

Related Posts

Fisher Flow: Optimization on the Statistical Manifold

Numerical Methods for Maximum Likelihood Estimation

Compilers: Principles, Techniques, and Tools (Dragon Book)

Convex Optimization

compositional.mle: SICP-Inspired Optimization

Discussion