Skip to main content

Fisher Flow: Optimization on the Statistical Manifold

Standard gradient descent treats parameter space as flat. It uses Euclidean distance, which means the same step size in parameter space can produce wildly different changes in the distribution depending on where you are. Fisher Flow fixes this by optimizing along the natural geometry of probability distributions.

The Geometry of Distributions

Probability distributions form a Riemannian manifold. The Fisher information matrix provides the natural metric on this manifold:

FIM: I(theta) = E[grad log p(x|theta) grad log p(x|theta)^T]

This captures how sensitive the distribution is to parameter changes. It is the curvature of the likelihood surface. It tells you the true gradient direction in parameter space, not the Euclidean one.

Natural Gradient vs. Standard Gradient

Standard gradient descent updates parameters in Euclidean space:

theta_{t+1} = theta_t - alpha grad L(theta_t)

This is inefficient because it ignores parameter correlations. A step of size epsilon in one direction might barely change the distribution while the same step in another direction changes it drastically.

Natural gradient descent uses the Fisher metric:

theta_{t+1} = theta_t - alpha I(theta_t)^{-1} grad L(theta_t)

Pre-multiplying by the inverse Fisher information rescales the gradient to account for the local geometry. The key property: natural gradient is invariant to reparametrization. It does not matter how you choose to represent your distributions.

Fisher Flow

Fisher Flow makes this continuous:

dtheta/dt = -I(theta)^{-1} grad L(theta)

This defines a flow on the parameter manifold. Loss decreases monotonically along the flow. The trajectories follow the natural geometry. Step size adapts automatically because the curvature scales the updates.

The Practical Problem

Computing I(theta)^{-1} is expensive. For a neural network with n parameters, the full Fisher information matrix is n x n, and inverting it is O(n^3). That is not practical for modern networks.

Approximations help. Diagonal Fisher approximation is cheap but crude. Block-diagonal Fisher captures within-layer correlations. K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher with Kronecker products, bringing computation down to O(n). That makes it practical for real networks.

def fisher_flow_step(params, loss_fn, data):
    # Compute gradient
    grad = compute_gradient(loss_fn, params, data)

    # Estimate Fisher information (diagonal approx)
    fisher = estimate_fisher_diagonal(params, data)

    # Natural gradient step
    natural_grad = grad / (fisher + epsilon)

    # Update parameters
    params = params - learning_rate * natural_grad

    return params

Variational Inference

Fisher flow gives a natural framework for variational inference. You have a variational distribution q(z|phi) and you want to minimize KL[q(z|phi) || p(z|x)]. Following the Fisher geometry of q means your optimization respects the structure of the distribution family you are searching over.

What the Theory Says

Fisher flow converges to local optima. The convergence rate depends on the conditioning of the Fisher information matrix. The flow is stable under perturbations, and the natural gradient provides implicit regularization.

The important theoretical property is geometric: the optimization respects the underlying probability manifold structure and is invariant to reparametrization. If you switch from mean/variance to natural parameters, the flow does not change. That is a strong property.

Open Questions

Higher-order geometry (third-order information) could improve convergence near saddle points. Dynamic FIM approximation quality, adjusting the fidelity of your Fisher approximation as training progresses, is worth exploring. And extending Fisher flow to infinite-dimensional (non-parametric) settings is an open theoretical problem.

Discussion