Skip to main content

Fisher Flow: Information Geometry in Optimization

Fisher Flow explores using Fisher information and information geometry for optimization problems, particularly in machine learning.

Information Geometry Primer

Information geometry studies the geometry of probability distributions:

  • Manifold Structure: Probability distributions form a Riemannian manifold
  • Fisher Information Metric: Provides the natural distance metric
  • Geodesics: Shortest paths on the manifold = efficient optimization

Fisher Information Matrix

For parametric distribution p(x|θ), the Fisher Information is:

FIM: I(θ) = E[∇log p(x|θ) ∇log p(x|θ)ᵀ]

This captures:

  • Sensitivity: How much distribution changes with parameters
  • Curvature: Second-order information about likelihood surface
  • Natural Gradient: The true gradient direction in parameter space

Natural Gradient Descent

Standard gradient descent:

  • θₜ₊₁ = θₜ - α ∇L(θₜ)
  • Uses Euclidean geometry
  • Inefficient in parameter space

Natural gradient descent:

  • θₜ₊₁ = θₜ - α I(θₜ)⁻¹ ∇L(θₜ)
  • Uses information geometry
  • Invariant to parametrization

Fisher Flow Dynamics

The paper proposes continuous-time optimization:

dθ/dt = -I(θ)⁻¹ ∇L(θ)

This defines a flow on the parameter manifold with properties:

  • Monotonic Decrease: Loss decreases along flow
  • Geodesic Paths: Flow follows natural geometry
  • Adaptive Step Size: Curvature automatically scales updates

Applications to Deep Learning

Neural Network Training

Benefits:

  • Faster Convergence: Accounts for parameter correlations
  • Better Generalization: Natural geometry reduces overfitting
  • Stable Training: Adaptive scaling prevents instability

Challenges:

  • Computational Cost: Computing I(θ)⁻¹ expensive for large networks
  • Approximations: Use diagonal or block-diagonal FIM
  • Stochastic Setting: Estimate FIM from mini-batches

K-FAC Algorithm

Kronecker-Factored Approximate Curvature:

  • Approximates FIM with Kronecker products
  • O(n) computation vs. O(n³) for exact natural gradient
  • Practical for modern deep networks

Variational Inference

Fisher flow provides natural framework:

  • Variational Distribution: q(z|φ)
  • Objective: KL[q(z|φ) || p(z|x)]
  • Flow: Follow Fisher information geometry

Results in efficient inference with geometric guarantees.

Theoretical Properties

The paper proves:

Convergence:

  • Fisher flow converges to local optima
  • Convergence rate depends on conditioning of FIM

Stability:

  • Flow is stable under perturbations
  • Natural gradient provides implicit regularization

Geometry:

  • Respects underlying probability manifold structure
  • Invariant to reparametrization

Practical Implementation

def fisher_flow_step(params, loss_fn, data):
    # Compute gradient
    grad = compute_gradient(loss_fn, params, data)

    # Estimate Fisher information (diagonal approx)
    fisher = estimate_fisher_diagonal(params, data)

    # Natural gradient step
    natural_grad = grad / (fisher + epsilon)

    # Update parameters
    params = params - learning_rate * natural_grad

    return params

Future Directions

  • Higher-Order Methods: Incorporating 3rd order geometry
  • Adaptive Approximations: Dynamic FIM approximation quality
  • Distributed Optimization: Fisher flow in federated learning
  • Non-Parametric: Extending to infinite-dimensional spaces

Discussion