Fisher Flow explores using Fisher information and information geometry for optimization problems, particularly in machine learning.
Information Geometry Primer
Information geometry studies the geometry of probability distributions:
- Manifold Structure: Probability distributions form a Riemannian manifold
- Fisher Information Metric: Provides the natural distance metric
- Geodesics: Shortest paths on the manifold = efficient optimization
Fisher Information Matrix
For parametric distribution p(x|θ), the Fisher Information is:
FIM: I(θ) = E[∇log p(x|θ) ∇log p(x|θ)ᵀ]
This captures:
- Sensitivity: How much distribution changes with parameters
- Curvature: Second-order information about likelihood surface
- Natural Gradient: The true gradient direction in parameter space
Natural Gradient Descent
Standard gradient descent:
- θₜ₊₁ = θₜ - α ∇L(θₜ)
- Uses Euclidean geometry
- Inefficient in parameter space
Natural gradient descent:
- θₜ₊₁ = θₜ - α I(θₜ)⁻¹ ∇L(θₜ)
- Uses information geometry
- Invariant to parametrization
Fisher Flow Dynamics
The paper proposes continuous-time optimization:
dθ/dt = -I(θ)⁻¹ ∇L(θ)
This defines a flow on the parameter manifold with properties:
- Monotonic Decrease: Loss decreases along flow
- Geodesic Paths: Flow follows natural geometry
- Adaptive Step Size: Curvature automatically scales updates
Applications to Deep Learning
Neural Network Training
Benefits:
- Faster Convergence: Accounts for parameter correlations
- Better Generalization: Natural geometry reduces overfitting
- Stable Training: Adaptive scaling prevents instability
Challenges:
- Computational Cost: Computing I(θ)⁻¹ expensive for large networks
- Approximations: Use diagonal or block-diagonal FIM
- Stochastic Setting: Estimate FIM from mini-batches
K-FAC Algorithm
Kronecker-Factored Approximate Curvature:
- Approximates FIM with Kronecker products
- O(n) computation vs. O(n³) for exact natural gradient
- Practical for modern deep networks
Variational Inference
Fisher flow provides natural framework:
- Variational Distribution: q(z|φ)
- Objective: KL[q(z|φ) || p(z|x)]
- Flow: Follow Fisher information geometry
Results in efficient inference with geometric guarantees.
Theoretical Properties
The paper proves:
Convergence:
- Fisher flow converges to local optima
- Convergence rate depends on conditioning of FIM
Stability:
- Flow is stable under perturbations
- Natural gradient provides implicit regularization
Geometry:
- Respects underlying probability manifold structure
- Invariant to reparametrization
Practical Implementation
def fisher_flow_step(params, loss_fn, data):
# Compute gradient
grad = compute_gradient(loss_fn, params, data)
# Estimate Fisher information (diagonal approx)
fisher = estimate_fisher_diagonal(params, data)
# Natural gradient step
natural_grad = grad / (fisher + epsilon)
# Update parameters
params = params - learning_rate * natural_grad
return params
Future Directions
- Higher-Order Methods: Incorporating 3rd order geometry
- Adaptive Approximations: Dynamic FIM approximation quality
- Distributed Optimization: Fisher flow in federated learning
- Non-Parametric: Extending to infinite-dimensional spaces
Discussion