Training & Optimization¶

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer by Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean (2017) [paper] — Mixture of Experts with learned gating. Conditional computation at scale. PDF
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Dao, Fu, Ermon, Rudra, Ré (2022) [paper] — IO-aware attention that is both faster and uses less memory. Essential infrastructure. PDF
Generating Long Sequences with Sparse Transformers by Child, Gray, Radford, Sutskever (2019) [paper] — Sparse attention patterns for long-range dependencies. O(n√n) attention. PDF