Attention Is All You Need by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017) [paper] — Introduced the Transformer architecture. The paper that started everything. PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin, Chang, Lee, Toutanova (2018) [paper] — Bidirectional pre-training via masked language modeling. Defined the pre-train/fine-tune paradigm. PDF
Language Models are Unsupervised Multitask Learners (GPT-2) by Radford, Wu, Child, Luan, Amodei, Sutskever (2019) [paper] — Showed large LMs can perform tasks zero-shot. Introduced the scaling intuition. PDF
Language Models are Few-Shot Learners (GPT-3) by Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, et al. (2020) [paper] — 175B parameters. In-context learning emerges at scale. Changed the field. PDF
LLaMA: Open and Efficient Foundation Language Models by Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, et al. (2023) [paper] — Open-weight models competitive with GPT-3. Catalyzed the open-source LLM ecosystem. PDF
The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy (2015) [blog] — Seminal blog post demonstrating char-level RNN power. Shakespeare, LaTeX, kernel code generation. Online