Awesome AI Papers¶
Every landmark paper in AI history deserves a biography — its context, failures, legacy, and modern perspective.
Era 5 · Large Model Era (2023-present) 28 notes¶
- 2025 · Claude 3.5/3.7 Sonnet - Turning Frontier Models into Controllable Engineering Collaborators
- 2025 · DeepSeek-R1 — How Pure Reinforcement Learning Taught an Open LLM to Reason
- 2025 · Qwen2.5 / Qwen3 - How Alibaba Turned Open LLMs into a Full-Stack Model Family
- 2024 · DeepSeek-V2 / V3 - How MLA and MoE Pushed Open Models to the Frontier
- 2024 · Gemini 1.5 - Multimodal Understanding Across Million-Token Contexts
- 2024 · Genie: Generative Interactive Environments
- 2024 · Llama 3 Herd - An Engineering Blueprint for Open Frontier Models
- 2024 · Mamba-2 - When Transformers and SSMs Meet in the Same Algebra
- 2024 · OpenAI o1 - Reinforcement Learning for Deep LLM Reasoning
- 2024 · Sora Technical Report - Video Generation Models as World Simulators
- 2024 · Stable Diffusion 3 / Rectified Flow — Moving Text-to-Image from U-Net Diffusion to Scalable MMDiT
- 2023 · 3DGS — Bringing NeRF-Quality Radiance Fields into Real-Time Interaction
- 2023 · AudioLM - Turning Raw Audio into a Language Modeling Problem
- 2023 · DINOv2 - Robust Visual Features without Supervision
- 2023 · DPO — Aligning LLMs Directly from Preferences without a Reward Model or PPO
- 2023 · GPT-4 Technical Report - Capability Leap and the Black-Box Technical Report
- 2023 · LLaMA — How Smaller Parameters + More Tokens Helped Open-Source LLMs Catch Up to GPT-3
- 2023 · LLaVA - Turning GPT-4-Generated Visual Instructions into an Open Multimodal Assistant
- 2023 · Llama 2: Open Foundation and Fine-Tuned Chat Models
- 2023 · Mamba — How Selective State Spaces Became the First Credible Transformer Challenger in a Decade
- 2023 · Mixtral 8x7B — Open-Weight LLMs Enter the Sparse Expert Era
- 2023 · QLoRA — Bringing 65B LLM Fine-Tuning onto a Single 48GB GPU
- 2023 · RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- 2023 · RWKV - Reinventing RNNs for the Transformer Era
- 2023 · SAM — One Prompt + 11M Images + 1B Masks: Turning Segmentation into a Foundation Model Problem
- 2023 · Toolformer - Letting Language Models Teach Themselves When to Use Tools
- 2023 · Tree of Thoughts — Deliberate Search as a Reasoning Interface for LLMs
- 2023 · vLLM / PagedAttention — Rescuing LLM Serving from KV-Cache Fragmentation
Era 4 · Foundation Models (2020-2022) 35 notes¶
- 2023 · ControlNet — Plugging spatial control into frozen diffusion via zero-convolutions
- 2022 · Chinchilla — Proving All LLMs Were 'Undertrained' via Compute-Optimal Allocation
- 2022 · Classifier-Free Diffusion Guidance — One Line of Code Removes the Bolt-On Classifier and Unifies Modern Text-to-Image
- 2022 · CoT — Unlocking LLM Reasoning with 'Let's Think Step by Step'
- 2022 · Constitutional AI — Replacing Tens of Thousands of Human Harm Labels With a Constitution and AI Feedback
- 2022 · ConvNeXt — Porting Every Swin Trick Back into a Pure ConvNet and Finding the CNN Decade Was Underrated
- 2022 · DALL-E 2 - Rewriting Text-to-Image as CLIP-Latent Imagination plus Diffusion Rendering
- 2022 · DreamBooth — Implanting Any Subject Into a Text-to-Image Model With 3-5 Photos
- 2022 · DreamFusion — Text-to-3D by Distilling a Frozen 2D Diffusion into a NeRF
- 2022 · Flamingo: a Visual Language Model for Few-Shot Learning
- 2022 · Imagen — Cascaded Text-to-Image Diffusion with Deep Language Understanding
- 2022 · InstructGPT — Turning GPT-3 from a Continuator into an Obedient Assistant via RLHF
- 2022 · MAE — Teaching ViT Self-Supervised Pretraining via 75% Masking
- 2022 · PaLM — Scaling Dense Language Models to 540B with Pathways
- 2022 · ReAct: Synergizing Reasoning and Acting in Language Models
- 2022 · Stable Diffusion — Moving Diffusion into Latent Space so Consumer GPUs Can Generate Images
- 2022 · Whisper - Turning 680k Hours of Weak Supervision into a General Speech Interface
- 2021 · AlphaFold2 — Driving Protein Structure Prediction to Atomic Accuracy via Attention + Evoformer
- 2021 · CLIP — Teaching Vision Models to Understand Language via 400M Image-Text Pairs
- 2021 · Codex — Evaluating Large Language Models Trained on Code
- 2021 · DALL-E — Recasting Text-to-Image Generation as Language Modeling
- 2021 · LoRA — Slashing Large-Model Fine-tuning Cost by 99% via Low-Rank Matrices
- 2021 · Swin Transformer - Turning ViT into a General-Purpose Vision Backbone with Shifted Windows
- 2020 · DDPM — Crowning Diffusion as the King of Image Generation via Thousand-Step Denoising
- 2020 · DETR — Recasting Object Detection as Transformer Set Prediction
- 2020 · GPT-3 — When 175B Parameters Made Prompting the New Programming Paradigm
- 2020 · MoCo: Queues, Momentum Encoders, and the Moment Vision Self-Supervision Became Transferable
- 2020 · MuZero — Planning in Unknown Worlds with a Learned Model
- 2020 · NeRF — Encoding a Scene into a Differentiable Radiance Field with One MLP
- 2020 · Reformer — LSH and Reversible Layers for Million-Token Transformers
- 2020 · Scaling Laws for Neural Language Models
- 2020 · Score SDE — Unifying Score-Based and Diffusion Models through Stochastic Differential Equations
- 2020 · SimCLR — A Plain Contrastive Loss That Crowned Self-Supervised Vision on ImageNet Linear Eval
- 2020 · ViT — Dethroning Convolution from Vision with Pure Transformer
- 2020 · wav2vec 2.0 - Speech Recognition After 53k Hours of Listening and 10 Minutes of Labels
Era 3 · Attention Era (2017-2019) 24 notes¶
- 2019 · EfficientNet — Redefining CNN Efficiency via Compound Scaling
- 2019 · GPT-2 — Announcing the LLM Era with Scale and Zero-shot
- 2019 · RoBERTa — The Engineering Audit That Re-trained BERT Properly
- 2019 · Sentence-BERT — Turning BERT into a Sentence Embedding Engine
- 2019 · T5 — Unifying All NLP Tasks as Text-to-Text
- 2018 · BERT — Ushering NLP into the Pretraining Era via Masked Language Modeling
- 2018 · ELMo — Bringing Contextual Embeddings Mainstream via BiLSTM Bidirectional LMs
- 2018 · GPT-1 — Igniting the Pre-training Revolution with Decoder-only Transformer
- 2018 · Graph Attention Networks (GAT) — Attention as a Learnable Graph Edge
- 2018 · Group Normalization - Freeing Normalization from Batch Size
- 2018 · PGD Adversarial Training — Robustness as Min-Max Optimization
- 2018 · SE-Net — Channel Attention Crowning the ILSVRC 2017 Champion
- 2018 · StyleGAN — Pushing GAN to Photorealistic Face Generation via Style Modulation
- 2018 · ULMFiT — Making Language Model Fine-tuning Work
- 2017 · AlphaZero — Erasing Human Go Knowledge from RL via Pure Self-Play
- 2017 · Capsule Networks — Routing Parts into Wholes
- 2017 · CycleGAN — Unlocking Unpaired Image Translation via Cycle Consistency Loss
- 2017 · GCN — Founding Semi-supervised Node Classification and Graph Neural Networks
- 2017 · Mask R-CNN — Unifying Instance Segmentation by Adding One Branch to Faster R-CNN
- 2017 · MobileNet — Bringing Deep Learning to Mobile Devices via Depthwise Separable Convolutions
- 2017 · PPO — How Clipping Finally Made Policy Gradient Tunable and Usable
- 2017 · PointNet — Permutation-Invariant Deep Networks for Unordered Point Clouds
- 2017 · Transformer — Burying Recurrence with Attention
- 2017 · WGAN — Curing GAN Training Instability with Wasserstein Distance
Era 2 · Deep Renaissance (2012-2016) 31 notes¶
- 2016 · A3C - Asynchronous Actors as the Stabilizer for Deep Reinforcement Learning
- 2016 · AlphaGo — Defeating the Human Go World Champion with MCTS + Deep Networks
- 2016 · DenseNet - Feature Reuse as a Network Architecture
- 2016 · LayerNorm: Normalization Without a Batch
- 2016 · WaveNet - The Neural Starting Point for Raw-Waveform Generation
- 2016 · YOLO — Turning Object Detection into a Single Real-Time Regression
- 2015 · BatchNorm — Turning Training Stability into a Layer
- 2015 · FCN - Turning Classification Networks into Pixel-Level Segmenters
- 2015 · Faster R-CNN — Learning Region Proposals Inside the Detector
- 2015 · He Init - The Starting Point That Kept ReLU Networks Alive
- 2015 · Inception / GoogLeNet — Making CNNs Deeper by Making Them Wider
- 2015 · Knowledge Distillation — Pouring a Large Model's Dark Knowledge into a Small One
- 2015 · Nature DQN - The Atari Moment That Made Deep Reinforcement Learning Public
- 2015 · ResNet — How Deep Residual Learning Unlocked the 152-Layer Door
- 2015 · Spatial Transformer Networks — Letting CNNs Learn to Crop, Align, and Warp
- 2015 · U-Net — Turning Encoder-Decoders and Skip Connections into the Default Grammar of Medical Segmentation
- 2014 · Adam — Adaptive Moments for Stochastic Optimization
- 2014 · Adversarial Examples — Linearity, FGSM, and the Beginning of Modern Robustness
- 2014 · Bahdanau Attention — Teaching Neural MT Where to Look
- 2014 · GAN — Adversarial Games that Taught Neural Networks to Forge
- 2014 · GloVe - The Global Co-occurrence Bridge for Word Vectors
- 2014 · Network In Network — Putting a Tiny MLP Inside Every Convolution
- 2014 · R-CNN — The ImageNet Feature Hierarchy That Rebooted Detection
- 2014 · Seq2Seq - Compress Any Sequence into a Vector, Then Decode It Back
- 2014 · VGG — Pushing CNNs to 19 Layers with 3×3 Convolutions
- 2013 · DQN — The First Deep RL Agent to Learn Atari from Pixels
- 2013 · VAE — Turning Generative Modeling into a Tractable Variational Bound
- 2013 · Word2Vec - The Industrial Shortcut that Put Meaning into Vectors
- 2013 · ZFNet — Visualizing the Black Box That AlexNet Opened
- 2012 · AlexNet — Halving ImageNet Top-5 Error with GPU + ReLU + Dropout
- 2012 · Dropout — Randomly Turning Neurons Off to Stop Feature Co-adaptation
Era 1 · Foundations (1957-2011) 16 notes¶
- 2011 · ReLU — How max(0, x) Turned Deep Networks from "Lab Toy" to "Industrial Cornerstone"
- 2010 · RNN-LM — Moving Language Modeling from Fixed Windows to Recurrent State
- 2010 · Stacked Denoising Autoencoders — Turning Local Denoising into Deep Representation Pretraining
- 2010 · Glorot Init — Making Deep Networks Pass Signals Before They Learn
- 2009 · ImageNet — How 15M Images Turned a 'Dataset' into the Fuse of the Deep Learning Revolution
- 2008 · t-SNE — The Visual Language of High-Dimensional Data Visualization
- 2006 · DBN — How Layer-wise Greedy Pretraining Made Deep Networks Trainable for the First Time
- 2006 · Autoencoder — RBM Pretraining Wakes Neural Networks From Cold Storage
- 2003 · LDA — Promoting pLSA to a Generalizable Fully-Bayesian Topic Model with a Dirichlet Prior
- 2001 · Random Forests — Bagging + Feature Sampling that Crowned Decision Trees on the ML Throne
- 1998 · LeNet — Stitching Convolution, Pooling and Backprop into the First Industrial-Grade Deep Network
- 1997 · LSTM — How Gating Made Recurrent Networks Remember Long Dependencies for the First Time
- 1992 · SVM — How Max-Margin and the Kernel Trick Dominated Machine Learning for Two Decades
- 1989 · Universal Approximation — The Existence Theorem That Certified Neural Networks' Expressive Power
- 1986 · Backprop — Pulling Multi-layer Networks from 'Untrainable' into the Optimizable World via the Chain Rule
- 1958 · Perceptron — How the First Hardware Neuron That Learns from Data Sparked AI as a Discipline