Skip to content

Awesome AI Papers

Home

Awesome AI Papers¶

Every landmark paper in AI history deserves a biography — its context, failures, legacy, and modern perspective.

Era 5 · Large Model Era (2023-present) 28 notes¶

2025 · Claude 3.5/3.7 Sonnet - Turning Frontier Models into Controllable Engineering Collaborators
2025 · DeepSeek-R1 — How Pure Reinforcement Learning Taught an Open LLM to Reason
2025 · Qwen2.5 / Qwen3 - How Alibaba Turned Open LLMs into a Full-Stack Model Family
2024 · DeepSeek-V2 / V3 - How MLA and MoE Pushed Open Models to the Frontier
2024 · Gemini 1.5 - Multimodal Understanding Across Million-Token Contexts
2024 · Genie: Generative Interactive Environments
2024 · Llama 3 Herd - An Engineering Blueprint for Open Frontier Models
2024 · Mamba-2 - When Transformers and SSMs Meet in the Same Algebra
2024 · OpenAI o1 - Reinforcement Learning for Deep LLM Reasoning
2024 · Sora Technical Report - Video Generation Models as World Simulators
2024 · Stable Diffusion 3 / Rectified Flow — Moving Text-to-Image from U-Net Diffusion to Scalable MMDiT
2023 · 3DGS — Bringing NeRF-Quality Radiance Fields into Real-Time Interaction
2023 · AudioLM - Turning Raw Audio into a Language Modeling Problem
2023 · DINOv2 - Robust Visual Features without Supervision
2023 · DPO — Aligning LLMs Directly from Preferences without a Reward Model or PPO
2023 · GPT-4 Technical Report - Capability Leap and the Black-Box Technical Report
2023 · LLaMA — How Smaller Parameters + More Tokens Helped Open-Source LLMs Catch Up to GPT-3
2023 · LLaVA - Turning GPT-4-Generated Visual Instructions into an Open Multimodal Assistant
2023 · Llama 2: Open Foundation and Fine-Tuned Chat Models
2023 · Mamba — How Selective State Spaces Became the First Credible Transformer Challenger in a Decade
2023 · Mixtral 8x7B — Open-Weight LLMs Enter the Sparse Expert Era
2023 · QLoRA — Bringing 65B LLM Fine-Tuning onto a Single 48GB GPU
2023 · RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
2023 · RWKV - Reinventing RNNs for the Transformer Era
2023 · SAM — One Prompt + 11M Images + 1B Masks: Turning Segmentation into a Foundation Model Problem
2023 · Toolformer - Letting Language Models Teach Themselves When to Use Tools
2023 · Tree of Thoughts — Deliberate Search as a Reasoning Interface for LLMs
2023 · vLLM / PagedAttention — Rescuing LLM Serving from KV-Cache Fragmentation

Era 4 · Foundation Models (2020-2022) 35 notes¶

2023 · ControlNet — Plugging spatial control into frozen diffusion via zero-convolutions
2022 · Chinchilla — Proving All LLMs Were 'Undertrained' via Compute-Optimal Allocation
2022 · Classifier-Free Diffusion Guidance — One Line of Code Removes the Bolt-On Classifier and Unifies Modern Text-to-Image
2022 · CoT — Unlocking LLM Reasoning with 'Let's Think Step by Step'
2022 · Constitutional AI — Replacing Tens of Thousands of Human Harm Labels With a Constitution and AI Feedback
2022 · ConvNeXt — Porting Every Swin Trick Back into a Pure ConvNet and Finding the CNN Decade Was Underrated
2022 · DALL-E 2 - Rewriting Text-to-Image as CLIP-Latent Imagination plus Diffusion Rendering
2022 · DreamBooth — Implanting Any Subject Into a Text-to-Image Model With 3-5 Photos
2022 · DreamFusion — Text-to-3D by Distilling a Frozen 2D Diffusion into a NeRF
2022 · Flamingo: a Visual Language Model for Few-Shot Learning
2022 · Imagen — Cascaded Text-to-Image Diffusion with Deep Language Understanding
2022 · InstructGPT — Turning GPT-3 from a Continuator into an Obedient Assistant via RLHF
2022 · MAE — Teaching ViT Self-Supervised Pretraining via 75% Masking
2022 · PaLM — Scaling Dense Language Models to 540B with Pathways
2022 · ReAct: Synergizing Reasoning and Acting in Language Models
2022 · Stable Diffusion — Moving Diffusion into Latent Space so Consumer GPUs Can Generate Images
2022 · Whisper - Turning 680k Hours of Weak Supervision into a General Speech Interface
2021 · AlphaFold2 — Driving Protein Structure Prediction to Atomic Accuracy via Attention + Evoformer
2021 · CLIP — Teaching Vision Models to Understand Language via 400M Image-Text Pairs
2021 · Codex — Evaluating Large Language Models Trained on Code
2021 · DALL-E — Recasting Text-to-Image Generation as Language Modeling
2021 · LoRA — Slashing Large-Model Fine-tuning Cost by 99% via Low-Rank Matrices
2021 · Swin Transformer - Turning ViT into a General-Purpose Vision Backbone with Shifted Windows
2020 · DDPM — Crowning Diffusion as the King of Image Generation via Thousand-Step Denoising
2020 · DETR — Recasting Object Detection as Transformer Set Prediction
2020 · GPT-3 — When 175B Parameters Made Prompting the New Programming Paradigm
2020 · MoCo: Queues, Momentum Encoders, and the Moment Vision Self-Supervision Became Transferable
2020 · MuZero — Planning in Unknown Worlds with a Learned Model
2020 · NeRF — Encoding a Scene into a Differentiable Radiance Field with One MLP
2020 · Reformer — LSH and Reversible Layers for Million-Token Transformers
2020 · Scaling Laws for Neural Language Models
2020 · Score SDE — Unifying Score-Based and Diffusion Models through Stochastic Differential Equations
2020 · SimCLR — A Plain Contrastive Loss That Crowned Self-Supervised Vision on ImageNet Linear Eval
2020 · ViT — Dethroning Convolution from Vision with Pure Transformer
2020 · wav2vec 2.0 - Speech Recognition After 53k Hours of Listening and 10 Minutes of Labels

Era 3 · Attention Era (2017-2019) 24 notes¶

2019 · EfficientNet — Redefining CNN Efficiency via Compound Scaling
2019 · GPT-2 — Announcing the LLM Era with Scale and Zero-shot
2019 · RoBERTa — The Engineering Audit That Re-trained BERT Properly
2019 · Sentence-BERT — Turning BERT into a Sentence Embedding Engine
2019 · T5 — Unifying All NLP Tasks as Text-to-Text
2018 · BERT — Ushering NLP into the Pretraining Era via Masked Language Modeling
2018 · ELMo — Bringing Contextual Embeddings Mainstream via BiLSTM Bidirectional LMs
2018 · GPT-1 — Igniting the Pre-training Revolution with Decoder-only Transformer
2018 · Graph Attention Networks (GAT) — Attention as a Learnable Graph Edge
2018 · Group Normalization - Freeing Normalization from Batch Size
2018 · PGD Adversarial Training — Robustness as Min-Max Optimization
2018 · SE-Net — Channel Attention Crowning the ILSVRC 2017 Champion
2018 · StyleGAN — Pushing GAN to Photorealistic Face Generation via Style Modulation
2018 · ULMFiT — Making Language Model Fine-tuning Work
2017 · AlphaZero — Erasing Human Go Knowledge from RL via Pure Self-Play
2017 · Capsule Networks — Routing Parts into Wholes
2017 · CycleGAN — Unlocking Unpaired Image Translation via Cycle Consistency Loss
2017 · GCN — Founding Semi-supervised Node Classification and Graph Neural Networks
2017 · Mask R-CNN — Unifying Instance Segmentation by Adding One Branch to Faster R-CNN
2017 · MobileNet — Bringing Deep Learning to Mobile Devices via Depthwise Separable Convolutions
2017 · PPO — How Clipping Finally Made Policy Gradient Tunable and Usable
2017 · PointNet — Permutation-Invariant Deep Networks for Unordered Point Clouds
2017 · Transformer — Burying Recurrence with Attention
2017 · WGAN — Curing GAN Training Instability with Wasserstein Distance

Era 2 · Deep Renaissance (2012-2016) 31 notes¶

2016 · A3C - Asynchronous Actors as the Stabilizer for Deep Reinforcement Learning
2016 · AlphaGo — Defeating the Human Go World Champion with MCTS + Deep Networks
2016 · DenseNet - Feature Reuse as a Network Architecture
2016 · LayerNorm: Normalization Without a Batch
2016 · WaveNet - The Neural Starting Point for Raw-Waveform Generation
2016 · YOLO — Turning Object Detection into a Single Real-Time Regression
2015 · BatchNorm — Turning Training Stability into a Layer
2015 · FCN - Turning Classification Networks into Pixel-Level Segmenters
2015 · Faster R-CNN — Learning Region Proposals Inside the Detector
2015 · He Init - The Starting Point That Kept ReLU Networks Alive
2015 · Inception / GoogLeNet — Making CNNs Deeper by Making Them Wider
2015 · Knowledge Distillation — Pouring a Large Model's Dark Knowledge into a Small One
2015 · Nature DQN - The Atari Moment That Made Deep Reinforcement Learning Public
2015 · ResNet — How Deep Residual Learning Unlocked the 152-Layer Door
2015 · Spatial Transformer Networks — Letting CNNs Learn to Crop, Align, and Warp
2015 · U-Net — Turning Encoder-Decoders and Skip Connections into the Default Grammar of Medical Segmentation
2014 · Adam — Adaptive Moments for Stochastic Optimization
2014 · Adversarial Examples — Linearity, FGSM, and the Beginning of Modern Robustness
2014 · Bahdanau Attention — Teaching Neural MT Where to Look
2014 · GAN — Adversarial Games that Taught Neural Networks to Forge
2014 · GloVe - The Global Co-occurrence Bridge for Word Vectors
2014 · Network In Network — Putting a Tiny MLP Inside Every Convolution
2014 · R-CNN — The ImageNet Feature Hierarchy That Rebooted Detection
2014 · Seq2Seq - Compress Any Sequence into a Vector, Then Decode It Back
2014 · VGG — Pushing CNNs to 19 Layers with 3×3 Convolutions
2013 · DQN — The First Deep RL Agent to Learn Atari from Pixels
2013 · VAE — Turning Generative Modeling into a Tractable Variational Bound
2013 · Word2Vec - The Industrial Shortcut that Put Meaning into Vectors
2013 · ZFNet — Visualizing the Black Box That AlexNet Opened
2012 · AlexNet — Halving ImageNet Top-5 Error with GPU + ReLU + Dropout
2012 · Dropout — Randomly Turning Neurons Off to Stop Feature Co-adaptation

Era 1 · Foundations (1957-2011) 16 notes¶