GPT-1 — Igniting the Pre-training Revolution with Decoder-only Transformer¶
June 11, 2018. Radford and 3 co-authors release the GPT-1 tech report on the OpenAI blog with the plain title "Improving Language Understanding by Generative Pre-Training". A 12-page report widely underestimated at the time, it was the first to prove that "decoder-only Transformer + large-scale unsupervised LM pre-training + task-specific fine-tuning" works — taking SOTA on 9 of 12 NLU tasks and pushing Stories Cloze commonsense reasoning from 77.6% to 86.5%. Four months later, BERT crushed it on GLUE (75.1 vs 79.6), but a year later GPT-2 followed the same path to prove unidirectional LM is the true entry point to LLMs. Looking back today, GPT-1 is the real founder of the "pre-train + fine-tune" paradigm (4 months before BERT) and the foundational paper of the GPT-2 / GPT-3 / ChatGPT / GPT-4 LLM mainline.
TL;DR¶
GPT-1 uses a 12-layer decoder-only Transformer + BookCorpus (800M words) + standard LM loss for unsupervised pre-training, then all-parameter fine-tuning + task-specific input formatting for downstream adaptation, achieving SOTA on 9 of 12 NLU tasks. It was the first engineering proof that "one model per task" can be replaced by "one pretrained model + lightweight fine-tuning."
Historical Context¶
What was the NLP community stuck on in early 2018?¶
Early 2018 NLP was still dominated by Word2Vec / GloVe static word embeddings + task-specific LSTM/CNN two-stage architectures. Each new task required training a model from scratch, and tasks with little labeled data (RTE, CoLA) were nearly hopeless. The community had a few key signals:
(1) Transformer (2017.06) had just appeared, proving attention could fully replace RNN; (2) ULMFiT (2018.01) did LM pre-training + 3-stage fine-tuning on LSTM, dropping IMDb error from 5.9% to 4.6%; (3) ELMo (2018.02) used BiLSTM pre-training for contextualized embeddings, proving dynamic embeddings beat static.
But both routes had fatal limits: ULMFiT used LSTM (capacity-limited); ELMo only replaced the embedding layer (downstream model still trained from scratch). "Can we use the entire pre-trained model as the downstream backbone?" — this was GPT-1's core question.
The 3 immediate predecessors that pushed GPT-1 out¶
- Vaswani et al., 2017 (Transformer) [NeurIPS]: provided the only architectural foundation. GPT-1 simply chopped the encoder, keeping only decoder
- Howard, Ruder, 2018 (ULMFiT) [ACL]: first engineered "LM pre-training + downstream fine-tuning," but with LSTM
- Peters et al., 2018 (ELMo) [NAACL]: proved contextual > static embedding, but only replaced embedding layer
What was the author team doing?¶
All 4 authors were at OpenAI. Alec Radford was the core first author (later led GPT-2/3/DALL-E/CLIP/Whisper); Tim Salimans was a GAN expert (improved-GAN first author); Ilya Sutskever was Chief Scientist. OpenAI was a ~70-person nonprofit at the time, betting on "unsupervised learning is the path to AGI." GPT-1 was the first engineering proof of that bet.
State of industry, compute, data¶
- GPUs: 8 P600s, 1 month of training (very cheap by today's standards)
- Data: BookCorpus (7000 unpublished novels, ~800M tokens) — chose novels for "long context + story coherence"
- Frameworks: TensorFlow 1.x; BPE via fastBPE
- Academic mood: BERT had not yet released (4 months later), ELMo had won NAACL best paper, the field was full of expectation for "pre-training"
Method Deep Dive¶
Overall framework¶
[Pre-training]
Input: BookCorpus tokens (BPE)
↓ Token Emb + Position Emb (learnable)
↓ 12 × Decoder Block (Multi-Head Self-Attn + FFN, Post-LN)
↓ Linear (tied with input emb) + softmax
↓ L_LM = -∑ log P(x_t | x_<t)
[Fine-tuning]
Same backbone + task input formatting + small task head
Joint loss: L_task + λ * L_LM (auxiliary LM loss)
| Config | L | \(d_{model}\) | A | \(d_{ff}\) | Params | Context |
|---|---|---|---|---|---|---|
| GPT-1 | 12 | 768 | 12 | 3072 | 117M | 512 |
Key designs¶
Design 1: Decoder-only LM Pretraining — autoregressive pre-training¶
Function: causal-masked self-attention does next-token prediction, letting the backbone learn the general capability of "modeling sequence probability distributions."
Forward formula:
where \(k = 512\) is the context window. The decoder block's self-attention uses a causal mask so \(x_t\) only sees \(x_{<t}\).
Why decoder-only and not encoder-decoder? - Autoregressive LM is a natural self-supervised task (no paired data needed) - Decoder-only is simpler (half the params / half the compute) - Generation ability comes for free
Comparison with same-era methods:
| Method | Architecture | Pre-train objective | Downstream usage | Generation |
|---|---|---|---|---|
| Word2Vec | Shallow | skip-gram | Replace embedding | None |
| ELMo | BiLSTM | LM | Replace embedding | None |
| ULMFiT | LSTM | LM | Full backbone fine-tune | Weak |
| GPT-1 | decoder-only Transformer | LM | Full backbone fine-tune | Strong |
| BERT (later) | encoder-only Transformer | MLM | Full backbone fine-tune | None |
Design 2: Discriminative Fine-tuning + Auxiliary LM Loss¶
Function: during downstream training, all backbone parameters update, plus an auxiliary LM loss prevents catastrophic forgetting.
Core idea:
The auxiliary LM loss lets the backbone keep language modeling ability while adapting to the task — one of GPT-1's key fine-tuning tricks.
Comparison with ULMFiT's complex fine-tuning:
| Method | Fine-tune workflow | Complexity |
|---|---|---|
| ULMFiT | 3-stage: discriminative LR + gradual unfreeze + slanted triangular LR | Complex |
| GPT-1 | 1-stage: full params + auxiliary LM loss + small LR | Minimal |
Design 3: Task-specific Input Transformation — input formatting for multi-task¶
Function: don't change the backbone, only change input format, so one pretrained model adapts to classification, entailment, similarity, QA, etc.
4 task input formats:
| Task | Input format |
|---|---|
| Classification (SST / CoLA) | <s> text <e> |
| Entailment (MNLI / SNLI) | <s> premise $ hypothesis <e> |
| Similarity (STS / MRPC) | <s> textA $ textB <e> and <s> textB $ textA <e> averaged |
| Multi-choice QA (RACE / Story Cloze) | per candidate: <s> context $ answer_i <e>, compare logits |
Take last token's hidden state, attach a Linear head:
class GPT1ForClassification(nn.Module):
def __init__(self, num_classes=2):
self.backbone = GPT1Decoder(L=12, d=768, h=12)
self.head = nn.Linear(768, num_classes)
def forward(self, ids):
h = self.backbone(ids) # (B, n, 768)
last = h[:, -1] # take last token (<e>)
return self.head(last) # (B, num_classes)
Design rationale: unified input format + shared backbone avoids the engineering burden of "one architecture per task." This is the embryo of later prompt engineering.
Loss / training strategy¶
| Item | Config |
|---|---|
| Pretrain Loss | LM cross-entropy |
| Optimizer | Adam (\(\beta_1=0.9, \beta_2=0.999\)) |
| Pretrain LR | 2.5e-4, cosine decay + 2k warmup |
| Pretrain Batch | 64 sequences × 512 tokens |
| Pretrain Epochs | 100 epochs on BookCorpus (re-train 100×) |
| Fine-tune LR | 6.25e-5 |
| Fine-tune Epochs | 3 |
| Norm | Post-LN |
| Activation | GELU |
| Tokenizer | BPE, 40k vocab |
| Auxiliary LM weight | \(\lambda = 0.5\) |
Failed Baselines¶
Opponents that lost to GPT-1 at the time¶
- Stories Cloze (prior SOTA 77.6%): GPT-1 got 86.5%, +8.9 jump, the largest single-jump in commonsense reasoning history
- RACE (prior SOTA 53.3%): GPT-1 got 59.0%, +5.7
- MultiNLI matched (prior SOTA 80.6%): GPT-1 got 82.1%
- Most small datasets: GPT-1 significantly beat task-specific LSTM/CNN
Failures admitted in the paper¶
- GLUE avg 75.1: 4 months later BERT-base 79.6 directly beat by 4.5 points (bidirectionality is the key)
- CoLA (grammar acceptability) 45.4: BERT-base 52.1 beat, proving encoder + bidirectional better on NLU
- Drop auxiliary LM loss: small datasets (RTE / MRPC) severely degraded, proving multi-task learning works
"Anti-baseline" lesson¶
- "Pretraining is useless" (2017 community consensus): GPT-1 directly refuted — 100 epochs of pretraining gives 5–10 point jumps on small downstream tasks
- "LSTM is the natural sequence-modeling paradigm": GPT-1 + Transformer proved replaceable
- "Need task-specific architectures": GPT-1 proved input formatting + shared backbone is enough
Key Experimental Numbers¶
Main experiment (12 NLU tasks)¶
| Task | Prior SOTA | GPT-1 | Gain |
|---|---|---|---|
| MNLI-m | 80.6 | 82.1 | +1.5 |
| MNLI-mm | 80.1 | 81.4 | +1.3 |
| SNLI | 89.3 | 89.9 | +0.6 |
| SciTail | 83.3 | 88.3 | +5.0 |
| QNLI | 82.3 | 88.1 | +5.8 |
| RTE | 61.7 | 56.0 | -5.7 |
| Story Cloze | 77.6 | 86.5 | +8.9 |
| RACE-m | 58.7 | 62.9 | +4.2 |
| RACE-h | 49.4 | 57.4 | +8.0 |
| CoLA | 35.0 | 45.4 | +10.4 |
| SST-2 | 93.2 | 91.3 | -1.9 |
| QQP | 66.1 | 70.3 | +4.2 |
SOTA on 9 of 12 tasks, average gain +5.5 points.
Ablation¶
| Config | Avg | Notes |
|---|---|---|
| GPT-1 full | 75.1 | baseline |
| No pretraining (from scratch) | 56.5 | -18.6, proves pretraining is core |
| No Transformer (LSTM instead) | 70.5 | -4.6 |
| Fine-tune without aux LM | 71.2 | -3.9 (small datasets hit hardest) |
| Only last layer, no backbone fine-tune | 65.0 | -10.1 |
Key findings¶
- Pretraining is the core: drops 18 points if removed
- Transformer > LSTM: 4–5 point advantage
- Aux LM loss prevents forgetting: +4–7 points on small datasets
- Full-param fine-tuning >> frozen backbone: +10 points
- Zero-shot ability already nascent: GPT-1 zero-shot reaches 70% on Stories Cloze (though GPT-2 truly explored this path)
Idea Lineage¶
graph LR
Word2Vec[Word2Vec 2013<br/>static embedding pretrain] -.foundation.-> GPT1
Seq2Seq[Seq2Seq 2014<br/>encoder-decoder] -.foundation.-> GPT1
Transformer[Transformer 2017<br/>self-attention architecture] -.architectural base.-> GPT1
ULMFiT[ULMFiT 2018.01<br/>LM pretrain + fine-tune] -.methodology.-> GPT1
ELMo[ELMo 2018.02<br/>contextual embedding] -.contemporary rival.-> GPT1
GPT1[GPT-1 2018.06<br/>decoder-only LM pretrain + fine-tune]
GPT1 --> BERT[BERT 2018.10<br/>encoder + MLM bidirectional]
GPT1 --> GPT2[GPT-2 2019<br/>1.5B + zero-shot]
GPT2 --> GPT3[GPT-3 2020<br/>175B + in-context learning]
GPT3 --> InstructGPT[InstructGPT 2022<br/>RLHF]
InstructGPT --> ChatGPT[ChatGPT 2022.11]
ChatGPT --> GPT4[GPT-4 2023]
GPT1 --> CodeX[Codex 2021<br/>code GPT]
GPT1 --> LLaMA[LLaMA 2023<br/>open-source LLM]
GPT1 --> Megatron[Megatron-LM 2019<br/>model parallelism]
Predecessors¶
- Word2Vec / GloVe (2013-2014): founded "pre-train + reuse" idea
- Transformer (2017): only architectural foundation
- ULMFiT (2018.01): LSTM version of LM pretrain + fine-tune
- ELMo (2018.02): contextual embedding contemporary rival
Successors¶
- Direct rival: BERT (2018.10) — encoder + bidirectional + MLM, beat GPT-1 4 months later
- Direct heirs: GPT-2 (2019) → GPT-3 (2020) → ChatGPT (2022.11) → GPT-4 (2023), the entire LLM mainline
- Architecture family: Transformer-XL / XLNet / Reformer and all decoder-only successors
- Multi-modal extensions: DALL-E / CLIP / Whisper / Sora all from GPT-1 team alumnus (Radford)
Misreadings¶
- "GPT-1 was a failure (crushed by BERT)": wrong. GPT-1 is the true starting point of LLM; BERT is the NLU branch. Today's mainstream LLMs all inherit GPT-1 paradigm
- "GPT-1 has no zero-shot ability": actually GPT-1 zero-shot already reaches 70% on Stories Cloze, but GPT-2 systematically explored this path
- "Need bidirectional for NLU": GPT-3's in-context learning proved unidirectional LM can do almost all NLU tasks
Modern Perspective (Looking Back from 2026)¶
Assumptions that don't hold up¶
- "117M params is already large": today mainstream is 70B-1T; GPT-1 is a toy today
- "BookCorpus 800M tokens is enough": today LLaMA-3 uses 15T tokens, 18750× GPT-1
- "Post-LN is the correct norm position": from GPT-2 onward Pre-LN; today LLaMA / Qwen all Pre-LN + RMSNorm
- "Need fine-tune for downstream": GPT-3 in-context learning proved zero/few-shot can do almost anything
- "learnable absolute PE is the standard": today RoPE / ALiBi
- "512 context is enough": today 1M-2M context (Gemini 1.5 / Claude 3.5)
What time validated as essential vs redundant¶
- Essential: decoder-only architecture, generative pre-training, full-param fine-tuning, input formatting, auxiliary LM loss
- Redundant: Post-LN (replaced by Pre-LN), 512 context (replaced by 1M+), aux LM loss (no longer needed from GPT-2), char-level BPE (replaced by byte-level)
Side effects the authors didn't anticipate¶
- Opened the LLM mainline: GPT-1 → GPT-2 → GPT-3 → ChatGPT → GPT-4 → o1 all inherit GPT-1's architecture and paradigm
- Short-term overshadowed by BERT, long-term overtook: BERT was NLU industry standard 2018-2022, but post-ChatGPT decoder-only LLM won user-facing apps
- Hugging Face ecosystem foundation: GPT-1 was an early supported model in Hugging Face transformers
If we rewrote GPT-1 today¶
- Scale up to 7B+, data 15T+ tokens
- Pre-LN + RMSNorm + RoPE + SwiGLU + GQA
- Add instruction tuning + RLHF
- Drop auxiliary LM loss
But the core paradigm "decoder-only Transformer + generative pre-training + input formatting for downstream" stays unchanged.
Limitations and Outlook¶
Authors admitted¶
- Still loses to task-specific bidirectional models on GLUE (validated by BERT)
- 117M params still small, scaling potential not fully released
- Pre-trained only on BookCorpus single domain, generalization limited
Found in retrospect¶
- Unidirectional LM theoretically caps below bidirectional on NLU
- 512 context limits long-document understanding
- BookCorpus domain-biased (novels), lacks encyclopedia / news diversity
Improvement directions (validated by follow-ups)¶
- Brute-force scaling (GPT-2 1.5B / GPT-3 175B)
- Larger and more general data (WebText / Common Crawl)
- Pre-LN (GPT-2 onward)
- in-context learning (GPT-3)
- Instruction tuning + RLHF (InstructGPT / ChatGPT)
Related Work and Inspiration¶
- vs ULMFiT (cross-architecture): ULMFiT used LSTM + complex 3-stage fine-tune; GPT-1 used Transformer + simple 1-stage. Lesson: with the right architecture, the method can be simpler
- vs ELMo (cross-paradigm): ELMo only replaced embedding; GPT-1 replaced the whole backbone. Lesson: transferring deeper layers >> transferring shallow layers
- vs BERT (cross-architecture): GPT-1 decoder + unidirectional + LM, BERT encoder + bidirectional + MLM. Lesson: architecture and objective combination is an independent design dimension
- vs Transformer (cross-task): Transformer solved MT, GPT-1 moved decoder to self-supervised LM. Lesson: general architectures can be reused across tasks
Related Resources¶
- 📄 GPT-1 Tech Report PDF · OpenAI Blog
- 💻 Authors' original TF implementation · HuggingFace transformers/openai-gpt
- 📚 Must-read follow-ups: GPT-2 (2019), BERT (2018), GPT-3 (2020), InstructGPT (2022)
- 🎬 Karpathy: Let's reproduce GPT (YouTube)
🌐 中文版本 · 📚 awesome-papers project · CC-BY-NC