T5 — Unifying All NLP Tasks as Text-to-Text¶
October 23, 2019. Raffel and 8 co-authors release T5 (1910.10683) on arXiv, 53 pages (67 with appendices), one of the most systematic transfer learning experimental reports in NLP history. A paper with no "new" model components, but rigorously control-tested all the scattered transfer-learning design choices since BERT — architecture (encoder-only / decoder-only / encoder-decoder), pre-training objective (LM / MLM / span corruption), data scale (10GB - 1TB), model size (60M - 11B) — and gave a clear recommendation: "encoder-decoder + span corruption + C4 + large model" is the strongest combo. T5 also released C4 (Colossal Clean Crawled Corpus, 750GB), today still a core data source for LLaMA / Falcon / RedPajama. Its 11B model swept SOTA on 24 benchmarks including GLUE / SuperGLUE / SQuAD.
TL;DR¶
T5 unifies all NLP tasks (classification / translation / summarization / QA / text similarity) into "text input → text output" format, paired with encoder-decoder Transformer + span corruption pre-training + 750GB C4 corpus, proving this unified framework refreshes SOTA on all 24 benchmarks — marking NLP's transition from "per-task model" to "per-corpus model" paradigm.
Historical Context¶
What was the NLP community stuck on in 2019?¶
2018-2019 was NLP's "pre-training explosion" year: BERT (2018.10) / GPT-2 (2019.02) / RoBERTa (2019.07) / XLNet (2019.06) / ALBERT (2019.09) — 5 milestones in one year. But these works each made different design choices: BERT used encoder-only + MLM, GPT-2 used decoder-only + LM, XLNet used permutation LM, ALBERT cut params with sharing. The community lacked a unified controlled experiment to answer "which design is actually best."
(1) Architecture: encoder-only / decoder-only / encoder-decoder — which is strongest? (2) Pre-training objective: LM / MLM / span corruption / shuffle — which is most effective? (3) Data scale: 10GB / 100GB / 1TB — how much gain from each? (4) Model scale: 100M / 1B / 11B — which can scale up? (5) Task format: how to unify classification / regression / sequence labeling / QA?
T5's goal was to do full-factor controlled experiments across these 5 dimensions with a unified framework, then combine the best of each dimension into the final model.
The 3 immediate predecessors that pushed T5 out¶
- Devlin et al., 2018 (BERT): encoder-only + MLM paradigm
- Radford et al., 2019 (GPT-2): decoder-only + LM + zero-shot
- Lewis et al., 2019 (BART): encoder-decoder + denoising pre-training (4 months before T5)
What was the author team doing?¶
9 authors all from Google Research. Colin Raffel (first author, later moved to HuggingFace and founded Answer.AI); Noam Shazeer is co-author of Transformer + MoE pioneer (later founded Character.AI); Peter Liu is Google Brain text summarization expert. Google Brain NLP team's goal was "find NLP's unified paradigm", T5 was the engineering output of that goal.
State of industry, compute, data¶
- TPU: T5-11B trained on 1024 TPU v3 for ~2 weeks, estimated cost ~$1.3M
- Data: self-crawled and cleaned C4 (750GB), heuristically filtered from Common Crawl
- Frameworks: TensorFlow + Mesh-TensorFlow (early model parallelism)
- Industry: NLP industry started big bets on pre-trained models, T5 was Google's key engineering competing with OpenAI's GPT-2/3 line
Method Deep Dive¶
Overall framework¶
[Pre-training: Span Corruption on C4 (750GB)]
Input: "Thank you <X> me to your party <Y> week"
Target: "<X> for inviting <Y> last <Z>"
↓ Encoder-Decoder Transformer
↓ Cross-entropy on target tokens
[Fine-tuning: Multi-task with task prefix]
All tasks formatted as text-to-text
Same encoder-decoder + small LR fine-tune
| Config | T5-Small | T5-Base | T5-Large | T5-3B | T5-11B |
|---|---|---|---|---|---|
| Params | 60M | 220M | 770M | 3B | 11B |
| Encoder layers | 6 | 12 | 24 | 24 | 24 |
| \(d_{model}\) | 512 | 768 | 1024 | 1024 | 1024 |
| \(d_{ff}\) | 2048 | 3072 | 4096 | 16384 | 65536 |
| Heads | 8 | 12 | 16 | 32 | 128 |
Key designs¶
Design 1: Text-to-Text Unified Framework — task prefix + text output¶
Function: turn classification / regression / sequence labeling / QA / translation / summarization all into "prefixed input text → output text."
4 task type unified examples:
| Task type | Input | Output |
|---|---|---|
| Classification (GLUE/SST-2) | sst2 sentence: it is great. |
positive |
| Regression (GLUE/STS-B) | stsb sentence1: ... sentence2: ... |
3.4 (text representation, discretized to 0.2 intervals) |
| Translation (WMT EN-DE) | translate English to German: Hello |
Hallo |
| Summarization (CNN/DM) | summarize: <article> |
<summary> |
| QA (SQuAD) | question: ... context: ... |
answer text |
| NLI (MNLI) | mnli premise: ... hypothesis: ... |
entailment / neutral / contradiction |
Comparison with BERT / GPT multi-task approaches:
| Model | Multi-task approach | Task interface |
|---|---|---|
| BERT | Add different head per task | Classification head / span head / token head |
| GPT-2 | Prompt + zero-shot | Auto-regressive generation |
| T5 | Unified text-to-text + task prefix | Encoder-decoder generation |
Design rationale: text-to-text interface fully shifts task differences to input format; model architecture + loss + training pipeline fully unified — peak engineering of transfer learning.
Design 2: Span Corruption Pre-training Objective — continuous version of BERT MLM¶
Function: randomly select continuous spans to mask, let the model predict the masked spans.
Core mechanism:
Input "Thank you for inviting me to your party last week"
Randomly select 15% tokens to form spans (avg length 3):
- "for inviting" → replace with <X>
- "last" → replace with <Y>
Input: Thank you <X> me to your party <Y> week
Target: <X> for inviting <Y> last <Z> (each span marked with unique sentinel token, ending with <Z>)
Loss is still cross-entropy on target tokens, but only computed on sentinel + span tokens.
Comparison of 5 pre-training objectives (paper Table 4):
| Objective | Description | GLUE | SQuAD F1 | Translation BLEU |
|---|---|---|---|---|
| Standard LM | \(P(x_t \| x_{<t})\) (GPT-style) | 73.78 | 78.94 | 26.0 |
| BERT-style MLM | 15% single-token mask | 82.96 | 86.78 | 26.7 |
| Deshuffling | shuffle tokens, reorder | 73.17 | 73.93 | 25.4 |
| Span Corruption (T5) | 15% span mask (avg length 3) | 83.28 | 87.24 | 27.6 |
| Random Replace + Reconstruct | replace with random token | 79.37 | 80.94 | 26.5 |
Span Corruption wins across the board, and target sequence is much shorter than BERT MLM (only outputs masked spans, saving compute).
Design 3: Encoder-Decoder Architecture — controlled experiment shows it's optimal¶
Function: use standard Transformer encoder-decoder (nearly identical to original Transformer), not BERT's encoder-only or GPT's decoder-only.
3 architecture controlled experiments (paper Table 2):
| Architecture | Param sharing | GLUE | SQuAD F1 | Translation BLEU |
|---|---|---|---|---|
| Encoder-Decoder (standard Transformer) | No | 83.28 | 87.24 | 27.6 |
| Encoder-Decoder (shared params) | Yes | 82.81 | 86.34 | 27.4 |
| Decoder-only (GPT-style) | - | 78.94 | 84.59 | 26.5 |
| Encoder-only (BERT-style) | - | N/A for generation | 84.81 | - |
Encoder-Decoder wins on generation tasks (translation / summarization), also slightly wins on NLU. This is T5's most counter-current finding — when BERT (encoder-only) dominated NLU and GPT (decoder-only) dominated generation, T5 proved encoder-decoder is strongest on both under unified framework.
T5's small differences from original Transformer: - Pre-LN (consistent with GPT-2) - Remove embedding bias - Remove layer norm bias - Relative Position Bias (no sinusoidal / learnable absolute PE, instead relative position bias added to attention logits)
Design 4: C4 Dataset — 750GB cleaned Common Crawl¶
Function: build a massive, high-quality, public pre-training corpus, solving the problem that BookCorpus / Wikipedia are too small.
Cleaning rules (heuristic filtering):
- Keep only sentences ending in
.,!,?,"(remove fragments) - Filter pages with fewer than 5 sentences (remove skeleton pages)
- Dedup (sentence-level)
- Filter pages with bad words (use List-of-Bad-Words)
- Filter "JavaScript must be enabled" pages (failed JS render)
- Filter "lorem ipsum" placeholders
- Filter pages with many
{}(code / JSON)
Final: from Common Crawl (April 2019 snapshot) ~6TB cleaned to 750GB / 156B tokens.
Pseudocode:
def build_c4(common_crawl_dump):
docs = []
for page in common_crawl_dump:
if not page.text:
continue
sentences = split_sentences(page.text)
if len(sentences) < 5:
continue
sentences = [s for s in sentences if s.endswith(('.', '!', '?', '"'))]
if any(bad_word in page.text for bad_word in BAD_WORDS):
continue
if 'javascript must be enabled' in page.text.lower():
continue
if '{' in page.text or '}' in page.text:
continue
docs.append(' '.join(sentences))
docs = dedup_by_sentence(docs)
return docs # 750GB / 156B tokens
Comparison with same-era datasets:
| Dataset | Scale | Source | Public | Cleaning intensity |
|---|---|---|---|---|
| BookCorpus | 5GB | Novels | Yes | Low |
| WikiText-103 | 0.5GB | Wikipedia | Yes | High |
| WebText (GPT-2) | 40GB | Reddit high-karma | No | Medium |
| C4 (T5) | 750GB | Common Crawl | Yes | High (heuristic) |
| The Pile (2020) | 825GB | 22 domains | Yes | Medium |
C4 was the largest publicly cleaned pre-training corpus at the time; it remains (2026) a core data source for LLaMA / Falcon / RedPajama.
Loss / training strategy¶
| Item | Config |
|---|---|
| Pretrain Loss | Span corruption cross-entropy |
| Optimizer | AdaFactor (not Adam, saves memory) |
| LR | 1e-3 inverse-square-root schedule |
| Pretrain Batch | 128 sequences × 512 tokens |
| Pretrain Steps | 524k (~1T tokens, 1/3 epoch on C4) |
| Fine-tune LR | 1e-3 |
| Fine-tune Steps | 262k (per task) |
| Norm | Pre-LN, no bias |
| Position | Relative position bias |
| Activation | ReLU (base/large) / GeGLU (11B version) |
| Tokenizer | SentencePiece, 32k vocab |
Failed Baselines¶
Opponents that lost to T5 at the time¶
- GLUE leaderboard: T5-11B avg 90.3, beats RoBERTa-large 88.5 (+1.8) and XLNet-large 89.5
- SuperGLUE: T5-11B 89.3, beats prior SOTA RoBERTa 84.6 (+4.7), first to surpass human baseline 89.0
- CNN/DM summarization: ROUGE-L 21.55 → 28.07 (+6.5)
- WMT EN-DE translation: 28.4 (Transformer) → 29.4 (T5-11B)
- SQuAD 1.1: F1 93.1 → 95.1, first time surpassing supervised BERT ensemble SOTA
Failures / limits admitted in the paper¶
- Decoder-only loses to encoder-decoder on generation tasks: contrary to GPT route's bet, but authors honestly report
- Span corruption gap to BERT MLM is small: only obvious on generation tasks, NLU +0.3
- Multi-task training slightly loses to single fine-tune: T5 also tried mixed multi-task fine-tuning, but slightly worse than single (paper Table 14)
- 11B model still not saturated on some tasks (e.g., ReCoRD): hints scaling can continue, leading to GPT-3 175B
- C4 cleaning heuristics are simple: today LM-based filtering / dedup is more refined
"Anti-baseline" lesson¶
- "Encoder-only is the king of NLU" (BERT route belief): T5 proved encoder-decoder under unified framework also gets NLU SOTA
- "Decoder-only is the king of generation" (GPT route belief): T5 reversed on generation tasks
- "BookCorpus + Wikipedia is enough" (community common belief): T5 with 750GB C4 proved data scale matters
- "Per-task fine-tune is best practice": T5 proved per-corpus pretrain + unified fine-tune is the ceiling
- "Need new architecture innovation to progress": T5 proved rigorous controlled experiments + select optimal combination + scale up beats new architecture
Key Experimental Numbers¶
Main experiment (24 benchmark SOTA)¶
| Benchmark | Prior SOTA | T5-11B | Gain |
|---|---|---|---|
| GLUE | 88.5 (RoBERTa) | 90.3 | +1.8 |
| SuperGLUE | 84.6 (RoBERTa) | 89.3 | +4.7 |
| SQuAD 1.1 F1 | 93.1 | 95.1 | +2.0 |
| SQuAD 2.0 F1 | 88.6 | 90.6 | +2.0 |
| CNN/DM ROUGE-L | 21.55 | 28.07 | +6.5 |
| WMT EN-DE BLEU | 28.4 | 29.4 | +1.0 |
| WMT EN-FR BLEU | 41.0 | 41.5 | +0.5 |
| ReCoRD acc | 84.0 | 90.6 | +6.6 |
| MultiRC F1a | 83.4 | 87.4 | +4.0 |
| BoolQ acc | 87.1 | 91.2 | +4.1 |
Architecture + objective control (paper Table 2 + 4)¶
| Architecture | Objective | GLUE | SQuAD F1 |
|---|---|---|---|
| Encoder-Decoder | Span | 83.28 | 87.24 |
| Encoder-Decoder | LM | 80.88 | 84.45 |
| Decoder-only | LM | 78.94 | 84.59 |
| Decoder-only | Span | 79.46 | 84.85 |
| Encoder-only | MLM | - | 84.81 |
Scaling (paper Table 14)¶
| Model | Params | C4 train tokens | GLUE | SQuAD F1 | CNN/DM ROUGE-L |
|---|---|---|---|---|---|
| T5-Small | 60M | 137B | 77.4 | 79.10 | 19.24 |
| T5-Base | 220M | 137B | 82.7 | 85.44 | 20.34 |
| T5-Large | 770M | 137B | 86.4 | 89.40 | 21.10 |
| T5-3B | 3B | 1.0T | 88.5 | 91.26 | 22.54 |
| T5-11B | 11B | 1.0T | 89.7 | 91.44 | 23.06 |
All dimensions improve monotonically, no saturation.
Key findings¶
- Encoder-decoder + span corruption is the strongest combination
- C4 750GB significantly improves over BookCorpus 5GB (+5-7 GLUE points)
- Scaling to 11B still monotonic: hints GPT-3 decision was reasonable
- Multi-task training slightly loses to single fine-tune: transfer learning still mainly per-task
- Task prefix format details matter: good prefixes give +1-2 points
Idea Lineage¶
graph LR
decaNLP[decaNLP 2018<br/>multi-task unified as QA] -.framing inspiration.-> T5
Transformer[Transformer 2017<br/>encoder-decoder] -.architectural base.-> T5
BERT[BERT 2018<br/>encoder + MLM] -.alternative route.-> T5
GPT2[GPT-2 2019<br/>decoder + LM] -.alternative route.-> T5
BART[BART 2019.10<br/>encoder-decoder + denoising] -.contemporary predecessor.-> T5
SpanBERT[SpanBERT 2019<br/>span-level mask] -.span idea.-> T5
T5[T5 2019<br/>text-to-text + span corruption + C4]
T5 --> mT5[mT5 2020<br/>multilingual T5]
T5 --> ByT5[ByT5 2021<br/>byte-level T5]
T5 --> FlanT5[Flan-T5 2022<br/>instruction-tuned T5]
T5 --> UL2[UL2 2022<br/>unified denoising objectives]
T5 --> PaLM[PaLM 2022<br/>540B decoder + scaling]
T5 --> RAG[RAG 2020<br/>retrieval-augmented]
T5 --> InstructGPT[InstructGPT 2022<br/>instruction tuning]
T5 -.idea inspiration.-> ChatGPT[ChatGPT 2022.11<br/>unified dialogue interface]
T5 -.data contribution.-> LLaMA[LLaMA 2023<br/>uses C4 as part of data]
Predecessors¶
- Transformer (2017): encoder-decoder architectural foundation
- BERT (2018): encoder + MLM control
- GPT-2 (2019): decoder + LM control
- decaNLP (2018): early framing of multi-task unified as QA
- SpanBERT (2019): span-level mask idea
- BART (2019.10): 4 months before T5, encoder-decoder + denoising
Successors¶
- Multilingual / cross-modal extensions: mT5 2020, ByT5 2021, PaLM 2022 (scales T5 paradigm to 540B)
- Instruction tuning: Flan-T5 2022 (first to do instruction tuning on T5, opening instruction-tuning paradigm)
- Objective improvements: UL2 2022 (unified multiple denoising objectives), PEGASUS 2020 (gap-sentence summarization pre-training)
- Retrieval augmentation: RAG 2020 (adds retrieval on T5 backbone)
- Data contribution: C4 to date (2026) remains core data source for LLaMA / Falcon / RedPajama and other open LLMs
- Idea inherited by GPT-3 paradigm: text-to-text interface is essentially the embryo of in-context learning
Misreadings¶
- "T5 is BERT's upgrade": wrong. T5 is the third path beyond BERT/GPT (encoder-decoder + span)
- "Text-to-text must be optimal interface": on classification tasks, task-specific heads may still be more precise
- "11B is the NLP endpoint": GPT-3 175B 4 months later proved scaling can continue 16×
Modern Perspective (Looking Back from 2026)¶
Assumptions that don't hold up¶
- "Encoder-decoder is the best architecture": post-ChatGPT era decoder-only LLMs (GPT-4 / Claude / LLaMA) crush encoder-decoder in user-facing apps. But T5 remains gold standard in backend embedding / dense retrieval / summarization tasks
- "11B is large enough": today mainstream is 70B-1T, T5-11B is medium-sized vs GPT-4 / Claude 3.5
- "Per-task fine-tune is best practice": overturned by in-context learning + RLHF — GPT-3+ doesn't need task-specific fine-tune
- "C4 heuristic cleaning is enough": today LM-based filtering is more refined
- "Encoder-decoder is more training-efficient than decoder-only": refuted — FlashAttention + KV cache make decoder-only inference more efficient
What time validated as essential vs redundant¶
- Essential: text-to-text unified framework, span corruption objective, C4 dataset, rigorous controlled experiment methodology, encoder-decoder advantage on generation
- Redundant / misleading: sentinel token design (GPT-3 in-context learning doesn't need it), relative position bias (replaced by RoPE), AdaFactor (replaced by Adam + ZeRO)
Side effects the authors didn't anticipate¶
- C4 became NLP data infrastructure: today 90%+ open LLMs use C4 / mC4 as part of pre-training data
- Flan-T5 opened instruction-tuning paradigm: 2022 Flan-T5 fine-tuned on 1800+ tasks in instruction format, inspiring later InstructGPT / ChatGPT
- Unified text-to-text interface fully inherited by ChatGPT: ChatGPT's dialogue format is the extreme of text-to-text
- Changed NLP benchmark design philosophy: pre-T5 benchmarks were independent, post-T5 community values "unified-framework comparable"
- Opened systematic transfer learning research direction: T5's controlled experiment methodology widely imitated (e.g., GPT-3's scaling laws, Chinchilla)
If we rewrote T5 today¶
- Switch to decoder-only (per ChatGPT experience)
- Add instruction tuning + RLHF
- Use byte-level BPE (per GPT-3)
- Use RoPE / ALiBi instead of relative position bias
- Use SwiGLU instead of ReLU
- Use GQA / MQA to reduce KV cache
- Scale data to 15T tokens (per LLaMA 3, Chinchilla-balanced)
- Add LM-based filtering (per Falcon)
But the core ideas "text-to-text unified interface + large-scale high-quality data + rigorous controls" stay unchanged.
Limitations and Outlook¶
Authors admitted¶
- 11B training cost is extremely high (millions of dollars), academia hard to replicate
- Multi-task training slightly loses to single fine-tune
- C4 cleaning heuristics simple, no LM-based filtering
- Sequence length only 512, long-document limited
- Cannot do zero-shot prompt (still needs fine-tuning data corresponding to task prefix)
Found in retrospect¶
- Encoder-decoder cannot reuse KV cache across steps in inference, less efficient than decoder-only
- Relative position bias has weak extrapolation
- AdaFactor weaker than Adam on small models
Improvement directions (validated by follow-ups)¶
- mT5 (2020): multilingual extension
- Flan-T5 (2022): instruction tuning
- UL2 (2022): unified multiple denoising objectives
- LongT5 (2022): long documents (4k+ context)
- ByT5 (2021): byte-level tokenization
- Switch to decoder-only (GPT-3/4 route)
Related Work and Inspiration¶
- vs BERT (cross-architecture): BERT encoder-only + MLM, T5 encoder-decoder + span corruption. Lesson: encoder-decoder has architectural advantage on generation
- vs GPT-2 (cross-architecture): GPT decoder-only + LM + zero-shot, T5 encoder-decoder + span + multi-task fine-tune. Lesson: architecture choice must match task type
- vs BART (cross-contemporary): BART proposed encoder-decoder + denoising 4 months before T5; T5 is more systematic controlled experiment. Lesson: original ideas don't always beat thorough engineering
- vs Transformer (cross-task): Transformer solved MT, T5 generalized encoder-decoder to all NLP tasks. Lesson: general architectures can be reused across tasks
- vs Flan-T5 (cross-generation): Flan-T5 added instruction tuning on T5, proving T5 framework perfectly compatible with instruction tuning. Lesson: good pre-training paradigm should be easy to extend to new training objectives
Related Resources¶
- 📄 arXiv 1910.10683 · JMLR 2020
- 💻 Authors' original TF implementation · HuggingFace transformers/t5
- 🔗 t5-base on HF Hub · t5-11b · Flan-T5
- 📦 Datasets: C4 on TF Datasets · mC4 (multilingual)
- 📚 Must-read follow-ups: mT5 (2020), Flan-T5 (2022), UL2 (2022), PaLM (2022)
- 🎬 Yannic Kilcher: T5 paper explained
🌐 中文版本 · 📚 awesome-papers project · CC-BY-NC