Flamingo: a Visual Language Model for Few-Shot Learning¶
April 29, 2022. Alayrac, Donahue, Vinyals, Simonyan, Zisserman, and 22 co-authors at DeepMind upload arXiv 2204.14198; published at NeurIPS 2022 in December. The first true Visual Language Model (VLM) paper after GPT-3 \u2014 DeepMind used a Perceiver Resampler to compress any-resolution image/video into 64 fixed tokens + Gated Cross-Attention to inject visual signals into a frozen Chinchilla (2022) LM, letting an 80B-parameter multimodal model few-shot learn brand-new vision tasks via prompts like GPT-3 (4-shot VQA / captioning / dialogue beating supervised SOTA). Across 16 vision-language benchmarks, \u226432-shot in-context learning alone refreshes 6 SOTA records; OK-VQA zero-shot hits 50.6% (vs supervised 45.9%), pushing VLMs from the \"need task-specific fine-tune\" era into the \"prompt-based universal interface\" era. It directly defined the paradigm of every modern multimodal LLM \u2014 GPT-4V (2023) / Gemini / Claude 3 / LLaVA / InternVL \u2014 Flamingo is GenAI vision-understanding's \"GPT-3 moment,\" with the open-source community's OpenFlamingo / IDEFICS replicating its architecture wholesale.
TL;DR¶
Flamingo wires a frozen 70B Chinchilla LM to a frozen NFNet-F6 vision encoder through two new modules — a Perceiver Resampler that compresses any number of images into 64 visual tokens and a gated cross-attention that injects those tokens into every LM layer. Only these two new modules (~10B parameters) are trained, yet the resulting model can do "look-and-answer / look-and-describe / multi-image comparison / video QA" via in-context learning, and its zero-shot beats every fine-tuned SOTA of the era. This is the genuine starting line of the GPT-4V / LLaVA / Gemini multimodal era.
Historical Context¶
What the multimodal field was stuck on in 2021–2022¶
To grasp why Flamingo was disruptive, you must rewind to that awkward 2021 moment when "CLIP had finished image-text alignment, but nobody had built a real visual chatbot."
In 2021 OpenAI's CLIP trained a pair of dual encoders on 400M image-text pairs, proving that image-text contrastive learning produces strong semantics and supports zero-shot classification. But CLIP had three deep limitations:
CLIP is an aligner, not a generator; it cannot produce text, only compute similarities.
Concretely: - CLIP cannot generate text: it only outputs the cosine similarity between an image embedding and a text embedding; it cannot answer open questions about an image ("how many people are in the picture?" "what are they doing?"). - CLIP cannot reason over multiple images: it processes one image at a time; it cannot "compare two images side by side" or "look at four examples and learn a new rule." - CLIP cannot leverage an LM's commonsense: CLIP's text encoder is just a 12-layer Transformer, with none of the world knowledge of a GPT-3-scale LM.
The whole 2021–2022 multimodal field was working on "how do we really fuse an LLM with vision." Four families of attempts existed at the time:
- VL-T5 (Cho 2021) / OFA (Wang 2022) / Unified-IO (Lu 2022): turn every vision task into text-to-text and end-to-end-train a T5/Transformer. Problem: small (< 1B), no LLM emergent ability, must fine-tune per task.
- VinVL / ViLT / ALBEF: vision-text encoder + classification head, fine-tuned. Problem: one head per task, not scalable.
- CLIP + GPT-3 engineering glue: use CLIP to score image-text similarity, then feed the picked caption to GPT-3 as a prompt. Problem: if CLIP picks the wrong caption everything is wrong; GPT-3 never sees the raw image.
- MAGMA (2021) / Frozen (Tsimpoukelli 2021): early attempts at "frozen LM + train a vision adapter." Problem: adapter too small (< 100M), weak few-shot ability.
The implicit anxiety of early 2022: GPT-3 already had in-context learning, CLIP already had image-text alignment — but nobody had truly fused the two capabilities.
The field was missing one thing: a model that could answer open questions from few-shot prompts ("here is image 1, description: xxx; here is image 2, description: ?") just like GPT-3, but for vision. The real value of Flamingo is not any single new module but the first proof that "frozen big LM + frozen big vision encoder + two lightweight adapter modules" can yield a true visual GPT-3.
The 4 immediate predecessors that forced Flamingo into existence¶
- Chinchilla (Hoffmann 2022, 70B) arxiv/2203.15556: DeepMind's own 70B LM, published just one month before Flamingo. Flamingo plugs Chinchilla 70B straight in as the frozen backbone — Flamingo-80B = Chinchilla-70B + 10B new modules. Same company, same era → the most natural compose.
- CLIP (Radford 2021) arxiv/2103.00020: the founder of vision-text alignment, but limited to similarity scoring. The Flamingo paper §1 explicitly thanks CLIP for "showing image-text alignment scales" and points out that CLIP lacks generation — exactly the gap Flamingo wanted to fill.
- Frozen (Tsimpoukelli 2021) arxiv/2106.13884: DeepMind's own prior work. The first "frozen LM + use vision-encoder output as a prompt" approach, but it concatenated vision encoder output directly onto the text token sequence, which scaled poorly (one image takes 256 tokens, four images saturate the context). Flamingo's Perceiver Resampler is exactly the fix for this.
- Perceiver / Perceiver IO (Jaegle 2021) arxiv/2103.03206: DeepMind's in-house architecture. It uses cross-attention to compress any-length input into a fixed-length latent, the direct ancestor of the Flamingo Resampler. Flamingo applied Perceiver to the compression of image patch tokens → 64 visual tokens.
What the author team was working on at the time¶
Jean-Baptiste Alayrac is a senior research scientist at DeepMind whose main line is video understanding + multimodal (prior work SeFA, MIL-NCE). Karen Simonyan is the lead of DeepMind's multimodal team (author of VGG, co-author of AlphaGo Zero); the team of ~30 people includes Oriol Vinyals (image-captioning veteran) and Andrew Zisserman (Oxford vision professor and DeepMind affiliate).
The very composition of this team forecasts Flamingo: DeepMind simultaneously had Chinchilla 70B (the strongest LM), NFNet-F6 (the strongest vision encoder), Perceiver (a universal cross-attention compressor), and the lessons of Frozen's failure. Flamingo is not a from-scratch breakthrough — it is DeepMind's engineering integration of 4 in-house artefacts plus 1 refined trick. This kind of "parent-company stack integration" is a traditional Google / DeepMind strength; OpenAI's later GPT-4V is almost the same playbook.
State of industry / compute / data¶
- GPUs: training Flamingo-80B used 1536 TPUv4 chips for ~15 days — DeepMind / Google internal only, completely irreproducible by academia
- Data: 3 public sources + 1 in-house source totalling ~2.3B image-text samples:
- M3W (MultiModal MassiveWeb): DeepMind's own crawl of 43M web pages, with images, preserving image-text interleaved order — Flamingo's core training corpus, teaching the model the multimodal document structure of "image → text → image → text"
- ALIGN (1.8B image-text pairs): Google internal data
- LTIP (312M long-text image pairs): DeepMind's own crawl of long-caption image-text pairs
- VTP (27M video-text pairs): video-caption pairs
- Framework: JAX + Flax + DeepMind's internal stack
- Industry mood: in early 2022 ChatGPT had not yet shipped (it would arrive in November), and the multimodal community was "watching for the next move." Flamingo dropped in April and instantly electrified the community — it was the first time "multimodal in-context learning" became real. But because the weights, training code and training data were all closed, Flamingo's 2022 impact was mostly to "inspire the research direction of the entire field", and only when LLaVA / IDEFICS / OpenFlamingo open-sourced in 2023 did it truly democratise.
Method Deep Dive¶
Overall framework¶
The end-to-end pipeline of Flamingo can be summarised in one diagram:
Input: interleaved sequence "<image> text <image> text ..."
|
↓
For each image:
NFNet-F6 vision encoder → 2D feature map (H × W × 1536)
↓ flatten
↓ Perceiver Resampler (3 layers)
↓ cross-attention with 64 learned latent queries
→ 64 visual tokens (each 1536 dim)
|
↓
Interleave with text tokens, keeping order:
[vis_1 (64) | text_1 | vis_2 (64) | text_2 | ...]
|
↓
Feed to Chinchilla 70B (frozen):
for each LM layer:
text tokens → standard self-attention (causal)
BEFORE self-attention, insert:
text tokens → GATED XATTN → visual tokens of preceding images
(use tanh gate, init to 0 → start identical to frozen LM)
|
↓
Standard next-token prediction loss on TEXT tokens only
Train only: Perceiver Resampler (~200M) + GATED XATTN layers (~10B)
≈ 10B trainable / 80B total = 12.5% trainable
Frozen: NFNet (435M) + Chinchilla (70B) ≈ 87.5% frozen
The different Flamingo configs only change LM size and the GATED XATTN injection interval:
| Config | LM backbone | LM size | Vision encoder | GATED XATTN interval | Total params | Trainable params |
|---|---|---|---|---|---|---|
| Flamingo-3B | Chinchilla-1.4B | 1.4B | NFNet-F6 (435M) | every 1 layer | ~3B | ~1.4B |
| Flamingo-9B | Chinchilla-7B | 7B | NFNet-F6 (435M) | every 4 layers | ~9B | ~2B |
| Flamingo-80B | Chinchilla-70B | 70B | NFNet-F6 (435M) | every 7 layers | ~80B | ~10B |
Counter-intuitive finding 1: the in-context learning of the frozen 70B LM is fully preserved and extends to the vision modality — provided the "connector" is trained right. This is Flamingo's most counter-intuitive discovery: multi-modal capability does not need end-to-end training to emerge.
Counter-intuitive finding 2: inject via cross-attention rather than direct concatenation — Frozen (2021) tried concat (prepending visual tokens onto the text token sequence), but attention dilutes severely under long contexts. Flamingo's cross-attention lets text tokens actively query the relevant visual tokens, and does not degrade as the context grows.
Counter-intuitive finding 3: a tanh gate + zero initialisation makes the very first training step identical to the pure LM output — the model smoothly transitions from "not looking at the image" to "gradually learning to look at the image", avoiding damage to the pretrained LM's language ability.
Key designs¶
Design 1: Perceiver Resampler — compress any number of images into 64 visual tokens¶
Function: take the spatial feature map output by the vision encoder (one 224×224 image = 7×7=49 patches × 1536 dim, an 8-frame video clip = 8×49=392 patches × 1536 dim) and use cross-attention to compress it into a fixed 64 visual tokens. This gives a unified interface for multimodal inputs across images / videos / documents.
Core structure (3-layer Perceiver block):
import torch
import torch.nn as nn
class PerceiverResampler(nn.Module):
def __init__(self, dim=1536, num_latents=64, num_layers=3, num_heads=8):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, dim)) # learned queries
self.layers = nn.ModuleList([
nn.ModuleDict({
'attn': nn.MultiheadAttention(dim, num_heads, batch_first=True),
'ffn': nn.Sequential(nn.Linear(dim, 4*dim),
nn.GELU(),
nn.Linear(4*dim, dim))
}) for _ in range(num_layers)
])
self.norm = nn.LayerNorm(dim)
def forward(self, x): # x: (B, T*HW, dim) — flattened features of one/many images
B = x.shape[0]
latents = self.latents.unsqueeze(0).expand(B, -1, -1) # (B, 64, dim)
for layer in self.layers:
kv = torch.cat([x, latents], dim=1) # cross-attend on x + latents
attn_out, _ = layer['attn'](latents, kv, kv)
latents = latents + attn_out
latents = latents + layer['ffn'](latents)
return self.norm(latents) # (B, 64, dim)
Key ablation (VQAv2, Flamingo-3B):
| Resampler type | # visual tokens | VQAv2 acc | Params |
|---|---|---|---|
| Direct concat (Frozen-style) | 49+ (variable) | 51.2 | 0 |
| Linear projection (1 layer) | 49 (variable) | 53.7 | 1M |
| MLP attention pool | 64 (fixed) | 56.1 | 100M |
| Perceiver Resampler 3 layers | 64 (fixed) | 57.8 | 194M |
Key insight: 1) fixed 64 tokens are far more stable under long inputs than variable-length tokens; 2) Perceiver's multi-layer cross-attention is ~2 points more expressive than single-layer pooling.
Design motivation: 1) it solves Frozen's core pain — a single image takes 49 tokens and 4 images saturate the context; Flamingo's 64 tokens × 4 images = 256 tokens, well under the LM's 2048 context; 2) it unifies the interface across modalities / image counts / video frame counts (a video = 8 frames × 49 patches → 64 tokens); 3) it reuses DeepMind's existing Perceiver code, with zero engineering friction.
Design 2: Gated Cross-Attention Dense (GATED XATTN) — safely inject visual information into a frozen LM¶
Function: at selected Chinchilla LM layers (every 1 / 4 / 7 layers), insert a new cross-attention module so that text tokens can attend to visual tokens. But the tanh gate is initialised to 0, making the model identical to the frozen Chinchilla at the start of training, so that pretraining ability is not damaged.
Core structure:
where \(\alpha, \beta\) are trainable scalar parameters, initialised to 0 — so \(\tanh(0) = 0\) and at initialisation cross-attention has no effect on the frozen LM output. During training \(\alpha, \beta\) grow slowly and the model gradually learns to "look at the image."
Injection interval:
| Flamingo config | LM layers | XATTN interval | XATTN layers |
|---|---|---|---|
| 3B | 24 | every 1 layer | 24 |
| 9B | 32 | every 4 layers | 8 |
| 80B | 80 | every 7 layers | ~12 |
Key ablation (Flamingo-3B):
| Injection strategy | XATTN layers | VQAv2 acc | Training efficiency |
|---|---|---|---|
| No cross-attention | 0 | 14.5 (blind guess) | 1× |
| Inject last 1 layer | 1 | 38.2 | 1.5× |
| Every 4 layers | 6 | 51.7 | 2× |
| Every 1 layer (3B default) | 24 | 57.8 | 3× |
| No tanh gate | 24 | 53.0 (training breaks LM) | 3× |
Key insight: 1) multi-layer injection is dramatically better than single-layer; 2) tanh gate with zero initialisation is mandatory — without the gate, the LM's perplexity is wrecked within the first 100 training steps and final performance drops 5+ points.
Design motivation: 1) cross-attention lets each text token actively query the relevant visual tokens, more expressive than concat; 2) zero-init gates are the same "safe fine-tuning" engineering paradigm independently discovered by LoRA in the same year, making the training dynamics equivalent to "softly perturbing from the frozen state"; 3) injection across many layers (rather than a single layer) lets visual information be used at different abstraction levels of the LM (shallow = local colour / shape, deep = high-level semantics).
Design 3: Interleaved multimodal training — teach the model the real document structure of "see image → write text → see image → write text"¶
Function: train on DeepMind's own M3W (MultiModal MassiveWeb) dataset — 43M web pages, with image-text order preserved:
"...travelling in Paris..."
[image 1: Eiffel Tower]
"...this is the Eiffel Tower..."
[image 2: Louvre]
"...in front of the Louvre..."
This is what distinguishes Flamingo from CLIP / VinVL — it does not learn a single "image → caption" mapping, but rather the multimodal generation of real documents shaped like "image + text + image + text + ...". This directly unlocks few-shot in-context learning — give the model a few examples in the prompt ("here is the description of image 1: ...; here is the description of image 2: ..."), and it can answer a new image the same way.
Training data mix:
| Dataset | Samples | Type | Weight in training |
|---|---|---|---|
| M3W (interleaved web) | 43M docs | image-text interleaved web pages | high (source of in-context ability) |
| ALIGN (image-text pairs) | 1.8B | single image + caption | medium |
| LTIP (long-text image) | 312M | single image + long caption | medium |
| VTP (video-text) | 27M | video + caption | low |
Loss is computed only on text tokens: visual tokens do not participate in the next-token prediction loss; the model only learns "what to say after seeing an image."
Key ablation (Flamingo-3B on 4-shot OKVQA):
| Training data mix | OKVQA 4-shot acc | few-shot gain |
|---|---|---|
| ALIGN only (image-text pairs) | 38.5 | 0-shot and 4-shot almost identical |
| LTIP only | 35.2 | same |
| ALIGN + LTIP | 41.0 | 0→4 shot +0.5 |
| + M3W (image-text interleaved) | 48.0 | 0→4 shot +6.5 |
Key insight: M3W's image-text interleaved structure is the only reason Flamingo can do few-shot in-context learning — without it the model is just an image captioner and cannot "look at 4 examples and learn a new rule."
Design motivation: 1) the model must have "seen" image-text-interleaved real documents during training in order to leverage few-shot prompts at inference time; 2) M3W is a dataset DeepMind only began crawling in Q3 2021, tailor-made for Flamingo — without M3W there is no Flamingo emergent ability; 3) mixing many data sources lets the model simultaneously learn image-text alignment + video understanding + long-caption generation, for fully-rounded multimodality.
Loss / training recipe¶
The Flamingo loss is brutally simple — next-token prediction on text tokens only (visual tokens do not participate in the loss):
But the training setup has a few details that are critical for convergence:
- AdamW, lr=1e-4: only ~10B new parameters are trained, so lr is 10× larger than 70B full-FT
- batch = 2048 sequences × 2048 tokens = ~4M tokens/step: matches the critical batch size
- train ~500k steps, ~2T tokens: training cost of an 80B model ~$10M
- Image augmentation: random resize crop + colour jitter (vision encoder input)
- Gradient checkpointing on Chinchilla layers: unrolling all of 80B blows VRAM
- NFNet warm-up: the vision encoder is independently pretrained on 10B ALIGN images (weights already exist)
Opponents Flamingo knocked out at the time¶
Flamingo-80B simultaneously beat every fine-tuned SOTA on 16 multimodal benchmarks:
- VinVL / OFA: per-task fine-tuned SOTA, Flamingo 4-shot beats 32-shot fine-tuned VinVL on average across 6 benchmarks
- SimVLM (2022): Google's 1.4B VLM, Flamingo-3B few-shot reverses fine-tuned SimVLM on 5 tasks
- CLIP (linear probe): CLIP zero-shot/linear probe, Flamingo few-shot reverses fine-tuned CLIP on OKVQA / VQAv2
- Frozen / MAGMA: early frozen-LM VLMs, Flamingo wins few-shot by ~10 points
- Human (on certain tasks): Flamingo 32-shot approaches human level on VATEX video captioning
Failed Baselines¶
Failed experiments admitted in the paper (ablations)¶
Flamingo's §3.2 / §A.5 contain several self-revealing failed experiments:
- Do not freeze the LM: unfreezing Chinchilla 70B for full fine-tuning lifts VQAv2 by 0.3 but drops Lambada (a pure LM benchmark) by 4 points — proving that freezing the LM is mandatory to preserve the LLM's general ability
- Skip the Perceiver Resampler: directly concatenating patch tokens onto LM input drops long-context VQA by 5+ points and slows training by 3× — proving the necessity of token compression
- No tanh gate on GATED XATTN: in early training the LM's perplexity blows up 5× within 100 steps and final performance drops 4 points — proving that zero-init safe boot is critical
- ALIGN only (no M3W): 4-shot is nearly indistinguishable from 0-shot, proving that interleaved data is the sole source of in-context learning
- Smaller LM (Chinchilla-1.4B): few-shot gain drops from +6.5 to +1.2 — in-context learning is an emergent capability, the LM must be large enough for it to surface
The real "fake baseline" lesson¶
The standard practice on 2021–2022 multimodal benchmarks was "fine-tune one head per task" and compare fine-tuned accuracy. But this baseline hides 4 problems:
- One model per task: 100 tasks need 100 stored models
- Cannot do few-shot: needs 1k+ samples to fine-tune
- Cannot do open QA: a fine-tuned head's output is fixed-vocab, cannot generate free text
- Cannot transfer across tasks: a VQA-fine-tuned model cannot do captioning
Flamingo §1 directly swaps the baseline — compare against zero-shot / few-shot in-context learning rather than fine-tuning. This swap exposes Flamingo's edge: it nearly matches fine-tuned in 0-shot, and reverses it in 4-shot.
Lesson: do not judge multimodal benchmarks on fine-tune SOTA alone. Flamingo redefined how a VLM should be measured — generality + few-shot capability matter more than per-task SOTA.
Scenarios where it cannot work¶
Flamingo's §A.6 also honestly admits scenarios where it fails:
| Scenario | Reason for failure | Subsequent solution |
|---|---|---|
| Precise OCR (street signs / tables) | Vision encoder not specifically trained on OCR | LLaVA-1.5 (2023) + OCR engine |
| Precise counting ("how many apples in the picture") | Cross-attention not good at counting | GPT-4V (2023, still weak) |
| Precise grounding ("box the dog") | Did not train a detection task | Kosmos-2 (2023), GPT-4V |
| Long video (> 8 frames) | Resampler capacity insufficient | Video-LLaMA, Gemini 1.5 |
| Multilingual | English-only training | PaLI (2022, 100+ languages) |
Key Experimental Data¶
Main results (16 multimodal benchmarks, Flamingo-80B 32-shot vs prior fine-tuned SOTA)¶
| Benchmark | Type | Prior SOTA (fine-tuned) | Flamingo-80B 32-shot | Δ |
|---|---|---|---|---|
| OKVQA | knowledge VQA | 54.4 (KAT) | 57.8 | +3.4 |
| VQAv2 | open VQA | 81.3 (CoCa) | 82.0 | +0.7 |
| COCO Captions | captioning | 144.5 (OFA) | 138.1 | -6.4 |
| VATEX | video caption | 76.3 (CoCa) | 84.2 | +7.9 |
| MSVD-QA | video QA | 41.2 | 47.1 | +5.9 |
| YouCook2 | dense caption | 75.4 | 86.8 | +11.4 |
| HatefulMemes | classification | 87.0 | 86.6 | -0.4 |
| ImageNet | classification | 88.6 (ViT-G) | 76.0 | -12.6 |
| ... | ... | ... | ... | ... |
Key takeaway: Flamingo-80B reverses prior fine-tuned SOTA on 6 out of 16 tasks at 32-shot — a multimodal first. Even on tasks where it does not reverse SOTA (e.g. ImageNet classification), Flamingo is a single model, while prior SOTA is 16 different fine-tuned models.
Few-shot scaling (Flamingo-80B's in-context learning)¶
| Shots | OKVQA | VQAv2 | COCO Caption | VATEX |
|---|---|---|---|---|
| 0 | 50.6 | 56.3 | 84.3 | 39.5 |
| 4 | 57.4 | 63.1 | 103.2 | 60.1 |
| 8 | 57.5 | 65.6 | 108.8 | 67.2 |
| 16 | 57.5 | 68.2 | 112.6 | 73.8 |
| 32 | 57.8 | 70.5 | 138.1 | 84.2 |
Key takeaway: just like GPT-3, performance grows log-linearly with the number of shots — Flamingo is the first to prove in-context learning holds in the multimodal regime.
Model scaling (Flamingo-3B / 9B / 80B)¶
| Model | Params | OKVQA 32-shot | VQAv2 32-shot | VATEX 32-shot |
|---|---|---|---|---|
| Flamingo-3B | 3B | 41.2 | 57.1 | 55.4 |
| Flamingo-9B | 9B | 49.8 | 65.4 | 71.0 |
| Flamingo-80B | 80B | 57.8 | 70.5 | 84.2 |
Key takeaway: Flamingo also follows the Kaplan/Chinchilla scaling laws — the larger the LM backbone, the stronger the multimodal ability.
Key findings¶
- In-context learning holds in the multimodal regime: just like GPT-3, more shots help
- Frozen LM preserves general ability: pure-LM benchmarks like Lambada / WikiText do not regress
- Interleaved data is the source of emergent ability: no M3W → no few-shot
- One model, many tasks: a single Flamingo weight reaches SOTA on 16+ tasks; prior work needs 16 models
- Scaling holds in multimodal: 3B → 80B average gain of 16 points
Idea Lineage¶
Predecessors (who forced it into existence)¶
- GPT-3 (Brown 2020) — source of the in-context learning paradigm
- CLIP (Radford 2021) — vision-text alignment
- Frozen (Tsimpoukelli 2021) — DeepMind's prior work, "frozen LM + visual prompt"
- Perceiver / Perceiver IO (Jaegle 2021) — DeepMind's in-house architecture, the cross-attention compressor
- Chinchilla (Hoffmann 2022) — DeepMind's 70B LM, plugged in directly by Flamingo
- NFNet (Brock 2021) — DeepMind's in-house vision encoder
Heirs (descendants)¶
After Flamingo the entire multimodal LLM ecosystem is almost entirely built on the Flamingo framework:
- GPT-4V (OpenAI 2023) — the commercial version, almost the Flamingo recipe + larger scale
- LLaVA (Liu 2023) — open-source Flamingo, using CLIP-ViT + Vicuna + a simple projection (a more aggressive simplification)
- MiniGPT-4 (2023) — contemporary with LLaVA, similar architecture
- IDEFICS (HuggingFace 2023) — open-source reproduction of Flamingo
- OpenFlamingo (LAION 2023) — fully open-source Flamingo
- Qwen-VL (2023), CogVLM (2023), InternVL (2023) — the dominant multimodal LLMs of the Chinese community
- Gemini (Google 2023) — natively multimodal, but the underlying architecture follows the Flamingo recipe
- GPT-4o (OpenAI 2024) — end-to-end multimodal generation, the "ultimate form" of the Flamingo paradigm
- Claude 3 Vision — Anthropic's commercial VLM
Misreadings / simplifications¶
The community has several common misreadings of Flamingo:
- "Flamingo = CLIP + GPT-3" — half right. Architecturally it is frozen vision encoder + frozen LM + adapter, but the key Perceiver Resampler + GATED XATTN + interleaved data are Flamingo's true innovations.
- "Flamingo has been fully replaced by LLaVA" — half right. LLaVA is engineering-simpler (no Perceiver Resampler, single-layer projection), but its few-shot in-context learning ability is far weaker than Flamingo's. LLaVA takes the "complex SFT" path; Flamingo takes the "few-shot prompt" path.
- "Frozen LM must be worse than end-to-end FT" — wrong. Flamingo proves that frozen LM + adapter preserves general ability while delivering multimodal performance on par with end-to-end FT.
graph LR
A[GPT-3 2020<br/>in-context learning] --> F[Flamingo 2022]
B[CLIP 2021<br/>vision-text alignment] --> F
C[Frozen 2021<br/>DeepMind prior work] --> F
D[Perceiver IO 2021<br/>DeepMind compressor] --> F
E[Chinchilla 2022<br/>DeepMind 70B LM] --> F
G[NFNet 2021<br/>DeepMind vision encoder] --> F
F --> H[GPT-4V 2023<br/>commercial version]
F --> I[LLaVA 2023<br/>open-source simplified]
F --> J[IDEFICS 2023<br/>HF reproduction]
F --> K[OpenFlamingo 2023<br/>LAION reproduction]
F --> L[Gemini 2023<br/>natively multimodal]
F --> M[Qwen-VL / CogVLM / InternVL]
F --> N[GPT-4o 2024<br/>end-to-end MM]
H --> O[Claude 3 Vision]
Modern Perspective¶
Assumptions that no longer hold¶
Looking back 4 years (2022 → 2026), several core claims of the Flamingo paper have been partially revised:
- "Freezing the LM is mandatory": partially overturned by LLaVA-1.5 / GPT-4o — with enough SFT data, partially unfreezing the LM can lift another 5–10 points. Freezing is an engineering trade-off, not a paradigm necessity.
- "Perceiver Resampler is mandatory": overturned by LLaVA — a single linear projection (CLIP visual feature → LM token) is enough and engineering-simpler. Flamingo's complex Resampler actually loses detail at high resolution.
- "Few-shot is the core evaluation method": partially revised in the post-ChatGPT era — real deployment is chat-style zero-shot, with little use of few-shot.
- "M3W interleaved data is mandatory": overturned by LLaVA — large amounts of single-image instruction-tuning data also let the model chat.
Designs the era proved key vs redundant¶
| Design | Key / Redundant | Verdict from the era |
|---|---|---|
| Frozen LM + vision encoder | key (but partially replaced by partial unfreezing) | trade-off between training cost and general ability |
| Cross-attention to inject visual tokens | key | adopted by almost every follow-up |
| Zero-init tanh gate | key | safe-fine-tuning paradigm now widely adopted |
| Perceiver Resampler | transitional | LLaVA replaces it with linear projection |
| Interleaved training data | transitional | LLaVA SFT can also chat |
| LM scaling | key | multimodal also obeys scaling laws |
Side effects the authors did not foresee¶
- GPT-4V almost copies Flamingo: in 2022 the authors only thought "let's build a vision GPT-3" and completely failed to predict that one year later OpenAI would essentially reuse the Flamingo recipe + larger scale to ship GPT-4V — Flamingo became OpenAI's de-facto R&D blueprint.
- The open-source LLaVA revolution: Flamingo not being open-source directly catalysed LLaVA / IDEFICS / OpenFlamingo — three fully open-source versions that let everyone play with multimodal LLMs.
- Gemini natively multimodal: Google pushed the Flamingo idea to the limit, building a 1.5M-context natively multimodal LLM.
- Chat-based multimodal UI proliferation: from GPT-4V chat, Claude Vision, and Gemini, every AI assistant now ships multimodality — Flamingo's "look-and-chat" paradigm has become an AI-product standard.
If we rewrote Flamingo today¶
A 2026 "Modern Flamingo" would look like this:
- replace NFNet-F6 with CLIP-ViT-L/14-336 (more standard, easier to reproduce)
- replace Chinchilla with LLaMA 3.1 70B (open-source + stronger)
- swap the Perceiver Resampler for a single-layer MLP projection (LLaVA-style simplification)
- partially unfreeze the LM (LoRA-style PEFT) to boost chat
- add a large amount of instruction tuning + RLHF
- use mixed (interleaved + single-image SFT) data
- add OCR / counting / grounding specialised pretraining tasks
- deploy via vLLM + multi-LoRA so a single GPU serves many vision adapters
The core idea (frozen + adapter + cross-attention injection + interleaved learning) is still the 2022 Flamingo — and that is its biggest victory after 4 years: every improvement is around the periphery.
Limitations and Outlook¶
Limitations admitted by the authors¶
- Weak OCR / counting / grounding: the vision encoder was not specifically trained on these tasks; §A.6 admits this openly
- Cannot generate images: Flamingo only does "look and talk"; the reverse "text → image" is not supported
- Weak multilingual: trained only on English M3W; weak in Chinese / Japanese etc.
- Weak on long videos (> 8 frames): Resampler's 64-token capacity is insufficient
- Training data is closed: M3W is DeepMind-internal; reproduction is impossible
Limitations discovered by others¶
- Hallucination: makes up objects not in the image ("I see a cat" when the image is a dog)
- Sensitive to image resolution: at low resolution OCR / fine-detail reading degrades
- Weak chat ability: compared with ChatGPT's multi-turn dialogue, Flamingo is single-turn in-context learning, with no dedicated chat training
- Few-shot prompt design is sensitive: picking the wrong example can drop performance by 20+ points
Improvement directions (later validated by follow-up work)¶
- Open-source reproduction → LLaVA, IDEFICS, OpenFlamingo, Qwen-VL ✓
- Chat instruction tuning → LLaVA-1.5, MiniGPT-v2 ✓
- OCR / grounding → Kosmos-2, GPT-4V, LLaVA-NeXT ✓
- Larger context / long video → Gemini 1.5 Pro (1M ctx), Video-LLaMA ✓
- Multilingual → PaLI (Google 2022), Qwen-VL (Chinese) ✓
- End-to-end multimodal generation → GPT-4o (2024), Gemini 2.0 Flash (2024) ✓
- RLHF on VLM → LLaVA-RLHF, InternVL-RLHF ✓
Related Work and Inspirations¶
Flamingo is the genuine starting line of the multimodal-LLM era — its arrival broke the 5-year deadlock of "VLMs can only fine-tune one task at a time" in a single stroke and made "look-and-chat + few-shot learning" possible for the first time. The significance of this far exceeds the architecture itself:
- Theoretical inspiration: it proved that in-context learning is not exclusive to LLMs; multimodal emergent ability can emerge from frozen LM + adapter + interleaved data. This gave the theoretical foundation to every subsequent VLM design.
- Engineering inspiration: the paradigm of freezing big models + adapter training + zero-init gates was adopted by LoRA / QLoRA / GPT-4 fine-tuning APIs / Gemini — it is the de-facto standard for "safe fine-tuning" in the big-model era.
- Paradigm inspiration: Flamingo directly defined the product form of "multimodal chatbot" — every later GPT-4V / Claude Vision / Gemini / GPT-4o follows this interaction mode.
- Ecosystem inspiration: it gave birth to the entire LLaVA / IDEFICS / OpenFlamingo / Qwen-VL open-source VLM ecosystem — Flamingo not being open-source actually pushed the open-source community to reproduce it collectively.
- Commercial inspiration: it pushed "multimodal AI assistant" from a research project into a consumer product — the visual abilities of ChatGPT-4V, Claude and Gemini, the three major AI assistants, all derive directly from the Flamingo paradigm.
Flamingo is not the most technically revolutionary paper — every component (CLIP / GPT-3 / Perceiver) was an existing artefact. Its greatness lies in using 4 engineering components + 1 interleaved data corpus to prove that "frozen big models + lightweight adapters" can deliver multimodal capability comparable to fine-tuned SOTA — this kind of "engineering integration driving paradigm shift" is a traditional DeepMind strength.
Back to that 2022 moment when "VLMs were stuck in the dead end of per-task fine-tuning": while everyone else was fine-tuning VinVL / OFA / SimVLM, Flamingo went the other way — freezing all backbones, training only the connector, evaluating with in-context learning — and opened a brand new road. This mind-shift of "redefining VLM with the LLM paradigm" is Flamingo's true moat.
Resources¶
- Paper: arXiv 2204.14198
- Official code: (none, DeepMind internal)
- Open-source reproductions:
- OpenFlamingo (LAION)
- IDEFICS (HuggingFace)
- Key follow-up papers:
- LLaVA (2023) — open-source simplified Flamingo
- GPT-4V System Card (2023) — commercialised Flamingo
- Gemini (2023) — Google natively multimodal
- Qwen-VL (2023) — Chinese multimodal LLM
- GPT-4o (2024) — end-to-end multimodal
- Gemini 1.5 (2024) — 1M-context multimodal
- LLaVA-NeXT (2024) — adds OCR / high resolution
- Readable survey: Yin et al., "A Survey on Multimodal Large Language Models" (2023)
- Author retrospective: Jean-Baptiste Alayrac at the NeurIPS 2022 oral presentation; Karen Simonyan at ICML 2023 invited talk From Flamingo to Gemini: A Multimodal Journey