PaLM — Scaling Dense Language Models to 540B with Pathways¶
On April 4, 2022, Google Research announced PaLM in an official blog post; on April 5, Chowdhery, Narang, Devlin, and 64 coauthors posted arXiv:2204.02311. The drama of the paper was not merely the headline number, 540B dense parameters. It was Google stress-testing the Pathways promise on a real frontier language model: 6,144 TPU v4 chips across two Pods, 780B training tokens, 46.2% model FLOPs utilization, and 57.8% hardware FLOPs utilization. PaLM pushed BIG-bench, GSM8K reasoning, multilingual transfer, and code generation to the 2022 frontier, yet it also arrived in the same month as Chinchilla, which made the uncomfortable point that PaLM was spectacular and still token-starved. That tension is why PaLM is historically useful: it is both the high-water mark of the parameter arms race and one of the last grand demonstrations before the field relearned how much data allocation mattered.
TL;DR¶
Chowdhery, Narang, Devlin, and 64 coauthors introduced PaLM in 2022 as Google's 540B-parameter dense decoder-only Transformer trained with the Pathways system. Under the dense-Transformer compute approximation \(C \approx 6ND\), PaLM pushed \(N\) to 540.35B, trained on 780B tokens, and used a very specific engineering recipe: SwiGLU, a parallel Transformer block, multi-query attention, RoPE, a 256k lossless SentencePiece vocabulary, Adafactor, and a stabilizing z-loss. The systems result was as important as the model result: 6,144 TPU v4 chips across two Pods, 238.3K tokens/s throughput, 46.2% model FLOPs utilization, and 57.8% hardware FLOPs utilization. Conceptually, PaLM replaced the post-GPT-3 feeling that 175B might already be close to the ceiling with a blunt industrial answer: no, dense scaling was still buying capability. PaLM 540B set single-checkpoint large-LM few-shot SOTA on 28 of 29 English NLP tasks, reached 69.3 on 5-shot MMLU, beat prior SOTA on 44 of 58 common BIG-bench tasks, reached 58% on GSM8K with 8-shot chain-of-thought plus a calculator, and achieved 76.2 pass@100 on HumanEval before code-only finetuning.
The counterintuitive lesson is that PaLM proved two things at once: further scale still worked, and parameter-first scaling was already reaching a bad allocation regime. The paper itself asks what would happen if a 62B model were trained for 7T tokens; the same-month Chinchilla paper answered that question by showing that 540B/780B is only about 1.44 tokens per parameter, far from the later 20 tokens/parameter rule of thumb. PaLM's legacy therefore splits in two. One branch leads to Flan-PaLM, Med-PaLM, PaLM-E, PaLM 2, and Gemini. The other branch turns PaLM into a reference point for the token-rich, parameter-modest philosophy later visible in LLaMA and most efficient open LLMs.
Historical Context¶
What was the LLM field stuck on in early 2022?¶
PaLM appeared in an unusually revealing in-between moment for large language models. In May 2020, GPT-3 had shown that a 175B-parameter autoregressive model could do meaningful few-shot learning, but it left two questions unresolved: would dense parameter scaling keep buying new capabilities, and could such models be trained reproducibly and efficiently by an industrial system rather than by heroic one-off engineering? By late 2021 the answer was still incomplete. DeepMind's Gopher 280B, Microsoft/NVIDIA's MT-NLG 530B, Google's GLaM, and LaMDA each illuminated one part of the frontier: analysis, GPU pipeline scale, sparse MoE scaling, and dialogue safety. No Google dense LLM paper had yet put bigger models, stronger systems, broad evaluation, and risk analysis on the same table.
PaLM entered exactly that window. It was not the first large language model and not the first model in the 500B-parameter range. Its real contribution was to answer whether Google's next-generation Pathways system could push a conventional but enormous decoder-only Transformer to the frontier. The title is honest in putting Pathways next to language modeling: half of PaLM's technical story is about the model, and half is about the training system. Without 6,144 TPU v4 chips, pod-level data parallelism across two Pods, and the JAX/T5X/XLA/Flaxformer stack, 540B parameters would have remained a proposal number rather than a trained model.
The 5 immediate predecessors that pushed PaLM out¶
GPT-3 (Brown et al., 2020) set PaLM's external reference point: decoder-only architecture, left-to-right language modeling, few-shot evaluation, natural-language task descriptions, and a handful of exemplars. PaLM keeps that evaluation philosophy but moves from 175B to 540B parameters, from 300B to 780B training tokens, and broadens the evaluation surface to BIG-bench, GSM8K, code, multilingual benchmarks, and risk analysis.
Kaplan scaling laws (2020) supplied the early resource-allocation mindset: larger parameter counts predictably reduce loss and improve few-shot performance. PaLM inherits that optimism about continued scaling, but it also pushes the recipe to the point where cracks become visible. Section 13 directly asks what would happen if a 62B model were trained for 7T tokens, or a 120B model for 3.6T tokens. That question is almost a pre-announcement of Chinchilla.
GLaM and LaMDA (Google, 2021-2022) provided Google's internal experience with data, evaluation, and serving. PaLM's 780B-token corpus follows the LaMDA/GLaM data line: multilingual conversations, filtered web pages, books, Wikipedia, news, and GitHub code. PaLM does not take GLaM's sparse MoE route; it uses a dense model to show that Google's training stack can still move the frontier without conditional computation.
Pathways (Barham et al., 2022) is the systems predecessor. Google had introduced Pathways as a vision for a model that generalizes across tasks, modalities, and hardware with high efficiency. PaLM is its first large language-model landing point. Training is neither a traditional single-cluster run nor a pipeline split across stages; one Python client dispatches work to two TPU v4 Pods, and both sides exchange gradients every step while maintaining bitwise-identical parameters.
Chain-of-thought prompting (Wei et al., 2022) supplied the magnifying glass for PaLM's capability jump. On GSM8K, SVAMP, MAWPS, StrategyQA, and related tasks, size alone is not enough; the model must be prompted to write intermediate reasoning steps before the answer. The PaLM 540B + CoT combination made the 2022 reader see, very concretely, that scale was not only improving text continuation but also stabilizing multi-step reasoning formats.
What was the author team doing?¶
PaLM is a classic Google Brain-scale engineering paper. The author list has 67 names, and the appendix decomposes contributions by project phase. Aakanksha Chowdhery, Sharan Narang, and Jacob Devlin are co-first authors; Narang, Chowdhery, and Noah Fiedel led the overall project; Noam Shazeer, Yi Tay, Rewon Child, and others worked on architecture and optimizer selection; Paul Barham, Sanjay Ghemawat, Michael Isard, Hyeontaek Lim, and others connected Pathways to the training loop; Jason Wei, Xuezhi Wang, and Denny Zhou handled reasoning evaluation; Katherine Lee, Daphne Ippolito, and Jacob Devlin handled memorization; Marie Pellat, Kevin Robinson, Sunipa Dev, Parker Barnes, and others handled Responsible AI and risk analysis.
That division of labor reveals what PaLM really is. It is not a paper where one module proposal carries the contribution. It is Google compressing language modeling, distributed systems, data pipelines, evaluation infrastructure, safety analysis, and product-grade serving experience into a 100-plus-page report. The first-author group includes both scaling/modeling and infrastructure ownership; the final advisory group includes Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM is therefore a bridge: one end is the Transformer/LLM research line, the other is Google's data-center and TPU product line.
State of industry, compute, and data¶
The industrial background of April 2022 was the peak of the parameter race. Gopher 280B had appeared only months earlier, MT-NLG 530B had pushed dense parameter counts past the 500B headline, and DeepMind's Chinchilla would arrive in the same month to argue that token/parameter allocation was wrong. PaLM stands between those two stories: it displays Google's system scale with 540B parameters, but uses only 780B training tokens, about 1.44 tokens per parameter. In hindsight it is both astonishingly strong and visibly undertrained.
The hardware behind PaLM 540B was two TPU v4 Pods: 6,144 chips and 1,536 hosts. The paper reports 238.3K tokens/s average throughput, 46.2% model FLOPs utilization, 57.8% hardware FLOPs utilization, and cross-Pod gradient transfer bursts around 81 Tbps. Those numbers matter because PaLM's contribution is not just “spend more money on a bigger model.” It proves that pipeline-free training across Pods can reach nearly 2x weak scaling. Compared with MT-NLG's pipeline parallelism, PaLM chooses pod-level data parallelism plus within-Pod model/data sharding, reducing pipeline bubbles while moving pressure to the data-center network.
The data side has the same 2022 industrial character. The 780B tokens are not a public corpus; they are a Google-internal mixture: 50% multilingual social-media conversations, 27% multilingual filtered web pages, 13% English books, 5% GitHub code, 4% multilingual Wikipedia, and 1% English news. The paper trains for one epoch to avoid repeating subcorpora, while also acknowledging that some components begin to repeat after 780B tokens. That boundary, where the data is very large but not infinite, is precisely the starting point for Chinchilla and LLaMA's later rethinking of training allocation.
Background and Motivation¶
The real question PaLM wanted to answer¶
PaLM's central question is not simply “can we reach 540B parameters?” It asks three more specific things. First, does a dense decoder-only Transformer still benefit from scale after GPT-3? Second, can Google's Pathways system efficiently train one model across multiple TPU Pods? Third, do the resulting gains appear only on traditional NLP tasks, or also on harder surfaces such as reasoning, code, multilingual transfer, and BIG-bench?
The experimental design follows those questions. Three model scales, 8B, 62B, and 540B, are trained with the same data and vocabulary to observe log-linear and discontinuous improvements. Evaluation covers 29 English NLP tasks, MMLU, BIG-bench, reasoning, code, translation, multilingual generation, and multilingual QA. Memorization, dataset contamination, bias/toxicity, a model card, and a datasheet respond to the obvious criticism that scaling capability also scales risk.
Why Pathways is not decorative in the title¶
If you look only at the architecture, PaLM is intentionally plain: dense decoder-only Transformer, next-token prediction, 2048-token context, and standard few-shot evaluation. What makes the paper possible is that Pathways pushes this plain model to a scale that was previously hard to execute. In PaLM, Pathways does the following: one client dispatches a training batch to two TPU v4 Pods; inside each Pod, 12-way model parallelism and 256-way fully sharded data parallelism run the model; the two Pods compute gradients on their halves of the batch, exchange gradients through the data-center network, and synchronously apply identical updates.
The significance is that “training a large model” becomes more than a modeling script. It becomes a system problem involving data-center networking, compiler choices, schedulers, checkpointing, deterministic input pipelines, and evaluation infrastructure. PaLM's later influence is not only in its benchmark numbers but in this systems view: if LLMs are to become infrastructure, model, data, training system, and evaluation must be designed together. PaLM is Google's first complete public answer to that proposition.
Method Deep Dive¶
Overall framework¶
PaLM's method can be compressed into one sentence: use Google's Pathways system to train an engineered dense decoder-only Transformer at three scales, 8B, 62B, and 540B, on the same 780B-token corpus, then use a very broad evaluation surface to observe what scaling changes. There is no retrieval, no MoE routing, no new pretraining objective, and no RLHF. The training objective remains standard autoregressive language modeling: predict the next token from its prefix.
That simplicity is the point. PaLM wants to test whether dense Transformers still generate new capabilities when the architecture is stable, the training system is powerful, and the data is large. The paper trains three model sizes as controlled comparisons: 8B with 32 layers, 62B with 64 layers, and 540B with 118 layers. The three use the same training data, vocabulary, and most of the same recipe; scale and batch schedule are the main differences. That makes BIG-bench discontinuities, GSM8K chain-of-thought gains, and code sample efficiency easier to attribute to scale and systems rather than to an extra module.
| Model | Layers | heads | \(d_{model}\) | Parameters | batch schedule |
|---|---|---|---|---|---|
| PaLM 8B | 32 | 16 | 4096 | 8.63B | 256 -> 512 |
| PaLM 62B | 64 | 32 | 8192 | 62.50B | 512 -> 1024 |
| PaLM 540B | 118 | 48 | 18432 | 540.35B | 512 -> 1024 -> 2048 |
Each attention head has dimension 256, and the feed-forward dimension is always \(4d_{model}\). PaLM 540B therefore has \(d_{ff}=73728\), making every layer's MLP and attention dominated by enormous matrix multiplications. The paper's practical optimization target is not a new block type; it is making those matrix multiplications run stably across thousands of TPU chips.
Key Design 1: Dense decoder-only scaling on Pathways¶
Function: keep the model as a dense decoder-only Transformer so every token activates the same parameters, then use Pathways to split training across two TPU v4 Pods. This is deliberately conservative. It keeps PaLM's evaluation claim clean: if capabilities improve, the likely drivers are scale, data, and the training system, not MoE routing, a retrieval cache, or task-specific heads.
For dense Transformers, the training FLOPs approximation is:
where \(N\) is parameters and \(D\) is training tokens. PaLM 540B uses \(N=540.35\text{B}\) and \(D=780\text{B}\); the paper reports about \(2.5272\times10^{24}\) training FLOPs, or 2527.2 ZFLOPs. That is roughly 4.3x the compute of Chinchilla 70B/1.4T at 588 ZFLOPs, while PaLM's token/parameter ratio is only about 1.44. This is the root of its later reputation as powerful but not compute-optimal.
A simplified view of Pathways training is:
def palm_pathways_step(batch):
pod_a_batch, pod_b_batch = split(batch, parts=2)
grad_a = tpu_pod_forward_backward(pod_a_batch, shard="pod_a")
grad_b = tpu_pod_forward_backward(pod_b_batch, shard="pod_b")
grad_a_remote, grad_b_remote = cross_pod_exchange(grad_a, grad_b)
apply_update(grad_a + grad_a_remote, shard="pod_a")
apply_update(grad_b + grad_b_remote, shard="pod_b")
assert bitwise_identical_parameters("pod_a", "pod_b")
The design motivation is straightforward. Pipeline parallelism creates bubbles, and micro-batches increase weight reload pressure. Pathways lets each Pod keep the full model-sharding layout internally, while the Pods participate in two-way data parallelism. The cost is bursty network traffic: the paper reports roughly 1.3GB of gradients exchanged per host per step across 1,536 hosts, an aggregate burst near 81 Tbps. PaLM's systems contribution is exactly there: converting model scale into a data-center scheduling, compilation, and reproducibility problem.
| Training system | Hardware scale | Main parallel axis | reported efficiency | Key risk |
|---|---|---|---|---|
| GPT-3 | V100 cluster | model/data parallel | 21.3% MFU | low GPU-cluster utilization |
| Gopher | 4096 TPU v3 | multi-Pod + pipeline | 32.5% MFU | pipeline bubbles |
| MT-NLG | 2240 A100 | tensor + pipeline | 30.2% MFU | cross-node pipeline complexity |
| PaLM 540B | 6144 TPU v4 | Pathways pod-level data parallel | 46.2% MFU / 57.8% HFU | 81Tbps gradient burst |
Key Design 2: Parallel Transformer block and SwiGLU¶
PaLM's block is still a pre-norm decoder block, but it changes the standard serial formulation into a parallel one. A standard Transformer block can be written as:
PaLM uses the parallel formulation:
The goal is not expressivity but training speed. The input matrix multiplications for attention and MLP can be fused more effectively, and the paper reports about 15% faster training at large scale. An 8B ablation shows slight quality degradation; a 62B ablation shows no degradation, so the authors extrapolate that the effect should be quality-neutral at 540B. That decision has an industrial flavor: not every small-scale ablation is perfect, but the team must decide before a 540B run that cannot be repeated casually.
SwiGLU is another quality/efficiency trade-off. PaLM's MLP uses Shazeer 2020's gated activation instead of ReLU or GELU:
It requires three matrix multiplications rather than two, but compute-equivalent experiments showed better quality. PaLM is one of the first flagship dense LLMs at the 500B scale to publicly adopt SwiGLU; later LLaMA, Mistral, Qwen, Gemma, and related open models made gated FFNs the default. The lesson is that PaLM's architectural “innovations” look small, but it stress-tested a set of small parts that later became standard LLM components.
| Design piece | Role in PaLM | Direct benefit | Later influence |
|---|---|---|---|
| SwiGLU | replaces ReLU/GELU MLP | better quality/compute trade-off | default FFN path in LLaMA/Mistral/Qwen/Gemma |
| Parallel block | parallel attention and MLP branches | about 15% training speedup at scale | GPT-J/PaLM-style block reused later |
| No bias | removes dense-kernel / LayerNorm bias | improved stability in large models | common simplification in LLM implementations |
| Shared embedding | ties input/output embeddings | saves parameters and stabilizes logits | common decoder-only LM practice |
Key Design 3: Multi-query attention, RoPE, and lossless vocabulary¶
Multi-query attention is PaLM's advance investment in inference cost. Standard multi-head attention stores separate key/value tensors for every head. Multi-query attention keeps multiple query heads but shares one key/value projection across heads. Training quality and speed are mostly neutral, but autoregressive decoding uses much less KV cache. For a 540B model, inference is not the paper's main benchmark target, yet it is already a systems bottleneck; PaLM ties “trainable” and “servable” together.
RoPE supplies positional information. PaLM uses rotary positional embeddings rather than absolute or relative embeddings, arguing that RoPE performs better on long sequence lengths. PaLM's training context is still 2048 tokens, not the 32K/128K regime of later models, but RoPE had already become the safer choice compared with learned absolute positions. LLaMA later inherited RoPE and turned it from a Google-internal choice into an open-LLM default.
The vocabulary design serves multilingual text and code. PaLM uses a 256k lossless SentencePiece vocabulary that preserves whitespace, splits out-of-vocabulary Unicode characters into UTF-8 bytes, and splits numbers into individual digit tokens. The last point is easy to miss, but the paper explicitly suspects digit-level tokenization may help GSM8K-like arithmetic tasks. Preserving whitespace also matters for code; a large multilingual vocabulary reduces tokenization damage outside high-resource English.
| Sub-design | Concrete choice | Why it matters for PaLM | Typical consequence |
|---|---|---|---|
| Multi-query attention | many query heads share key/value | lowers 540B inference KV-cache cost | later serving stacks prioritize MQA/GQA |
| RoPE | rotary positional embedding | more robust than absolute/relative for longer sequences | inherited by LLaMA-style open models |
| 256k lossless vocab | preserves whitespace, UTF-8 bytes, digit-by-digit numbers | supports multilingual text, code, and arithmetic prompts | code/digit tasks suffer less tokenization damage |
| One-epoch data | 780B tokens with minimal repetition | reduces overfitting and repeated-corpus effects | exposes the data-supply boundary |
Key Design 4: Training stability, Adafactor, and z-loss¶
One of the most valuable parts of the PaLM recipe is its honesty about large-model instability. The 540B run encountered roughly 20 loss spikes at irregular intervals. The team did not find a principled fix; instead, it restarted from a checkpoint about 100 steps before the spike and skipped 200-500 batches. The paper also says those batches were not simply “bad data”: if the same surrounding batches were replayed from an earlier checkpoint, the spike did not necessarily recur. That implies loss spikes arise from interactions between specific data batches and specific parameter states, not from a simple filtering failure.
The optimizer is unfactorized Adafactor, effectively Adam with parameter scaling. The learning rate is \(10^{-2}\) for the first 10k steps, then decays as \(1/k\); global gradient clipping is 1.0; there is no pretraining dropout; sequence length is 2048; examples are concatenated into fixed-length chunks separated by [eod]. To stabilize the softmax, PaLM adds z-loss:
where \(Z\) is the softmax normalizer. This small term discourages uncontrolled logit-scale drift and makes the huge-vocabulary softmax more stable. It is not the most memorable design in the paper, but it is one of the fuses that lets a 540B training run finish.
| Stability component | PaLM setting | Problem addressed | Remaining issue |
|---|---|---|---|
| Adafactor parameter scaling | unfactorized Adafactor, \(10^{-2}\) then \(1/k\) decay | steadier LR scale across matrices | does not eliminate loss spikes |
| Global grad clipping | norm 1.0 | limits abnormal gradients | about 20 spikes still occur |
| z-loss | \(10^{-4}\log^2 Z\) | stabilizes the softmax normalizer | empirical coefficient required |
| Deterministic pipeline | checkpoint replay is bitwise deterministic | makes spike localization and recovery possible | does not explain spike cause |
| Batch schedule | 1M -> 2M -> 4M tokens | balances early sample efficiency and late TPU efficiency | larger-batch generalization remains unclear |
Failed Baselines¶
The baselines PaLM beat at the time¶
PaLM's “failed baselines” should not be read as weak systems defeated by a clever trick. It beat the strongest large-language-model paradigms of 2020-2022: GPT-3's few-shot dense scaling, GLaM's sparse MoE, Gopher's DeepMind dense analysis, MT-NLG's GPU-pipeline giant model, LaMDA's dialogue model, and Codex's code-specialized model. PaLM's advantage does not come from a new objective. It comes from the sum of scale, data, systems, and a set of stable engineering choices.
| baseline | What it represented | How PaLM won | Key caveat |
|---|---|---|---|
| GPT-3 175B | few-shot dense LM reference point | much higher average across 29 English NLP tasks; PaLM 62B also beats GPT-3 averages | GPT-3 is earlier, with different data/eval details |
| GLaM 64B/64E | Google's sparse MoE route | PaLM 540B refreshes GLaM results on most English NLP few-shot tasks | GLaM uses lower inference FLOPs and optimizes a different trade-off |
| Gopher 280B | DeepMind dense scaling analysis | PaLM is stronger in BIG-bench, several QA comparisons, and MMLU context | Chinchilla later shows Gopher was undertrained |
| MT-NLG 530B | one of the largest dense parameter counts | PaLM is better on most benchmarks at similar nominal parameter scale | data, system, and token count all differ |
| Codex 12B | code-specialized model | PaLM 540B reaches comparable HumanEval few-shot with 50x less Python code | Codex is code-specialized and API/contamination details are not fully knowable |
The shared lesson across these baselines is that the 2022 LLM frontier was no longer decided by one clever module. It was decided by who could jointly move data, hardware, model recipe, and evaluation infrastructure. PaLM's 540B dense model is not elegant, and from a Chinchilla perspective it is not economical, but it converts systems intensity directly into benchmark performance.
Failures the paper itself acknowledges¶
One of the most admirable parts of the PaLM paper is that it does not hide failures. The largest model encounters roughly 20 loss spikes during training. The team does not find a principled fix; it rewinds to a checkpoint before the spike and skips a few hundred batches. For a 540B model, that means stability is still empirical engineering: the model can be trained to completion, but the team does not fully understand why a particular step explodes.
Section 13 also raises the Chinchilla-style question directly: what if the same training budget were spent on a smaller model trained for more tokens? The authors mention possible alternatives such as a 62B model trained for 7T tokens, a 120B model for 3.6T tokens, or a 240B model for 1.8T tokens. PaLM does not run those experiments because full-scale ablations are expensive, data repetition appears, and batch size / TPU efficiency become difficult. That omission later becomes the center of Chinchilla and LLaMA.
| Exposed issue | How the paper describes it | Why it matters | Who continued the fix |
|---|---|---|---|
| loss spikes | about 20 irregular spikes, mitigated by rewind + skipped batches | large-model stability still relies on empirical recovery | later optimizer, norm, data-ordering, checkpoint strategies |
| token/parameter ratio | openly asks whether smaller models trained longer would be better | anticipates PaLM's non-compute-optimal allocation | Chinchilla, LLaMA, Mistral |
| repeated data | some subcorpora start repeating after 780B tokens | data is not an infinite resource | Chinchilla data allocation, LLaMA token-rich corpus |
| prompt variance | WebQuestions and other results vary across checkpoint/prompt choices | few-shot scores are not absolute one-point truths | HELM, lm-eval, multiple-seed evaluation |
Problems scale did not truly solve¶
PaLM is strong, but the paper's failure signals are clear. On BIG-bench, PaLM 540B exceeds average human performance in aggregate, yet average human performance is still higher on 35% of individual tasks. In BIG-bench Lite, tasks such as navigate, symbol interpretation, logic grid puzzle, tracking shuffled objects, and Chinese remainder theorem remain far from best-human performance. Scale unlocks some capabilities; it does not solve systematic reasoning, strict symbolic manipulation, or long-horizon state tracking in one shot.
Risk also does not disappear with scale. The memorization experiment shows that PaLM 540B exactly reproduces the 50-token continuation of random training spans 2.4% of the time, higher than 8B's 1.6%. Spans seen more than 500 times have over 40% memorization rate for 540B. Bias/toxicity analysis also shows religious, racial, and gendered stereotype associations in data and outputs, including Islam co-occurring with terms such as terrorist, violent, and radical. The paper stresses that these analyses are mostly English-centric and do not cover every language or cultural context.
Code is bounded too. PaLM-Coder reaches a high DeepFix compile rate, but the paper explicitly notes that compiling is not the same as being safe, robust, or correct. The student-C-program setting permits assumptions that would be undesirable in production. A model-generated patch may pass a small test suite and still contain subtle bugs or security issues. This point becomes even more important in the code-assistant era: pass@k is a capability metric, not a deployment guarantee.
The real anti-baseline lesson¶
Compressed into one sentence, PaLM's failure case is this: scale is a necessary but insufficient infrastructure variable. It can push models across capability thresholds, but it does not automatically produce optimal training allocation, explained stability, open reproducibility, safe behavior, or reliable reasoning. PaLM's historical value comes from showing both the power of scale and the blind spots of scale at the same time.
In hindsight, PaLM was not simply “beaten” by one later model. It was decomposed by multiple directions. Chinchilla decomposed it into a training-economics problem: too many parameters, too few tokens. InstructGPT and Flan-PaLM decomposed it into an interaction problem: a strong base model still needs instruction tuning and alignment. LLaMA decomposed it into an ecosystem problem: a closed giant model can be approached by smaller, better-trained open models. PaLM's failure is not uselessness; it is the clarity with which it exposed the next set of questions.
Key Experimental Data¶
Main experiment 1: English NLP and MMLU¶
PaLM's first hard result is the 29-task English NLP suite. The paper compares only pretrained single-checkpoint few-shot / one-shot results, excluding instruction tuning and multitask adaptation. PaLM 540B wins 24 of 29 tasks in the 1-shot setting and 28 of 29 in few-shot. On average NLG/NLU, PaLM 62B already surpasses GPT-3 175B, and PaLM 540B pulls further ahead.
| Model | Params | Training tokens | Avg NLG 1-shot | Avg NLU 1-shot | MMLU 5-shot |
|---|---|---|---|---|---|
| GPT-3 | 175B | 300B | 52.9 | 65.4 | 43.9 |
| GLaM | 64B/64E | 1.6T | 58.4 | 68.7 | - |
| PaLM 8B | 8.63B | 780B | 41.5 | 59.2 | - |
| PaLM 62B | 62.50B | 795B | 57.7 | 67.3 | 53.7 |
| PaLM 540B | 540.35B | 780B | 63.9 | 74.7 | 69.3 |
The MMLU number, 69.3, is the one people often remember because it places PaLM on the GPT-3 / Gopher / Chinchilla comparison line. But PaLM 62B is just as revealing: with roughly one-third of GPT-3's parameters, it approaches or beats GPT-3's average NLG/NLU results. PaLM's gain is therefore not “parameters only.” Better data, longer training, and stronger recipe matter too.
Main experiment 2: BIG-bench and discontinuous behavior¶
BIG-bench is the part of the PaLM paper that most strongly creates the feeling of capability jumps. PaLM 540B wins 44 of 58 common tasks against prior SOTA in 5-shot and exceeds average human aggregate performance. Across 150 textual tasks, roughly 25% have more than +10% discontinuity from 62B to 540B, and roughly 15% exceed +20%. These results were repeatedly cited by later work on emergent abilities and also triggered debate about metric scaling and nonlinear presentation.
| BIG-bench phenomenon | PaLM 8B | PaLM 62B | PaLM 540B | Interpretation |
|---|---|---|---|---|
| logical_sequence normalized score | 13 | 25 | 87 | 62B -> 540B jump far exceeds log-linear projection |
| english_proverbs | about 25 | - | 87 | abstract metaphor understanding jumps sharply |
| 58 common tasks | - | - | wins 44/58 | against GPT-3/Gopher/Chinchilla prior SOTA |
| aggregate human comparison | - | - | above average human | still loses to average humans on 35% of individual tasks |
This section is both PaLM's highlight and the source of later controversy. It convinced many researchers that sufficiently large models can suddenly do things smaller models cannot, but it also led evaluation researchers to ask how much of the discontinuity comes from real capability thresholds and how much comes from metric normalization, multiple-choice chance correction, or display choices. PaLM gave the phenomenon; later work had to unpack it.
Main experiment 3: reasoning and CoT¶
PaLM's reasoning results are inseparable from chain-of-thought prompting. On GSM8K, PaLM 540B without CoT reaches only 17%. With 8-shot CoT it reaches 54%. With CoT plus an external calculator it reaches 58%, surpassing the previous GPT-3 finetuning + CoT + calculator + verifier result of 55%. The numbers show two facts at once: scale makes intermediate reasoning chains more stable, but arithmetic is still not fully reliable, so a calculator remains useful.
| Setting | GSM8K accuracy | Reading |
|---|---|---|
| PaLM 540B without CoT | 17% | direct answering remains weak |
| PaLM 62B + CoT | 33% | CoT helps, but size is insufficient |
| PaLM 540B + CoT | 54% | scale and CoT combine into a jump |
| PaLM 540B + CoT + calculator | 58% | beats prior SOTA of 55% |
| GPT-3 finetune + CoT + calculator | 34% | far behind without a verifier |
| GPT-3 finetune + CoT + calculator + verifier | 55% | task-specialized system is matched by few-shot PaLM |
Across seven reasoning benchmarks, PaLM 540B + CoT reaches SOTA on GSM8K, MAWPS, SVAMP, and StrategyQA, and close-to-SOTA on ASDiv, AQuA, and CommonsenseQA. The paper also shows qualitative joke-explanation and logical-inference examples, which became part of the public discussion about whether the model “understands.” Strictly speaking, they are not quantitative proof; historically, they provided the first vivid PaLM-era image of a model explaining its own answer.
Main experiment 4: code, multilingual, and risk data¶
PaLM's code results are often underrated. Only 5% of its pretraining mixture is GitHub code, for 39B code tokens and about 2.7B Python tokens. Compared with Codex 12B's reported 100B Python tokens, PaLM reaches comparable HumanEval few-shot performance with far less Python-specific data. PaLM-Coder then finetunes on Python-heavy code and reaches 88.4 HumanEval pass@100 and 82.1% DeepFix compile rate.
| Result category | PaLM 540B | PaLM-Coder 540B | Comparison / meaning |
|---|---|---|---|
| HumanEval pass@100 | 76.2 | 88.4 | Codex 12B is 72.3 |
| HumanEval pass@1 | 26.2 | 36.0 | pretraining-only approaches a code-specialized model |
| MBPP pass@1 | 36.8 | 47.0 | code finetuning remains valuable |
| DeepFix compile rate | 73.7 | 82.1 | prior work is 71.7 |
| Memorization exact continuation | 2.4% | - | 540B exceeds 8B's 1.6% |
| Toxic continuation risk | strongly tied to prompt toxicity | - | 62B and 540B toxicity curves are similar |
The multilingual results reveal another side. Although only about 22% of non-code tokens are non-English, PaLM already approaches or exceeds some specialized systems in translation, TyDiQA, and GEM generation. But the paper also acknowledges that non-English generation, non-Western fairness contexts, dialects, and code-switching are not adequately covered. PaLM is a strong generalist, but its generality is still constrained by training data, evaluation languages, and deployment context.
Idea Lineage¶
Mermaid Citation Graph¶
graph TD
Transformer[Transformer 2017<br/>decoder-only LM substrate] --> GPT3[GPT-3 2020<br/>few-shot scaling]
ScalingLaws[Kaplan Scaling Laws 2020<br/>parameter-first scaling] --> GPT3
GLaM[GLaM 2021<br/>Google MoE and corpus line] --> PaLM[PaLM 2022<br/>540B dense LLM on Pathways]
LaMDA[LaMDA 2022<br/>dialogue data and safety line] --> PaLM
Pathways[Pathways 2022<br/>cross-Pod ML system] --> PaLM
CoT[Chain-of-Thought 2022<br/>reasoning prompt interface] --> PaLM
PaLM --> Emergent[Emergent Abilities 2022<br/>scale jump debate]
PaLM --> FlanPaLM[Flan-PaLM 2022<br/>instruction-tuned PaLM]
PaLM --> MedPaLM[Med-PaLM 2022<br/>medical QA adaptation]
PaLM --> PaLME[PaLM-E 2023<br/>embodied multimodal LM]
PaLM --> PaLM2[PaLM 2 2023<br/>data and compute revision]
Chinchilla[Chinchilla 2022<br/>compute-optimal allocation] -. critiques .-> PaLM
Chinchilla --> LLaMA[LLaMA 2023<br/>token-rich open LLMs]
PaLM2 --> Gemini[Gemini 2023<br/>Google multimodal successor]
PaLME --> Gemini
Ancestors: from GPT-3 to Google's own scale stack¶
PaLM's ancestry is not a single line; it is a convergence of three lines. The first is GPT-3's decoder-only few-shot scaling: use natural-language prompts and a small number of examples to adapt a model temporarily, without finetuning every task. PaLM accepts that paradigm and scales it to a larger model, longer training, and a wider benchmark surface.
The second is Google's internal model and data line. T5 had shown the power of large-scale text-to-text pretraining and unified task formatting; GLaM had shown that Google's filtered corpus and MoE systems could support a trillion-parameter headline; LaMDA had brought dialogue, safety, and data documentation into Google's LLM workflow. PaLM does not appear from nowhere; it compresses those experiences back into a dense decoder-only model.
The third is the systems line: Mesh TensorFlow, GSPMD, T5X, Pathways, TPU v4 Pods, the XLA compiler, and data-center networking. Without that line, PaLM would be a larger GPT-3 replica. With it, PaLM becomes the public stress test of Google's training infrastructure. That is why the paper spends so much space on systems, data, risks, and appendices.
| Predecessor | What it passes to PaLM | PaLM's transformation | Later misreading |
|---|---|---|---|
| Transformer 2017 | decoder/self-attention substrate | 118-layer dense decoder-only LM | assuming the architecture itself is highly novel |
| GPT-3 2020 | few-shot prompting paradigm | larger dense model + broader evaluation | assuming PaLM is just GPT-3 scaled up |
| Kaplan 2020 | scaling-law optimism | pushes parameters to 540B | ignoring token-allocation problems |
| GLaM/LaMDA 2021-22 | Google corpus, evaluation, safety experience | returns from sparse/dialogue to dense generalist | ignoring internal system accumulation |
| Pathways 2022 | cross-Pod dataflow system | 6,144 TPU v4 pipeline-free training | treating Pathways as marketing copy |
Descendants: PaLM splits into three successor lines¶
PaLM's first afterlife is instruction-tuned PaLM. Flan-PaLM connects PaLM to FLAN-style instruction tuning and dramatically improves zero-shot/few-shot usability; Med-PaLM then brings Flan-PaLM, CoT, and self-consistency into medical question answering, becoming an early landmark for medical LLM evaluation. In this line, PaLM is the strong base model: not directly the user-facing system, but the substrate shaped by instruction tuning, domain prompting, and safety filtering.
The second afterlife is multimodal and embodied PaLM. PaLM-E injects visual and state inputs into a language model and uses PaLM as an embodied-reasoning backbone. PaLM-SayCan combines language-model planning with robot affordance models. RT-2 later connects PaLM-E / PaLI-X-style VLM backbones to robot action tokens. In this lineage, PaLM is not merely a chatbot; it is Google's attempt to turn language models into a general semantic and decision layer.
The third afterlife is Google's own successor line: PaLM 2 and Gemini. PaLM 2's technical report clearly absorbs the post-Chinchilla lesson, emphasizing data quality, multilinguality, reasoning, and compute-efficient training rather than another parameter-count headline. Gemini then pushes the PaLM / PaLM-E / Pathways systems line into native multimodality. PaLM is the last Google flagship centered primarily on dense text-LM scale before that transition.
Misreadings: PaLM is not proof that bigger parameters always win¶
The most common misreading of PaLM is to treat it as evidence that “540B parameters win.” The paper is more complicated. It does show that dense LM scaling after GPT-3 still works, but it also lists the open question itself: would smaller models trained on more tokens do better? Chinchilla answered in the same month: under fixed training budgets, PaLM, Gopher, MT-NLG, and GPT-3 were all biased toward over-parameterization and under-training. PaLM is not the endpoint of parameter worship; it is the scene where the problem becomes visible.
The second misreading is “emergent abilities = mysterious emergence.” PaLM's BIG-bench discontinuities are striking, but later evaluation work reminds us that metric normalization, chance baselines, task selection, and plotting scale all affect the visual impression of a jump. The safer reading is that some tasks produce little measurable behavior at 8B/62B but become measurable at 540B. That need not imply a literal discrete switch inside the model.
The third misreading is that Pathways is just a branding term. The paper's details say otherwise: 6,144 TPU v4 chips, two Pods, 81Tbps gradient bursts, 1.95x weak scaling, and a bitwise-deterministic pipeline. These are not decorative appendix details. They are the difference between PaLM and a generic “larger Transformer.”
Historical position¶
Historically, PaLM is a hinge. It pushes post-GPT-3 dense few-shot scaling to Google's public limit, showing the field that further scale can unlock BIG-bench behavior, CoT reasoning, code, and multilingual capability. At the same time, it puts token scarcity, loss spikes, closed data, incomplete risk evaluation, and benchmark controversy on paper. It is both a victory for the old scaling paradigm and a prelude to the next one.
After 2023, the most successful LLMs no longer ask only “how many parameters?” The questions become: are there enough training tokens, how good is the data, is the model instruction-tuned, is there RLHF/RLAIF, what is inference cost, how long is the context, is it open, can it use tools, and is it multimodal? PaLM does not answer all of those questions, but it makes them unavoidable. The most useful thing to remember is not 540B itself; it is the transition from “can we scale?” to “what kind of scaling is worth doing?”
Modern Perspective¶
What still holds in 2026?¶
First, PaLM was right that dense decoder-only Transformers still had scaling room. After 2022, LLaMA, Mistral, Qwen, Gemma, DeepSeek, GPT-4-class systems, and Gemini-class systems all continued to show that, with sufficiently good data, training, and post-training recipes, decoder-only Transformers remain the most reliable foundation-model backbone. PaLM was not the last dense LLM; it was the industrial rehearsal before the dense/open LLM explosion.
Second, PaLM was right about the importance of systems engineering. In modern frontier training, the hard part is rarely writing a Transformer class. The hard parts are deduplication, tokenization, parallelism, checkpointing, fault recovery, optimizer stability, evaluation pipelines, serving cost, and safety monitoring. PaLM put those topics in the paper, making clear that an LLM is not a model file but an infrastructure stack.
Third, PaLM was right that CoT and scale interact. Later self-consistency, tool use, program-of-thought, verifiers, process reward models, and reasoning models all extend the same intuition in different ways: a large model's generation ability can organize intermediate computation, not just emit final answers. PaLM is not the endpoint of reasoning, but it helped make “write the reasoning process” a central interface for demonstrating LLM capability.
| 2022 PaLM claim | 2026 status | Why it still matters |
|---|---|---|
| dense Transformers still scale | true, but not by parameters alone | architecture remains stable; data/training/alignment are main variables |
| system efficiency decides trainability | even more true | training and inference are systems competitions |
| CoT amplifies reasoning ability | true, but needs verifiers/tools | intermediate generation is an interface, not a reliability guarantee |
| multilingual/code skills can emerge from a generalist | partly true | better data allocation and specialized evaluation are still needed |
Which assumptions no longer hold?¶
The most obvious broken assumption is parameter-first scaling. PaLM 540B has about 1.44 training tokens per parameter, which looks extremely low today. Chinchilla, LLaMA, Llama 2/3, Qwen, DeepSeek, and related models all show that training tokens, data quality, and inference economics matter more than headline parameter count. PaLM is not weak; its training economics still belong to the GPT-3/Kaplan era.
The second broken assumption is that a strong few-shot base model is enough as a product interface. The PaLM paper shows impressive few-shot ability, but after ChatGPT and InstructGPT the industry quickly learned that base models need instruction tuning, RLHF/RLAIF, safety policies, tool interfaces, and product-grade evaluation to become stable assistants. Flan-PaLM and Med-PaLM are direct corrections of that limitation.
The third broken assumption is that closed giant-model reports are sufficient for reproducible science. PaLM's weights, training data, and full training code were never released. External researchers can reproduce the idea but not the experiment. After 2023, LLaMA's release/leak, Mistral/Qwen/DeepSeek open models, and data efforts such as OpenWebText, FineWeb, and RedPajama shifted community attention from reading giant-model reports to actually training, finetuning, and evaluating models. PaLM remains scientifically valuable, but its reproducibility value is limited.
Lessons for training LLMs today¶
PaLM's first lesson for today is: systems bottlenecks belong in the paper. If a design requires thousand-chip training, reporting loss and benchmarks is not enough. A serious report should include throughput, MFU/HFU, fault recovery, parallelism, batch schedule, data repetition, energy, and serving constraints. PaLM is more transparent on these axes than many later model reports.
The second lesson is: base-model benchmarks must be read alongside risk evaluation. PaLM reports BIG-bench, GSM8K, HumanEval, memorization, contamination, bias/toxicity, and a model card in one place. That structure became a common shape for frontier-model reports. The more general the model, the less acceptable it is to tell the story only through capability tables.
The third lesson is: a successful training run is not the same as an optimal training run. PaLM is successful but not compute-optimal; Chinchilla has a better allocation rule but is not a product assistant; LLaMA is more open and cheaper to infer but not a complete replacement for every closed frontier model. Modern LLM training must optimize training budget, inference budget, openness, data legality, safety, and maintainability at the same time. PaLM's value is that it put all of those dimensions on the table.
Limitations and Future Directions¶
Limitations of the paper itself¶
PaLM's first limitation is reproducibility. The model weights, training data, and full systems stack are not open, so external teams cannot strictly reproduce the experiments. Even with rich systems detail, 6,144 TPU v4 chips plus Google's data-center network are not an experimental condition available to academia or most companies.
The second limitation is training allocation. The paper already realizes that 540B/780B might not be the best budget allocation, but it does not run full-scale alternatives. In hindsight, this is the central gap: if Google had trained 70B/1.4T, 120B/3.6T, or 240B/1.8T models with the same data in 2022, LLM training history might have moved toward token-rich scaling sooner.
The third limitation is evaluation and risk coverage. PaLM's risk analysis was unusually broad for its time, but it is still mostly English-centric and limited in identity axes. Multilingual bias, dialects, low-resource languages, real interactive safety, tool-use risk, and stronger privacy attacks are not adequately covered. BIG-bench emergent behavior also required later, stricter evaluation methodology to interpret.
| Limitation | Acknowledged in paper? | 2026 reading | Possible improvement |
|---|---|---|---|
| closed weights/data | partly | severely limits reproduction | release small models, data cards, logs, or reproducible runs |
| non-optimal token/parameter ratio | explicitly raised as open question | central training-economics gap | multi-ratio scaling sweeps |
| unexplained loss spikes | explicitly documented | still a large-model training problem | optimizer/norm/data-ordering research |
| English-centric risk analysis | explicitly acknowledged | far from enough coverage | multilingual and multicultural benchmarks/red-teaming |
| benchmark contamination | n-gram/manual analysis performed | still insufficient | canaries, provenance, dynamic evaluation |
Cautions when using this paper today¶
Do not treat every PaLM number as a perfectly comparable leaderboard point. GPT-3, Gopher, Chinchilla, GLaM, and MT-NLG use different data, prompts, shot counts, checkpoints, and cleaning strategies. PaLM makes serious comparison efforts, but the comparisons cannot be fully apples-to-apples. The robust conclusion is directional: PaLM refreshes many tasks under its evaluation framework, not that every single number permanently beats later systems.
Do not equate BIG-bench discontinuity with human-like understanding. PaLM's examples are compelling, but the model still fails badly on navigate, mathematical induction, symbol tracking, simple editing, and related tasks. The more accurate claim is that scale makes some behaviors measurable and some prompting formats stable; it does not guarantee robust reasoning.
Finally, do not mechanically copy PaLM's engineering recipe today. SwiGLU, RoPE, and MQA remain valuable, but 2026 LLMs also consider GQA, FlashAttention, long-context RoPE scaling, MoE, post-training, speculative decoding, KV-cache compression, tool calling, RLAIF, and data governance. PaLM is a baseline, not a complete modern recipe.
Related Work and Insights¶
Recommended reading path¶
The best way to read PaLM is to start with GPT-3, to understand the origin of few-shot dense LMs; then read Kaplan scaling laws, to understand why 2020-2022 teams favored parameter growth; then read PaLM, to see how Google pushed that route to 540B with Pathways; then read Chinchilla, to understand why PaLM-style parameter-first training is not compute-optimal; finally read Flan-PaLM, LLaMA, and PaLM 2/Gemini, to see how instruction tuning, openness, and Google's successor line revised the recipe.
| Reading order | Paper / system | What to watch for |
|---|---|---|
| 1 | GPT-3 (2020) | why few-shot prompting works at all |
| 2 | Scaling Laws (2020) | where the parameter-first prescription came from |
| 3 | PaLM (2022) | systems evidence for Pathways + 540B dense scaling |
| 4 | Chinchilla (2022) | why PaLM-like models are undertrained |
| 5 | Flan-PaLM / LLaMA / PaLM 2 | how successors revise data, instruction, openness, and training economics |
Relationship to neighboring awesome-papers¶
PaLM directly inherits from 2020_gpt3: it keeps GPT-3's few-shot evaluation language but expands the system and evaluation surface. PaLM and 2022_chinchilla form a same-month tension: PaLM says “continued dense scaling still works,” while Chinchilla says “but the allocation is not economical.” They should be read together; either one alone gives a distorted picture of 2022.
PaLM is also tightly linked to 2022_cot. CoT is the interface amplifier for PaLM's reasoning results: without CoT, GSM8K is 17%; with CoT, 54%; with a calculator, 58%. PaLM's relationship to 2023_llama is ecological inversion: LLaMA uses fewer parameters, more tokens, and open weights to bring part of the pressure created by closed giants back into the community.
Resources¶
Primary sources¶
| Resource | Link | Use |
|---|---|---|
| PaLM arXiv | https://arxiv.org/abs/2204.02311 | paper text, metadata, version history |
| Google Research blog | https://research.google/blog/pathways-language-model-palm-scaling-to-540-billion-parameters-for-breakthrough-performance/ | official launch framing, systems and result summary |
| ar5iv HTML | https://ar5iv.labs.arxiv.org/html/2204.02311 | searchable paper tables, sections, appendices |
| Pathways paper | https://arxiv.org/abs/2203.12533 | training-system background |
| T5X / SeqIO paper | https://arxiv.org/abs/2203.17189 | Google training framework context |
| Chinchilla paper | https://arxiv.org/abs/2203.15556 | training-economics counterpoint to PaLM |
🌐 中文版 · 📚 awesome-papers project · CC-BY-NC