Capsule Networks — Routing Parts into Wholes¶

On October 26, 2017, Sara Sabour, Nicholas Frosst, and Geoffrey Hinton uploaded arXiv 1710.09829. In the same year that Transformer made attention the center of sequence modeling, this paper asked a very different visual question: what if max-pooling was not a harmless convenience but a representational tax? Capsules replace scalar feature detectors with vectors, use vector length as existence probability, use vector direction as pose-like instantiation parameters, and route lower-level parts to higher-level wholes by agreement. The result was thrilling and awkward at once: 0.25% MNIST error and strong performance on heavily overlapping digits, but poor scaling to natural images. It became one of deep learning's most memorable alternate futures.

TL;DR¶

Sabour, Frosst, and Hinton's NeurIPS 2017 paper replaces CNN scalar feature detectors and max-pooling with vector capsules plus dynamic routing by agreement: each lower capsule predicts a parent pose with $\hat{\mathbf{u}}_{j|i}=\mathbf{W}_{ij}\mathbf{u}_i$, routing weights are normalized as $c_{ij}=\mathrm{softmax}(b_{ij})$, and agreement updates the logits by $b_{ij}\leftarrow b_{ij}+\hat{\mathbf{u}}_{j|i}\cdot\mathbf{v}_j$. The pitch was not simply better MNIST. The pitch was a representational critique: pooling throws away pose, while capsules try to keep pose equivariant and use it for part-whole assignment. Empirically, CapsNet reached $0.25\%_{\pm0.005}$ MNIST error with 8.2M parameters versus a 35.4M CNN baseline at 0.39%, and cut MultiMNIST error from 8.1% to 5.2% when digit bounding boxes overlapped by about 80%. Historically, however, it became a brilliant side branch: its routing behaved like an iterative visual cousin of Transformer-style attention, but it did not inherit Transformer's scaling and hardware advantages.

Historical Context¶

What did vision quietly assume in 2017?¶

By 2017, mainstream computer vision had become deeply CNN-shaped. AlexNet, VGG, Inception, and ResNet had made ImageNet the proving ground for deep convolutional networks; detection and segmentation were moving through Faster R-CNN and Mask R-CNN. The strengths of convolution were clear: local connectivity, weight sharing, translation equivariance, and pooling for local translation invariance. The engineering stack loved it too: GPU kernels were mature, benchmarks rewarded it, and the recipe scaled.

Hinton had long been dissatisfied with pooling. In his view, pooling was not a benign abstraction; it discarded pose. A strong eye detector, nose detector, and mouth detector do not by themselves prove that the parts form a valid face. CNNs can learn such relationships with depth and data, but capsules tried to put the relationship directly into the representation: a lower-level part should not only say "I exist" but also "I exist in this pose, so I predict the parent whole should have that pose."

This Hinton line did not appear out of nowhere¶

The roots of Capsule Networks predate 2017 by decades. In 1981, Hinton wrote about parallel shape representation and the binding problem. In 2000, Hinton, Ghahramani, and Teh framed image understanding as carving an input-dependent parse tree out of a fixed network. In 2011, Hinton, Krizhevsky, and Wang's Transforming Auto-Encoders already proposed capsules as groups of neurons carrying instantiation parameters, but that system still lacked a fully learned, end-to-end parent assignment mechanism.

The 2017 paper supplied that missing mechanism: routing-by-agreement. It turns the question "which whole should this part belong to?" into a small iterative optimization process, instead of letting local max-pooling perform a winner-take-all reduction. CapsNet is therefore not just a different nonlinearity; it inserts a soft parse-tree construction process into a CNN-like visual pipeline.

Why did it trigger such a strong reaction?¶

The deep learning community was being pulled in two directions: vision was driven by ResNet-style depth, while NLP had just seen Transformer rewrite attention. Capsule routing looked attention-like too: lower capsules allocate weights to multiple upper capsules, and the weights are refined from the current input. But it was more geometric than ordinary attention: agreement is a dot product between a predicted pose vector and the parent capsule output, and the goal is for several part-level pose votes to meet on the same whole.

That gave the paper two kinds of force. It had the Hinton-style cognitive ambition: vision is not merely averaging local features; it is parsing parts, wholes, and pose. It also had hard benchmark hooks: 0.25% MNIST error, 5.2% MultiMNIST error, and 79% affNIST transfer accuracy. Those numbers made the community believe that there might be a real route outside the standard CNN trajectory.

Compute and data context¶

Hardware: GPU convolution was highly optimized; capsule routing required many small matrix multiplications plus iterative refinement, a much less friendly workload.
Data: MNIST, MultiMNIST, affNIST, and smallNORB highlighted pose and overlapping-object issues; CIFAR-10 exposed the difficulty of natural backgrounds and texture variation.
Frameworks: the paper used TensorFlow and default Adam; capsule tooling was nowhere near as mature as CNN tooling.
Competing paradigms: ResNet showed that "deeper and stable" was a scalable path; Spatial Transformer Networks showed that "normalize first, recognize later" could handle geometry; CapsNet bet on "do not erase pose, preserve it."

That is the historical position of Capsule Networks. It did not become the Transformer of vision, but it fixed an important question in place: invariance is not free. Whatever information a model throws away may later have to be repaid in compositional generalization, segmentation, or viewpoint change.

Method Deep Dive¶

Overall framework¶

CapsNet is intentionally shallow: two convolution-style feature stages followed by a fully connected DigitCaps layer. The novelty is not depth; it is what each unit outputs and how lower-level units send information upward.

28x28 image
  ↓ Conv1: 256 filters, 9x9, stride 1, ReLU
  ↓ PrimaryCaps: 32 channels × 6 × 6 capsules, each capsule is 8D
  ↓ Dynamic Routing: lower capsule votes for every digit capsule
  ↓ DigitCaps: 10 capsules, each capsule is 16D
  ↓ Margin loss on vector length + reconstruction regularizer

Layer	Output	Capsule dimension	Role
Conv1	20×20×256 scalar features	1D scalar	Low-level edge and stroke features
PrimaryCaps	32×6×6 capsule outputs	8D	Package local features into pose-like vectors
DigitCaps	10 digit capsules	16D	One capsule per class; length means class existence
Decoder	3 fully connected layers	Reconstruct from 16D	Forces DigitCaps to preserve instantiation details

The point is not that a 3-layer network wins MNIST. The point is that information normally blurred by pooling becomes a first-class object: position, direction, width, and local stroke shape live in vector orientation; class existence lives in vector length.

Key designs¶

Design 1: Vector capsules + squash nonlinearity — length for existence, direction for pose¶

A conventional CNN neuron emits one scalar: high means detected, low means absent. A capsule emits a vector: length near 1 means the entity exists, while orientation carries instantiation parameters such as pose, width, skew, and local shape.

The paper uses a squash function to constrain length while preserving direction:

\[ \mathbf{v}_j = \frac{\|\mathbf{s}_j\|^2}{1+\|\mathbf{s}_j\|^2}\frac{\mathbf{s}_j}{\|\mathbf{s}_j\|} \]

Short vectors shrink toward zero; long vectors shrink to just below one. This is more capsule-friendly than a sigmoid because it treats the whole vector as one entity state rather than compressing each coordinate independently.

Representation	Output	Preserve pose?	Classification meaning	Cost
CNN scalar neuron	Single activation	Usually weakened by pooling	Larger activation means stronger feature	Efficient and hardware-friendly
Max-pooling	Local maximum	Loses precise local position	Local invariance	Blurry binding relation
Spatial Transformer	Transformed feature map	Normalize first	Make recognition easier downstream	Harder with multiple objects
Capsule vector	Vector length + orientation	Explicitly preserved	Length is existence probability	Expensive routing and complex implementation

The design motivation is clear: if an object rotates, the ideal representation should not become completely invariant; it should be equivariant, changing in a predictable internal way. Capsules are counter-intuitive precisely here: they do not rush to erase pose variation; they keep it so upper layers can reason about part-whole geometry through transformation matrices.

Design 2: Prediction vectors — lower parts vote before upper wholes explain¶

Each lower capsule $i$ does not directly pass its output to an upper capsule $j$. It first uses a learnable matrix to predict what the parent should look like if the child belongs to it:

\[ \mathbf{s}_j = \sum_i c_{ij}\hat{\mathbf{u}}_{j|i}, \qquad \hat{\mathbf{u}}_{j|i}=\mathbf{W}_{ij}\mathbf{u}_i \]

$\mathbf{W}_{ij}$ learns the part-to-whole geometry. For example, if an upper-left stroke capsule belongs to digit 7, its predicted pose for DigitCaps(7) should agree with other stroke predictions; if it is forced into DigitCaps(3), the predictions should conflict.

This changes classification from "which feature is largest?" into "which parts jointly support the same whole explanation?" It is the most cognitively flavored part of the paper: recognition is not just voting for a label, but constructing an explanation graph.

Design 3: Dynamic routing — higher agreement means stronger coupling¶

Dynamic routing takes all prediction vectors as input and returns the upper capsule outputs. Initially, each lower capsule treats all possible parents equally. It then iteratively computes parent capsules, measures agreement, and updates routing logits.

\[ c_{ij}=\frac{\exp(b_{ij})}{\sum_k\exp(b_{ik})},\qquad \mathbf{s}_j=\sum_i c_{ij}\hat{\mathbf{u}}_{j|i},\qquad b_{ij}\leftarrow b_{ij}+\hat{\mathbf{u}}_{j|i}\cdot\mathbf{v}_j \]

Variable	Meaning	Input-dependent?	Intuition
$b_{ij}$	Routing logit from capsule $i$ to parent $j$	After iteration, yes	"Who do I most likely belong to?"
$c_{ij}$	Coupling coefficient after softmax	Yes	Weight assigned to each parent
$\hat{\mathbf{u}}_{j	i}$	Child's prediction vector for parent	Yes
$\hat{\mathbf{u}}_{j	i}\cdot\mathbf{v}_j$	Agreement	Yes

Routing pseudocode:

def dynamic_routing(votes, num_iterations=3):
    logits = zeros_like_parent_scores(votes)
    for _ in range(num_iterations):
        coupling = softmax(logits, axis="parents")
        total_input = weighted_sum(coupling, votes, axis="children")
        parent_output = squash(total_input)
        agreement = dot(votes, parent_output)
        logits = logits + agreement
    return parent_output

Why use a dot product for agreement? Because capsule orientation is intended to be pose-like. The more aligned a prediction vector and a parent output are, the more consistent that part-to-whole explanation is. Routing turns that consistency into stronger coupling, so the child contributes more to the same parent on the next iteration.

Design 4: Margin loss + reconstruction regularization — classify by length, explain the image¶

Because class probability is represented by DigitCaps vector length, the paper does not use ordinary softmax classification. It applies a separate margin loss to every class capsule:

\[ L_k = T_k\max(0,m^+ - \|\mathbf{v}_k\|)^2 + \lambda(1-T_k)\max(0,\|\mathbf{v}_k\|-m^-)^2 \]

with $m^+=0.9$, $m^-=0.1$, and $\lambda=0.5$. The target class capsule should have length close to one; absent classes should stay below 0.1.

The reconstruction regularizer is another key detail. During training, the model keeps only the correct class capsule's 16D vector, masks the others, and reconstructs the input image through a 3-layer fully connected decoder. The reconstruction loss is multiplied by 0.0005 so it does not dominate the classification loss. This is not about making pretty images; it prevents DigitCaps from becoming a mere class switch and forces it to encode stroke width, skew, local shape, and other instantiation parameters.

Loss / training strategy¶

Item	Setting	Role
Main loss	Sum of margin losses over 10 DigitCaps	Use vector length for multi-label-style classification
Reconstruction loss	Squared image error × 0.0005	Preserve pose and detail in the 16D capsule
Routing iterations	Main experiments use 3	Balance capacity against overfitting
Optimizer	TensorFlow default Adam + decaying learning rate	Keep engineering simple
Data augmentation	MNIST only shifted by up to 2 pixels	Avoid winning via heavy augmentation
Parameters	CapsNet 8.2M; no reconstruction 6.8M; CNN baseline 35.4M	Fewer parameters, but slower computation pattern

The beauty of the method is that it rewrites recognition as explanation. Its fragility comes from the same place: once scenes contain backgrounds, textures, clutter, and scale variation, asking the model to explain everything becomes a burden.

Failed Baselines¶

Baselines that lost to CapsNet¶

CapsNet's wins did not come on ImageNet-scale natural images. They came on small tasks designed to expose part-whole binding. The losing baselines fall into three groups: convolutional baselines, weakened capsule variants, and earlier sequential-attention systems.

Baseline	What it represents	Where it lost	Key number
35.4M-param CNN	Strong conventional conv classifier	Pooling does not explicitly preserve pose and binding	MNIST 0.39% vs CapsNet 0.25%
MultiMNIST CNN	Wide conv + pooling + sigmoid multilabel classifier	Hard to assign pixel evidence between two overlapping digits	8.1% vs CapsNet 5.2%
1 routing iteration, no reconstruction	Routing nearly reduced to one soft assignment	Capsule vector not forced to preserve pose	MNIST 0.34%; MultiMNIST not reported
Spatial Transformer style	Normalize input first, then recognize	Hard to normalize multiple objects with different poses at once	Paper argues capsules can handle multiple transformations simultaneously
Ba et al. 2014 sequential attention	Stepwise glimpse-based multi-object recognition	Much less overlap in the task	5.0% at <4% bbox overlap; CapsNet approached this at 80% overlap

The most important failed case is MultiMNIST. Two digits have about 80% average bounding-box overlap, so CNN pooling tends to mix local evidence. CapsNet reconstructs the two digits from the two most active DigitCaps one at a time, showing a coarse but real form of explaining away.

Key experimental data¶

The results support the capsule intuition while also forecasting its limits. MNIST and MultiMNIST are strong cases; CIFAR-10 is the warning sign.

Task / setting	Model	Routing	Reconstruction	Result
MNIST	CNN baseline	-	-	0.39% error
MNIST	CapsNet	1	no	0.34% ± 0.032 error
MNIST	CapsNet	1	yes	0.29% ± 0.011 error
MNIST	CapsNet	3	no	0.35% ± 0.036 error
MNIST	CapsNet	3	yes	0.25% ± 0.005 error
MultiMNIST	CNN baseline	-	-	8.1% error
MultiMNIST	CapsNet	3	yes	5.2% error
affNIST transfer	CapsNet vs CNN	3	yes	79% vs 66% accuracy

Other dataset	Setting	Result
CIFAR-10	7-model ensemble, 24×24 patch, 3 routing iterations	10.6% error
smallNORB	MNIST-like architecture, 48×48 resize, 32×32 crop	2.7% error
SVHN	Smaller network, small training set of 73,257 images	4.3% error

What the ablations tell us¶

The details of Table 1 are more informative than the 0.25% headline. One routing iteration plus reconstruction improves MNIST from 0.34% to 0.29%. Three routing iterations without reconstruction gives 0.35%, not better than one iteration. Three routing iterations plus reconstruction reaches 0.25%. Dynamic routing is not acting alone; it needs the reconstruction regularizer to push capsule vectors toward instantiation parameters.

Observation	Direct explanation	Deeper lesson
Reconstruction helps MNIST	The 16D DigitCaps must encode stroke details	Capsules need representation pressure, not just a routing algorithm
MultiMNIST drops to 5.2%	Routing can assign evidence between overlapping objects	Explaining away fits binding better than pooling
affNIST 79% vs 66%	Capsules are steadier under unseen affine transforms	Equivariant representation gives real transfer gains
CIFAR-10 10.6%	Background clutter makes "explain everything" hard	The generative bias becomes a burden on natural images
3 routing iterations can overfit	More iterations add capacity	Iterative inference is not free compute

Why those baselines were not fully displaced¶

CapsNet exposed a real CNN weakness, but it did not provide a scalable replacement path. First, routing is an awkward computation for accelerators: many small matrix multiplications, softmaxes, and iterative updates are far less efficient than large convolutions or large matrix multiplies. Second, the capsule prior is strong: it assumes at most one instance of an entity type at each local position and asks the model to explain all pixels; this is reasonable for clean digits but heavy for natural backgrounds and textures. Third, later large-scale vision models found a cruder but more scalable route: absorb geometric variation with more data, larger models, and flexible attention/convolution hybrids.

So CapsNet's failure is not that the idea was wrong. It is that the system did not scale. It proved that pose-aware representation and routing-by-agreement matter, but it did not prove that this mechanism can compound scale the way ResNet or Transformer did.

Idea Lineage¶

flowchart TD
    H1981[Hinton 1981\nBinding problem and shape representation] --> H2000[Learning to Parse Images 2000\nCarving a parse tree from a fixed network]
    H2000 --> TAE2011[Transforming Auto-Encoders 2011\nPose-bearing capsules]
    STN2015[Spatial Transformer Networks 2015\nNormalize first, recognize later] --> CAPS2017[Dynamic Routing Between Capsules 2017\nRouting by agreement]
    TAE2011 --> CAPS2017
    CNN2012[AlexNet / VGG / ResNet\nConvolution + pooling mainstream] --> CAPS2017
    CAPS2017 --> EM2018[Matrix Capsules with EM Routing 2018]
    CAPS2017 --> GNN2018[Graph / Text Capsule variants\nRouting as aggregation]
    CAPS2017 --> EQ2020[Equivariant networks\nPreserve predictable transformations]
    CAPS2017 --> ATTENTION[Routing-as-attention readings\nIterative soft assignment]
    CAPS2017 -.Misread.-> HYPE["Capsules will replace CNNs"\nScaling promise overestimated]

Before: from binding problem to Transforming Auto-Encoders¶

Capsules descend less from CNNs than from Hinton's long-standing obsession with the binding problem. Scalar neurons are good at saying "this pattern is here," but they are poor at naturally expressing "this eye, this nose, and this mouth belong to the same face and share one pose explanation." Pooling gives a model invariance, but it also discards the geometric evidence needed to build a whole from parts.

The 2000 paper Learning to parse images supplied an important metaphor: a parse tree need not allocate new memory dynamically; it can be carved out of a fixed network. The 2011 Transforming Auto-Encoders paper made capsules more concrete as pose-bearing representations: an object has a class and transformation parameters. The 2017 paper pushes that chain into a complete discriminative network, making routing responsible for assigning lower capsules to upper capsules.

After: the branches it left behind¶

CapsNet did not become the default vision architecture, but it left several clear branches.

Follow-up direction	Representative work / phenomenon	What it inherited	What it changed
Matrix Capsules	EM Routing 2018	Pose representation and part-whole agreement	Replaced vector dot products with pose matrices and EM-style updates
Graph/text capsules	Graph Capsule, TextCaps, and variants	Routing as dynamic aggregation	No longer insisted on visual geometric explanation
Equivariant networks	Group equivariant CNN, SE(3) Transformer	Representations should transform predictably	Used group/geometric constraints instead of routing
Attention readings	Routing-as-attention	Soft assignment and input-dependent weights	Dropped the iterative part-whole parse-tree commitment
Vision foundation models	ViT, ConvNeXt, SAM, and others	Composition matters	Used scale, data, and attention to absorb geometric complexity

The durable legacy is not the exact routing algorithm, but two more abstract ideas. First, equivariance is more informative than invariance. Second, aggregation weights should depend on agreement in the current input. The first idea lives on in geometric deep learning; the second keeps reappearing as attention, routing, and MoE gating.

Misread: treating it as the CNN killer¶

The common misread of Capsule Networks was to treat it as the next CNN replacement. That was understandable in 2017: Hinton's name, 0.25% MNIST error, and attractive MultiMNIST reconstructions made a paradigm shift feel plausible. But the paper itself was more cautious. It compared capsule research to recurrent networks for speech recognition at the beginning of the century: representationally promising, but still needing many small insights before beating a mature engineered technology.

The better historical placement is this: CapsNet asked a very sharp question, built a beautiful first system, and failed to solve scaling. It did not abolish pooling or remove CNNs. It forced later work to ask a harder question: when a model buys invariance, what does it sacrifice? If pose, part ownership, and object explanation matter, should we use explicit routing, geometric equivariance, or large-scale attention and let the model discover the structure?

Modern Perspective¶

Assumptions that did not hold up¶

From 2026, the most valuable part of Capsule Networks is the question it asked. The weakest part is the system-level set of assumptions.

2017 assumption	Why it made sense	How it looks today	Consequence
Pooling is a fundamental CNN flaw	Pooling discards pose and binding information	The flaw is real, but attention, augmentation, scale, and equivariant design can partially compensate	Not enough to overthrow the CNN/ViT mainstream
Dynamic routing can scale	Routing is differentiable, end-to-end, and interpretable soft assignment	Iterative small-matrix computation is hardware-unfriendly and hard to stack deeply	It did not benefit from the large-model era
Explaining every pixel helps recognition	MNIST/MultiMNIST have clean backgrounds, so explanation pressure works	Natural images contain cluttered backgrounds; explaining everything can hurt classification	CIFAR-10 was not convincing enough
Explicit part-whole structure will naturally win	Human vision really is compositional	Benchmarks reward scalable training and data absorption	Beautiful idea, insufficient engineering leverage

The key correction is: invariance has a cost, but explicit routing is not the only way to repay it. Modern vision models can preserve or recover some structure through self-attention, patch tokens, strong augmentation, contrastive learning, 3D/SE(3) equivariant constraints, or diffusion-style reconstruction objectives. The pain point identified by capsules remains; the prescription need not be the 2017 CapsNet.

If rewritten today¶

If Dynamic Routing Between Capsules were rewritten today, I would not simply replicate the 2017 architecture. I would keep three principles: vector or matrix representations, agreement-based assignment, and interpretable part-whole routing. The system design would change substantially.

2017 version	Possible 2026 version	Reason
Lower capsules vote for all parents via many small matrices	Implement routing with batched tensor contractions or attention kernels	Expose large matrices to hardware instead of fragmented multiplications
Fixed 3 routing iterations	Learned early-exit or one-step amortized routing	Avoid fixed iterative cost at every layer
MNIST/MultiMNIST as main evidence	ShapeNet, CLEVR, multi-object video, robotics manipulation	Better tests of 3D pose, occlusion, and compositional generalization
Image reconstruction from a 16D capsule	Combine with masked modeling or diffusion decoders	Stronger representation pressure and more stable training

The concept worth preserving is local agreement. Today's MoE routers, slot attention, object-centric learning, and test-time refinement all revisit a version of the same question: which explanatory unit should receive which evidence from the input? CapsNet called the units parts and wholes; modern systems may call them token slots, experts, object queries, or latent variables.

CapsNet's limitations come in three layers. The engineering limitation is speed: routing is slow, tensor shapes are complex, and the computation does not map as naturally to accelerators as convolution or attention. The statistical limitation is domain fit: it works well on small, clean geometric tasks but struggles with backgrounds, textures, occlusion, and long-tail variation in natural images. The research-process limitation is reproducibility: capsules are conceptually attractive, but the community never converged on one stable, scalable training recipe.

Category	Recommended reading	Why it matters
Original paper	Dynamic Routing Between Capsules	Source of vector capsules, routing, and margin loss
Predecessor idea	Transforming Auto-Encoders (2011)	Earlier pose-bearing capsule formulation
Direct successor	Matrix Capsules with EM Routing (2018)	Attempts to repair v1 with pose matrices and EM routing
Contrastive route	Spatial Transformer Networks (2015)	Another way to handle geometry: normalize instead of preserve
Modern neighbor	Slot Attention / object-centric learning	Modern route for dynamically assigning evidence to object slots

Final historical judgment¶

Capsule Networks is a paper that did not win the mainline, but should not be forgotten. Its benchmark life was short, and its descendants never formed a scaling flywheel like ResNet or Transformer. But it caught a root problem in deep vision: recognition systems cannot only ask "is there a feature?" They also need to ask "how do these features compose into an object?"

In that sense, it is less a failed architecture than an intellectual wedge. It exposed the tension among pooling, invariance, pose, binding, and explanation, making it harder for later researchers to pretend those problems were absent. CapsNet was not the future of vision models, but it remains a key to understanding why that future is difficult.

🌐 中文版 · 📚 awesome-papers project · CC-BY-NC

Variable	Meaning	Input-dependent?	Intuition
\(b_{ij}\)	Routing logit from capsule \(i\) to parent \(j\)	After iteration, yes	"Who do I most likely belong to?"
\(c_{ij}\)	Coupling coefficient after softmax	Yes	Weight assigned to each parent
$\hat{\mathbf{u}}_{j	i}$	Child's prediction vector for parent	Yes
$\hat{\mathbf{u}}_{j	i}\cdot\mathbf{v}_j$	Agreement	Yes