EfficientNet — Redefining CNN Efficiency via Compound Scaling¶
May 28, 2019. Google Brain's Tan & Le release EfficientNet (1905.11946) on arXiv, accepted at ICML 2019. The most important paper in CNN scaling history — first to systematically study "how to scale up models," proposing the compound scaling principle: depth / width / input resolution must scale by fixed proportions simultaneously, instead of tweaking one dimension at a time as before. Used NAS to search baseline EfficientNet-B0 (5.3M params), then generated B1-B7 via compound scaling. EfficientNet-B7 (66M params) achieves 84.4% top-1 on ImageNet, beating prior SOTA GPipe (557M params) by 0.1%, but with 8.4× fewer params and 6.1× faster inference. EfficientNet dominated 2019-2021 ImageNet leaderboards and birthed EfficientNetV2 / MobileNetV3 / ResNeSt / RegNet — the CNN apex before the ViT era.
TL;DR¶
EfficientNet uses compound scaling (\(d=\alpha^\phi, w=\beta^\phi, r=\gamma^\phi\), constraint \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\)) to scale depth / width / resolution at fixed proportions simultaneously, paired with the NAS-searched EfficientNet-B0 baseline (MBConv + SE blocks). From B0 (5.3M / 0.39G FLOPs / 77.3%) to B7 (66M / 37G FLOPs / 84.4%), it covers 8 SOTA models along the Pareto frontier.
Historical Context¶
What was the CNN scaling community stuck on in 2019?¶
2012-2018 ImageNet SOTA evolved: AlexNet (60M, 57.2%) → VGG (138M, 71.5%) → Inception v3 (24M, 78%) → ResNet-50 (25M, 76%) → ResNet-152 (60M, 78.6%) → ResNeXt-101 (84M, 80.5%) → SENet-154 (146M, 82.7%) → GPipe (557M, 84.3%). But all scaling was ad-hoc:
(1) Deeper (ResNet 50→101→152): returns diminish quickly, gains drop +2 → +0.6 → +0.4 (2) Wider (WideResNet widening ResNet): saturates quickly (3) Higher resolution (raising input image): memory explosion + returns saturate (4) Three dimensions interdepend: but no one systematically studied the dependency
The community's open question: "When growing all three dimensions simultaneously, what proportion?" GPipe with 557M brute-force scaling was the limit, but extremely inefficient.
The 3 immediate predecessors that pushed EfficientNet out¶
- Sandler et al., 2018 (MobileNet v2) [CVPR]: proposed inverted residual + linear bottleneck (MBConv), EfficientNet's building block
- Tan et al., 2018 (MnasNet) [CVPR 2019]: authors' previous paper, NAS for mobile CNN; search space and method directly reused for EfficientNet
- Hu et al., 2017 (SE-Net) [CVPR 2018]: channel attention module embedded in EfficientNet's MBConv
What was the author team doing?¶
2 authors all from Google Brain. Mingxing Tan is NAS / efficient CNN main force (MnasNet / EfficientNet / EfficientNetV2 / NoisyStudent etc.); Quoc V. Le is Google senior researcher (NAS / Seq2Seq / AutoAugment co-author). Google Brain was betting on "automated model design + systematic scaling" strategy at the time, EfficientNet was the representative work.
State of industry, compute, data¶
- TPU: training B7 on 256 TPU v3 took ~5 days
- Data: ImageNet 1.28M training + 50k validation
- Frameworks: TensorFlow + in-house NAS framework (based on MnasNet)
- Industry: CV community fervent about "accuracy leaderboards"; Google / FAIR / NVIDIA competing for ImageNet SOTA
Method Deep Dive¶
Overall framework¶
Step 1: NAS Search
Goal: find best architecture under mobile FLOPs constraint
Search space: MBConv (mobile inverted bottleneck) + SE block
→ EfficientNet-B0 (5.3M params, 0.39B FLOPs, 77.3% top-1)
Step 2: Compound Scaling
Constraint: depth=α^φ, width=β^φ, resolution=γ^φ
s.t. α·β²·γ² ≈ 2 (FLOPs ≈ 2^φ)
Grid search small (φ=1) → α=1.2, β=1.1, γ=1.15
Generate B1, B2, ..., B7 by setting φ=1, 2, 3, ..., 7
| Model | φ | depth | width | resolution | params | FLOPs | top-1 |
|---|---|---|---|---|---|---|---|
| B0 | 0 | 1.0 | 1.0 | 224 | 5.3M | 0.39G | 77.3% |
| B1 | 1 | 1.2 | 1.1 | 240 | 7.8M | 0.70G | 79.2% |
| B2 | 2 | 1.4 | 1.2 | 260 | 9.2M | 1.0G | 80.3% |
| B3 | 3 | 1.7 | 1.3 | 300 | 12M | 1.8G | 81.6% |
| B4 | 4 | 2.0 | 1.4 | 380 | 19M | 4.2G | 82.9% |
| B5 | 5 | 2.4 | 1.6 | 456 | 30M | 9.9G | 83.6% |
| B6 | 6 | 2.8 | 1.8 | 528 | 43M | 19G | 84.0% |
| B7 | 7 | 3.2 | 2.0 | 600 | 66M | 37G | 84.4% |
Key designs¶
Design 1: Compound Scaling Formula — systematic CNN scaling¶
Function: turn "how to scale up CNN" from ad-hoc empirical practice to formula-guided.
Core formula:
Define baseline network depth \(d_0\), width \(w_0\), resolution \(r_0\). On the new model:
Subject to:
Intuition: FLOPs proportional to depth \(d\), width squared \(w^2\), resolution squared \(r^2\), so FLOPs \(\propto d \cdot w^2 \cdot r^2 = (\alpha \beta^2 \gamma^2)^\phi \approx 2^\phi\). φ +1 → FLOPs doubled.
Grid search (paper Section 3.3): fix φ=1, in small search space find best \((\alpha, \beta, \gamma)\), found:
Verify: \(1.2 \cdot 1.1^2 \cdot 1.15^2 = 1.2 \cdot 1.21 \cdot 1.3225 = 1.92 \approx 2\) ✓
Then fix this ratio set, increase φ to generate B1, B2, ..., B7.
Comparison with single-dimension scaling:
| Scaling method | Example | top-1 increment from B0 (77.3%) | FLOPs |
|---|---|---|---|
| width only (×2 w) | β=2 | +2.1 (79.4) | 0.93G |
| depth only (×2 d) | α=2 | +2.4 (79.7) | 0.84G |
| resolution only (×2 r) | γ=2 | +2.1 (79.4) | 1.30G |
| compound (B3, φ=3) | α=1.7,β=1.3,γ=1.5 | +4.3 (81.6) | 1.8G |
Compound scaling significantly wins under similar FLOPs.
Design 2: NAS-searched EfficientNet-B0 baseline — starting point determines ceiling¶
Function: use NAS to find the best architecture under mobile-FLOPs constraint as baseline, avoiding scaling on a weak baseline.
Search method: based on MnasNet's NAS framework, search objective:
where \(T=400M\) is target FLOPs, \(w=-0.07\) is tradeoff factor.
Search space (which architectural hyperparams): - Convolutional ops: regular conv, depthwise conv, MBConv (mobile inverted bottleneck) - Kernel size: 3 or 5 - Squeeze-and-Excitation ratio: 0 or 0.25 - Skip ops: pooling, identity, none - Channel size, number of layers per block
Searched EfficientNet-B0 architecture:
| Stage | Operator | Resolution | Channels | Layers |
|---|---|---|---|---|
| 1 | Conv 3×3 | 224×224 | 32 | 1 |
| 2 | MBConv1, k3×3 | 112×112 | 16 | 1 |
| 3 | MBConv6, k3×3 | 112×112 | 24 | 2 |
| 4 | MBConv6, k5×5 | 56×56 | 40 | 2 |
| 5 | MBConv6, k3×3 | 28×28 | 80 | 3 |
| 6 | MBConv6, k5×5 | 14×14 | 112 | 3 |
| 7 | MBConv6, k5×5 | 14×14 | 192 | 4 |
| 8 | MBConv6, k3×3 | 7×7 | 320 | 1 |
| 9 | Conv 1×1 + Pool + FC | 7×7 | 1280 | 1 |
Design 3: MBConv + SE Block — efficient building block¶
Function: each MBConv block (mobile inverted bottleneck conv) consists of expand → depthwise → SE → project 4 steps, plus residual connection.
MBConv structure:
class MBConv(nn.Module):
def __init__(self, in_ch, out_ch, kernel=3, stride=1, expand_ratio=6, se_ratio=0.25):
super().__init__()
hidden_ch = in_ch * expand_ratio
# Step 1: Expand (1×1 conv)
self.expand = nn.Sequential(
nn.Conv2d(in_ch, hidden_ch, 1, bias=False),
nn.BatchNorm2d(hidden_ch),
nn.SiLU()) # Swish/SiLU activation
# Step 2: Depthwise conv
self.depthwise = nn.Sequential(
nn.Conv2d(hidden_ch, hidden_ch, kernel, stride=stride,
padding=kernel//2, groups=hidden_ch, bias=False),
nn.BatchNorm2d(hidden_ch),
nn.SiLU())
# Step 3: SE block
squeeze_ch = max(1, int(in_ch * se_ratio))
self.se = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(hidden_ch, squeeze_ch, 1),
nn.SiLU(),
nn.Conv2d(squeeze_ch, hidden_ch, 1),
nn.Sigmoid())
# Step 4: Project (1×1 conv, no activation)
self.project = nn.Sequential(
nn.Conv2d(hidden_ch, out_ch, 1, bias=False),
nn.BatchNorm2d(out_ch))
self.use_residual = (stride == 1 and in_ch == out_ch)
def forward(self, x):
out = self.expand(x)
out = self.depthwise(out)
out = out * self.se(out) # SE attention
out = self.project(out)
if self.use_residual:
out = out + x # Inverted residual
return out
4 key building block choices: - MBConv: expand → depthwise → project, parameter-efficient - SE block: channel attention, free +1% accuracy - SiLU/Swish activation: \(\text{SiLU}(x) = x \cdot \sigma(x)\), smoother than ReLU - Inverted residual: skip connect in expanded space
Design 4: Training Augmentation — AutoAugment + Stochastic Depth + Dropout¶
Function: use various regularizations to prevent overfitting in large models.
Key regularization tricks:
| Trick | Effect |
|---|---|
| AutoAugment | NAS-searched data augmentation policy (rotation/shear/color jitter etc.) |
| Dropout | B0 0.2 → B7 0.5 (more dropout for larger models) |
| Stochastic Depth | randomly drop layers during training, more stable for deep nets |
| Label Smoothing | 0.1 |
| EMA weights | maintain EMA weights as inference weights |
| Larger LR + warmup | adapts to TPU large batch |
Loss / training strategy¶
| Item | Config |
|---|---|
| Loss | Cross-entropy + label smoothing 0.1 |
| Optimizer | RMSprop with momentum 0.9, decay 0.9 |
| LR | 0.256, decay 0.97 every 2.4 epochs |
| Batch | 4096 (256 TPU v3) |
| Weight decay | 1e-5 |
| Dropout | 0.2 (B0) → 0.5 (B7) |
| Activation | SiLU/Swish |
| Training epochs | 350 |
| Augmentation | AutoAugment + RandAugment |
Failed Baselines¶
Opponents that lost to EfficientNet at the time¶
- GPipe (Google 2018, 557M params): ImageNet 84.3% → EfficientNet-B7 84.4% (66M params, 8.4× fewer)
- AmoebaNet-B (NAS-search) (135M, 83.5%) → B6 (43M, 84.0%, 3.1× fewer)
- ResNeXt-101 + SE (146M, 82.7%) → B5 (30M, 83.6%, 4.9× fewer)
- NASNet-A (89M, 82.7%) → B4 (19M, 82.9%, 4.7× fewer)
- ResNet-50 (26M, 76.0%) → B0 (5.3M, 77.3%, 5× fewer + 1.3% higher)
Failures / limits admitted in the paper¶
- NAS search expensive: B0 baseline NAS search cost thousands of TPU-hours
- Depthwise conv slow on GPU: actual GPU inference EfficientNet not necessarily faster than ResNet (depthwise conv weakly optimized)
- Slow training: B7 trains 5 days on 256 TPU
- Large model OOM: B6/B7 inference on single V100 needs large memory
- Compound scaling coefficients fixed: α=1.2/β=1.1/γ=1.15 not necessarily optimal on different baselines
"Anti-baseline" lesson¶
- "Single-dimension scaling is enough" (ResNet-152, WideResNet): EfficientNet proved three-dimension synergistic scaling far better
- "Scale up is ad-hoc engineering": EfficientNet proposed systematic principle
- "Accuracy first" (GPipe route): EfficientNet brought Pareto frontier perspective to CV
- "Hand-designed architecture > NAS" (some SOTA belief): EfficientNet with NAS baseline + scaling fully wins
Key Experimental Numbers¶
ImageNet main experiment¶
| Model | Params | FLOPs | top-1 | top-5 |
|---|---|---|---|---|
| ResNet-50 | 26M | 4.1G | 76.0 | 93.0 |
| ResNet-152 | 60M | 11G | 78.6 | 94.3 |
| Inception-ResNet-v2 | 56M | 13G | 80.1 | 95.1 |
| ResNeXt-101 | 84M | 32G | 80.9 | 95.6 |
| AmoebaNet-A | 87M | 23G | 82.8 | 96.1 |
| AmoebaNet-C | 155M | 41G | 83.5 | 96.5 |
| GPipe | 557M | - | 84.3 | 97.0 |
| EfficientNet-B7 | 66M | 37G | 84.4 | 97.1 |
Single/dual/triple dimension scaling comparison (Section 3.3)¶
| Scaling | top-1 increment from B0 | FLOPs |
|---|---|---|
| width only (β=2) | +2.1 (79.4) | 0.93G |
| depth only (α=2) | +2.4 (79.7) | 0.84G |
| resolution only (γ=2) | +2.1 (79.4) | 1.30G |
| compound (B3) | +4.3 (81.6) | 1.8G |
Transfer learning (paper Table 5)¶
| Dataset | EfficientNet | Prior SOTA | Improvement |
|---|---|---|---|
| CIFAR-10 | 98.9 | 98.4 | +0.5 |
| CIFAR-100 | 91.7 | 89.3 | +2.4 |
| Birdsnap | 81.8 | 81.2 | +0.6 |
| Stanford Cars | 93.6 | 94.7 | -1.1 (slight loss) |
| Flowers | 98.8 | 97.7 | +1.1 |
| FGVC Aircraft | 92.9 | 92.9 | tie |
| Oxford Pets | 95.4 | 95.9 | -0.5 |
Key findings¶
- Compound scaling works on all 8 scales
- B7 8.4× fewer params beats GPipe: parameter efficiency unprecedented
- Transfer learning also SOTA: 5 of 8 transfer datasets beat prior
- NAS baseline more important than hand-designed: ResNet baseline + compound scaling worse than EfficientNet family
Idea Lineage¶
graph LR
ResNet[ResNet 2015<br/>residual + depth] -.architectural base.-> EN
MobileNetV1[MobileNet v1 2017<br/>depthwise sep] -.predecessor.-> EN
MobileNetV2[MobileNet v2 2018<br/>inverted residual MBConv] -.direct predecessor.-> EN
MnasNet[MnasNet 2018<br/>NAS for mobile] -.authors' prior work.-> EN
SE[SE-Net 2017<br/>channel attention] -.building block.-> EN
AutoAug[AutoAugment 2018<br/>Cubuk Le data augmentation] -.training trick.-> EN
EN[EfficientNet 2019<br/>compound scaling + NAS]
EN --> EfficientNetV2[EfficientNetV2 2021<br/>progressive learning + Fused-MBConv]
EN --> MobileNetV3[MobileNet v3 2019<br/>NAS + h-swish]
EN --> RegNet[RegNet 2020<br/>FAIR scaling laws]
EN --> NoisyStudent[NoisyStudent 2019<br/>self-training + EfficientNet]
EN --> ConvNeXt[ConvNeXt 2022<br/>modern CNN + Transformer borrowing]
EN -.alternative.-> ViT[ViT 2020<br/>Transformer replaces CNN]
EN -.industry.-> AutoML[Google AutoML]
Predecessors¶
- ResNet (2015): architectural foundation
- MobileNet v1/v2 (2017-2018): MBConv block source
- MnasNet (2018): authors' previous NAS for mobile
- SE-Net (2017): channel attention building block
- AutoAugment (2018): training augmentation
Successors¶
- EfficientNetV2 (2021): authors' own improvement, progressive learning + Fused-MBConv
- MobileNet v3 (2019): NAS searches smaller mobile models
- RegNet (Facebook 2020): another branch of scaling laws CNN
- NoisyStudent (2019): authors' use of EfficientNet + self-training to push ImageNet to 88.4%
- ConvNeXt (2022): modern CNN, borrowing from Transformer
- ViT era (2020+): Vision Transformer beats EfficientNet on large datasets
Misreadings¶
- "EfficientNet is the ultimate ImageNet solution": 2020+ ViT/Swin/MAE fully surpass
- "Compound scaling coefficients are universal": α/β/γ on different baselines need re-search
- "Pareto optimal = actual optimal": on GPU EfficientNet slower than ResNet (depthwise conv slow)
Modern Perspective (Looking Back from 2026)¶
Assumptions that don't hold up¶
- "CNN is the ultimate ImageNet architecture": ViT/Swin (2020+) fully surpass
- "Compound scaling is universal law": on ViT scaling proportions completely different
- "66M params is large model": today ViT-G 1.8B / SAM 1B
- "ImageNet 1.28M is reasonable benchmark size": today LAION-5B 5B images
- "Depthwise sep is mobile optimal": today MobileViT / EfficientFormer with attention + conv hybrid
What time validated as essential vs redundant¶
- Essential: compound scaling idea (borrowed by ViT/Swin), NAS baseline + scaling framework, SiLU/Swish activation, SE integration
- Redundant / misleading: α=1.2/β=1.1/γ=1.15 specific values (baseline-dependent), AutoAugment complex policy (replaced by RandAugment), EfficientNet B0 NAS search (v2 simplified to hand-designed)
Side effects the authors didn't anticipate¶
- Pareto frontier perspective entered mainstream: subsequent ImageNet papers must report params/FLOPs/accuracy 3-axis comparison
- NoisyStudent + EfficientNet pushed ImageNet to 88.4%: with self-training + 300M unlabeled data
- AutoML industry rise: Google AutoML / Vertex AI all based on NAS + EfficientNet ideas
- Changed CV paper writing: previously only reported top-1, after must report efficiency frontier
- Ended by ViT era but ideas remain: Swin/SwinV2/CoAtNet all use compound scaling ideas
If we rewrote EfficientNet today¶
- Use EfficientNetV2's Fused-MBConv instead of MBConv (GPU-friendly)
- Use progressive learning (start small res, gradually scale up)
- Add ViT elements (CoAtNet-style conv + attention hybrid)
- Use ConvNeXt block instead of MBConv
- Default RandAugment instead of AutoAugment
- Scale data to ImageNet-21k pre-training
But the core principle "compound scaling synergizing three dimensions" remains the foundation of today's ViT/Swin/CoAtNet scaling.
Limitations and Outlook¶
Authors admitted¶
- NAS search expensive (thousands of TPU-hours)
- B7 training slow (5 days on 256 TPU)
- Depthwise conv slow on GPU
- α/β/γ coefficients baseline-dependent
- Memory limited (B7 inference needs large memory)
Found in retrospect¶
- Some transfer datasets slightly lose (Stanford Cars / Pets)
- GPU inference speed worse than theoretical FLOPs prediction
- Training hyperparameter sensitive
Improvement directions (validated by follow-ups)¶
- EfficientNetV2 (2021): Fused-MBConv + progressive learning
- NoisyStudent (2019): self-training pushed to 88.4%
- ConvNeXt (2022): modern CNN
- Transition to ViT/Swin (2020+)
Related Work and Inspiration¶
- vs single-dimension scaling (cross-paradigm): previously ResNet-50→152 single-dimensional depth scaling, EfficientNet three-dimensional synergistic. Lesson: scaling is multi-dimensional, not single-knob tuning
- vs MnasNet (cross-scale): MnasNet only searches mobile, EfficientNet searches baseline + scaling. Lesson: NAS small baseline + law amplification = efficient route
- vs GPipe (cross-parameter-efficiency): GPipe brute force 557M, EfficientNet smart 66M. Lesson: 8.4× param efficiency proves the leverage of architecture + scaling design
- vs ViT (cross-architecture): ViT beats EfficientNet on large data. Lesson: CNN era ends, but scaling laws still guide ViT
- vs MobileNet v3 (cross-contemporary): MobileNet v3 uses NAS for mobile, EfficientNet uses NAS + scaling for general. Lesson: NAS + scaling is more general than NAS alone
Related Resources¶
- 📄 arXiv 1905.11946 · ICML 2019
- 💻 Authors' TF implementation · PyTorch Image Models (timm)
- 🔗 HuggingFace EfficientNet
- 📚 Must-read follow-ups: EfficientNetV2 (2021), MobileNet v3 (2019), NoisyStudent (2019), ConvNeXt (2022)
- 🎬 Yannic Kilcher: EfficientNet paper review
🌐 中文版本 · 📚 awesome-papers project · CC-BY-NC