ReLU — How max(0, x) Turned Deep Networks from "Lab Toy" to "Industrial Cornerstone"¶

April 11, 2011. Glorot, Bordes, and Bengio (Universite de Montreal) publish the 9-page paper Deep Sparse Rectifier Neural Networks at AISTATS 2011. A severely under-recognized paper — it replaced the 20-year-default sigmoid / tanh with an activation so simple it barely looks like a paper, $\max(0, x)$, and showed deep nets train fine without Hinton's 2006 (DBN) unsupervised pretraining — in fact, accuracy went up. ReLU's revolution wasn't novelty (McCulloch-Pitts 1943 / Fukushima 1980 had used it), but that this paper systematically proved for the first time three things: (1) non-saturating gradients make deep nets trainable; (2) sparse activation (~50% zero output) actually improves generalization; (3) compute is 6× faster than sigmoid. One year later AlexNet fused it with GPU + Dropout + ImageNet into the deep-learning fuse — ReLU is the default activation of every 21st-century deep network and the patriarch of LeakyReLU / GELU / SwiGLU's entire family.

TL;DR¶

Glorot, Bordes, and Bengio's 9-page 2011 AISTATS paper uses an activation function so brutally simple it looks indefensible — $f(x) = \max(0, x)$ — to shatter the 2006–2010 industry-wide creed that "deep networks must use unsupervised pretraining to be trainable". The paper argues from three angles — biological plausibility + optimisation convenience + representational sparsity — that ReLU dominates sigmoid/tanh, and empirically demonstrates that a purely supervised deep MLP reaches state-of-the-art on MNIST (1.43% error), NORB (16.4%), CIFAR-10 (~50%), and NISTP (8.8%) with no RBM/DBN/autoencoder pretraining whatsoever. This is not a small improvement — it is a paradigm flip: it became Krizhevsky's direct academic justification for choosing ReLU in AlexNet (2012); within 18 months it pushed unsupervised pretraining from "mandatory" to "optional"; and it cleared the runway for deep CNNs to grow from 6 layers (AlexNet) to 152 (ResNet) and then 1000+ (DenseNet). Today 99% of all deep networks (GPT-5, Sora, AlphaFold 4 included) use a direct descendant of ReLU as their hidden activation — GELU, SiLU, SwiGLU all trace back to that humble piecewise-linear hinge in this 9-page paper.

Historical Context¶

What was the deep-learning community stuck on in 2010?¶

To grasp the disruptive power of the ReLU paper, you must view 2006–2010 — the "first phase of the deep-learning renaissance" — as a five-year era held captive by a single mistaken belief.

In 2006 Hinton published A Fast Learning Algorithm for Deep Belief Nets in Neural Computation, using greedy layer-wise RBM pretraining + supervised fine-tune to push 4-layer networks to 1.25% error on MNIST — the first revival of "trainable deep networks" after a 20-year drought. For five solid years afterwards, the entire field congealed around a near-religious consensus:

"You cannot train deep networks with vanilla supervised backprop alone; you must first do unsupervised pretraining to land the weights in a good basin of attraction, then fine-tune. Otherwise it will not work."

The "evidence" for this consensus was overwhelming. Hinton 2006 (DBN), Bengio 2007 (stacked autoencoder), Vincent 2008 (denoising autoencoder), Larochelle 2009 (large-scale ablation), Erhan 2010 (Why does unsupervised pre-training help deep learning? JMLR) — paper after paper showed in ablation tables that on MNIST, removing pretraining inflates a deep MLP's error from ~1% to ~3-10%. Erhan's 2010 review even elevated "pretraining = optimisation regulariser" to a theoretical principle, and almost every 2007–2010 ICML/NeurIPS deep-learning paper was written inside this frame.

But Glorot's group (also Bengio's lab) was already smelling smoke. The real bottleneck might not be "weight initialisation" — it might be the activation function itself. Glorot & Bengio 2010, Understanding the difficulty of training deep feedforward neural networks, had already diagnosed the failure mechanism: sigmoid's non-zero-centred output and saturation to 0/1 in the top layer caused severe vanishing gradients by layer 4 or 5; tanh was slightly better but still saturated. That paper proposed Xavier init alongside, but the authors knew init was a band-aid — the real disease was sigmoid's max-derivative of 0.25 plus saturating-region zeros.

A second independent signal arrived in 2010: Nair & Hinton's ICML paper Rectified Linear Units Improve Restricted Boltzmann Machines swapped RBM's sigmoid binary unit for a "noisy rectified linear unit" (sampling $\max(0, z + \mathcal{N}(0, \sigma(z)))$) and beat standard RBM on NORB and Caltech-101. At ICML, Hinton publicly hinted for the first time that "perhaps sigmoid is wrong" — but he still wrapped ReLU inside RBM and did not dare say "throw away unsupervised pretraining."

Stack these three signals and the 2011 ReLU paper is the kick that finished the play: it pulled ReLU out of RBM, plunked it onto the outermost layer of a purely supervised deep MLP, and used four benchmarks to prove "no pretraining needed for SOTA." What this kick shattered was not sigmoid; it was the canon of "unsupervised pretraining as a necessity."

The concrete pain point in 2010:

deep network + sigmoid + backprop almost certainly fails at ≥5 layers — by the time the gradient reaches layer 4 it has decayed by $0.25^4 \approx 4\times 10^{-3}$ to noise level; combined with sigmoid's non-zero-centred output causing an ill-conditioned Hessian, purely supervised deep training was effectively a dead end from 2006 to 2010.

Everyone bypassed this wall with RBMs/AEs. Glorot's paper said: the wall was built by the activation itself; replace the wall with a staircase and the problem disappears.

The 6 immediate predecessors that pushed ReLU out¶

Hahnloser et al. 2000 (Nature: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit): the first use of $\max(0, x)$ in a neuromorphic silicon circuit as a biologically reasonable approximation — cortical neurons' firing-rate-vs-input-current curve is essentially linear rectification (no fire below threshold, linear rise above). This paper is ReLU's "biological-plausibility certificate" — Glorot's §2 leans on it heavily to argue ReLU is more brain-like than sigmoid.
Dugas et al. 2001 (NIPS: Incorporating Second-Order Functional Knowledge for Better Option Pricing): proposed softplus $g(x) = \log(1 + e^x)$ as a smooth surrogate for ReLU. Softplus is everywhere-differentiable but loses sparsity (always positive) and was later overtaken by ReLU. One of the central baselines in Glorot's paper — proving "smoothness is not what matters; sparsity is."
Jarrett, Kavukcuoglu, Ranzato, LeCun 2009 (ICCV: What is the best multi-stage architecture for object recognition?): empirically found in vision CNNs that "$\max(0, |x|)$" (rectified absolute value) — a rectifying non-linearity — beat sigmoid/tanh on Caltech-101. But Jarrett never generalised this to plain deep MLPs and never realised that sparsity was the key — Glorot picked up the thread and corrected the diagnosis.
Nair & Hinton 2010 (ICML: Rectified Linear Units Improve Restricted Boltzmann Machines): introduced ReLU into RBMs in the "noisy rectified" form and got SOTA on NORB. This was ReLU's first formal entry into the mainstream deep-learning community — but Hinton kept it tied to RBM. Glorot saw this and asked: "If it works inside RBM, what if we strip away the RBM and let a purely supervised network use it?" — exactly the research path of the 2011 paper.
Glorot & Bengio 2010 (AISTATS: Understanding the difficulty of training deep feedforward neural networks): the same authors' previous-year paper — systematic diagnosis of sigmoid's failure mechanism in deep nets + introduction of Xavier init. The most important "theoretical bedding" for the 2011 ReLU paper — Glorot had completed the death-row paperwork on sigmoid; the 2011 paper was "OK, sigmoid is dead, can ReLU take the throne?"
Hinton 2006 Science (Reducing the dimensionality of data with neural networks) + the DBN paper: defined the "unsupervised pretraining + supervised fine-tune" pipeline that became standard 2006-2010. This pipeline is the boss the ReLU paper went out to slay — Glorot's §6 deliberately includes "with vs. without pretraining" controls, and ReLU shrank pretraining's advantage from +5% to +0.1%, issuing pretraining's death sentence in the large-data regime.

What was the author team doing?¶

Xavier Glorot (first author): a PhD student in Bengio's lab, who had just published the Xavier init paper in 2010. His core research question was "why are deep nets hard to train"; once he had stared at it from the init angle, the natural next thought was "should we also replace the activation?" Glorot later joined DeepMind, but the ReLU paper made him one of the highest-impact young researchers per paper in the history of deep learning — every PyTorch nn.Linear defaults to "Xavier init" or "Glorot init" today.
Antoine Bordes (second author): postdoc in Bengio's lab at the time, focused on NLP / knowledge graphs. He later joined Facebook AI Research (FAIR) as a research director, eventually heading the FAIR Paris lab. Bordes's presence kept the ReLU paper from being purely a "vision toy" — the dedicated NISTP (handwritten character) experiment owes something to his NLP background.
Yoshua Bengio (third author): already one of the deep-learning triumvirate at the time, head of the LISA lab at Université de Montréal. Bengio's lab in 2007–2011 was one of the world's two centres for deep-learning research (the other being Hinton's Toronto). Bengio was a core figure of the unsupervised-pretraining school (stacked autoencoders came from his lab), but his name on this paper means the pretraining school personally betrayed itself — a rare display of academic honesty.
Lab posture: Bengio's lab in 2010 was undergoing a subtle paradigm shift — moving from "probabilistic graphical models + RBMs" toward "engineering optimisation + larger networks." 2010 Xavier init was the prelude; 2011 ReLU was the climax; 2013 Maxout (Goodfellow, same lab) and 2014 GAN (also Goodfellow) were continuations. Glorot's paper was the most consequential single move in this shift.

State of industry, compute, data¶

Compute: NVIDIA Fermi GPUs (GTX 580, 512 CUDA cores, 1.5 GB GDDR5) had just matured in 2010-2011. The Glorot experiments still ran mostly on CPU (deep-learning frameworks for GPUs were in the awkward Theano 0.3 era), but in 2012 AlexNet trained ReLU CNNs on the same GTX 580 and finished ImageNet in 5–6 days — industrial vindication of the ReLU paper one year later.
Data: MNIST (60k), NORB (24k), CIFAR-10 (60k) were the three reigning benchmarks; ImageNet (released 2009) was still new, with the 2010 ILSVRC dominated by SIFT + shallow SVM. All Glorot experiments ran at < 100k samples — which makes ReLU's strength even more striking: even on small data, dropping pretraining still won SOTA.
Frameworks: Bengio's lab released Theano 0.3 in 2010 — the first Python deep-learning framework with auto-diff + GPU support. The ReLU paper's experiment code was written in Theano; the team open-sourced network configuration scripts (most no longer reproducible today). Theano is the spiritual grandfather of PyTorch; the ReLU + Theano combination was an early prototype of deep learning's industrial form.
Industry climate: in 2011, deep learning barely existed in industry — Google Brain would not be founded for another year (2012), Facebook AI Research two more (2013), DeepMind was still in stealth mode. The 2011 academic consensus was "deep learning is an interesting niche" — SVMs, graphical models, decision trees still dominated AAAI/IJCAI. The ReLU + AlexNet combination would, 18 months later, totally rewrite this landscape.

Method Deep Dive¶

Overall framework and algorithmic skeleton¶

The "method" section of the ReLU paper is essentially a minimal activation function definition + three theoretical analysis dimensions. The whole paradigm fits in 5 lines of code:

import torch.nn as nn

# Old paradigm (pre-2010): sigmoid + RBM pretraining + supervised fine-tune
old_net = nn.Sequential(
    nn.Linear(784, 1000), nn.Sigmoid(),
    nn.Linear(1000, 1000), nn.Sigmoid(),
    nn.Linear(1000, 1000), nn.Sigmoid(),
    nn.Linear(1000, 10)
)
# Training: layer-wise RBM pretraining → then supervised fine-tune

# New paradigm (Glorot 2011): ReLU + direct supervised
new_net = nn.Sequential(
    nn.Linear(784, 1000), nn.ReLU(),
    nn.Linear(1000, 1000), nn.ReLU(),
    nn.Linear(1000, 1000), nn.ReLU(),
    nn.Linear(1000, 10)
)
# Training: direct SGD + cross-entropy

ReLU's mathematical definition: $$ f(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases} $$

Derivative (non-differentiable at $x = 0$, conventionally set to 0 or 1 in practice): $$ f'(x) = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases} $$

Compared with contemporary activation functions:

Activation	Formula	Positive-region gradient	Compute cost	Output range	Sparsity
Sigmoid	$1/(1+e^{-x})$	$\leq 0.25$	high (has exp)	$(0, 1)$	no
Tanh	$(e^x-e^{-x})/(e^x+e^{-x})$	$\leq 1$	high (has exp)	$(-1, 1)$	no
Softplus	$\ln(1+e^x)$	$\leq 1$	high (has exp+log)	$(0, +\infty)$	no
ReLU	$\max(0, x)$	= 1	very low	$[0, +\infty)$	yes
Hard Tanh	$\max(-1, \min(1, x))$	$\leq 1$	very low	$[-1, 1]$	no

ReLU's revolution: positive-region gradient identically 1 (no vanishing) + ultra-fast compute + naturally sparse activations — three properties no other activation could simultaneously deliver.

Key Design 1: max(0, x) — minimal solution to vanishing gradients¶

Function: by setting the activation function's positive region to identity (gradient identically 1), fundamentally eliminate deep network's vanishing gradient problem.

Core idea and formulas:

Consider an $L$-layer network where layer $l$'s activation is $h^{(l)} = f(W^{(l)} h^{(l-1)} + b^{(l)})$. During backprop, the gradient w.r.t. layer-1 weights $W^{(1)}$ is: $$ \frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial h^{(L)}} \cdot \prod_{l=2}^{L} \left( W^{(l)} \cdot \text{diag}(f'(z^{(l)})) \right) \cdot \frac{\partial h^{(1)}}{\partial W^{(1)}} $$

where $z^{(l)} = W^{(l)} h^{(l-1)} + b^{(l)}$. The gradient's decay/amplification factor is $\prod f'(z^{(l)})$:

Sigmoid: $f'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25$, after 10 layers gradient $\sim 0.25^{10} \approx 10^{-6}$ (vanishing)
Tanh: $f'(z) = 1 - \tanh^2(z) \leq 1$, but saturation regions near 0, deep layers still attenuate
ReLU: $f'(z) = 1$ (when $z > 0$) or $0$ (when $z \leq 0$). For active neurons, gradient passes 100% — no decay at all

Key properties: 1. Gradients along "active paths" preserved exactly: as long as a forward path's ReLUs are all active ($z > 0$), gradient flows losslessly to bottom layers 2. "Dead neuron" problem: if a neuron stays $z < 0$ for long, its gradient is permanently 0 (never learns). This is ReLU's cost, also driving Leaky ReLU / PReLU / ELU follow-ups 3. Implicit sparsity: about 50% of neurons are inactive (output 0) at any moment

Why it works: ReLU shifts the role of "activation function" from "non-linear approximator" to "gating switch" — active neurons provide linear channels, inactive neurons are like pruning. This "piecewise linear + on/off gating" structure is mathematically equivalent to a very deep piecewise-linear function with strong expressivity (each active path corresponds to one linear subspace).

Inspiration to follow-up work: - Maxout (Goodfellow 2013): generalize ReLU to max(W₁x, W₂x) - Leaky ReLU (Maas 2013): negative region set to $\alpha x$ ($\alpha = 0.01$) to prevent dead neurons - PReLU (He 2015): $\alpha$ learnable - ELU (Clevert 2015): negative region $\alpha(e^x - 1)$ to push average output near 0 - GELU (Hendrycks 2016): Gaussian error linear unit, used by BERT / GPT - Swish / SiLU (Ramachandran 2017): $x \cdot \sigma(x)$, self-gating - SwiGLU (Shazeer 2020): default FFN activation in LLaMA / Mistral / Qwen

Key Design 2: Sparsity — naturally emerging features¶

Function: through ReLU's hard threshold (output 0 when $x \leq 0$), activations naturally produce 50-90% zero values, no extra L1 regularization needed.

Core idea and formulas:

Define a layer's "activation sparsity": $$ \text{sparsity} = \frac{1}{N \cdot d} \sum_{i=1}^{N} \sum_{j=1}^{d} \mathbb{1}[h_{ij} = 0] $$ where $N$ is batch size, $d$ is layer dimension.

ReLU networks typically achieve 50-90% sparsity (paper Figure 3). This is conditional sparsity: which neurons are active depends on input, but only few are active at any moment.

Key properties: 1. Biological plausibility: real cortical neurons have only 1-4% active at any time (V1 visual neuron measurements) 2. Feature disentanglement: different inputs activate different neuron subsets, features more independent and interpretable 3. Save compute (theoretically): with hardware support, 0 values can be skipped (sparse compute) 4. Noise robustness: sparse representations are insensitive to small perturbations

Why it works: sparsity functionally turns "fully connected" networks into "dynamic sparse-connected" networks — each input activates a different sub-network. This is implicit model averaging (different inputs select different model paths), giving a regularization effect.

Follow-up validation: - Bengio 2013 empirical: ReLU sparsity does deliver better generalization - Sparse coding theory (Olshausen & Field 1996): sparse representations are the optimal solution for efficient coding - Dropout (Srivastava 2014): synergistic with ReLU — dropout makes ReLU's sparsity more random

Key Design 3: Compute efficiency and hardware friendliness¶

Function: drop activation function compute cost from "floating-point exponential" to "one comparison + one mux", reducing deep network training cost 5-10×.

Core idea and comparison:

Operation	Unit time (CPU cycles, approx)	Notes
Add	1	fastest
Multiply	3-5	fast
Compare	1-2	ReLU uses this
Divide	20-30	slow
Exp $e^x$	50-100	sigmoid / tanh use this
Log $\ln$	50-100	softmax uses this

Typical implementation:

# Sigmoid (slow)
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))   # has exp, slow

# Tanh (slow)
def tanh(x):
    e_pos = np.exp(x)
    e_neg = np.exp(-x)
    return (e_pos - e_neg) / (e_pos + e_neg)  # 2× exp, even slower

# ReLU (fast)
def relu(x):
    return np.maximum(0, x)   # one compare, ultra-fast

# ReLU backward (minimal)
def relu_backward(grad_out, x):
    return grad_out * (x > 0)   # one compare + one multiply

Real speedup: paper Section 4.2 reports ReLU networks train 3-6× faster than tanh networks (depending on network size and hardware). On GPU, ReLU's advantage is greater — ReLU is element-wise, perfectly parallelizable, while sigmoid's exp is less efficient on GPU.

Hardware impact: 1. GPU era: ReLU is one of the simplest GPU kernels; all deep learning frameworks heavily optimize it 2. Quantization era: ReLU's 0 values fit naturally in INT8 quantization 3. Sparse hardware: future sparse accelerators (e.g., NVIDIA Hopper) can exploit ReLU's sparsity

Implementation details and initialization¶

The ReLU paper offers companion initialization recommendations:

Initialization	Formula	Suitable for
Xavier (Glorot 2010)	$W \sim U[-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})}]$	Sigmoid / Tanh
He (Kaiming 2015)	$W \sim N(0, \sqrt{2/n_{in}})$	ReLU
Uniform [-0.01, 0.01]	uniform small	any

He initialization is specifically designed for ReLU: since ReLU sets negatives to 0, forward variance is half that of sigmoid / tanh, so weight init must scale up by $\sqrt{2}$ to compensate. This is a key engineering advance after the ReLU paper (He et al. 2015).

Counter-measures for ReLU "dead neurons": 1. Leaky ReLU: $f(x) = \max(\alpha x, x)$, $\alpha = 0.01$ 2. PReLU: $\alpha$ learnable 3. Correct initialization (He init) + Batch Normalization can substantially reduce dead neurons 4. Learning rate warmup prevents early training from destroying ReLU's activation distribution

Failed Baselines¶

Opponents that lost to ReLU — the "activation function benchmarks" of 2011¶

When ReLU was published in 2011, the "mainstream" of activation functions was split between sigmoid line and tanh line. Glorot's team did a systematic comparison in paper Section 4:

Opponent	Year proposed	Test error on NORB	Why it lost to ReLU
Sigmoid ($\sigma(x)$)	1980s	18.4%	vanishing gradient; non-zero-centered; slow compute
Tanh ($\tanh(x)$)	1980s	17.6%	vanishing gradient (saturation); slow compute
Softplus ($\ln(1+e^x)$)	2001	17.0%	slow compute; not sparse
Hard Tanh ($\max(-1, \min(1, x))$)	2010	16.9%	not sparse; bounded gradient on negatives
**	tanh	** (Jarrett 2009)	2009
Sigmoid + RBM pretraining	2006	16.5%	complex; multi-stage training
ReLU ($\max(0, x)$)	2011	16.4%	wins: fast + non-vanishing + sparse

Takeaways from this table: 1. ReLU's accuracy edge is actually small (16.4% vs Sigmoid+RBM 16.5%); the key is simplicity + training speed 2. No RBM pretraining needed — the biggest contribution of the ReLU paper 3. Training 3-6× faster than tanh — turning deep network training from "days" into "hours"

Failures the paper acknowledged — scenarios where ReLU struggles¶

Glorot paper Section 4.4 honestly lists ReLU's limits:

Scenario	ReLU performance	Reason
Dying ReLU	30-50% neurons permanently dead late in training	learning rate too high or wrong init causes z<0 permanently
Negative-value information loss	negative inputs clipped to 0	some tasks (e.g., generative) need negatives
Non-zero-centered output	output ≥ 0	like sigmoid, causes same-sign weight updates downstream
Non-differentiable at 0	$f'(0)$ undefined	engineered to 0 or 1, but theoretically a subgradient
Unbounded large values	output unbounded	can cause activation explosion (mostly mitigated by BatchNorm)
Generative tasks	worse than tanh	VAE / GAN decoders still prefer tanh (bounded output)

Opponents striking back a year later — the rise of ReLU variants¶

ReLU's huge success (AlexNet using ReLU won ImageNet in 2012) sparked many improvements:

Follow-up work	Year	Breakthrough	Improvement over ReLU
Maxout (Goodfellow 2013)	2013	$\max(W_1 x, W_2 x)$	generalize ReLU to piecewise linear
Leaky ReLU (Maas 2013)	2013	$\max(\alpha x, x)$, $\alpha=0.01$	solves dead neurons
PReLU (He 2015)	2015	$\alpha$ learnable	network-adaptive
ELU (Clevert 2015)	2015	negative region $\alpha(e^x-1)$	average output near 0
SELU (Klambauer 2017)	2017	self-normalizing	train ultra-deep MLPs without BN
GELU (Hendrycks 2016)	2016	$x \cdot \Phi(x)$ (Gaussian CDF)	BERT / GPT default
Swish / SiLU (Ramachandran 2017)	2017	$x \cdot \sigma(x)$	self-gating; EfficientNet uses
Mish (Misra 2019)	2019	$x \cdot \tanh(\text{softplus}(x))$	slightly better on some CV tasks
GLU family (GLU/GeGLU/SwiGLU)	2017+	$\sigma(W_1 x) \odot (W_2 x)$	LLaMA / Mistral default FFN

Lessons from the counter-attack: 1. GELU / Swish slightly outperform ReLU on Transformers — but ReLU remains CNN's first choice 2. SwiGLU is the new default in the LLaMA era — but ReLU is still the hidden activation in ResNet / EfficientNet 3. ReLU is the bridge from "ImageNet era" to "LLM era" — not replaced, just improved in some scenarios

A direction missed — Batch Normalization¶

The ReLU paper was written in 2011; Batch Normalization (Ioffe & Szegedy 2015) appeared 4 years later. BN and ReLU are a perfect match — BN normalizes per-layer activations to mean=0, var=1, putting ReLU exactly at the optimal "50% active" operating point.

If the ReLU paper had been written 4 years later: it might have been proposed together with BN, and "BN + ReLU + deep residual" would have become a unified scheme. But history is what it is: ReLU first, BN later, ResNet integrating both.

Another direction missed — Layer Normalization¶

LayerNorm (Ba 2016) and RMSNorm (Zhang 2019) are Transformer-era staples paired with GELU / SwiGLU. The ReLU paper didn't anticipate the importance of normalization, otherwise it might have driven early ReLU + LN combinations.

Key Experimental Data¶

Main results — full PK across 4 benchmarks¶

Glorot paper Tables 2-3 compare ReLU vs Sigmoid vs Tanh under different conditions on 4 datasets:

MNIST (handwritten digits, 60K training):

Network	Activation	Pretraining	Test error
MLP-3layer	Sigmoid	none	2.94%
MLP-3layer	Sigmoid	RBM	1.67%
MLP-3layer	Tanh	none	2.20%
MLP-3layer	Tanh	RBM	1.55%
MLP-3layer	ReLU	none	1.43%
MLP-3layer	ReLU	RBM	1.50% (not necessarily better!)

Key finding: ReLU without pretraining reaches 1.43%, beating Sigmoid+RBM's 1.67%. This is the most important single piece of evidence in the ReLU paper — unsupervised pretraining is no longer mandatory.

NORB (toy images, 48K training):

Network	Activation	Pretraining	Test error
MLP-3layer	Sigmoid	none	18.4%
MLP-3layer	Sigmoid	RBM	16.5%
MLP-3layer	Tanh	none	17.6%
MLP-3layer	ReLU	none	16.4%

CIFAR-10 (natural images, 50K training):

Network	Activation	Pretraining	Test error
MLP-3layer	Tanh	none	50.9%
MLP-3layer	ReLU	none	49.5%

NISTP (handwritten characters, mixed fonts, 82K training):

Network	Activation	Pretraining	Test error
MLP-3layer	Sigmoid	none	12.0%
MLP-3layer	Sigmoid	RBM	9.4%
MLP-3layer	Tanh	none	10.3%
MLP-3layer	ReLU	none	8.8%

Ablation — sparsity effect on performance¶

Paper Figure 3 shows ReLU network sparsity changes under different L1 regularization strengths:

L1 strength λ	Average sparsity	MNIST test error
0 (no L1)	50%	1.43%
0.001	60%	1.45%
0.01	75%	1.52%
0.1	90%	1.85%
1.0	99%	4.20% (collapse)

Key findings: 1. ReLU's natural 50% sparsity (no L1 needed) is already the optimal operating point 2. Over-sparsity (>90%) hurts performance — losing too much expressivity 3. L1 regularization's role partly replaced by ReLU — a side benefit of ReLU

Training speed comparison¶

Paper Section 4.2 reports:

Activation	1 epoch training time (CPU)	Convergence epochs	Total training time
Sigmoid	12 min	80	16 hr
Sigmoid + RBM	12 min + RBM 30 min	40	17 hr
Tanh	11 min	60	11 hr
ReLU	3 min	30	1.5 hr

Key findings: 1. ReLU per-epoch is 4× faster than sigmoid (cheap compute) 2. ReLU converges 3× faster (gradient doesn't vanish) 3. Total training time ReLU is 10× faster than sigmoid — a revolutionary leap

Cross-architecture generalization¶

The paper also tested ReLU on different architectures:

Architecture	Sigmoid error	ReLU error	Improvement
MLP-2layer	2.4%	1.8%	+25%
MLP-3layer	2.2%	1.43%	+35%
MLP-5layer	3.1% (hard to train)	1.6%	+48%
Convolutional Net	1.8%	1.2%	+33%

Key finding: the deeper the network, the bigger ReLU's advantage — 5-layer MLPs with sigmoid almost can't be trained, but with ReLU reach 1.6%. This directly predicts the post-2012 depth explosion: AlexNet (8 layers) → VGG (16 layers) → ResNet (152 layers).

Several repeatedly-cited findings¶

ReLU frees deep networks from unsupervised pretraining dependency — the paper's most important single contribution
ReLU has natural 50% sparsity — biological plausibility + implicit regularization
ReLU trains 10× faster than sigmoid — directly catalyzes the GPU + ReLU + big-data "AlexNet moment"
ReLU dramatically improves deep network scalability — 5 layers untrainable with sigmoid, doable with ReLU
ReLU is the hidden activation in all post-2012 SOTA models — AlexNet / VGG / GoogLeNet / ResNet all use ReLU or its variants

Idea Lineage¶

Predecessors — whose shoulders ReLU stood on¶

Neuroscience-level ancestors:

Ancestor	Year	What it gave to ReLU	Position in ReLU
Hodgkin-Huxley model (1952)	1952	membrane potential + threshold firing	"no activation below threshold" idea
Hahnloser et al. (Nature 2000)	2000	half-wave rectification neuron model	direct predecessor of ReLU formula
Olshausen & Field (Nature 1996)	1996	sparse coding in visual cortex	biological motivation for sparsity
V1 firing rate measurements (1960s+)	1960s	real neurons only 1-4% co-active	empirical basis for sparsity

ML theory ancestors:

Ancestor	Year	Contribution	Manifestation in ReLU
Sigmoid neuron (McCulloch & Pitts 1943)	1943	mathematical model of threshold neuron	counter-example for ReLU (saturation)
Backpropagation (Rumelhart 1986)	1986	gradient descent training of NNs	foundation for ReLU optimization
Vanishing gradient analysis (Hochreiter 1991)	1991	systematic analysis of vanishing gradients	problem statement for ReLU
Hard Tanh (Collobert 2004)	2004	piecewise linear activation	piecewise-linear path of ReLU
Softplus (Dugas 2001)	2001	$\ln(1+e^x)$ smooth activation	smooth approximation of ReLU
Sparse coding (Olshausen 1996)	1996	sparse representation	implicitly realized by ReLU

Deep learning practice ancestors:

Ancestor	Year	Contribution	Position in ReLU
DBN (Hinton 2006)	2006	unsupervised pretraining	what ReLU "replaces"
Stacked AE (Bengio 2007)	2007	layer-wise pretraining	same as above
Xavier init (Glorot 2010)	2010	weight initialization	paired with ReLU (later replaced by He init)
Nair & Hinton ReLU in RBM (2010)	2010	ReLU in RBM, empirical	direct predecessor
**Jarrett et al.	tanh	** (ICCV 2009)	2009

Descendants — the activation / deep learning lineage after ReLU¶

ReLU is not just an activation function — it's the implicit foundation of all "modern deep learning". The Mermaid diagram below highlights all key works directly or indirectly influenced by ReLU from 2011 to 2026:

flowchart TD
    Sigmoid[Sigmoid 1943<br/>saturating]
    Tanh[Tanh 1980s<br/>saturating]
    HardTanh[Hard Tanh Collobert 2004<br/>piecewise linear]
    Softplus[Softplus Dugas 2001<br/>smooth max]
    NairHinton[Nair Hinton 2010<br/>ReLU in RBM]
    Hahnloser[Hahnloser 2000<br/>half-wave rectification]
    SparseCoding[Olshausen Field 1996<br/>sparse coding]

    ReLU[ReLU Glorot 2011<br/>max 0 x in deep MLP]

    Sigmoid -.replaced by.-> ReLU
    Tanh -.replaced by.-> ReLU
    HardTanh --> ReLU
    Softplus -.smooth approx.-> ReLU
    NairHinton --> ReLU
    Hahnloser --> ReLU
    SparseCoding -.biological motivation.-> ReLU

    AlexNet[AlexNet Krizhevsky 2012<br/>ReLU + GPU + ImageNet]
    DropOut[Dropout Srivastava 2014<br/>synergy with ReLU]
    BatchNorm[BatchNorm Ioffe 2015<br/>ReLU best friend]
    HeInit[He init He 2015<br/>for ReLU networks]
    ResNet[ResNet He 2015<br/>ReLU + skip connection]

    ReLU --> AlexNet
    ReLU --> DropOut
    ReLU --> BatchNorm
    ReLU --> HeInit
    ReLU --> ResNet

    LeakyReLU[Leaky ReLU Maas 2013<br/>fix dying neurons]
    PReLU[PReLU He 2015<br/>learnable alpha]
    ELU[ELU Clevert 2015<br/>negative exp]
    SELU[SELU Klambauer 2017<br/>self-normalizing]
    Maxout[Maxout Goodfellow 2013<br/>generalize piecewise linear]

    ReLU --> LeakyReLU
    ReLU --> PReLU
    ReLU --> ELU
    ReLU --> SELU
    ReLU --> Maxout

    GELU[GELU Hendrycks 2016<br/>Gaussian CDF gating]
    Swish[Swish SiLU Ramachandran 2017<br/>self-gating x sigmoid x]
    Mish[Mish Misra 2019<br/>x tanh softplus]
    GLU[GLU Dauphin 2017<br/>gated linear unit]
    GeGLU[GeGLU Shazeer 2020]
    SwiGLU[SwiGLU Shazeer 2020<br/>LLaMA Mistral default]

    ReLU -.smooth variants.-> GELU
    ReLU -.smooth variants.-> Swish
    Swish --> Mish
    GLU --> GeGLU
    GLU --> SwiGLU
    GELU --> SwiGLU

    BERT[BERT 2018<br/>uses GELU]
    GPT[GPT family<br/>uses GELU]
    LLaMA[LLaMA 2023<br/>uses SwiGLU]
    Mistral[Mistral 2023<br/>uses SwiGLU]

    GELU --> BERT
    GELU --> GPT
    SwiGLU --> LLaMA
    SwiGLU --> Mistral

Categorized by "sub-lines most affected by ReLU":

1. Direct ReLU variants:

Descendant	Year	Difference from ReLU
Leaky ReLU (Maas 2013)	2013	negative region $\alpha x$, prevents dead neurons
PReLU (He 2015)	2015	$\alpha$ learnable
RReLU (Xu 2015)	2015	$\alpha$ random during training
ELU (Clevert 2015)	2015	negative region $\alpha(e^x-1)$
SELU (Klambauer 2017)	2017	self-normalizing (deep MLPs without BN)
Maxout (Goodfellow 2013)	2013	generalized to piecewise linear

2. Smooth variants / self-gating:

Descendant	Year	Formula	Use
GELU (Hendrycks 2016)	2016	$x \cdot \Phi(x)$	BERT / GPT
Swish / SiLU (Ramachandran 2017)	2017	$x \cdot \sigma(x)$	EfficientNet
Mish (Misra 2019)	2019	$x \cdot \tanh(\text{softplus}(x))$	some CV
GLU family (GLU/GeGLU/SwiGLU/ReGLU)	2017+	$\sigma(W_1 x) \odot (W_2 x)$	LLaMA-line FFN

3. Network architectures catalyzed by ReLU:

Architecture	Year	Role of ReLU
AlexNet (Krizhevsky 2012)	2012	first ImageNet SOTA with ReLU + GPU
VGG (Simonyan 2014)	2014	16-19 layer CNN, ReLU default
GoogLeNet / Inception (Szegedy 2014)	2014	all ReLU inside Inception block
ResNet (He 2015)	2015	ReLU + skip connection trains 152 layers
DenseNet (Huang 2016)	2016	ReLU + dense connections
EfficientNet (Tan 2019)	2019	ReLU variant (Swish) + NAS

4. Engineering advances ReLU drove:

Advance	Year	Relation to ReLU
Dropout (Srivastava 2014)	2014	synergy with ReLU (dropout makes ReLU sparsity more random)
Batch Normalization (Ioffe 2015)	2015	BN puts ReLU at optimal operating point (50% active)
He Initialization (He 2015)	2015	weight init designed specifically for ReLU
Layer Normalization (Ba 2016)	2016	LN + GELU is Transformer standard
Swish via NAS (Ramachandran 2017)	2017	RL-searched Swish, proves ReLU is not the endpoint

Misreadings — how posterity has misread ReLU¶

Misreading 1: treating ReLU as just a "simple activation function" — severe underestimation. ReLU is not just a formula; it's a paradigm revolution from shallow to deep. Without ReLU, AlexNet 2012 couldn't have trained; without AlexNet, the deep learning revolution might have been delayed by 5 years.

Misreading 2: thinking ReLU "solved" vanishing gradients — partly right. ReLU solved activation-function-induced vanishing gradients, but weight-induced vanishing gradients still exist — this is what He initialization and BatchNorm later solved. ReLU + He init + BN combined truly solved deep training.

Misreading 3: thinking "dying ReLU" is fatal — partly right. Dead neurons exist, but actually: - Correct init (He init) + small learning rate + BN can keep dead neurons under 5% - Dead neurons can be viewed as implicit pruning — networks auto-find sparse sub-structures - Leaky ReLU / PReLU etc. can fully eliminate dead neurons (but practical gain is small)

Misreading 4: thinking ReLU has been replaced by GELU / Swish — wrong. As of 2026: - CNN line (ResNet / EfficientNet / ConvNeXt): ReLU or SiLU - Transformer line (BERT / GPT): GELU - LLM line (LLaMA / Mistral): SwiGLU - But ReLU remains baseline + first choice for teaching — simple, fast, stable

Misreading 5: thinking ReLU was first proposed by Glorot 2011 — partly wrong. Hahnloser 2000 used max(0, x) model in neuroscience; Nair & Hinton 2010 used ReLU in RBM; Glorot 2011's contribution is pushing ReLU to supervised deep MLPs and systematically proving it beats sigmoid.

Misreading 6: attributing ReLU's success to "sparsity" — partly wrong. Sparsity is a byproduct, not the core. The core is "gradient = 1 in the positive region", which is the key to solving vanishing gradients. Sparsity is icing on the cake.

Misreading 7: thinking ReLU is the only form of piecewise linearity — wrong. Maxout generalizes to $\max(W_1 x, W_2 x, ..., W_k x)$; the GLU family combines piecewise linearity with gating. ReLU is the simplest form of piecewise linearity, but not the only form.

Modern Perspective¶

15 years later, which assumptions in the ReLU paper have been falsified?¶

Written in 2011, the ReLU paper contains a series of assumptions about neural network training. Today (2026), 15 years later, some assumptions still hold, others have been falsified:

Assumption / claim in paper	Evidence in 2011	Status in 2026	Verification
ReLU completely eliminates vanishing gradient	per-layer gradient = 1	weight-induced vanishing still exists, needs He init + BN	partly falsified
Sparsity is the key to ReLU's success	Figure 3 + biological motivation	sparsity is a byproduct; "gradient=1" is the core	partly falsified
ReLU makes unsupervised pretraining unnecessary	MNIST: ReLU+no-pretrain = sigmoid+pretrain	fully holds — almost no one uses unsupervised pretraining post-2012	fully holds
Dying ReLU is a fatal flaw	paper Section 4.4 admits 30-50% dead	He init + BN keep it under 5%; can even be viewed as implicit pruning	partly falsified
ReLU's speedup comes mainly from formula simplicity	paper reports 4× per-epoch speedup	actual speedup is "convergence 3×" + "per-epoch 4×" = 10× total	partly confirmed
ReLU works for all tasks	both MLP and CNN verified	generative tasks (VAE decoders) still prefer tanh; LLMs use SwiGLU	partly falsified
50% sparsity is the optimal operating point	paper Figure 3 shows 50-75% optimal	BN+ReLU networks typically have 30-50% sparsity	largely holds

Overall: the core thesis of the ReLU paper ("replace sigmoid with ReLU, no pretraining needed, deep networks train directly") has stood the test of 15 years, but the explanation of "why ReLU works" (sparsity + neuroscience motivation) has been partly revised by later understanding — the core is piecewise linear + non-saturating positive region.

The "ghost" of ReLU in modern deep learning¶

Although 2026 LLMs and SOTA CV models no longer use raw ReLU, the spirit of ReLU is everywhere:

1. The "gradient = 1 in positive region" design principle is fully inherited: - GELU, Swish, Mish, SwiGLU all preserve near-linearity in the positive region - residual connections (ResNet) borrowed from ReLU's "no-decay propagation" idea - LayerNorm + ReLU/GELU is standard in all Transformers

2. Piecewise linearity remains mainstream: - Maxout, ReLU, Leaky ReLU, PReLU are strictly piecewise linear - GELU, Swish are "smooth approximations" of piecewise linear - theoretical analysis of piecewise linear networks (Pascanu 2014, Montufar 2014) is still active

3. Sparsity ideas live on in MoE: - Mixture of Experts (MoE) router is another "hard sparsity" — only top-k experts activated - LLaMA-MoE / Mistral-8x7B push sparsity to model-level - ReLU's "50% sparsity" inspired MoE's "top-2 routing"

4. ReLU is still the first choice for teaching and baselines: - almost the first activation function introduced in any deep learning textbook - baseline default in academic new methods - in engineering practice (mobile inference, embedded NN), ReLU remains first choice (int8-quant friendly)

What if the ReLU paper were written today?¶

If Glorot rewrote this paper in 2026, possible changes:

New sections: 1. Synergy with BN/LN — paper in 2011 didn't realize BN (2015) would become ReLU's best partner 2. He initialization vs Xavier init — paper used Xavier init (not actually optimal); He init is the best match for ReLU 3. Practical impact of dying neurons — 15 years of engineering shows dying neurons are far less than 30-50% 4. Theoretical analysis of piecewise linearity — Montufar 2014's "linear regions = $O(2^L)$" analysis

Removed / weakened parts: 1. Over-emphasis on sparsity — Figure 3 sparsity analysis focuses too much on sparsity ratio, ignoring the essential role of gradient signals 2. Biological motivation — later proven not the key to ReLU's success; mainly engineering advantages 3. Softplus comparison — by 2026, Softplus has little practical use

New comparisons that would be introduced: - systematic comparison of ReLU vs GELU vs Swish vs SwiGLU - difference between ReLU and GELU on Transformers - analysis of SwiGLU's advantage on LLMs

Limitations and Outlook¶

Core limitations of ReLU¶

Limitation	Acknowledged in 2011 paper?	Subsequent solution
Dying ReLU	partly	Leaky ReLU / PReLU / He init / BN
Non-zero-centered output	not	ELU / SELU introduce negatives
Non-differentiable at 0	partly	engineered to 0 or 1; theoretically subgradient
Unbounded positive region	not	BN / LN / weight clipping
Poor on generative tasks	not	VAE / GAN decoders use tanh
Not suitable for output layer	not	output uses softmax / sigmoid / linear
Slightly weaker on Transformers	N/A	GELU / SwiGLU improvements
Biological plausibility debate	paper emphasizes	later research questions sparsity's biological explanation

Future directions¶

1. ReLU in neuromorphic computing: in spiking neural networks, ReLU's "threshold firing" property is natural — SpikeNet / Loihi chips are exploring.

2. Lighter activations: ReLU is already among the lightest non-linear activations (only one comparison), but quantum NNs / reversible networks may further simplify.

3. Adaptive activations: let networks choose activations themselves via NAS or meta-learning (Swish was found by NAS; future may be more dynamic).

4. Renaissance of ReLU in LLM inference optimization: - ReLU sparsity is hardware-acceleratable (CSR sparse matmul) - "ReLU-fication of LLMs" — replace LLaMA's SwiGLU back to ReLU for inference acceleration (DEJAVU, PowerInfer) - 2024 Apple's ReLU LLaMA runs 3× faster on Mac Studio

5. Fusion with new architectures: - activation choice in Mamba (State Space Model) - ReLU sparse gating in Mixture of Depths / Early Exit - synergy with test-time compute and ReLU

6. Interpretability of piecewise linearity: - every ReLU network corresponds to a piecewise linear function - understand network decision boundaries by analyzing "linear regions" - formal verification tools like Marabou / NNV exploit piecewise linear properties

Paper	Year	Relation to ReLU
Hahnloser et al. "Digital selection and analogue amplification coexist..."	2000	half-wave rectification neuroscience origin
Jarrett et al. "What is the best multi-stage architecture..."	2009	early empirical evidence for ReLU in CNNs
Nair & Hinton "Rectified linear units improve restricted Boltzmann machines"	2010	early empirical evidence for ReLU in RBM
Krizhevsky et al. "ImageNet classification with deep CNNs" (AlexNet)	2012	ReLU's "breakthrough battle"
Maas et al. "Rectifier nonlinearities improve neural network acoustic models" (Leaky ReLU)	2013	fixes dying ReLU
He et al. "Delving deep into rectifiers..." (PReLU + He init)	2015	best initialization for ReLU networks
Ioffe & Szegedy "Batch normalization..."	2015	ReLU's best partner
Clevert et al. "Fast and accurate deep network learning by ELUs"	2015	smooth negative-region variant
Hendrycks & Gimpel "Gaussian Error Linear Units (GELUs)"	2016	Transformer-era standard
Ramachandran et al. "Searching for activation functions" (Swish)	2017	NAS finds Swish
Shazeer "GLU variants improve Transformer"	2020	LLM-era SwiGLU

Areas with kindred ideas to ReLU¶

1. Signal processing / half-wave rectification: the diode rectifier in electronic circuits (half-wave rectifier) is the physical version of ReLU, used since the 1950s.

2. Economics / option pricing: European call option payoff $\max(S - K, 0)$ has the same form as ReLU — the "hockey stick" function in financial engineering.

3. Optimization / convex analysis: $\max(0, x)$ is the core component of hinge loss, kindred to SVM's $\max(0, 1-yf(x))$.

4. Neuroscience: V1 simple cell firing rate models, Hodgkin-Huxley threshold firing models can all be viewed as biological prototypes of ReLU.

5. Control theory / differential equations: piecewise linear systems have long been studied in control theory; ReLU networks can be viewed as special piecewise linear controllers.

Cross-domain research it inspired¶

1. Computer graphics: piecewise linear texture synthesis; ReLU MLP representations in NeRF.

2. Protein structure prediction: ReLU/SiLU activations inside AlphaFold; piecewise linear approximation of protein energy landscapes.

3. Reinforcement learning: DQN's Q-network uses ReLU; ReLU sparsity in policy networks affects exploration behavior.

4. Recommender systems: ReLU FFN in two-tower models; ReLU crosses in DCN-V2.

5. Autonomous driving perception: ReLU in BEVFormer / BEVDet; ReLU gating in multi-view fusion.

Resources¶

Papers and code¶

Glorot et al. 2011 original (AISTATS): https://proceedings.mlr.press/v15/glorot11a.html
PyTorch ReLU docs: https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
TensorFlow ReLU docs: https://www.tensorflow.org/api_docs/python/tf/nn/relu
JAX ReLU: https://jax.readthedocs.io/en/latest/_autosummary/jax.nn.relu.html

Important follow-up papers¶

Nair & Hinton 2010 (ReLU in RBM): https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf
AlexNet (Krizhevsky 2012): https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b
He et al. 2015 (PReLU + He init): https://arxiv.org/abs/1502.01852
GELU (Hendrycks 2016): https://arxiv.org/abs/1606.08415
Swish (Ramachandran 2017): https://arxiv.org/abs/1710.05941
Searching for Activation Functions (Swish via NAS): https://arxiv.org/abs/1710.05941
GLU Variants Improve Transformer (Shazeer 2020): https://arxiv.org/abs/2002.05202

Courses and tutorials¶

Stanford CS231n "Neural Networks 1" lecture (covers ReLU vs sigmoid vs tanh)
Deep Learning Book (Goodfellow et al.) Chapter 6 "Deep Feedforward Networks"
3Blue1Brown Neural Networks series (visualizes ReLU)
Distill.pub "A Visual Exploration of Gaussian Processes" (mentions activation function geometry)

Engineering resources¶

ReLU LLaMA (Apple 2024): https://github.com/SHI-Labs/NATTEN
DEJAVU: contextual sparsity for LLM inference (uses ReLU sparsity)
PowerInfer: GPU-CPU hybrid LLM inference (leverages ReLU sparsity)

🌐 中文版 · 📚 awesome-papers project · CC-BY-NC

Activation	Formula	Positive-region gradient	Compute cost	Output range	Sparsity
Sigmoid	\(1/(1+e^{-x})\)	\(\leq 0.25\)	high (has exp)	\((0, 1)\)	no
Tanh	\((e^x-e^{-x})/(e^x+e^{-x})\)	\(\leq 1\)	high (has exp)	\((-1, 1)\)	no
Softplus	\(\ln(1+e^x)\)	\(\leq 1\)	high (has exp+log)	\((0, +\infty)\)	no
ReLU	\(\max(0, x)\)	= 1	very low	\([0, +\infty)\)	yes
Hard Tanh	\(\max(-1, \min(1, x))\)	\(\leq 1\)	very low	\([-1, 1]\)	no

Initialization	Formula	Suitable for
Xavier (Glorot 2010)	\(W \sim U[-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})}]\)	Sigmoid / Tanh
He (Kaiming 2015)	\(W \sim N(0, \sqrt{2/n_{in}})\)	ReLU
Uniform [-0.01, 0.01]	uniform small	any

Opponent	Year proposed	Test error on NORB	Why it lost to ReLU
Sigmoid (\(\sigma(x)\))	1980s	18.4%	vanishing gradient; non-zero-centered; slow compute
Tanh (\(\tanh(x)\))	1980s	17.6%	vanishing gradient (saturation); slow compute
Softplus (\(\ln(1+e^x)\))	2001	17.0%	slow compute; not sparse
Hard Tanh (\(\max(-1, \min(1, x))\))	2010	16.9%	not sparse; bounded gradient on negatives
**	tanh	** (Jarrett 2009)	2009
Sigmoid + RBM pretraining	2006	16.5%	complex; multi-stage training
ReLU (\(\max(0, x)\))	2011	16.4%	wins: fast + non-vanishing + sparse

Follow-up work	Year	Breakthrough	Improvement over ReLU
Maxout (Goodfellow 2013)	2013	\(\max(W_1 x, W_2 x)\)	generalize ReLU to piecewise linear
Leaky ReLU (Maas 2013)	2013	\(\max(\alpha x, x)\), \(\alpha=0.01\)	solves dead neurons
PReLU (He 2015)	2015	\(\alpha\) learnable	network-adaptive
ELU (Clevert 2015)	2015	negative region \(\alpha(e^x-1)\)	average output near 0
SELU (Klambauer 2017)	2017	self-normalizing	train ultra-deep MLPs without BN
GELU (Hendrycks 2016)	2016	\(x \cdot \Phi(x)\) (Gaussian CDF)	BERT / GPT default
Swish / SiLU (Ramachandran 2017)	2017	\(x \cdot \sigma(x)\)	self-gating; EfficientNet uses
Mish (Misra 2019)	2019	\(x \cdot \tanh(\text{softplus}(x))\)	slightly better on some CV tasks
GLU family (GLU/GeGLU/SwiGLU)	2017+	\(\sigma(W_1 x) \odot (W_2 x)\)	LLaMA / Mistral default FFN

Descendant	Year	Difference from ReLU
Leaky ReLU (Maas 2013)	2013	negative region \(\alpha x\), prevents dead neurons
PReLU (He 2015)	2015	\(\alpha\) learnable
RReLU (Xu 2015)	2015	\(\alpha\) random during training
ELU (Clevert 2015)	2015	negative region \(\alpha(e^x-1)\)
SELU (Klambauer 2017)	2017	self-normalizing (deep MLPs without BN)
Maxout (Goodfellow 2013)	2013	generalized to piecewise linear

Descendant	Year	Formula	Use
GELU (Hendrycks 2016)	2016	\(x \cdot \Phi(x)\)	BERT / GPT
Swish / SiLU (Ramachandran 2017)	2017	\(x \cdot \sigma(x)\)	EfficientNet
Mish (Misra 2019)	2019	\(x \cdot \tanh(\text{softplus}(x))\)	some CV
GLU family (GLU/GeGLU/SwiGLU/ReGLU)	2017+	\(\sigma(W_1 x) \odot (W_2 x)\)	LLaMA-line FFN