ReLU — How max(0, x) Turned Deep Networks from "Lab Toy" to "Industrial Cornerstone"¶
April 11, 2011. Glorot, Bordes, and Bengio (Universite de Montreal) publish the 9-page paper Deep Sparse Rectifier Neural Networks at AISTATS 2011. A severely under-recognized paper — it replaced the 20-year-default sigmoid / tanh with an activation so simple it barely looks like a paper, \(\max(0, x)\), and showed deep nets train fine without Hinton's 2006 (DBN) unsupervised pretraining — in fact, accuracy went up. ReLU's revolution wasn't novelty (McCulloch-Pitts 1943 / Fukushima 1980 had used it), but that this paper systematically proved for the first time three things: (1) non-saturating gradients make deep nets trainable; (2) sparse activation (~50% zero output) actually improves generalization; (3) compute is 6× faster than sigmoid. One year later AlexNet fused it with GPU + Dropout + ImageNet into the deep-learning fuse — ReLU is the default activation of every 21st-century deep network and the patriarch of LeakyReLU / GELU / SwiGLU's entire family.
TL;DR¶
Glorot, Bordes, and Bengio's 9-page 2011 AISTATS paper uses an activation function so brutally simple it looks indefensible — \(f(x) = \max(0, x)\) — to shatter the 2006–2010 industry-wide creed that "deep networks must use unsupervised pretraining to be trainable". The paper argues from three angles — biological plausibility + optimisation convenience + representational sparsity — that ReLU dominates sigmoid/tanh, and empirically demonstrates that a purely supervised deep MLP reaches state-of-the-art on MNIST (1.43% error), NORB (16.4%), CIFAR-10 (~50%), and NISTP (8.8%) with no RBM/DBN/autoencoder pretraining whatsoever. This is not a small improvement — it is a paradigm flip: it became Krizhevsky's direct academic justification for choosing ReLU in AlexNet (2012); within 18 months it pushed unsupervised pretraining from "mandatory" to "optional"; and it cleared the runway for deep CNNs to grow from 6 layers (AlexNet) to 152 (ResNet) and then 1000+ (DenseNet). Today 99% of all deep networks (GPT-5, Sora, AlphaFold 4 included) use a direct descendant of ReLU as their hidden activation — GELU, SiLU, SwiGLU all trace back to that humble piecewise-linear hinge in this 9-page paper.
Historical Context¶
What was the deep-learning community stuck on in 2010?¶
To grasp the disruptive power of the ReLU paper, you must view 2006–2010 — the "first phase of the deep-learning renaissance" — as a five-year era held captive by a single mistaken belief.
In 2006 Hinton published A Fast Learning Algorithm for Deep Belief Nets in Neural Computation, using greedy layer-wise RBM pretraining + supervised fine-tune to push 4-layer networks to 1.25% error on MNIST — the first revival of "trainable deep networks" after a 20-year drought. For five solid years afterwards, the entire field congealed around a near-religious consensus:
"You cannot train deep networks with vanilla supervised backprop alone; you must first do unsupervised pretraining to land the weights in a good basin of attraction, then fine-tune. Otherwise it will not work."
The "evidence" for this consensus was overwhelming. Hinton 2006 (DBN), Bengio 2007 (stacked autoencoder), Vincent 2008 (denoising autoencoder), Larochelle 2009 (large-scale ablation), Erhan 2010 (Why does unsupervised pre-training help deep learning? JMLR) — paper after paper showed in ablation tables that on MNIST, removing pretraining inflates a deep MLP's error from ~1% to ~3-10%. Erhan's 2010 review even elevated "pretraining = optimisation regulariser" to a theoretical principle, and almost every 2007–2010 ICML/NeurIPS deep-learning paper was written inside this frame.
But Glorot's group (also Bengio's lab) was already smelling smoke. The real bottleneck might not be "weight initialisation" — it might be the activation function itself. Glorot & Bengio 2010, Understanding the difficulty of training deep feedforward neural networks, had already diagnosed the failure mechanism: sigmoid's non-zero-centred output and saturation to 0/1 in the top layer caused severe vanishing gradients by layer 4 or 5; tanh was slightly better but still saturated. That paper proposed Xavier init alongside, but the authors knew init was a band-aid — the real disease was sigmoid's max-derivative of 0.25 plus saturating-region zeros.
A second independent signal arrived in 2010: Nair & Hinton's ICML paper Rectified Linear Units Improve Restricted Boltzmann Machines swapped RBM's sigmoid binary unit for a "noisy rectified linear unit" (sampling \(\max(0, z + \mathcal{N}(0, \sigma(z)))\)) and beat standard RBM on NORB and Caltech-101. At ICML, Hinton publicly hinted for the first time that "perhaps sigmoid is wrong" — but he still wrapped ReLU inside RBM and did not dare say "throw away unsupervised pretraining."
Stack these three signals and the 2011 ReLU paper is the kick that finished the play: it pulled ReLU out of RBM, plunked it onto the outermost layer of a purely supervised deep MLP, and used four benchmarks to prove "no pretraining needed for SOTA." What this kick shattered was not sigmoid; it was the canon of "unsupervised pretraining as a necessity."
The concrete pain point in 2010:
deep network + sigmoid + backprop almost certainly fails at ≥5 layers — by the time the gradient reaches layer 4 it has decayed by \(0.25^4 \approx 4\times 10^{-3}\) to noise level; combined with sigmoid's non-zero-centred output causing an ill-conditioned Hessian, purely supervised deep training was effectively a dead end from 2006 to 2010.
Everyone bypassed this wall with RBMs/AEs. Glorot's paper said: the wall was built by the activation itself; replace the wall with a staircase and the problem disappears.
The 6 immediate predecessors that pushed ReLU out¶
- Hahnloser et al. 2000 (Nature: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit): the first use of \(\max(0, x)\) in a neuromorphic silicon circuit as a biologically reasonable approximation — cortical neurons' firing-rate-vs-input-current curve is essentially linear rectification (no fire below threshold, linear rise above). This paper is ReLU's "biological-plausibility certificate" — Glorot's §2 leans on it heavily to argue ReLU is more brain-like than sigmoid.
- Dugas et al. 2001 (NIPS: Incorporating Second-Order Functional Knowledge for Better Option Pricing): proposed softplus \(g(x) = \log(1 + e^x)\) as a smooth surrogate for ReLU. Softplus is everywhere-differentiable but loses sparsity (always positive) and was later overtaken by ReLU. One of the central baselines in Glorot's paper — proving "smoothness is not what matters; sparsity is."
- Jarrett, Kavukcuoglu, Ranzato, LeCun 2009 (ICCV: What is the best multi-stage architecture for object recognition?): empirically found in vision CNNs that "\(\max(0, |x|)\)" (rectified absolute value) — a rectifying non-linearity — beat sigmoid/tanh on Caltech-101. But Jarrett never generalised this to plain deep MLPs and never realised that sparsity was the key — Glorot picked up the thread and corrected the diagnosis.
- Nair & Hinton 2010 (ICML: Rectified Linear Units Improve Restricted Boltzmann Machines): introduced ReLU into RBMs in the "noisy rectified" form and got SOTA on NORB. This was ReLU's first formal entry into the mainstream deep-learning community — but Hinton kept it tied to RBM. Glorot saw this and asked: "If it works inside RBM, what if we strip away the RBM and let a purely supervised network use it?" — exactly the research path of the 2011 paper.
- Glorot & Bengio 2010 (AISTATS: Understanding the difficulty of training deep feedforward neural networks): the same authors' previous-year paper — systematic diagnosis of sigmoid's failure mechanism in deep nets + introduction of Xavier init. The most important "theoretical bedding" for the 2011 ReLU paper — Glorot had completed the death-row paperwork on sigmoid; the 2011 paper was "OK, sigmoid is dead, can ReLU take the throne?"
- Hinton 2006 Science (Reducing the dimensionality of data with neural networks) + the DBN paper: defined the "unsupervised pretraining + supervised fine-tune" pipeline that became standard 2006-2010. This pipeline is the boss the ReLU paper went out to slay — Glorot's §6 deliberately includes "with vs. without pretraining" controls, and ReLU shrank pretraining's advantage from +5% to +0.1%, issuing pretraining's death sentence in the large-data regime.
What was the author team doing?¶
- Xavier Glorot (first author): a PhD student in Bengio's lab, who had just published the Xavier init paper in 2010. His core research question was "why are deep nets hard to train"; once he had stared at it from the init angle, the natural next thought was "should we also replace the activation?" Glorot later joined DeepMind, but the ReLU paper made him one of the highest-impact young researchers per paper in the history of deep learning — every PyTorch
nn.Lineardefaults to "Xavier init" or "Glorot init" today. - Antoine Bordes (second author): postdoc in Bengio's lab at the time, focused on NLP / knowledge graphs. He later joined Facebook AI Research (FAIR) as a research director, eventually heading the FAIR Paris lab. Bordes's presence kept the ReLU paper from being purely a "vision toy" — the dedicated NISTP (handwritten character) experiment owes something to his NLP background.
- Yoshua Bengio (third author): already one of the deep-learning triumvirate at the time, head of the LISA lab at Université de Montréal. Bengio's lab in 2007–2011 was one of the world's two centres for deep-learning research (the other being Hinton's Toronto). Bengio was a core figure of the unsupervised-pretraining school (stacked autoencoders came from his lab), but his name on this paper means the pretraining school personally betrayed itself — a rare display of academic honesty.
- Lab posture: Bengio's lab in 2010 was undergoing a subtle paradigm shift — moving from "probabilistic graphical models + RBMs" toward "engineering optimisation + larger networks." 2010 Xavier init was the prelude; 2011 ReLU was the climax; 2013 Maxout (Goodfellow, same lab) and 2014 GAN (also Goodfellow) were continuations. Glorot's paper was the most consequential single move in this shift.
State of industry, compute, data¶
- Compute: NVIDIA Fermi GPUs (GTX 580, 512 CUDA cores, 1.5 GB GDDR5) had just matured in 2010-2011. The Glorot experiments still ran mostly on CPU (deep-learning frameworks for GPUs were in the awkward Theano 0.3 era), but in 2012 AlexNet trained ReLU CNNs on the same GTX 580 and finished ImageNet in 5–6 days — industrial vindication of the ReLU paper one year later.
- Data: MNIST (60k), NORB (24k), CIFAR-10 (60k) were the three reigning benchmarks; ImageNet (released 2009) was still new, with the 2010 ILSVRC dominated by SIFT + shallow SVM. All Glorot experiments ran at < 100k samples — which makes ReLU's strength even more striking: even on small data, dropping pretraining still won SOTA.
- Frameworks: Bengio's lab released Theano 0.3 in 2010 — the first Python deep-learning framework with auto-diff + GPU support. The ReLU paper's experiment code was written in Theano; the team open-sourced network configuration scripts (most no longer reproducible today). Theano is the spiritual grandfather of PyTorch; the ReLU + Theano combination was an early prototype of deep learning's industrial form.
- Industry climate: in 2011, deep learning barely existed in industry — Google Brain would not be founded for another year (2012), Facebook AI Research two more (2013), DeepMind was still in stealth mode. The 2011 academic consensus was "deep learning is an interesting niche" — SVMs, graphical models, decision trees still dominated AAAI/IJCAI. The ReLU + AlexNet combination would, 18 months later, totally rewrite this landscape.
Method Deep Dive¶
Overall framework and algorithmic skeleton¶
The "method" section of the ReLU paper is essentially a minimal activation function definition + three theoretical analysis dimensions. The whole paradigm fits in 5 lines of code:
import torch.nn as nn
# Old paradigm (pre-2010): sigmoid + RBM pretraining + supervised fine-tune
old_net = nn.Sequential(
nn.Linear(784, 1000), nn.Sigmoid(),
nn.Linear(1000, 1000), nn.Sigmoid(),
nn.Linear(1000, 1000), nn.Sigmoid(),
nn.Linear(1000, 10)
)
# Training: layer-wise RBM pretraining → then supervised fine-tune
# New paradigm (Glorot 2011): ReLU + direct supervised
new_net = nn.Sequential(
nn.Linear(784, 1000), nn.ReLU(),
nn.Linear(1000, 1000), nn.ReLU(),
nn.Linear(1000, 1000), nn.ReLU(),
nn.Linear(1000, 10)
)
# Training: direct SGD + cross-entropy
ReLU's mathematical definition: $$ f(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases} $$
Derivative (non-differentiable at \(x = 0\), conventionally set to 0 or 1 in practice): $$ f'(x) = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases} $$
Compared with contemporary activation functions:
| Activation | Formula | Positive-region gradient | Compute cost | Output range | Sparsity |
|---|---|---|---|---|---|
| Sigmoid | \(1/(1+e^{-x})\) | \(\leq 0.25\) | high (has exp) | \((0, 1)\) | no |
| Tanh | \((e^x-e^{-x})/(e^x+e^{-x})\) | \(\leq 1\) | high (has exp) | \((-1, 1)\) | no |
| Softplus | \(\ln(1+e^x)\) | \(\leq 1\) | high (has exp+log) | \((0, +\infty)\) | no |
| ReLU | \(\max(0, x)\) | = 1 | very low | \([0, +\infty)\) | yes |
| Hard Tanh | \(\max(-1, \min(1, x))\) | \(\leq 1\) | very low | \([-1, 1]\) | no |
ReLU's revolution: positive-region gradient identically 1 (no vanishing) + ultra-fast compute + naturally sparse activations — three properties no other activation could simultaneously deliver.
Key Design 1: max(0, x) — minimal solution to vanishing gradients¶
Function: by setting the activation function's positive region to identity (gradient identically 1), fundamentally eliminate deep network's vanishing gradient problem.
Core idea and formulas:
Consider an \(L\)-layer network where layer \(l\)'s activation is \(h^{(l)} = f(W^{(l)} h^{(l-1)} + b^{(l)})\). During backprop, the gradient w.r.t. layer-1 weights \(W^{(1)}\) is: $$ \frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial h^{(L)}} \cdot \prod_{l=2}^{L} \left( W^{(l)} \cdot \text{diag}(f'(z^{(l)})) \right) \cdot \frac{\partial h^{(1)}}{\partial W^{(1)}} $$
where \(z^{(l)} = W^{(l)} h^{(l-1)} + b^{(l)}\). The gradient's decay/amplification factor is \(\prod f'(z^{(l)})\):
- Sigmoid: \(f'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25\), after 10 layers gradient \(\sim 0.25^{10} \approx 10^{-6}\) (vanishing)
- Tanh: \(f'(z) = 1 - \tanh^2(z) \leq 1\), but saturation regions near 0, deep layers still attenuate
- ReLU: \(f'(z) = 1\) (when \(z > 0\)) or \(0\) (when \(z \leq 0\)). For active neurons, gradient passes 100% — no decay at all
Key properties: 1. Gradients along "active paths" preserved exactly: as long as a forward path's ReLUs are all active (\(z > 0\)), gradient flows losslessly to bottom layers 2. "Dead neuron" problem: if a neuron stays \(z < 0\) for long, its gradient is permanently 0 (never learns). This is ReLU's cost, also driving Leaky ReLU / PReLU / ELU follow-ups 3. Implicit sparsity: about 50% of neurons are inactive (output 0) at any moment
Why it works: ReLU shifts the role of "activation function" from "non-linear approximator" to "gating switch" — active neurons provide linear channels, inactive neurons are like pruning. This "piecewise linear + on/off gating" structure is mathematically equivalent to a very deep piecewise-linear function with strong expressivity (each active path corresponds to one linear subspace).
Inspiration to follow-up work: - Maxout (Goodfellow 2013): generalize ReLU to max(W₁x, W₂x) - Leaky ReLU (Maas 2013): negative region set to \(\alpha x\) (\(\alpha = 0.01\)) to prevent dead neurons - PReLU (He 2015): \(\alpha\) learnable - ELU (Clevert 2015): negative region \(\alpha(e^x - 1)\) to push average output near 0 - GELU (Hendrycks 2016): Gaussian error linear unit, used by BERT / GPT - Swish / SiLU (Ramachandran 2017): \(x \cdot \sigma(x)\), self-gating - SwiGLU (Shazeer 2020): default FFN activation in LLaMA / Mistral / Qwen
Key Design 2: Sparsity — naturally emerging features¶
Function: through ReLU's hard threshold (output 0 when \(x \leq 0\)), activations naturally produce 50-90% zero values, no extra L1 regularization needed.
Core idea and formulas:
Define a layer's "activation sparsity": $$ \text{sparsity} = \frac{1}{N \cdot d} \sum_{i=1}^{N} \sum_{j=1}^{d} \mathbb{1}[h_{ij} = 0] $$ where \(N\) is batch size, \(d\) is layer dimension.
ReLU networks typically achieve 50-90% sparsity (paper Figure 3). This is conditional sparsity: which neurons are active depends on input, but only few are active at any moment.
Key properties: 1. Biological plausibility: real cortical neurons have only 1-4% active at any time (V1 visual neuron measurements) 2. Feature disentanglement: different inputs activate different neuron subsets, features more independent and interpretable 3. Save compute (theoretically): with hardware support, 0 values can be skipped (sparse compute) 4. Noise robustness: sparse representations are insensitive to small perturbations
Why it works: sparsity functionally turns "fully connected" networks into "dynamic sparse-connected" networks — each input activates a different sub-network. This is implicit model averaging (different inputs select different model paths), giving a regularization effect.
Follow-up validation: - Bengio 2013 empirical: ReLU sparsity does deliver better generalization - Sparse coding theory (Olshausen & Field 1996): sparse representations are the optimal solution for efficient coding - Dropout (Srivastava 2014): synergistic with ReLU — dropout makes ReLU's sparsity more random
Key Design 3: Compute efficiency and hardware friendliness¶
Function: drop activation function compute cost from "floating-point exponential" to "one comparison + one mux", reducing deep network training cost 5-10×.
Core idea and comparison:
| Operation | Unit time (CPU cycles, approx) | Notes |
|---|---|---|
| Add | 1 | fastest |
| Multiply | 3-5 | fast |
| Compare | 1-2 | ReLU uses this |
| Divide | 20-30 | slow |
| Exp \(e^x\) | 50-100 | sigmoid / tanh use this |
| Log \(\ln\) | 50-100 | softmax uses this |
Typical implementation:
# Sigmoid (slow)
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x)) # has exp, slow
# Tanh (slow)
def tanh(x):
e_pos = np.exp(x)
e_neg = np.exp(-x)
return (e_pos - e_neg) / (e_pos + e_neg) # 2× exp, even slower
# ReLU (fast)
def relu(x):
return np.maximum(0, x) # one compare, ultra-fast
# ReLU backward (minimal)
def relu_backward(grad_out, x):
return grad_out * (x > 0) # one compare + one multiply
Real speedup: paper Section 4.2 reports ReLU networks train 3-6× faster than tanh networks (depending on network size and hardware). On GPU, ReLU's advantage is greater — ReLU is element-wise, perfectly parallelizable, while sigmoid's exp is less efficient on GPU.
Hardware impact: 1. GPU era: ReLU is one of the simplest GPU kernels; all deep learning frameworks heavily optimize it 2. Quantization era: ReLU's 0 values fit naturally in INT8 quantization 3. Sparse hardware: future sparse accelerators (e.g., NVIDIA Hopper) can exploit ReLU's sparsity
Implementation details and initialization¶
The ReLU paper offers companion initialization recommendations:
| Initialization | Formula | Suitable for |
|---|---|---|
| Xavier (Glorot 2010) | \(W \sim U[-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})}]\) | Sigmoid / Tanh |
| He (Kaiming 2015) | \(W \sim N(0, \sqrt{2/n_{in}})\) | ReLU |
| Uniform [-0.01, 0.01] | uniform small | any |
He initialization is specifically designed for ReLU: since ReLU sets negatives to 0, forward variance is half that of sigmoid / tanh, so weight init must scale up by \(\sqrt{2}\) to compensate. This is a key engineering advance after the ReLU paper (He et al. 2015).
Counter-measures for ReLU "dead neurons": 1. Leaky ReLU: \(f(x) = \max(\alpha x, x)\), \(\alpha = 0.01\) 2. PReLU: \(\alpha\) learnable 3. Correct initialization (He init) + Batch Normalization can substantially reduce dead neurons 4. Learning rate warmup prevents early training from destroying ReLU's activation distribution
Failed Baselines¶
Opponents that lost to ReLU — the "activation function benchmarks" of 2011¶
When ReLU was published in 2011, the "mainstream" of activation functions was split between sigmoid line and tanh line. Glorot's team did a systematic comparison in paper Section 4:
| Opponent | Year proposed | Test error on NORB | Why it lost to ReLU |
|---|---|---|---|
| Sigmoid (\(\sigma(x)\)) | 1980s | 18.4% | vanishing gradient; non-zero-centered; slow compute |
| Tanh (\(\tanh(x)\)) | 1980s | 17.6% | vanishing gradient (saturation); slow compute |
| Softplus (\(\ln(1+e^x)\)) | 2001 | 17.0% | slow compute; not sparse |
| Hard Tanh (\(\max(-1, \min(1, x))\)) | 2010 | 16.9% | not sparse; bounded gradient on negatives |
| ** | tanh | ** (Jarrett 2009) | 2009 |
| Sigmoid + RBM pretraining | 2006 | 16.5% | complex; multi-stage training |
| ReLU (\(\max(0, x)\)) | 2011 | 16.4% | wins: fast + non-vanishing + sparse |
Takeaways from this table: 1. ReLU's accuracy edge is actually small (16.4% vs Sigmoid+RBM 16.5%); the key is simplicity + training speed 2. No RBM pretraining needed — the biggest contribution of the ReLU paper 3. Training 3-6× faster than tanh — turning deep network training from "days" into "hours"
Failures the paper acknowledged — scenarios where ReLU struggles¶
Glorot paper Section 4.4 honestly lists ReLU's limits:
| Scenario | ReLU performance | Reason |
|---|---|---|
| Dying ReLU | 30-50% neurons permanently dead late in training | learning rate too high or wrong init causes z<0 permanently |
| Negative-value information loss | negative inputs clipped to 0 | some tasks (e.g., generative) need negatives |
| Non-zero-centered output | output ≥ 0 | like sigmoid, causes same-sign weight updates downstream |
| Non-differentiable at 0 | \(f'(0)\) undefined | engineered to 0 or 1, but theoretically a subgradient |
| Unbounded large values | output unbounded | can cause activation explosion (mostly mitigated by BatchNorm) |
| Generative tasks | worse than tanh | VAE / GAN decoders still prefer tanh (bounded output) |
Opponents striking back a year later — the rise of ReLU variants¶
ReLU's huge success (AlexNet using ReLU won ImageNet in 2012) sparked many improvements:
| Follow-up work | Year | Breakthrough | Improvement over ReLU |
|---|---|---|---|
| Maxout (Goodfellow 2013) | 2013 | \(\max(W_1 x, W_2 x)\) | generalize ReLU to piecewise linear |
| Leaky ReLU (Maas 2013) | 2013 | \(\max(\alpha x, x)\), \(\alpha=0.01\) | solves dead neurons |
| PReLU (He 2015) | 2015 | \(\alpha\) learnable | network-adaptive |
| ELU (Clevert 2015) | 2015 | negative region \(\alpha(e^x-1)\) | average output near 0 |
| SELU (Klambauer 2017) | 2017 | self-normalizing | train ultra-deep MLPs without BN |
| GELU (Hendrycks 2016) | 2016 | \(x \cdot \Phi(x)\) (Gaussian CDF) | BERT / GPT default |
| Swish / SiLU (Ramachandran 2017) | 2017 | \(x \cdot \sigma(x)\) | self-gating; EfficientNet uses |
| Mish (Misra 2019) | 2019 | \(x \cdot \tanh(\text{softplus}(x))\) | slightly better on some CV tasks |
| GLU family (GLU/GeGLU/SwiGLU) | 2017+ | \(\sigma(W_1 x) \odot (W_2 x)\) | LLaMA / Mistral default FFN |
Lessons from the counter-attack: 1. GELU / Swish slightly outperform ReLU on Transformers — but ReLU remains CNN's first choice 2. SwiGLU is the new default in the LLaMA era — but ReLU is still the hidden activation in ResNet / EfficientNet 3. ReLU is the bridge from "ImageNet era" to "LLM era" — not replaced, just improved in some scenarios
A direction missed — Batch Normalization¶
The ReLU paper was written in 2011; Batch Normalization (Ioffe & Szegedy 2015) appeared 4 years later. BN and ReLU are a perfect match — BN normalizes per-layer activations to mean=0, var=1, putting ReLU exactly at the optimal "50% active" operating point.
If the ReLU paper had been written 4 years later: it might have been proposed together with BN, and "BN + ReLU + deep residual" would have become a unified scheme. But history is what it is: ReLU first, BN later, ResNet integrating both.
Another direction missed — Layer Normalization¶
LayerNorm (Ba 2016) and RMSNorm (Zhang 2019) are Transformer-era staples paired with GELU / SwiGLU. The ReLU paper didn't anticipate the importance of normalization, otherwise it might have driven early ReLU + LN combinations.
Key Experimental Data¶
Main results — full PK across 4 benchmarks¶
Glorot paper Tables 2-3 compare ReLU vs Sigmoid vs Tanh under different conditions on 4 datasets:
MNIST (handwritten digits, 60K training):
| Network | Activation | Pretraining | Test error |
|---|---|---|---|
| MLP-3layer | Sigmoid | none | 2.94% |
| MLP-3layer | Sigmoid | RBM | 1.67% |
| MLP-3layer | Tanh | none | 2.20% |
| MLP-3layer | Tanh | RBM | 1.55% |
| MLP-3layer | ReLU | none | 1.43% |
| MLP-3layer | ReLU | RBM | 1.50% (not necessarily better!) |
Key finding: ReLU without pretraining reaches 1.43%, beating Sigmoid+RBM's 1.67%. This is the most important single piece of evidence in the ReLU paper — unsupervised pretraining is no longer mandatory.
NORB (toy images, 48K training):
| Network | Activation | Pretraining | Test error |
|---|---|---|---|
| MLP-3layer | Sigmoid | none | 18.4% |
| MLP-3layer | Sigmoid | RBM | 16.5% |
| MLP-3layer | Tanh | none | 17.6% |
| MLP-3layer | ReLU | none | 16.4% |
CIFAR-10 (natural images, 50K training):
| Network | Activation | Pretraining | Test error |
|---|---|---|---|
| MLP-3layer | Tanh | none | 50.9% |
| MLP-3layer | ReLU | none | 49.5% |
NISTP (handwritten characters, mixed fonts, 82K training):
| Network | Activation | Pretraining | Test error |
|---|---|---|---|
| MLP-3layer | Sigmoid | none | 12.0% |
| MLP-3layer | Sigmoid | RBM | 9.4% |
| MLP-3layer | Tanh | none | 10.3% |
| MLP-3layer | ReLU | none | 8.8% |
Ablation — sparsity effect on performance¶
Paper Figure 3 shows ReLU network sparsity changes under different L1 regularization strengths:
| L1 strength λ | Average sparsity | MNIST test error |
|---|---|---|
| 0 (no L1) | 50% | 1.43% |
| 0.001 | 60% | 1.45% |
| 0.01 | 75% | 1.52% |
| 0.1 | 90% | 1.85% |
| 1.0 | 99% | 4.20% (collapse) |
Key findings: 1. ReLU's natural 50% sparsity (no L1 needed) is already the optimal operating point 2. Over-sparsity (>90%) hurts performance — losing too much expressivity 3. L1 regularization's role partly replaced by ReLU — a side benefit of ReLU
Training speed comparison¶
Paper Section 4.2 reports:
| Activation | 1 epoch training time (CPU) | Convergence epochs | Total training time |
|---|---|---|---|
| Sigmoid | 12 min | 80 | 16 hr |
| Sigmoid + RBM | 12 min + RBM 30 min | 40 | 17 hr |
| Tanh | 11 min | 60 | 11 hr |
| ReLU | 3 min | 30 | 1.5 hr |
Key findings: 1. ReLU per-epoch is 4× faster than sigmoid (cheap compute) 2. ReLU converges 3× faster (gradient doesn't vanish) 3. Total training time ReLU is 10× faster than sigmoid — a revolutionary leap
Cross-architecture generalization¶
The paper also tested ReLU on different architectures:
| Architecture | Sigmoid error | ReLU error | Improvement |
|---|---|---|---|
| MLP-2layer | 2.4% | 1.8% | +25% |
| MLP-3layer | 2.2% | 1.43% | +35% |
| MLP-5layer | 3.1% (hard to train) | 1.6% | +48% |
| Convolutional Net | 1.8% | 1.2% | +33% |
Key finding: the deeper the network, the bigger ReLU's advantage — 5-layer MLPs with sigmoid almost can't be trained, but with ReLU reach 1.6%. This directly predicts the post-2012 depth explosion: AlexNet (8 layers) → VGG (16 layers) → ResNet (152 layers).
Several repeatedly-cited findings¶
- ReLU frees deep networks from unsupervised pretraining dependency — the paper's most important single contribution
- ReLU has natural 50% sparsity — biological plausibility + implicit regularization
- ReLU trains 10× faster than sigmoid — directly catalyzes the GPU + ReLU + big-data "AlexNet moment"
- ReLU dramatically improves deep network scalability — 5 layers untrainable with sigmoid, doable with ReLU
- ReLU is the hidden activation in all post-2012 SOTA models — AlexNet / VGG / GoogLeNet / ResNet all use ReLU or its variants
Idea Lineage¶
Predecessors — whose shoulders ReLU stood on¶
Neuroscience-level ancestors:
| Ancestor | Year | What it gave to ReLU | Position in ReLU |
|---|---|---|---|
| Hodgkin-Huxley model (1952) | 1952 | membrane potential + threshold firing | "no activation below threshold" idea |
| Hahnloser et al. (Nature 2000) | 2000 | half-wave rectification neuron model | direct predecessor of ReLU formula |
| Olshausen & Field (Nature 1996) | 1996 | sparse coding in visual cortex | biological motivation for sparsity |
| V1 firing rate measurements (1960s+) | 1960s | real neurons only 1-4% co-active | empirical basis for sparsity |
ML theory ancestors:
| Ancestor | Year | Contribution | Manifestation in ReLU |
|---|---|---|---|
| Sigmoid neuron (McCulloch & Pitts 1943) | 1943 | mathematical model of threshold neuron | counter-example for ReLU (saturation) |
| Backpropagation (Rumelhart 1986) | 1986 | gradient descent training of NNs | foundation for ReLU optimization |
| Vanishing gradient analysis (Hochreiter 1991) | 1991 | systematic analysis of vanishing gradients | problem statement for ReLU |
| Hard Tanh (Collobert 2004) | 2004 | piecewise linear activation | piecewise-linear path of ReLU |
| Softplus (Dugas 2001) | 2001 | \(\ln(1+e^x)\) smooth activation | smooth approximation of ReLU |
| Sparse coding (Olshausen 1996) | 1996 | sparse representation | implicitly realized by ReLU |
Deep learning practice ancestors:
| Ancestor | Year | Contribution | Position in ReLU |
|---|---|---|---|
| DBN (Hinton 2006) | 2006 | unsupervised pretraining | what ReLU "replaces" |
| Stacked AE (Bengio 2007) | 2007 | layer-wise pretraining | same as above |
| Xavier init (Glorot 2010) | 2010 | weight initialization | paired with ReLU (later replaced by He init) |
| Nair & Hinton ReLU in RBM (2010) | 2010 | ReLU in RBM, empirical | direct predecessor |
| **Jarrett et al. | tanh | ** (ICCV 2009) | 2009 |
Descendants — the activation / deep learning lineage after ReLU¶
ReLU is not just an activation function — it's the implicit foundation of all "modern deep learning". The Mermaid diagram below highlights all key works directly or indirectly influenced by ReLU from 2011 to 2026:
flowchart TD
Sigmoid[Sigmoid 1943<br/>saturating]
Tanh[Tanh 1980s<br/>saturating]
HardTanh[Hard Tanh Collobert 2004<br/>piecewise linear]
Softplus[Softplus Dugas 2001<br/>smooth max]
NairHinton[Nair Hinton 2010<br/>ReLU in RBM]
Hahnloser[Hahnloser 2000<br/>half-wave rectification]
SparseCoding[Olshausen Field 1996<br/>sparse coding]
ReLU[ReLU Glorot 2011<br/>max 0 x in deep MLP]
Sigmoid -.replaced by.-> ReLU
Tanh -.replaced by.-> ReLU
HardTanh --> ReLU
Softplus -.smooth approx.-> ReLU
NairHinton --> ReLU
Hahnloser --> ReLU
SparseCoding -.biological motivation.-> ReLU
AlexNet[AlexNet Krizhevsky 2012<br/>ReLU + GPU + ImageNet]
DropOut[Dropout Srivastava 2014<br/>synergy with ReLU]
BatchNorm[BatchNorm Ioffe 2015<br/>ReLU best friend]
HeInit[He init He 2015<br/>for ReLU networks]
ResNet[ResNet He 2015<br/>ReLU + skip connection]
ReLU --> AlexNet
ReLU --> DropOut
ReLU --> BatchNorm
ReLU --> HeInit
ReLU --> ResNet
LeakyReLU[Leaky ReLU Maas 2013<br/>fix dying neurons]
PReLU[PReLU He 2015<br/>learnable alpha]
ELU[ELU Clevert 2015<br/>negative exp]
SELU[SELU Klambauer 2017<br/>self-normalizing]
Maxout[Maxout Goodfellow 2013<br/>generalize piecewise linear]
ReLU --> LeakyReLU
ReLU --> PReLU
ReLU --> ELU
ReLU --> SELU
ReLU --> Maxout
GELU[GELU Hendrycks 2016<br/>Gaussian CDF gating]
Swish[Swish SiLU Ramachandran 2017<br/>self-gating x sigmoid x]
Mish[Mish Misra 2019<br/>x tanh softplus]
GLU[GLU Dauphin 2017<br/>gated linear unit]
GeGLU[GeGLU Shazeer 2020]
SwiGLU[SwiGLU Shazeer 2020<br/>LLaMA Mistral default]
ReLU -.smooth variants.-> GELU
ReLU -.smooth variants.-> Swish
Swish --> Mish
GLU --> GeGLU
GLU --> SwiGLU
GELU --> SwiGLU
BERT[BERT 2018<br/>uses GELU]
GPT[GPT family<br/>uses GELU]
LLaMA[LLaMA 2023<br/>uses SwiGLU]
Mistral[Mistral 2023<br/>uses SwiGLU]
GELU --> BERT
GELU --> GPT
SwiGLU --> LLaMA
SwiGLU --> Mistral
Categorized by "sub-lines most affected by ReLU":
1. Direct ReLU variants:
| Descendant | Year | Difference from ReLU |
|---|---|---|
| Leaky ReLU (Maas 2013) | 2013 | negative region \(\alpha x\), prevents dead neurons |
| PReLU (He 2015) | 2015 | \(\alpha\) learnable |
| RReLU (Xu 2015) | 2015 | \(\alpha\) random during training |
| ELU (Clevert 2015) | 2015 | negative region \(\alpha(e^x-1)\) |
| SELU (Klambauer 2017) | 2017 | self-normalizing (deep MLPs without BN) |
| Maxout (Goodfellow 2013) | 2013 | generalized to piecewise linear |
2. Smooth variants / self-gating:
| Descendant | Year | Formula | Use |
|---|---|---|---|
| GELU (Hendrycks 2016) | 2016 | \(x \cdot \Phi(x)\) | BERT / GPT |
| Swish / SiLU (Ramachandran 2017) | 2017 | \(x \cdot \sigma(x)\) | EfficientNet |
| Mish (Misra 2019) | 2019 | \(x \cdot \tanh(\text{softplus}(x))\) | some CV |
| GLU family (GLU/GeGLU/SwiGLU/ReGLU) | 2017+ | \(\sigma(W_1 x) \odot (W_2 x)\) | LLaMA-line FFN |
3. Network architectures catalyzed by ReLU:
| Architecture | Year | Role of ReLU |
|---|---|---|
| AlexNet (Krizhevsky 2012) | 2012 | first ImageNet SOTA with ReLU + GPU |
| VGG (Simonyan 2014) | 2014 | 16-19 layer CNN, ReLU default |
| GoogLeNet / Inception (Szegedy 2014) | 2014 | all ReLU inside Inception block |
| ResNet (He 2015) | 2015 | ReLU + skip connection trains 152 layers |
| DenseNet (Huang 2016) | 2016 | ReLU + dense connections |
| EfficientNet (Tan 2019) | 2019 | ReLU variant (Swish) + NAS |
4. Engineering advances ReLU drove:
| Advance | Year | Relation to ReLU |
|---|---|---|
| Dropout (Srivastava 2014) | 2014 | synergy with ReLU (dropout makes ReLU sparsity more random) |
| Batch Normalization (Ioffe 2015) | 2015 | BN puts ReLU at optimal operating point (50% active) |
| He Initialization (He 2015) | 2015 | weight init designed specifically for ReLU |
| Layer Normalization (Ba 2016) | 2016 | LN + GELU is Transformer standard |
| Swish via NAS (Ramachandran 2017) | 2017 | RL-searched Swish, proves ReLU is not the endpoint |
Misreadings — how posterity has misread ReLU¶
Misreading 1: treating ReLU as just a "simple activation function" — severe underestimation. ReLU is not just a formula; it's a paradigm revolution from shallow to deep. Without ReLU, AlexNet 2012 couldn't have trained; without AlexNet, the deep learning revolution might have been delayed by 5 years.
Misreading 2: thinking ReLU "solved" vanishing gradients — partly right. ReLU solved activation-function-induced vanishing gradients, but weight-induced vanishing gradients still exist — this is what He initialization and BatchNorm later solved. ReLU + He init + BN combined truly solved deep training.
Misreading 3: thinking "dying ReLU" is fatal — partly right. Dead neurons exist, but actually: - Correct init (He init) + small learning rate + BN can keep dead neurons under 5% - Dead neurons can be viewed as implicit pruning — networks auto-find sparse sub-structures - Leaky ReLU / PReLU etc. can fully eliminate dead neurons (but practical gain is small)
Misreading 4: thinking ReLU has been replaced by GELU / Swish — wrong. As of 2026: - CNN line (ResNet / EfficientNet / ConvNeXt): ReLU or SiLU - Transformer line (BERT / GPT): GELU - LLM line (LLaMA / Mistral): SwiGLU - But ReLU remains baseline + first choice for teaching — simple, fast, stable
Misreading 5: thinking ReLU was first proposed by Glorot 2011 — partly wrong. Hahnloser 2000 used max(0, x) model in neuroscience; Nair & Hinton 2010 used ReLU in RBM; Glorot 2011's contribution is pushing ReLU to supervised deep MLPs and systematically proving it beats sigmoid.
Misreading 6: attributing ReLU's success to "sparsity" — partly wrong. Sparsity is a byproduct, not the core. The core is "gradient = 1 in the positive region", which is the key to solving vanishing gradients. Sparsity is icing on the cake.
Misreading 7: thinking ReLU is the only form of piecewise linearity — wrong. Maxout generalizes to \(\max(W_1 x, W_2 x, ..., W_k x)\); the GLU family combines piecewise linearity with gating. ReLU is the simplest form of piecewise linearity, but not the only form.
Modern Perspective¶
15 years later, which assumptions in the ReLU paper have been falsified?¶
Written in 2011, the ReLU paper contains a series of assumptions about neural network training. Today (2026), 15 years later, some assumptions still hold, others have been falsified:
| Assumption / claim in paper | Evidence in 2011 | Status in 2026 | Verification |
|---|---|---|---|
| ReLU completely eliminates vanishing gradient | per-layer gradient = 1 | weight-induced vanishing still exists, needs He init + BN | partly falsified |
| Sparsity is the key to ReLU's success | Figure 3 + biological motivation | sparsity is a byproduct; "gradient=1" is the core | partly falsified |
| ReLU makes unsupervised pretraining unnecessary | MNIST: ReLU+no-pretrain = sigmoid+pretrain | fully holds — almost no one uses unsupervised pretraining post-2012 | fully holds |
| Dying ReLU is a fatal flaw | paper Section 4.4 admits 30-50% dead | He init + BN keep it under 5%; can even be viewed as implicit pruning | partly falsified |
| ReLU's speedup comes mainly from formula simplicity | paper reports 4× per-epoch speedup | actual speedup is "convergence 3×" + "per-epoch 4×" = 10× total | partly confirmed |
| ReLU works for all tasks | both MLP and CNN verified | generative tasks (VAE decoders) still prefer tanh; LLMs use SwiGLU | partly falsified |
| 50% sparsity is the optimal operating point | paper Figure 3 shows 50-75% optimal | BN+ReLU networks typically have 30-50% sparsity | largely holds |
Overall: the core thesis of the ReLU paper ("replace sigmoid with ReLU, no pretraining needed, deep networks train directly") has stood the test of 15 years, but the explanation of "why ReLU works" (sparsity + neuroscience motivation) has been partly revised by later understanding — the core is piecewise linear + non-saturating positive region.
The "ghost" of ReLU in modern deep learning¶
Although 2026 LLMs and SOTA CV models no longer use raw ReLU, the spirit of ReLU is everywhere:
1. The "gradient = 1 in positive region" design principle is fully inherited: - GELU, Swish, Mish, SwiGLU all preserve near-linearity in the positive region - residual connections (ResNet) borrowed from ReLU's "no-decay propagation" idea - LayerNorm + ReLU/GELU is standard in all Transformers
2. Piecewise linearity remains mainstream: - Maxout, ReLU, Leaky ReLU, PReLU are strictly piecewise linear - GELU, Swish are "smooth approximations" of piecewise linear - theoretical analysis of piecewise linear networks (Pascanu 2014, Montufar 2014) is still active
3. Sparsity ideas live on in MoE: - Mixture of Experts (MoE) router is another "hard sparsity" — only top-k experts activated - LLaMA-MoE / Mistral-8x7B push sparsity to model-level - ReLU's "50% sparsity" inspired MoE's "top-2 routing"
4. ReLU is still the first choice for teaching and baselines: - almost the first activation function introduced in any deep learning textbook - baseline default in academic new methods - in engineering practice (mobile inference, embedded NN), ReLU remains first choice (int8-quant friendly)
What if the ReLU paper were written today?¶
If Glorot rewrote this paper in 2026, possible changes:
New sections: 1. Synergy with BN/LN — paper in 2011 didn't realize BN (2015) would become ReLU's best partner 2. He initialization vs Xavier init — paper used Xavier init (not actually optimal); He init is the best match for ReLU 3. Practical impact of dying neurons — 15 years of engineering shows dying neurons are far less than 30-50% 4. Theoretical analysis of piecewise linearity — Montufar 2014's "linear regions = \(O(2^L)\)" analysis
Removed / weakened parts: 1. Over-emphasis on sparsity — Figure 3 sparsity analysis focuses too much on sparsity ratio, ignoring the essential role of gradient signals 2. Biological motivation — later proven not the key to ReLU's success; mainly engineering advantages 3. Softplus comparison — by 2026, Softplus has little practical use
New comparisons that would be introduced: - systematic comparison of ReLU vs GELU vs Swish vs SwiGLU - difference between ReLU and GELU on Transformers - analysis of SwiGLU's advantage on LLMs
Limitations and Outlook¶
Core limitations of ReLU¶
| Limitation | Acknowledged in 2011 paper? | Subsequent solution |
|---|---|---|
| Dying ReLU | partly | Leaky ReLU / PReLU / He init / BN |
| Non-zero-centered output | not | ELU / SELU introduce negatives |
| Non-differentiable at 0 | partly | engineered to 0 or 1; theoretically subgradient |
| Unbounded positive region | not | BN / LN / weight clipping |
| Poor on generative tasks | not | VAE / GAN decoders use tanh |
| Not suitable for output layer | not | output uses softmax / sigmoid / linear |
| Slightly weaker on Transformers | N/A | GELU / SwiGLU improvements |
| Biological plausibility debate | paper emphasizes | later research questions sparsity's biological explanation |
Future directions¶
1. ReLU in neuromorphic computing: in spiking neural networks, ReLU's "threshold firing" property is natural — SpikeNet / Loihi chips are exploring.
2. Lighter activations: ReLU is already among the lightest non-linear activations (only one comparison), but quantum NNs / reversible networks may further simplify.
3. Adaptive activations: let networks choose activations themselves via NAS or meta-learning (Swish was found by NAS; future may be more dynamic).
4. Renaissance of ReLU in LLM inference optimization: - ReLU sparsity is hardware-acceleratable (CSR sparse matmul) - "ReLU-fication of LLMs" — replace LLaMA's SwiGLU back to ReLU for inference acceleration (DEJAVU, PowerInfer) - 2024 Apple's ReLU LLaMA runs 3× faster on Mac Studio
5. Fusion with new architectures: - activation choice in Mamba (State Space Model) - ReLU sparse gating in Mixture of Depths / Early Exit - synergy with test-time compute and ReLU
6. Interpretability of piecewise linearity: - every ReLU network corresponds to a piecewise linear function - understand network decision boundaries by analyzing "linear regions" - formal verification tools like Marabou / NNV exploit piecewise linear properties
Related Work and Inspirations¶
Papers directly related to ReLU¶
| Paper | Year | Relation to ReLU |
|---|---|---|
| Hahnloser et al. "Digital selection and analogue amplification coexist..." | 2000 | half-wave rectification neuroscience origin |
| Jarrett et al. "What is the best multi-stage architecture..." | 2009 | early empirical evidence for ReLU in CNNs |
| Nair & Hinton "Rectified linear units improve restricted Boltzmann machines" | 2010 | early empirical evidence for ReLU in RBM |
| Krizhevsky et al. "ImageNet classification with deep CNNs" (AlexNet) | 2012 | ReLU's "breakthrough battle" |
| Maas et al. "Rectifier nonlinearities improve neural network acoustic models" (Leaky ReLU) | 2013 | fixes dying ReLU |
| He et al. "Delving deep into rectifiers..." (PReLU + He init) | 2015 | best initialization for ReLU networks |
| Ioffe & Szegedy "Batch normalization..." | 2015 | ReLU's best partner |
| Clevert et al. "Fast and accurate deep network learning by ELUs" | 2015 | smooth negative-region variant |
| Hendrycks & Gimpel "Gaussian Error Linear Units (GELUs)" | 2016 | Transformer-era standard |
| Ramachandran et al. "Searching for activation functions" (Swish) | 2017 | NAS finds Swish |
| Shazeer "GLU variants improve Transformer" | 2020 | LLM-era SwiGLU |
Areas with kindred ideas to ReLU¶
1. Signal processing / half-wave rectification: the diode rectifier in electronic circuits (half-wave rectifier) is the physical version of ReLU, used since the 1950s.
2. Economics / option pricing: European call option payoff \(\max(S - K, 0)\) has the same form as ReLU — the "hockey stick" function in financial engineering.
3. Optimization / convex analysis: \(\max(0, x)\) is the core component of hinge loss, kindred to SVM's \(\max(0, 1-yf(x))\).
4. Neuroscience: V1 simple cell firing rate models, Hodgkin-Huxley threshold firing models can all be viewed as biological prototypes of ReLU.
5. Control theory / differential equations: piecewise linear systems have long been studied in control theory; ReLU networks can be viewed as special piecewise linear controllers.
Cross-domain research it inspired¶
1. Computer graphics: piecewise linear texture synthesis; ReLU MLP representations in NeRF.
2. Protein structure prediction: ReLU/SiLU activations inside AlphaFold; piecewise linear approximation of protein energy landscapes.
3. Reinforcement learning: DQN's Q-network uses ReLU; ReLU sparsity in policy networks affects exploration behavior.
4. Recommender systems: ReLU FFN in two-tower models; ReLU crosses in DCN-V2.
5. Autonomous driving perception: ReLU in BEVFormer / BEVDet; ReLU gating in multi-view fusion.
Resources¶
Papers and code¶
- Glorot et al. 2011 original (AISTATS): https://proceedings.mlr.press/v15/glorot11a.html
- PyTorch ReLU docs: https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
- TensorFlow ReLU docs: https://www.tensorflow.org/api_docs/python/tf/nn/relu
- JAX ReLU: https://jax.readthedocs.io/en/latest/_autosummary/jax.nn.relu.html
Important follow-up papers¶
- Nair & Hinton 2010 (ReLU in RBM): https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf
- AlexNet (Krizhevsky 2012): https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b
- He et al. 2015 (PReLU + He init): https://arxiv.org/abs/1502.01852
- GELU (Hendrycks 2016): https://arxiv.org/abs/1606.08415
- Swish (Ramachandran 2017): https://arxiv.org/abs/1710.05941
- Searching for Activation Functions (Swish via NAS): https://arxiv.org/abs/1710.05941
- GLU Variants Improve Transformer (Shazeer 2020): https://arxiv.org/abs/2002.05202
Courses and tutorials¶
- Stanford CS231n "Neural Networks 1" lecture (covers ReLU vs sigmoid vs tanh)
- Deep Learning Book (Goodfellow et al.) Chapter 6 "Deep Feedforward Networks"
- 3Blue1Brown Neural Networks series (visualizes ReLU)
- Distill.pub "A Visual Exploration of Gaussian Processes" (mentions activation function geometry)
Engineering resources¶
- ReLU LLaMA (Apple 2024): https://github.com/SHI-Labs/NATTEN
- DEJAVU: contextual sparsity for LLM inference (uses ReLU sparsity)
- PowerInfer: GPU-CPU hybrid LLM inference (leverages ReLU sparsity)
🌐 中文版 · 📚 awesome-papers project · CC-BY-NC