What is the central concept of Layer Normalization?

The methodological pivot — covariate shift, solved differently. layer normalization. Normalize summed inputs across all neurons in a layer for a single training case. No mini-batch dependency.

What is batch normalization in Layer Normalization?

Ioffe & Szegedy 2015. Normalize summed inputs across mini-batch per neuron.

What is LN suits RNNs; BN does not in Layer Normalization?

BN-RNN needs different statistics for different time-steps (length-dependent). LN doesn't.

What is LN invariance lens (§5) in Layer Normalization?

Whether the model output stays constant under transformations of weights or data. The analytical lens for comparing normalization methods.

What is the main argument of Layer Normalization?

LN suits RNNs; BN does not. BN-RNN needs different statistics for different time-steps (length-dependent). LN doesn't.

Layer <em>Normalization</em> · Knowledge Graph

Knowledge Graph: Layer Normalization (Ba, Kiros, Hinton, 2016)

Editorial spotlight: the methodological pivot — covariate shift, solved differently

Concepts

layer normalization (importance 5): Normalize summed inputs across all neurons in a layer for a single training case. No mini-batch dependency.. Source: §3 Layer normalization. Quote: "we compute the layer normalization statistics over all the hidden units in the same layer".
batch normalization (importance 5): Ioffe & Szegedy 2015. Normalize summed inputs across mini-batch per neuron.. Source: §2 Background + Ioffe & Szegedy 2015 citation. Quote: "batch normalization standardizes each summed input using its mean and its standard deviation across the training data".
LN invariance lens (§5) (importance 5): Whether the model output stays constant under transformations of weights or data. The analytical lens for comparing normalization methods.. Source: §5 Analysis. Quote: "investigate the invariance properties of different normalization schemes".
weight normalization (importance 4): Salimans & Kingma 2016. Normalize by L2 norm of incoming weights.. Source: §4 Related work + Salimans & Kingma 2016 citation.
covariate shift (importance 4): Distribution of inputs to each layer changes during training. The motivating problem for normalization methods.. Source: §2 Background. Quote: "Batch normalization was proposed to reduce such undesirable 'covariate shift'".
Riemannian metric (§5.2.1) (importance 3): Geometry of parameter space under KL divergence. Used to analyze learning dynamics.. Source: §5.2.1 Riemannian metric.
LN gain + bias parameters (importance 3): Learnable per-neuron parameters applied AFTER normalization, BEFORE non-linearity.. Source: §3 + Eq 4. Quote: "each neuron its own adaptive bias and gain".
LN implicit learning rate reduction (importance 3): Weight norm growth → effective LR shrinks. The 'early stopping' effect of normalization.. Source: §5.2.2 paragraph 'Implicit learning rate reduction'.
online learning (LN-compatible regime) (importance 3): Batch size = 1. BN can't run here; LN can.. Source: §1 Introduction. Quote: "batch normalization cannot be applied to online learning tasks".
Fisher information matrix (importance 2): Captures curvature of parameter manifold. Block-diagonal approximation used here.. Source: §5.2.2 + Eq 9.
generalized linear model (LN analysis primitive) (importance 2): Used as the analytical primitive for geometry analysis. Block per neuron.. Source: §5.2.2 The geometry of normalized generalized linear models.

Claims

LN suits RNNs; BN does not (importance 5): BN-RNN needs different statistics for different time-steps (length-dependent). LN doesn't.. Source: §3.1 Layer normalized recurrent neural networks. Quote: "Layer normalization does not have such problem".
LN has no minibatch dependency (importance 5): Mean and variance computed over hidden units within a layer for one training case — minibatch size doesn't matter.. Source: §3 + Abstract. Quote: "does not impose any constraint on the size of a mini-batch".
LN: same computation train and test (importance 4): BN needs running averages at test; LN doesn't.. Source: §Abstract. Quote: "layer normalization performs exactly the same computation at training and test times".
LN invariant: weight matrix re-scaling (Eq 6) (importance 4): Both gain g and σ scale by δ — model output unchanged. (Not invariant to individual weight vector scaling — that's BN/WN.). Source: §5.1 + Eq 6.
LN invariant: weight matrix re-centering (importance 4): Adding a constant vector to all incoming weights leaves model output unchanged under LN.. Source: §5.1 + Table 1.
LN invariant: single training-case re-scaling (Eq 7) (importance 4): Re-scaling one data point's features by δ changes σ by δ — cancels. BN does NOT have this property.. Source: §5.1 + Eq 7.
LN stabilizes RNN hidden-state dynamics (importance 4): Re-scaling invariance kills exploding/vanishing gradients across time-steps.. Source: §3.1. Quote: "more stable hidden-to-hidden dynamics".
BN cannot do online learning (importance 3): Requires mini-batch statistics; doesn't work at batch size 1 or in distributed settings with small batches.. Source: §1 Introduction.
Normalization → implicit early stopping (§5.2.2) (importance 3): Larger weight vector norm → harder to change orientation. Acts like implicit early stopping.. Source: §5.2.2. Quote: "implicit 'early stopping' effect on the weight vectors".
LN is NOT a re-parameterization (§4) (importance 3): Unlike weight norm and BN-with-expected-statistics. So LN has different invariance properties.. Source: §4 Related work. Quote: "Our proposed layer normalization method, however, is not a re-parameterization".
BN invariant: single weight-vector re-scaling (importance 2): Single weight vector scaled by δ; μ and σ scale too. Cancels.. Source: §5.1 Table 1.
BN NOT invariant: weight re-centering (importance 2): Shifting incoming weights adds a constant to summed inputs; BN doesn't absorb it.. Source: §5.1 Table 1.
WN invariant: weight-vector re-scaling only (importance 2): Not invariant to data re-centering or single-case re-scaling.. Source: §5.1 Table 1.

Empirical results

MSCOCO R@1 48.5 (OE+LN) (importance 4): Order-Embedding + LN beats OE alone (46.6) on caption retrieval. Sym baseline: 45.4.. Source: §6.1 + Table 2. Quote: "OE + LN 48.5".
LN reduces RNN training time (Abstract) (importance 4): Substantially reduced training time vs prior published techniques (paper's headline claim).. Source: §Abstract. Quote: "layer normalization can substantially reduce the training time".
LN-RNN: stable hidden dynamics (§3.1) (importance 4): Empirical demonstration that re-scaling invariance kills gradient instability.. Source: §3.1. Quote: "more stable hidden-to-hidden dynamics".
MSCOCO R@5 80.6 (OE+LN) (importance 3): vs OE 79.3.. Source: §6.1 + Table 2.
MSCOCO R@10 89.8 (OE+LN) (importance 3): vs OE 89.1.. Source: §6.1 + Table 2.
MSCOCO mean rank 5.1 (OE+LN) (importance 3): Lower is better. vs OE 5.2, Sym 5.8.. Source: §6.1 + Table 2.
Image Retrieval R@1 38.9 (OE+LN) (importance 3): vs OE 37.8, Sym 36.3.. Source: §6.1 + Figure 1.
Attentive Reader + LN gains (CNN/DM) (importance 3): Question answering on CNN/Daily Mail. LN beats BN-LSTM and BN-everywhere variants.. Source: §6 + Figure 2.

Methods

LN formula (Eq 3) (importance 5): μ and σ over all H hidden units in a layer, for a single training case.. Source: §3 Eq 3.
LN-RNN formula (Eq 4) (importance 5): Apply LN to recurrent layer with one shared gain+bias across time-steps.. Source: §3.1 Eq 4.
BN formula (Eq 2) (importance 4): ā = g/σ (a − μ); μ and σ over mini-batch.. Source: §2 Eq 2.
WN formula (μ=0, σ=‖w‖₂) (importance 3): μ = 0, σ = ||w||₂. L2 norm of incoming weights.. Source: §5.1 + Salimans Kingma 2016 ref.
BN-RNN (Cooijmans 2016 protocol) (importance 3): Apply BN per time-step. Best with gain init 0.1. Doesn't generalize to sequences longer than training.. Source: §4 + Cooijmans 2016 citation. Quote: "initializing the gain parameter in the recurrent batch normalization layer to 0.1".
GLM Fisher analysis (§5.2.2) (importance 2): Use generalized linear models with block-diagonal Fisher to analyze normalization geometry.. Source: §5.2.2.
path-normalized SGD (Neyshabur 2015) (importance 2): Neyshabur 2015. Re-parameterization in ReLU networks.. Source: §4 + Neyshabur 2015 citation.

Entities

Geoffrey Hinton (LN senior author) (importance 3): Senior author. Toronto + Google.. Source: title page.
Ioffe & Szegedy 2015 (BN paper) (importance 3): Batch Normalization paper. The method LN is responding to.. Source: §1 + §2 citation.
Jimmy Lei Ba (LN lead author) (importance 2): Lead author. University of Toronto.. Source: title page.
Jamie Kiros (LN co-author) (importance 2): University of Toronto. Co-author.. Source: title page.
Salimans & Kingma 2016 (WN paper) (importance 2): Weight Normalization paper.. Source: §4 Related work citation.
Cooijmans 2016 (BN-RNN paper) (importance 2): Recurrent batch normalization with per-time-step statistics.. Source: §4 Related work citation.
Sutskever 2014 (seq2seq) (importance 2): Sequence-to-sequence motivation for RNN focus.. Source: §3.1 citation.
Vendrov 2016 (Order-Embedding) (importance 2): Base model that LN is layered onto in §6.1.. Source: §6.1 citation.
MSCOCO dataset (Lin 2014) (importance 2): Image-caption dataset (Lin 2014). Used in §6.1.. Source: §6.1 (Lin 2014).
University of Toronto (LN affiliation) (importance 1): Authors' institution.. Source: title page (affiliation).
Laurent 2015 (prior BN-RNN) (importance 1): Prior BN-RNN work.. Source: §4 Related work citation.
Amodei 2015 (Deep Speech BN-RNN) (importance 1): Deep Speech. Prior BN-RNN application.. Source: §4 Related work citation.
Neyshabur 2015 (path-norm SGD) (importance 1): Path-normalized SGD.. Source: §4 Related work citation.
Cho 2014 (GRU) (importance 1): Recurrent unit used in order-embedding experiments.. Source: §6.1 citation.
Simonyan & Zisserman 2015 (VGG) (importance 1): Pre-trained image encoder.. Source: §6.1 citation.
Amari 1998 (info geometry) (importance 1): Foundational Riemannian metric on parameter manifolds.. Source: §5.2.1 citation.
Krizhevsky 2012 (AlexNet) (importance 1): Motivates SGD scale-up era.. Source: §1 Introduction citation.
Hinton 2012 (DNN speech) (importance 1): Motivates SGD scale-up in speech processing.. Source: §1 Introduction citation.
Dean 2012 (DistBelief) (importance 1): Distributed training. Where small minibatches become a problem.. Source: §1 Introduction citation.
Theano framework (importance 1): Framework used for experiments.. Source: §6.1 footnote.
CNN/Daily Mail QA dataset (importance 1): Question-answering dataset for attentive reader.. Source: §6 attentive reader section.
MNIST dataset (importance 1): Classification baseline in §6.. Source: §6 introduction.
arXiv:1607.06450 (LN paper id) (importance 1): July 2016. Workshop on Optimizing the Optimizers, NeurIPS 2016.. Source: arxiv:1607.06450 + cover page.

Relations

covariate shift motivates batch normalization
covariate shift motivates layer normalization
batch normalization precedes layer normalization
batch normalization precedes weight normalization
batch normalization contradicts layer normalization
layer normalization exemplifies LN invariance lens (§5)
batch normalization exemplifies LN invariance lens (§5)
weight normalization exemplifies LN invariance lens (§5)
Riemannian metric (§5.2.1) enables Fisher information matrix
Fisher information matrix exemplifies generalized linear model (LN analysis primitive)
generalized linear model (LN analysis primitive) supports layer normalization
LN gain + bias parameters enables layer normalization
LN gain + bias parameters enables batch normalization
LN implicit learning rate reduction exemplifies layer normalization
online learning (LN-compatible regime) supports layer normalization
online learning (LN-compatible regime) contradicts batch normalization
LN suits RNNs; BN does not supports layer normalization
LN suits RNNs; BN does not contradicts batch normalization
LN has no minibatch dependency supports layer normalization
LN has no minibatch dependency supports BN cannot do online learning
LN: same computation train and test supports layer normalization
LN invariant: weight matrix re-scaling (Eq 6) exemplifies LN invariance lens (§5)
LN invariant: weight matrix re-centering exemplifies LN invariance lens (§5)
LN invariant: single training-case re-scaling (Eq 7) exemplifies LN invariance lens (§5)
LN invariant: single training-case re-scaling (Eq 7) contradicts batch normalization
BN cannot do online learning refutes batch normalization
LN stabilizes RNN hidden-state dynamics supports layer normalization
Normalization → implicit early stopping (§5.2.2) supports LN implicit learning rate reduction
LN is NOT a re-parameterization (§4) supports layer normalization
LN is NOT a re-parameterization (§4) contradicts weight normalization
BN invariant: single weight-vector re-scaling exemplifies batch normalization
BN NOT invariant: weight re-centering exemplifies batch normalization
WN invariant: weight-vector re-scaling only exemplifies weight normalization
LN invariant: weight matrix re-scaling (Eq 6) contradicts BN invariant: single weight-vector re-scaling
LN invariant: weight matrix re-centering contradicts BN NOT invariant: weight re-centering
MSCOCO R@1 48.5 (OE+LN) evidences LN suits RNNs; BN does not
MSCOCO R@5 80.6 (OE+LN) evidences LN suits RNNs; BN does not
MSCOCO R@10 89.8 (OE+LN) evidences LN suits RNNs; BN does not
MSCOCO mean rank 5.1 (OE+LN) evidences LN suits RNNs; BN does not
Image Retrieval R@1 38.9 (OE+LN) evidences LN suits RNNs; BN does not
LN reduces RNN training time (Abstract) evidences layer normalization
Attentive Reader + LN gains (CNN/DM) evidences LN suits RNNs; BN does not
LN-RNN: stable hidden dynamics (§3.1) evidences LN stabilizes RNN hidden-state dynamics
MSCOCO R@1 48.5 (OE+LN) exemplifies MSCOCO dataset (Lin 2014)
MSCOCO R@5 80.6 (OE+LN) exemplifies MSCOCO dataset (Lin 2014)
MSCOCO R@10 89.8 (OE+LN) exemplifies MSCOCO dataset (Lin 2014)
MSCOCO mean rank 5.1 (OE+LN) exemplifies MSCOCO dataset (Lin 2014)
Image Retrieval R@1 38.9 (OE+LN) exemplifies MSCOCO dataset (Lin 2014)
Attentive Reader + LN gains (CNN/DM) exemplifies CNN/Daily Mail QA dataset
BN formula (Eq 2) enables batch normalization
LN formula (Eq 3) enables layer normalization
LN-RNN formula (Eq 4) enables layer normalization
LN-RNN formula (Eq 4) evidences LN suits RNNs; BN does not
WN formula (μ=0, σ=‖w‖₂) enables weight normalization
BN-RNN (Cooijmans 2016 protocol) exemplifies batch normalization
GLM Fisher analysis (§5.2.2) evidences LN implicit learning rate reduction
path-normalized SGD (Neyshabur 2015) exemplifies weight normalization
LN formula (Eq 3) contradicts BN formula (Eq 2)
LN-RNN formula (Eq 4) contradicts BN-RNN (Cooijmans 2016 protocol)
Jimmy Lei Ba (LN lead author) cites arXiv:1607.06450 (LN paper id)
Jamie Kiros (LN co-author) cites arXiv:1607.06450 (LN paper id)
Geoffrey Hinton (LN senior author) cites arXiv:1607.06450 (LN paper id)
Jimmy Lei Ba (LN lead author) cites University of Toronto (LN affiliation)
Jamie Kiros (LN co-author) cites University of Toronto (LN affiliation)
Geoffrey Hinton (LN senior author) cites University of Toronto (LN affiliation)
Ioffe & Szegedy 2015 (BN paper) motivates batch normalization
Salimans & Kingma 2016 (WN paper) motivates weight normalization
Cooijmans 2016 (BN-RNN paper) motivates BN-RNN (Cooijmans 2016 protocol)
Laurent 2015 (prior BN-RNN) precedes BN-RNN (Cooijmans 2016 protocol)
Amodei 2015 (Deep Speech BN-RNN) precedes BN-RNN (Cooijmans 2016 protocol)
Neyshabur 2015 (path-norm SGD) motivates path-normalized SGD (Neyshabur 2015)
Sutskever 2014 (seq2seq) motivates online learning (LN-compatible regime)
Sutskever 2014 (seq2seq) motivates LN-RNN formula (Eq 4)
Vendrov 2016 (Order-Embedding) enables MSCOCO R@1 48.5 (OE+LN)
Vendrov 2016 (Order-Embedding) cites MSCOCO dataset (Lin 2014)
Cho 2014 (GRU) supports LN-RNN formula (Eq 4)
Simonyan & Zisserman 2015 (VGG) supports MSCOCO dataset (Lin 2014)
Amari 1998 (info geometry) motivates Riemannian metric (§5.2.1)
Krizhevsky 2012 (AlexNet) precedes covariate shift
Hinton 2012 (DNN speech) precedes covariate shift
Dean 2012 (DistBelief) motivates online learning (LN-compatible regime)
MSCOCO dataset (Lin 2014) exemplifies Vendrov 2016 (Order-Embedding)
Theano framework enables MSCOCO R@1 48.5 (OE+LN)

Layer Normalization

fast mental map

share a specific view

not a citable source