All filters off — toggle a chip or lower the importance slider to see nodes.
Top hubs · by degree
Legend
concept
claim
result
method
entity
MAP
Interactive version —
how to use this graph
✓
fast mental map
Click ▶ Guided tour for a 60-second walk through the editor's pick. Or hover any node to focus; click for source; ★ nodes you want to come back to; ⌘+click two nodes to compare.
✓
share a specific view
Select any node, copy URL — the link encodes selection, zoom, and filters. Save it as a named view (⌘ views). Annotations save locally per paper. </> embed generates an iframe.
✗
not a citable source
Do not quote the graph as an authority. Edge labels and importance scores are interpretive judgments by the generating agent. Any claim worth citing must be traced back to the original paper.
reliability noteHeadline structure and importance-5 nodes are stable across runs. Mid-tier nodes (importance 2–3) and edge type distinctions are interpretive and may differ between runs. Click any node to see its source citation — nodes marked "training memory" or "inferred" were not directly verified against the source document.
Editorial spotlight: the methodological pivot — covariate shift, solved differently
Concepts
layer normalization (importance 5): Normalize summed inputs across all neurons in a layer for a single training case. No mini-batch dependency.. Source: §3 Layer normalization. Quote: "we compute the layer normalization statistics over all the hidden units in the same layer".
batch normalization (importance 5): Ioffe & Szegedy 2015. Normalize summed inputs across mini-batch per neuron.. Source: §2 Background + Ioffe & Szegedy 2015 citation. Quote: "batch normalization standardizes each summed input using its mean and its standard deviation across the training data".
LN invariance lens (§5) (importance 5): Whether the model output stays constant under transformations of weights or data. The analytical lens for comparing normalization methods.. Source: §5 Analysis. Quote: "investigate the invariance properties of different normalization schemes".
weight normalization (importance 4): Salimans & Kingma 2016. Normalize by L2 norm of incoming weights.. Source: §4 Related work + Salimans & Kingma 2016 citation.
covariate shift (importance 4): Distribution of inputs to each layer changes during training. The motivating problem for normalization methods.. Source: §2 Background. Quote: "Batch normalization was proposed to reduce such undesirable 'covariate shift'".
Riemannian metric (§5.2.1) (importance 3): Geometry of parameter space under KL divergence. Used to analyze learning dynamics.. Source: §5.2.1 Riemannian metric.
LN gain + bias parameters (importance 3): Learnable per-neuron parameters applied AFTER normalization, BEFORE non-linearity.. Source: §3 + Eq 4. Quote: "each neuron its own adaptive bias and gain".
LN implicit learning rate reduction (importance 3): Weight norm growth → effective LR shrinks. The 'early stopping' effect of normalization.. Source: §5.2.2 paragraph 'Implicit learning rate reduction'.
online learning (LN-compatible regime) (importance 3): Batch size = 1. BN can't run here; LN can.. Source: §1 Introduction. Quote: "batch normalization cannot be applied to online learning tasks".
Fisher information matrix (importance 2): Captures curvature of parameter manifold. Block-diagonal approximation used here.. Source: §5.2.2 + Eq 9.
generalized linear model (LN analysis primitive) (importance 2): Used as the analytical primitive for geometry analysis. Block per neuron.. Source: §5.2.2 The geometry of normalized generalized linear models.
Claims
LN suits RNNs; BN does not (importance 5): BN-RNN needs different statistics for different time-steps (length-dependent). LN doesn't.. Source: §3.1 Layer normalized recurrent neural networks. Quote: "Layer normalization does not have such problem".
LN has no minibatch dependency (importance 5): Mean and variance computed over hidden units within a layer for one training case — minibatch size doesn't matter.. Source: §3 + Abstract. Quote: "does not impose any constraint on the size of a mini-batch".
LN: same computation train and test (importance 4): BN needs running averages at test; LN doesn't.. Source: §Abstract. Quote: "layer normalization performs exactly the same computation at training and test times".
LN invariant: weight matrix re-scaling (Eq 6) (importance 4): Both gain g and σ scale by δ — model output unchanged. (Not invariant to individual weight vector scaling — that's BN/WN.). Source: §5.1 + Eq 6.
LN invariant: weight matrix re-centering (importance 4): Adding a constant vector to all incoming weights leaves model output unchanged under LN.. Source: §5.1 + Table 1.
LN invariant: single training-case re-scaling (Eq 7) (importance 4): Re-scaling one data point's features by δ changes σ by δ — cancels. BN does NOT have this property.. Source: §5.1 + Eq 7.
BN cannot do online learning (importance 3): Requires mini-batch statistics; doesn't work at batch size 1 or in distributed settings with small batches.. Source: §1 Introduction.
Normalization → implicit early stopping (§5.2.2) (importance 3): Larger weight vector norm → harder to change orientation. Acts like implicit early stopping.. Source: §5.2.2. Quote: "implicit 'early stopping' effect on the weight vectors".
LN is NOT a re-parameterization (§4) (importance 3): Unlike weight norm and BN-with-expected-statistics. So LN has different invariance properties.. Source: §4 Related work. Quote: "Our proposed layer normalization method, however, is not a re-parameterization".
BN invariant: single weight-vector re-scaling (importance 2): Single weight vector scaled by δ; μ and σ scale too. Cancels.. Source: §5.1 Table 1.
BN NOT invariant: weight re-centering (importance 2): Shifting incoming weights adds a constant to summed inputs; BN doesn't absorb it.. Source: §5.1 Table 1.
WN invariant: weight-vector re-scaling only (importance 2): Not invariant to data re-centering or single-case re-scaling.. Source: §5.1 Table 1.
LN reduces RNN training time (Abstract) (importance 4): Substantially reduced training time vs prior published techniques (paper's headline claim).. Source: §Abstract. Quote: "layer normalization can substantially reduce the training time".
BN-RNN (Cooijmans 2016 protocol) (importance 3): Apply BN per time-step. Best with gain init 0.1. Doesn't generalize to sequences longer than training.. Source: §4 + Cooijmans 2016 citation. Quote: "initializing the gain parameter in the recurrent batch normalization layer to 0.1".
GLM Fisher analysis (§5.2.2) (importance 2): Use generalized linear models with block-diagonal Fisher to analyze normalization geometry.. Source: §5.2.2.