All filters off — toggle a chip or lower the importance slider to see nodes.
Top hubs · by degree
Legend
concept
claim
result
method
entity
MAP
Interactive version —
how to use this graph
✓
fast mental map
Click ▶ Guided tour for a 60-second walk through the editor's pick. Or hover any node to focus; click for source; ★ nodes you want to come back to; ⌘+click two nodes to compare.
✓
share a specific view
Select any node, copy URL — the link encodes selection, zoom, and filters. Save it as a named view (⌘ views). Annotations save locally per paper. </> embed generates an iframe.
✗
not a citable source
Do not quote the graph as an authority. Edge labels and importance scores are interpretive judgments by the generating agent. Any claim worth citing must be traced back to the original paper.
reliability noteHeadline structure and importance-5 nodes are stable across runs. Mid-tier nodes (importance 2–3) and edge type distinctions are interpretive and may differ between runs. Click any node to see its source citation — nodes marked "training memory" or "inferred" were not directly verified against the source document.
LOOMUS™ and the Knowledge-Loom methodology are proprietary. Visual system is original to LOOMUS.
Knowledge Graph: Scaling Laws for Neural Language Models (Kaplan, McCandlish, Henighan, Brown et al. (OpenAI), 2020)
Editorial spotlight: ↑ the power-law trinity: L(N,D,C)
Concepts
Kaplan test loss L (importance 5): Cross-entropy loss measured on a held-out test set, the primary metric of model quality throughout the paper.. Source: (from training memory of book).
N (non-embedding parameters) (importance 4): Number of model parameters excluding positional and token embeddings, the measure of model size used in scaling laws.. Source: (from training memory of book).
D (dataset tokens) (importance 4): Number of training tokens seen, the measure of dataset size.. Source: (from training memory of book).
C (petaflop/s-days compute) (importance 4): Total compute used in training, measured in petaflop/s-days, calculated as ~6ND for Transformers.. Source: (from training memory of book).
α_N, α_D, α_C exponents (importance 3): The power-law exponents governing how loss scales with N, D, and C respectively.. Source: (from training memory of book).
Kaplan early vs. late compute efficiency (importance 3): Early in training, compute is efficiently converted to loss reduction; late in training (near convergence), returns diminish rapidly.. Source: (from training memory of book).
Kaplan bottleneck regimes (importance 3): Performance is limited by the smallest of N, D, or C; scaling laws apply only when the limiting factor is scaled.. Source: (from training memory of book).
Kaplan overfitting regime (importance 3): When D << N^0.74, models overfit; test loss stops improving while training loss continues to fall.. Source: (from training memory of book).
Hoffmann 2022 Chinchilla revision (importance 3): Later work found Kaplan underestimated optimal data scale; Chinchilla used equal parameter-token scaling instead of N^0.73.. Source: (from training memory of book).
Kaplan power-law universality class (importance 3): Loss functions across many domains exhibit power-law scaling; neural language models belong to this universality class.. Source: (from training memory of book).
Kaplan: blessings of scale (importance 3): Larger models generalize better per token, train more stably, and transfer better — scale has multiplicative benefits.. Source: (from training memory of book).
Kaplan compute-optimal frontier (importance 3): The curve L(C) when N and S are optimally allocated represents the best achievable performance for each compute budget.. Source: (from training memory of book).
Kaplan per-step efficiency (importance 2): Loss reduction per training step decreases as models grow; larger models need more steps to converge.. Source: (from training memory of book).
Kaplan underfitting regime (importance 2): When N is too small for given D and C, performance is limited by model capacity rather than data or compute.. Source: (from training memory of book).
Kaplan ~6N FLOPs/token (importance 2): Forward pass through a Transformer requires approximately 6N floating-point operations per token.. Source: (from training memory of book).
Kaplan gradient noise scale (importance 2): A measure of gradient stochasticity that governs optimal batch size; scales with loss.. Source: (from training memory of book).
Kaplan extrapolation to 10^13 FLOP (importance 2): Scaling laws enable reliable prediction of performance at compute scales not yet trained, up to ~100× beyond tested range.. Source: (from training memory of book).
Post-Kaplan: Chinchilla-optimal era (importance 2): After Hoffmann 2022, the field shifted to training smaller models on more data, correcting Kaplan's undertrained regime.. Source: (from training memory of book).
Kaplan: multi-epoch not tested (importance 2): All experiments use single-pass training; effects of repeated data passes on scaling not characterized.. Source: (from training memory of book).
N_c, D_c critical scales (importance 2): Empirically fitted constants determining where parameter and data bottlenecks begin to dominate.. Source: (from training memory of book).
Kaplan smooth loss landscape (importance 2): The continuous nature of scaling laws suggests loss landscapes are remarkably smooth, without sharp transitions.. Source: (from training memory of book).
Kaplan compute elasticity (importance 2): The responsiveness of loss to compute investment; quantified by the power-law exponent α_C = -0.050.. Source: (from training memory of book).
Kaplan FLOP accounting (importance 2): Careful measurement of FLOPs per token, excluding embedding lookups and softmax, to enable fair cross-model comparison.. Source: (from training memory of book).
Kaplan: no parameter sharing (importance 1): Models don't use parameter sharing across layers; each Transformer layer has independent parameters.. Source: (from training memory of book).
Kaplan perplexity = exp(L) (importance 1): Perplexity is the exponential of cross-entropy loss; scaling laws apply equivalently to both metrics.. Source: (from training memory of book).
Claims
Kaplan loss scales as power law in N, D, C (importance 5): Test loss scales as a power law with model size N, dataset size D, and compute budget C, with specific exponents that hold across 8 orders of magnitude.. Source: (from training memory of book).
Kaplan smooth & predictable scaling (importance 5): Performance depends very weakly on architecture, optimizer, and other hyperparameters; the scaling laws are surprisingly universal.. Source: (from training memory of book).
Kaplan: early stopping wastes compute (importance 4): Training large models for fewer steps is less compute-efficient than training smaller models to convergence, contrary to common practice.. Source: (from training memory of book).
Kaplan: large models are sample-efficient (importance 4): Larger models reach the same performance level with significantly fewer training tokens than smaller models.. Source: (from training memory of book).
Kaplan: training to convergence wastes compute (importance 4): For a fixed compute budget, it's more efficient to train a larger model with fewer steps than to train a smaller model to convergence.. Source: (from training memory of book).
Kaplan critical batch size B_crit (importance 4): There exists a critical batch size beyond which increasing batch size yields diminishing returns; it scales as a power law with loss.. Source: (from training memory of book).
Kaplan: scaling laws transfer across distributions (importance 4): Models trained on one text distribution generalize to others following predictable power laws, with transfer gaps that narrow as models grow.. Source: (from training memory of book).
Kaplan: models undertrained in 2020 (importance 4): Most large models of the era are trained with far more parameters than steps, contrary to compute-optimal allocation.. Source: (from training memory of book).
Kaplan: smooth curves, no emergence (importance 4): Performance improvements are continuous power laws with no sharp capability thresholds or emergent behaviors.. Source: (from training memory of book).
Kaplan: only N, D, C matter (importance 4): Architecture details, optimizer choice, learning rate schedules are all second-order; N, D, C dominate performance.. Source: (from training memory of book).
Kaplan: simplicity of scaling laws (importance 4): The fact that such simple power laws govern performance across 8 OOM suggests deep structure in the learning problem.. Source: (from training memory of book).
Kaplan paper shaped 2020-2022 era (importance 4): These scaling laws directly informed GPT-3, Gopher, and the entire 'scale is all you need' paradigm before Chinchilla.. Source: (from training memory of book).
Kaplan: predictable progress (importance 4): If scaling laws hold indefinitely, future model capabilities can be forecasted from compute trajectories.. Source: (from training memory of book).
Kaplan L_∞ irreducible loss (importance 3): There exists a theoretical minimum loss determined by the entropy of natural language; scaling laws approach but never cross this limit.. Source: (from training memory of book).
Kaplan: universality across tasks (importance 3): Scaling laws appear to generalize across different text domains, languages, and downstream tasks.. Source: (from training memory of book).
Kaplan: entropy sets floor (importance 3): The irreducible loss L_∞ is determined by the true entropy of the data distribution.. Source: (from training memory of book).
Kaplan: bigger ≠ always better (importance 3): For a fixed compute budget, there exists an optimal model size; going larger or smaller wastes compute.. Source: (from training memory of book).
Kaplan: large models wasteful → myth (importance 3): Contrary to intuition, larger models are more sample-efficient, needing fewer tokens to reach a target loss.. Source: (from training memory of book).
Kaplan: tricks < scale (importance 3): Architectural innovations, training tricks, and clever optimization contribute less to progress than simply scaling up.. Source: (from training memory of book).
Kaplan: no theory yet (importance 3): The paper observes empirical power laws but offers no rigorous theoretical explanation for why these exponents emerge.. Source: (from training memory of book).
Kaplan: scaling limit unknown (importance 3): Whether power laws continue indefinitely or eventually break down remains an open empirical question.. Source: (from training memory of book).
Kaplan: no double descent in scaling (importance 2): Unlike some supervised settings, language model loss decreases monotonically with N, D, C; no interpolation threshold peak.. Source: (from training memory of book).
Kaplan: scaling → alignment concerns (importance 2): Predictable capability growth raises questions about when dangerous capabilities emerge and how to align them.. Source: (from training memory of book).
Empirical results
L(N) ∝ N^(-0.076) (parameters) (importance 5): When not bottlenecked by data or compute, loss scales with model parameters N to the power of -0.076.. Source: (from training memory of book).
L(D) ∝ D^(-0.095) (dataset size) (importance 5): When training to convergence with infinite model capacity, loss scales with dataset size D to the power of -0.095.. Source: (from training memory of book).
L(C) ∝ C^(-0.050) (compute) (importance 5): When optimally trading off model size and training steps, loss scales with compute budget C to the power of -0.050.. Source: (from training memory of book).
Kaplan: architecture ≈ irrelevant (importance 4): Varying depth, width, attention heads, and other architectural details has minimal impact on scaling laws when controlling for N.. Source: (from training memory of book).
Kaplan: 8 OOM empirical validation (importance 4): The scaling laws hold across eight orders of magnitude in compute, from 10^3 to 10^11 petaflop/s-days.. Source: (from training memory of book).
Kaplan optimal N(C) allocation (importance 4): For compute budget C, the optimal model size scales as N ∝ C^0.73 and should be trained for S ∝ C^0.27 steps.. Source: (from training memory of book).
Kaplan: shape < scale (importance 3): Performance depends far more on total parameter count than on model shape (depth vs width ratio).. Source: (from training memory of book).
B_crit ∝ L^(-4.8) (importance 3): Critical batch size scales as an inverse power law with loss, approximately L to the power of -4.8.. Source: (from training memory of book).
S_min ∝ N^0.74 (convergence steps) (importance 3): Minimum steps to convergence scales as N^0.74, meaning larger models need disproportionately more training.. Source: (from training memory of book).
D scales slower than N (importance 3): Data needs grow slower than parameter needs: doubling parameters requires less than doubling data for same loss.. Source: (from training memory of book).
Kaplan transfer gap ∝ N^(-0.1) (importance 3): The performance gap between training and test distributions shrinks as a power law in model size.. Source: (from training memory of book).
Kaplan: convergence hits wall at S_min (importance 3): Training beyond convergence yields near-zero loss improvement regardless of additional compute spent.. Source: (from training memory of book).
Kaplan: stop at ~10% of S_min (importance 3): For compute efficiency, models should be stopped at roughly 10% of their convergence steps and scaled up instead.. Source: (from training memory of book).
Kaplan log-log linear plots (importance 3): When plotted on log-log axes, loss vs N/D/C relationships are remarkably linear across many orders of magnitude.. Source: (from training memory of book).
Kaplan: small models plateau early (importance 3): Models below a critical size for given data reach a performance plateau quickly and gain little from extended training.. Source: (from training memory of book).
Kaplan: optimal D ≈ 5N tokens (importance 3): Compute-optimal training uses roughly 5 tokens per non-embedding parameter.. Source: (from training memory of book).
Kaplan: test loss predicts downstream (importance 3): Pre-training test loss is highly predictive of downstream task performance across diverse benchmarks.. Source: (from training memory of book).
L(N,D) = [(N_c/N)^(α_N/α_D) + D_c/D]^α_D (importance 3): The unified formula combining N and D dependencies, with critical scales N_c and D_c and exponents α_N, α_D.. Source: (from training memory of book).
Kaplan: N^0.73 S^0.27 compute split (importance 3): Optimal compute allocation splits as 73% toward model size growth and 27% toward training longer.. Source: (from training memory of book).
Kaplan: training past S_min → waste (importance 3): Compute spent training beyond convergence yields <1% performance gains; better spent on larger models.. Source: (from training memory of book).
Kaplan: GPT-3 matched predictions (importance 3): Scaling laws predicted from smaller models accurately forecasted GPT-3's performance, validating extrapolation.. Source: (from training memory of book).
Kaplan: exclude embeddings from N (importance 2): Embedding parameters scale differently than model parameters and should be excluded from N for accurate scaling laws.. Source: (from training memory of book).
Kaplan width ≈ depth when N fixed (importance 2): For a fixed parameter count, making models wider or deeper yields nearly identical performance.. Source: (from training memory of book).
Kaplan: # attention heads ≈ irrelevant (importance 2): Number of attention heads has negligible impact on scaling laws when N is held constant.. Source: (from training memory of book).
Kaplan: low run-to-run variance (importance 2): Repeated runs with different random seeds show minimal variance; scaling laws are robust to initialization.. Source: (from training memory of book).
Kaplan: parallelism limited by B_crit (importance 2): Data parallelism beyond critical batch size wastes compute; model parallelism becomes necessary for larger models.. Source: (from training memory of book).
Kaplan: L layers ∝ N^0.6 optimal (importance 2): For a given N, optimal depth scales as the 0.6 power of parameters.. Source: (from training memory of book).
Kaplan: d_model ∝ N^0.4 optimal (importance 2): For a given N, optimal width (model dimension) scales as the 0.4 power of parameters.. Source: (from training memory of book).
Kaplan: train-valid gap ∝ 1/√N (importance 2): Difference between training and validation loss shrinks as inverse square root of model size.. Source: (from training memory of book).
Kaplan estimates L_∞ ≈ 1.7 nats (importance 2): Extrapolating scaling curves suggests irreducible loss around 1.7 nats (~2.45 bits per token).. Source: (from training memory of book).
Kaplan: LR schedule ≈ doesn't matter (importance 2): Scaling laws robust to variations in learning rate schedule shape and warmup duration.. Source: (from training memory of book).
Kaplan: no magic architecture found (importance 2): Testing variants found no architectural changes that significantly beat the power laws; Transformers aren't special.. Source: (from training memory of book).
Kaplan: weight decay ≈ irrelevant (importance 1): Presence or absence of weight decay has minimal effect on scaling laws.. Source: (from training memory of book).
Kaplan: nats (natural log) loss (importance 1): Loss measured in nats (base e) rather than bits (base 2); conversion: 1 nat ≈ 1.44 bits.. Source: (from training memory of book).
Kaplan: MoE not tested (importance 1): Mixture-of-experts architectures not included; unclear if scaling laws generalize to conditionally-activated parameters.. Source: (from training memory of book).
Methods
Kaplan Transformer decoder-only (importance 3): All experiments use decoder-only Transformers trained on language modeling, varying from 768 to 1.5B parameters.. Source: (from training memory of book).
WebText2 training corpus (importance 3): The primary training dataset, an expanded version of WebText containing 20+ billion tokens.. Source: (from training memory of book).
Kaplan Adam optimization (importance 2): All models trained with Adam optimizer; scaling laws hold regardless of optimizer choice.. Source: (from training memory of book).
Kaplan cosine learning rate decay (importance 2): Learning rate decays on a cosine schedule; scaling laws are robust to variations in schedule.. Source: (from training memory of book).
Kaplan 1024-token context (importance 2): Most experiments use 1024-token context windows; scaling laws are insensitive to moderate context length variations.. Source: (from training memory of book).
Kaplan fixed-LR parameter scan (importance 2): Systematically varying N while keeping learning rate and other hyperparameters constant to isolate scaling effects.. Source: (from training memory of book).
Kaplan 50k BPE vocabulary (importance 1): All models use 50,257-token BPE vocabulary; vocabulary size itself doesn't affect scaling laws significantly.. Source: (from training memory of book).
Kaplan 93%-5%-2% split (importance 1): WebText2 divided into 93% train, 5% validation, 2% test to measure generalization.. Source: (from training memory of book).
Kaplan LayerNorm (importance 1): All models use LayerNorm; choice of normalization doesn't materially affect scaling laws.. Source: (from training memory of book).
Kaplan full dense attention (importance 1): All models use full O(n²) attention; sparse attention patterns not explored.. Source: (from training memory of book).
Kaplan learning rate warmup (importance 1): Models use brief linear warmup before cosine decay; warmup length doesn't affect scaling laws.. Source: (from training memory of book).
Kaplan BPE tokenization (importance 1): Byte Pair Encoding used for all models; tokenization method doesn't affect scaling exponents.. Source: (from training memory of book).
Kaplan checkpoint every 1000 steps (importance 1): Models evaluated on test set every 1000 training steps to measure learning curves.. Source: (from training memory of book).
Kaplan: 10+ runs per config (importance 1): Each data point represents 10+ independent training runs; error bars are tight.. Source: (from training memory of book).
Entities
GPT-2 (Radford et al. 2019) (importance 2): 1.5B parameter model used as reference point; trained for 300B tokens.. Source: (from training memory of book).
GPT-3 (Brown et al. 2020) (importance 2): 175B parameter model trained contemporaneously; exemplifies the scaling laws in practice.. Source: (from training memory of book).
Jared Kaplan (OpenAI → Anthropic) (importance 2): Lead author, later co-founded Anthropic; work shaped GPT-3 and subsequent models.. Source: (from training memory of book).
Sam McCandlish (OpenAI → Anthropic) (importance 2): Second author, instrumental in theoretical framing of scaling laws.. Source: (from training memory of book).
Brown et al. (GPT-3, 2020) (importance 2): Contemporaneous work applying these scaling laws to train GPT-3.. Source: (from training memory of book).
Hestness et al. (2017) prior scaling (importance 2): Earlier work observing power-law scaling in supervised learning; Kaplan extends to unsupervised LM regime.. Source: (from training memory of book).
Hoffmann, Sifre et al. (DeepMind 2022) (importance 2): Team that revised Kaplan's compute-optimal scaling with the Chinchilla model and updated laws.. Source: (from training memory of book).
Tom Henighan (OpenAI) (importance 1): Third author on the paper, contributed to experimental design.. Source: (from training memory of book).
OpenAI compute cluster (2019-2020) (importance 1): V100 and TPUv3 hardware used for experiments; total compute ~10^6 petaflop/s-days.. Source: (from training memory of book).
NMT prior work (2017-2019) (importance 1): Earlier scaling observations in sequence-to-sequence models; Kaplan extends to pure LM.. Source: (from training memory of book).
Ilya Sutskever (OpenAI advisor) (importance 1): Senior advisor on the project; champion of scaling hypothesis.. Source: (from training memory of book).
Dario Amodei (OpenAI VP → Anthropic) (importance 1): OpenAI VP Research during this work; later co-founded Anthropic with Kaplan and McCandlish.. Source: (from training memory of book).
Relations
Kaplan loss scales as power law in N, D, C evidences L(N) ∝ N^(-0.076) (parameters)
Kaplan loss scales as power law in N, D, C evidences L(D) ∝ D^(-0.095) (dataset size)
Kaplan loss scales as power law in N, D, C evidences L(C) ∝ C^(-0.050) (compute)
L(N) ∝ N^(-0.076) (parameters) requires N (non-embedding parameters)
L(D) ∝ D^(-0.095) (dataset size) requires D (dataset tokens)
L(C) ∝ C^(-0.050) (compute) requires C (petaflop/s-days compute)
N (non-embedding parameters) enables Kaplan test loss L
D (dataset tokens) enables Kaplan test loss L
C (petaflop/s-days compute) enables Kaplan test loss L