What is the central concept of Scaling Laws for Neural Language Models?

↑ the power-law trinity: l(n,d,c). Kaplan loss scales as power law in N, D, C. Test loss scales as a power law with model size N, dataset size D, and compute budget C, with specific exponents that hold across 8 orders of magnitude.

What is Kaplan test loss L in Scaling Laws for Neural Language Models?

Cross-entropy loss measured on a held-out test set, the primary metric of model quality throughout the paper.

What is Kaplan optimal N(C) allocation in Scaling Laws for Neural Language Models?

For compute budget C, the optimal model size scales as N ∝ C^0.73 and should be trained for S ∝ C^0.27 steps.

What is Kaplan smooth & predictable scaling in Scaling Laws for Neural Language Models?

Performance depends very weakly on architecture, optimizer, and other hyperparameters; the scaling laws are surprisingly universal.

What is the main argument of Scaling Laws for Neural Language Models?

Kaplan loss scales as power law in N, D, C. Test loss scales as a power law with model size N, dataset size D, and compute budget C, with specific exponents that hold across 8 orders of magnitude.

Scaling Laws for Neural Language Models · Knowledge Graph

Knowledge Graph: Scaling Laws for Neural Language Models (Kaplan, McCandlish, Henighan, Brown et al. (OpenAI), 2020)

Editorial spotlight: ↑ the power-law trinity: L(N,D,C)

Concepts

Kaplan test loss L (importance 5): Cross-entropy loss measured on a held-out test set, the primary metric of model quality throughout the paper.. Source: (from training memory of book).
N (non-embedding parameters) (importance 4): Number of model parameters excluding positional and token embeddings, the measure of model size used in scaling laws.. Source: (from training memory of book).
D (dataset tokens) (importance 4): Number of training tokens seen, the measure of dataset size.. Source: (from training memory of book).
C (petaflop/s-days compute) (importance 4): Total compute used in training, measured in petaflop/s-days, calculated as ~6ND for Transformers.. Source: (from training memory of book).
α_N, α_D, α_C exponents (importance 3): The power-law exponents governing how loss scales with N, D, and C respectively.. Source: (from training memory of book).
Kaplan early vs. late compute efficiency (importance 3): Early in training, compute is efficiently converted to loss reduction; late in training (near convergence), returns diminish rapidly.. Source: (from training memory of book).
Kaplan bottleneck regimes (importance 3): Performance is limited by the smallest of N, D, or C; scaling laws apply only when the limiting factor is scaled.. Source: (from training memory of book).
Kaplan overfitting regime (importance 3): When D << N^0.74, models overfit; test loss stops improving while training loss continues to fall.. Source: (from training memory of book).
Hoffmann 2022 Chinchilla revision (importance 3): Later work found Kaplan underestimated optimal data scale; Chinchilla used equal parameter-token scaling instead of N^0.73.. Source: (from training memory of book).
Kaplan power-law universality class (importance 3): Loss functions across many domains exhibit power-law scaling; neural language models belong to this universality class.. Source: (from training memory of book).
Kaplan: blessings of scale (importance 3): Larger models generalize better per token, train more stably, and transfer better — scale has multiplicative benefits.. Source: (from training memory of book).
Kaplan compute-optimal frontier (importance 3): The curve L(C) when N and S are optimally allocated represents the best achievable performance for each compute budget.. Source: (from training memory of book).
Kaplan per-step efficiency (importance 2): Loss reduction per training step decreases as models grow; larger models need more steps to converge.. Source: (from training memory of book).
Kaplan underfitting regime (importance 2): When N is too small for given D and C, performance is limited by model capacity rather than data or compute.. Source: (from training memory of book).
Kaplan ~6N FLOPs/token (importance 2): Forward pass through a Transformer requires approximately 6N floating-point operations per token.. Source: (from training memory of book).
Kaplan gradient noise scale (importance 2): A measure of gradient stochasticity that governs optimal batch size; scales with loss.. Source: (from training memory of book).
Kaplan extrapolation to 10^13 FLOP (importance 2): Scaling laws enable reliable prediction of performance at compute scales not yet trained, up to ~100× beyond tested range.. Source: (from training memory of book).
Post-Kaplan: Chinchilla-optimal era (importance 2): After Hoffmann 2022, the field shifted to training smaller models on more data, correcting Kaplan's undertrained regime.. Source: (from training memory of book).
Kaplan: multi-epoch not tested (importance 2): All experiments use single-pass training; effects of repeated data passes on scaling not characterized.. Source: (from training memory of book).
N_c, D_c critical scales (importance 2): Empirically fitted constants determining where parameter and data bottlenecks begin to dominate.. Source: (from training memory of book).
Kaplan smooth loss landscape (importance 2): The continuous nature of scaling laws suggests loss landscapes are remarkably smooth, without sharp transitions.. Source: (from training memory of book).
Kaplan compute elasticity (importance 2): The responsiveness of loss to compute investment; quantified by the power-law exponent α_C = -0.050.. Source: (from training memory of book).
Kaplan FLOP accounting (importance 2): Careful measurement of FLOPs per token, excluding embedding lookups and softmax, to enable fair cross-model comparison.. Source: (from training memory of book).
Kaplan: no parameter sharing (importance 1): Models don't use parameter sharing across layers; each Transformer layer has independent parameters.. Source: (from training memory of book).
Kaplan perplexity = exp(L) (importance 1): Perplexity is the exponential of cross-entropy loss; scaling laws apply equivalently to both metrics.. Source: (from training memory of book).

Claims

Kaplan loss scales as power law in N, D, C (importance 5): Test loss scales as a power law with model size N, dataset size D, and compute budget C, with specific exponents that hold across 8 orders of magnitude.. Source: (from training memory of book).
Kaplan smooth & predictable scaling (importance 5): Performance depends very weakly on architecture, optimizer, and other hyperparameters; the scaling laws are surprisingly universal.. Source: (from training memory of book).
Kaplan: early stopping wastes compute (importance 4): Training large models for fewer steps is less compute-efficient than training smaller models to convergence, contrary to common practice.. Source: (from training memory of book).
Kaplan: large models are sample-efficient (importance 4): Larger models reach the same performance level with significantly fewer training tokens than smaller models.. Source: (from training memory of book).
Kaplan: training to convergence wastes compute (importance 4): For a fixed compute budget, it's more efficient to train a larger model with fewer steps than to train a smaller model to convergence.. Source: (from training memory of book).
Kaplan critical batch size B_crit (importance 4): There exists a critical batch size beyond which increasing batch size yields diminishing returns; it scales as a power law with loss.. Source: (from training memory of book).
Kaplan: scaling laws transfer across distributions (importance 4): Models trained on one text distribution generalize to others following predictable power laws, with transfer gaps that narrow as models grow.. Source: (from training memory of book).
Kaplan: models undertrained in 2020 (importance 4): Most large models of the era are trained with far more parameters than steps, contrary to compute-optimal allocation.. Source: (from training memory of book).
Kaplan: smooth curves, no emergence (importance 4): Performance improvements are continuous power laws with no sharp capability thresholds or emergent behaviors.. Source: (from training memory of book).
Kaplan: only N, D, C matter (importance 4): Architecture details, optimizer choice, learning rate schedules are all second-order; N, D, C dominate performance.. Source: (from training memory of book).
Kaplan: simplicity of scaling laws (importance 4): The fact that such simple power laws govern performance across 8 OOM suggests deep structure in the learning problem.. Source: (from training memory of book).
Kaplan paper shaped 2020-2022 era (importance 4): These scaling laws directly informed GPT-3, Gopher, and the entire 'scale is all you need' paradigm before Chinchilla.. Source: (from training memory of book).
Kaplan: predictable progress (importance 4): If scaling laws hold indefinitely, future model capabilities can be forecasted from compute trajectories.. Source: (from training memory of book).
Kaplan L_∞ irreducible loss (importance 3): There exists a theoretical minimum loss determined by the entropy of natural language; scaling laws approach but never cross this limit.. Source: (from training memory of book).
Kaplan: universality across tasks (importance 3): Scaling laws appear to generalize across different text domains, languages, and downstream tasks.. Source: (from training memory of book).
Kaplan: entropy sets floor (importance 3): The irreducible loss L_∞ is determined by the true entropy of the data distribution.. Source: (from training memory of book).
Kaplan: bigger ≠ always better (importance 3): For a fixed compute budget, there exists an optimal model size; going larger or smaller wastes compute.. Source: (from training memory of book).
Kaplan: large models wasteful → myth (importance 3): Contrary to intuition, larger models are more sample-efficient, needing fewer tokens to reach a target loss.. Source: (from training memory of book).
Kaplan: tricks < scale (importance 3): Architectural innovations, training tricks, and clever optimization contribute less to progress than simply scaling up.. Source: (from training memory of book).
Kaplan: no theory yet (importance 3): The paper observes empirical power laws but offers no rigorous theoretical explanation for why these exponents emerge.. Source: (from training memory of book).
Kaplan: scaling limit unknown (importance 3): Whether power laws continue indefinitely or eventually break down remains an open empirical question.. Source: (from training memory of book).
Kaplan: no double descent in scaling (importance 2): Unlike some supervised settings, language model loss decreases monotonically with N, D, C; no interpolation threshold peak.. Source: (from training memory of book).
Kaplan: scaling → alignment concerns (importance 2): Predictable capability growth raises questions about when dangerous capabilities emerge and how to align them.. Source: (from training memory of book).

Empirical results

L(N) ∝ N^(-0.076) (parameters) (importance 5): When not bottlenecked by data or compute, loss scales with model parameters N to the power of -0.076.. Source: (from training memory of book).
L(D) ∝ D^(-0.095) (dataset size) (importance 5): When training to convergence with infinite model capacity, loss scales with dataset size D to the power of -0.095.. Source: (from training memory of book).
L(C) ∝ C^(-0.050) (compute) (importance 5): When optimally trading off model size and training steps, loss scales with compute budget C to the power of -0.050.. Source: (from training memory of book).
Kaplan: architecture ≈ irrelevant (importance 4): Varying depth, width, attention heads, and other architectural details has minimal impact on scaling laws when controlling for N.. Source: (from training memory of book).
Kaplan: 8 OOM empirical validation (importance 4): The scaling laws hold across eight orders of magnitude in compute, from 10^3 to 10^11 petaflop/s-days.. Source: (from training memory of book).
Kaplan optimal N(C) allocation (importance 4): For compute budget C, the optimal model size scales as N ∝ C^0.73 and should be trained for S ∝ C^0.27 steps.. Source: (from training memory of book).
Kaplan: shape < scale (importance 3): Performance depends far more on total parameter count than on model shape (depth vs width ratio).. Source: (from training memory of book).
B_crit ∝ L^(-4.8) (importance 3): Critical batch size scales as an inverse power law with loss, approximately L to the power of -4.8.. Source: (from training memory of book).
S_min ∝ N^0.74 (convergence steps) (importance 3): Minimum steps to convergence scales as N^0.74, meaning larger models need disproportionately more training.. Source: (from training memory of book).
D scales slower than N (importance 3): Data needs grow slower than parameter needs: doubling parameters requires less than doubling data for same loss.. Source: (from training memory of book).
Kaplan transfer gap ∝ N^(-0.1) (importance 3): The performance gap between training and test distributions shrinks as a power law in model size.. Source: (from training memory of book).
Kaplan: convergence hits wall at S_min (importance 3): Training beyond convergence yields near-zero loss improvement regardless of additional compute spent.. Source: (from training memory of book).
Kaplan: stop at ~10% of S_min (importance 3): For compute efficiency, models should be stopped at roughly 10% of their convergence steps and scaled up instead.. Source: (from training memory of book).
Kaplan log-log linear plots (importance 3): When plotted on log-log axes, loss vs N/D/C relationships are remarkably linear across many orders of magnitude.. Source: (from training memory of book).
Kaplan: small models plateau early (importance 3): Models below a critical size for given data reach a performance plateau quickly and gain little from extended training.. Source: (from training memory of book).
Kaplan: optimal D ≈ 5N tokens (importance 3): Compute-optimal training uses roughly 5 tokens per non-embedding parameter.. Source: (from training memory of book).
Kaplan: test loss predicts downstream (importance 3): Pre-training test loss is highly predictive of downstream task performance across diverse benchmarks.. Source: (from training memory of book).
L(N,D) = [(N_c/N)^(α_N/α_D) + D_c/D]^α_D (importance 3): The unified formula combining N and D dependencies, with critical scales N_c and D_c and exponents α_N, α_D.. Source: (from training memory of book).
Kaplan: N^0.73 S^0.27 compute split (importance 3): Optimal compute allocation splits as 73% toward model size growth and 27% toward training longer.. Source: (from training memory of book).
Kaplan: training past S_min → waste (importance 3): Compute spent training beyond convergence yields <1% performance gains; better spent on larger models.. Source: (from training memory of book).
Kaplan: GPT-3 matched predictions (importance 3): Scaling laws predicted from smaller models accurately forecasted GPT-3's performance, validating extrapolation.. Source: (from training memory of book).
Kaplan: exclude embeddings from N (importance 2): Embedding parameters scale differently than model parameters and should be excluded from N for accurate scaling laws.. Source: (from training memory of book).
Kaplan width ≈ depth when N fixed (importance 2): For a fixed parameter count, making models wider or deeper yields nearly identical performance.. Source: (from training memory of book).
Kaplan: # attention heads ≈ irrelevant (importance 2): Number of attention heads has negligible impact on scaling laws when N is held constant.. Source: (from training memory of book).
Kaplan: low run-to-run variance (importance 2): Repeated runs with different random seeds show minimal variance; scaling laws are robust to initialization.. Source: (from training memory of book).
Kaplan: parallelism limited by B_crit (importance 2): Data parallelism beyond critical batch size wastes compute; model parallelism becomes necessary for larger models.. Source: (from training memory of book).
Kaplan: L layers ∝ N^0.6 optimal (importance 2): For a given N, optimal depth scales as the 0.6 power of parameters.. Source: (from training memory of book).
Kaplan: d_model ∝ N^0.4 optimal (importance 2): For a given N, optimal width (model dimension) scales as the 0.4 power of parameters.. Source: (from training memory of book).
Kaplan: train-valid gap ∝ 1/√N (importance 2): Difference between training and validation loss shrinks as inverse square root of model size.. Source: (from training memory of book).
Kaplan estimates L_∞ ≈ 1.7 nats (importance 2): Extrapolating scaling curves suggests irreducible loss around 1.7 nats (~2.45 bits per token).. Source: (from training memory of book).
Kaplan: LR schedule ≈ doesn't matter (importance 2): Scaling laws robust to variations in learning rate schedule shape and warmup duration.. Source: (from training memory of book).
Kaplan: no magic architecture found (importance 2): Testing variants found no architectural changes that significantly beat the power laws; Transformers aren't special.. Source: (from training memory of book).
Kaplan: weight decay ≈ irrelevant (importance 1): Presence or absence of weight decay has minimal effect on scaling laws.. Source: (from training memory of book).
Kaplan: nats (natural log) loss (importance 1): Loss measured in nats (base e) rather than bits (base 2); conversion: 1 nat ≈ 1.44 bits.. Source: (from training memory of book).
Kaplan: MoE not tested (importance 1): Mixture-of-experts architectures not included; unclear if scaling laws generalize to conditionally-activated parameters.. Source: (from training memory of book).

Methods

Kaplan Transformer decoder-only (importance 3): All experiments use decoder-only Transformers trained on language modeling, varying from 768 to 1.5B parameters.. Source: (from training memory of book).
WebText2 training corpus (importance 3): The primary training dataset, an expanded version of WebText containing 20+ billion tokens.. Source: (from training memory of book).
Kaplan Adam optimization (importance 2): All models trained with Adam optimizer; scaling laws hold regardless of optimizer choice.. Source: (from training memory of book).
Kaplan cosine learning rate decay (importance 2): Learning rate decays on a cosine schedule; scaling laws are robust to variations in schedule.. Source: (from training memory of book).
Kaplan 1024-token context (importance 2): Most experiments use 1024-token context windows; scaling laws are insensitive to moderate context length variations.. Source: (from training memory of book).
Kaplan fixed-LR parameter scan (importance 2): Systematically varying N while keeping learning rate and other hyperparameters constant to isolate scaling effects.. Source: (from training memory of book).
Kaplan 50k BPE vocabulary (importance 1): All models use 50,257-token BPE vocabulary; vocabulary size itself doesn't affect scaling laws significantly.. Source: (from training memory of book).
Kaplan 93%-5%-2% split (importance 1): WebText2 divided into 93% train, 5% validation, 2% test to measure generalization.. Source: (from training memory of book).
Kaplan LayerNorm (importance 1): All models use LayerNorm; choice of normalization doesn't materially affect scaling laws.. Source: (from training memory of book).
Kaplan full dense attention (importance 1): All models use full O(n²) attention; sparse attention patterns not explored.. Source: (from training memory of book).
Kaplan learning rate warmup (importance 1): Models use brief linear warmup before cosine decay; warmup length doesn't affect scaling laws.. Source: (from training memory of book).
Kaplan BPE tokenization (importance 1): Byte Pair Encoding used for all models; tokenization method doesn't affect scaling exponents.. Source: (from training memory of book).
Kaplan checkpoint every 1000 steps (importance 1): Models evaluated on test set every 1000 training steps to measure learning curves.. Source: (from training memory of book).
Kaplan: 10+ runs per config (importance 1): Each data point represents 10+ independent training runs; error bars are tight.. Source: (from training memory of book).

Entities

GPT-2 (Radford et al. 2019) (importance 2): 1.5B parameter model used as reference point; trained for 300B tokens.. Source: (from training memory of book).
GPT-3 (Brown et al. 2020) (importance 2): 175B parameter model trained contemporaneously; exemplifies the scaling laws in practice.. Source: (from training memory of book).
Jared Kaplan (OpenAI → Anthropic) (importance 2): Lead author, later co-founded Anthropic; work shaped GPT-3 and subsequent models.. Source: (from training memory of book).
Sam McCandlish (OpenAI → Anthropic) (importance 2): Second author, instrumental in theoretical framing of scaling laws.. Source: (from training memory of book).
Brown et al. (GPT-3, 2020) (importance 2): Contemporaneous work applying these scaling laws to train GPT-3.. Source: (from training memory of book).
Hestness et al. (2017) prior scaling (importance 2): Earlier work observing power-law scaling in supervised learning; Kaplan extends to unsupervised LM regime.. Source: (from training memory of book).
Hoffmann, Sifre et al. (DeepMind 2022) (importance 2): Team that revised Kaplan's compute-optimal scaling with the Chinchilla model and updated laws.. Source: (from training memory of book).
Tom Henighan (OpenAI) (importance 1): Third author on the paper, contributed to experimental design.. Source: (from training memory of book).
OpenAI compute cluster (2019-2020) (importance 1): V100 and TPUv3 hardware used for experiments; total compute ~10^6 petaflop/s-days.. Source: (from training memory of book).
NMT prior work (2017-2019) (importance 1): Earlier scaling observations in sequence-to-sequence models; Kaplan extends to pure LM.. Source: (from training memory of book).
Ilya Sutskever (OpenAI advisor) (importance 1): Senior advisor on the project; champion of scaling hypothesis.. Source: (from training memory of book).
Dario Amodei (OpenAI VP → Anthropic) (importance 1): OpenAI VP Research during this work; later co-founded Anthropic with Kaplan and McCandlish.. Source: (from training memory of book).

Relations

Kaplan loss scales as power law in N, D, C evidences L(N) ∝ N^(-0.076) (parameters)
Kaplan loss scales as power law in N, D, C evidences L(D) ∝ D^(-0.095) (dataset size)
Kaplan loss scales as power law in N, D, C evidences L(C) ∝ C^(-0.050) (compute)
L(N) ∝ N^(-0.076) (parameters) requires N (non-embedding parameters)
L(D) ∝ D^(-0.095) (dataset size) requires D (dataset tokens)
L(C) ∝ C^(-0.050) (compute) requires C (petaflop/s-days compute)
N (non-embedding parameters) enables Kaplan test loss L
D (dataset tokens) enables Kaplan test loss L
C (petaflop/s-days compute) enables Kaplan test loss L
Kaplan smooth & predictable scaling evidences Kaplan: architecture ≈ irrelevant
Kaplan: architecture ≈ irrelevant supports Kaplan: shape < scale
Kaplan smooth & predictable scaling evidences Kaplan: low run-to-run variance
Kaplan: early stopping wastes compute evidences Kaplan optimal N(C) allocation
Kaplan optimal N(C) allocation supports Kaplan: stop at ~10% of S_min
Kaplan: large models are sample-efficient evidences D scales slower than N
Kaplan: training to convergence wastes compute evidences Kaplan: convergence hits wall at S_min
Kaplan: convergence hits wall at S_min supports S_min ∝ N^0.74 (convergence steps)
Kaplan critical batch size B_crit evidences B_crit ∝ L^(-4.8)
B_crit ∝ L^(-4.8) requires Kaplan gradient noise scale
Kaplan: scaling laws transfer across distributions evidences Kaplan transfer gap ∝ N^(-0.1)
Kaplan: 8 OOM empirical validation supports Kaplan loss scales as power law in N, D, C
Kaplan log-log linear plots supports Kaplan: 8 OOM empirical validation
Kaplan Transformer decoder-only exemplifies N (non-embedding parameters)
WebText2 training corpus exemplifies D (dataset tokens)
Kaplan Adam optimization supports Kaplan: architecture ≈ irrelevant
Kaplan bottleneck regimes requires Kaplan loss scales as power law in N, D, C
Kaplan overfitting regime contradicts D scales slower than N
Kaplan underfitting regime evidences Kaplan: small models plateau early
Kaplan L_∞ irreducible loss evidences Kaplan estimates L_∞ ≈ 1.7 nats
Kaplan early vs. late compute efficiency supports Kaplan: training to convergence wastes compute
Kaplan: exclude embeddings from N enables N (non-embedding parameters)
GPT-2 (Radford et al. 2019) exemplifies Kaplan Transformer decoder-only
GPT-3 (Brown et al. 2020) exemplifies Kaplan: 8 OOM empirical validation
Kaplan: models undertrained in 2020 contradicts Kaplan optimal N(C) allocation
Hoffmann 2022 Chinchilla revision refutes Kaplan optimal N(C) allocation
Kaplan width ≈ depth when N fixed supports Kaplan: shape < scale
Kaplan per-step efficiency supports S_min ∝ N^0.74 (convergence steps)
Kaplan: # attention heads ≈ irrelevant supports Kaplan: architecture ≈ irrelevant
Jared Kaplan (OpenAI → Anthropic) cites Kaplan loss scales as power law in N, D, C
Sam McCandlish (OpenAI → Anthropic) cites Kaplan loss scales as power law in N, D, C
Kaplan ~6N FLOPs/token enables C (petaflop/s-days compute)
Kaplan: smooth curves, no emergence evidences Kaplan log-log linear plots
Kaplan: optimal D ≈ 5N tokens supports Kaplan optimal N(C) allocation
Brown et al. (GPT-3, 2020) builds-on Kaplan: 8 OOM empirical validation
Kaplan extrapolation to 10^13 FLOP requires Kaplan smooth & predictable scaling
Kaplan: L layers ∝ N^0.6 optimal supports Kaplan width ≈ depth when N fixed
Kaplan: d_model ∝ N^0.4 optimal supports Kaplan width ≈ depth when N fixed
Kaplan: only N, D, C matter supports Kaplan smooth & predictable scaling
Kaplan: weight decay ≈ irrelevant evidences Kaplan: only N, D, C matter
Post-Kaplan: Chinchilla-optimal era builds-on Hoffmann 2022 Chinchilla revision
Kaplan: bigger ≠ always better evidences Kaplan optimal N(C) allocation
Kaplan power-law universality class supports Kaplan loss scales as power law in N, D, C
Kaplan: test loss predicts downstream supports Kaplan test loss L
Hestness et al. (2017) prior scaling cites Kaplan loss scales as power law in N, D, C
Kaplan: no double descent in scaling supports Kaplan smooth & predictable scaling
Kaplan: simplicity of scaling laws supports Kaplan loss scales as power law in N, D, C
Hoffmann, Sifre et al. (DeepMind 2022) cites Hoffmann 2022 Chinchilla revision
L(N,D) = [(N_c/N)^(α_N/α_D) + D_c/D]^α_D evidences Kaplan loss scales as power law in N, D, C
N_c, D_c critical scales requires L(N,D) = [(N_c/N)^(α_N/α_D) + D_c/D]^α_D
Kaplan: train-valid gap ∝ 1/√N supports Kaplan: large models are sample-efficient
Kaplan paper shaped 2020-2022 era motivates GPT-3 (Brown et al. 2020)
Kaplan paper shaped 2020-2022 era motivates Brown et al. (GPT-3, 2020)
Kaplan: blessings of scale supports Kaplan: large models are sample-efficient
Kaplan: blessings of scale supports Kaplan: scaling laws transfer across distributions
Kaplan: predictable progress requires Kaplan extrapolation to 10^13 FLOP
Kaplan smooth loss landscape supports Kaplan: smooth curves, no emergence
Kaplan: N^0.73 S^0.27 compute split evidences Kaplan optimal N(C) allocation
Kaplan: large models wasteful → myth evidences D scales slower than N
Kaplan compute-optimal frontier requires L(C) ∝ C^(-0.050) (compute)
Kaplan: training past S_min → waste evidences Kaplan: training to convergence wastes compute
Kaplan: tricks < scale supports Kaplan: only N, D, C matter
Kaplan: LR schedule ≈ doesn't matter evidences Kaplan: only N, D, C matter
Kaplan: no theory yet contradicts Kaplan loss scales as power law in N, D, C
Kaplan compute elasticity exemplifies L(C) ∝ C^(-0.050) (compute)
Kaplan: GPT-3 matched predictions evidences Kaplan extrapolation to 10^13 FLOP
Kaplan: GPT-3 matched predictions supports GPT-3 (Brown et al. 2020)
Dario Amodei (OpenAI VP → Anthropic) cites Jared Kaplan (OpenAI → Anthropic)
Kaplan: scaling limit unknown contradicts Kaplan loss scales as power law in N, D, C
Kaplan FLOP accounting enables C (petaflop/s-days compute)
Kaplan: no magic architecture found supports Kaplan: architecture ≈ irrelevant
Kaplan: scaling → alignment concerns builds-on Kaplan: predictable progress
Kaplan perplexity = exp(L) exemplifies Kaplan test loss L
Kaplan: parallelism limited by B_crit supports B_crit ∝ L^(-4.8)
Kaplan: entropy sets floor supports Kaplan L_∞ irreducible loss
Kaplan LayerNorm requires Kaplan Transformer decoder-only
Kaplan 1024-token context requires Kaplan Transformer decoder-only
Kaplan 50k BPE vocabulary requires WebText2 training corpus
Kaplan 93%-5%-2% split requires WebText2 training corpus
Kaplan fixed-LR parameter scan enables Kaplan: architecture ≈ irrelevant
Kaplan cosine learning rate decay requires Kaplan Adam optimization
Kaplan learning rate warmup requires Kaplan cosine learning rate decay
Kaplan BPE tokenization requires Kaplan 50k BPE vocabulary
Kaplan full dense attention requires Kaplan Transformer decoder-only
Kaplan checkpoint every 1000 steps enables Kaplan test loss L
Kaplan: 10+ runs per config enables Kaplan: low run-to-run variance
Tom Henighan (OpenAI) cites Jared Kaplan (OpenAI → Anthropic)
OpenAI compute cluster (2019-2020) exemplifies C (petaflop/s-days compute)
Ilya Sutskever (OpenAI advisor) motivates Kaplan loss scales as power law in N, D, C
NMT prior work (2017-2019) precedes Hestness et al. (2017) prior scaling
Kaplan: nats (natural log) loss exemplifies Kaplan test loss L
Kaplan: no parameter sharing contradicts N (non-embedding parameters)
Kaplan: MoE not tested contradicts N (non-embedding parameters)
Kaplan: multi-epoch not tested contradicts D (dataset tokens)
Kaplan: universality across tasks supports Kaplan: scaling laws transfer across distributions
Kaplan: universality across tasks evidences Kaplan: test loss predicts downstream

Scaling Laws for Neural Language Models

fast mental map

share a specific view

not a citable source