What is the central concept of Why Machines Learn: The Elegant Math Behind Modern AI?

↑ the geometry of learning — what gradient descent sees in high-dimensional space. loss landscape geometry. The high-dimensional surface of the loss function. Neural networks navigate vast spaces with countless local minima, saddle points, and plateaus.

What is Ananthaswamy's representation learning in Why Machines Learn: The Elegant Math Behind Modern AI?

Neural networks don't just fit curves — they learn useful internal representations. Each layer transforms inputs into progressively more abstract features.

What is Ananthaswamy's curve-fitting paradigm in Why Machines Learn: The Elegant Math Behind Modern AI?

The central metaphor: all machine learning is finding the right curve through data points in high-dimensional space. From Legendre's method of least squares to deep neural networks.

What is gradient descent (Cauchy 1847) in Why Machines Learn: The Elegant Math Behind Modern AI?

Move in the direction of steepest descent of the loss function. The optimization workhorse behind modern deep learning.

What is the main argument of Why Machines Learn: The Elegant Math Behind Modern AI?

Minsky-Papert XOR impossibility. Demonstrated that single-layer perceptrons cannot learn XOR function — a linearly non-separable problem. Led to the first AI winter.

Why Machines Learn: The Elegant Math Behind Modern AI · Knowledge Graph

Knowledge Graph: Why Machines Learn: The Elegant Math Behind Modern AI (Anil Ananthaswamy, 2024)

Editorial spotlight: ↑ the geometry of learning — what gradient descent sees in high-dimensional space

Concepts

Ananthaswamy's curve-fitting paradigm (importance 5): The central metaphor: all machine learning is finding the right curve through data points in high-dimensional space. From Legendre's method of least squares to deep neural networks.. Source: (from training memory of book).
Ananthaswamy's feature space transformation (importance 5): The key insight bridging linear and nonlinear learning: transform the input into a higher-dimensional space where the problem becomes linearly separable.. Source: (from training memory of book).
loss landscape geometry (importance 5): The high-dimensional surface of the loss function. Neural networks navigate vast spaces with countless local minima, saddle points, and plateaus.. Source: (from training memory of book).
Ananthaswamy's representation learning (importance 5): Neural networks don't just fit curves — they learn useful internal representations. Each layer transforms inputs into progressively more abstract features.. Source: (from training memory of book).
bias-variance decomposition (importance 4): Total error splits into bias (underfitting), variance (overfitting), and irreducible noise. The fundamental tradeoff in statistical learning.. Source: (from training memory of book).
word embeddings as geometry (importance 4): Words mapped to vectors where semantic similarity corresponds to geometric proximity. Linear relationships capture analogies (king - man + woman ≈ queen).. Source: (from training memory of book).
high-dimensional space intuition (importance 4): In high dimensions, most of the volume is near the surface of a sphere. Random points are nearly orthogonal. Counter-intuitive geometry dominates ML.. Source: (from training memory of book).
generative modeling paradigm (importance 4): Learn the probability distribution of data, then sample from it to generate new examples. Contrasts with discriminative models that learn decision boundaries.. Source: (from training memory of book).
reinforcement learning framework (importance 4): Agent learns by trial and error, receiving rewards from environment. Goal is to maximize cumulative reward over time through learned policy.. Source: (from training memory of book).
transfer learning paradigm (importance 4): Pre-train on large dataset, fine-tune on specific task. Lower layers learn general features reusable across tasks; upper layers specialize.. Source: (from training memory of book).
Gaussian justification for least squares (importance 3): Gauss showed that if measurement errors follow a normal distribution, least squares gives the maximum likelihood estimate. Connected curve-fitting to probability theory.. Source: (from training memory of book).
Vapnik-Chervonenkis dimension (importance 3): Measure of a model's capacity: the largest number of points it can shatter (correctly classify all possible labelings). Controls generalization bounds.. Source: (from training memory of book).
chain rule as gradient highway (importance 3): Derivatives propagate backward through composed functions. Each layer's gradient is the product of local derivatives along the path.. Source: (from training memory of book).
vanishing gradient problem (importance 3): In deep networks with sigmoid/tanh, gradients shrink exponentially through layers during backprop. Early layers learn extremely slowly.. Source: (from training memory of book).
distributional hypothesis (Harris) (importance 3): Words that occur in similar contexts have similar meanings. Linguistic foundation for learned word representations.. Source: (from training memory of book).
curse of dimensionality (importance 3): As dimensions increase, data becomes exponentially sparse. Distances lose meaning. Need exponentially more data to maintain density.. Source: (from training memory of book).
GAN minimax game (importance 3): Generator minimizes log-probability discriminator is correct; discriminator maximizes it. Nash equilibrium when generator matches data distribution.. Source: (from training memory of book).
Markov Decision Process (MDP) (importance 3): Formal framework: states, actions, transition probabilities, rewards. Markov property: future depends only on current state, not history.. Source: (from training memory of book).
Bellman equation (importance 3): Value of state equals immediate reward plus discounted value of next state. Recursive relationship enabling dynamic programming solutions.. Source: (from training memory of book).
exploration-exploitation tradeoff (importance 3): Balance trying new actions (exploration) versus choosing known good actions (exploitation). Fundamental dilemma in sequential decision making.. Source: (from training memory of book).
Shannon entropy (importance 3): Measure of uncertainty/information content: H(X) = -Σ p(x) log p(x). Foundation for information-theoretic view of learning.. Source: (from training memory of book).
Kullback-Leibler divergence (importance 3): Measure how one distribution differs from another: KL(P||Q) = Σ p log(p/q). Asymmetric; central to VAE and information bottleneck.. Source: (from training memory of book).
information bottleneck principle (importance 3): Optimal representation compresses input X while preserving information about output Y. Trade off compression I(X;Z) against prediction I(Z;Y).. Source: (from training memory of book).
Bayesian inference framework (importance 3): Update beliefs through Bayes' rule: P(θ|data) ∝ P(data|θ)P(θ). Posterior combines likelihood and prior; principled uncertainty quantification.. Source: (from training memory of book).
emergent abilities in large models (importance 3): Capabilities not present in smaller models suddenly appear at scale (few-shot learning, chain-of-thought reasoning). Qualitative phase transitions.. Source: (from training memory of book).
in-context learning (importance 3): Large language models adapt to new tasks from examples in the prompt without weight updates. Meta-learning emerges from scale.. Source: (from training memory of book).
AI alignment problem (importance 3): Ensuring AI systems do what humans want them to do. Specification gaming, reward hacking, distributional shift — optimizers find unexpected solutions.. Source: (from training memory of book).
neural network interpretability (importance 3): Understanding what models have learned and how they make decisions. Activation visualization, attention maps, probing classifiers — reverse-engineering black boxes.. Source: (from training memory of book).
adversarial examples (importance 3): Imperceptible perturbations that cause misclassification. Reveal brittleness of learned representations; challenge to robustness and security.. Source: (from training memory of book).
implicit regularization of SGD (importance 3): Stochastic gradient descent biases toward simpler solutions even without explicit regularization. Optimization algorithm itself has inductive bias.. Source: (from training memory of book).
feature learning vs kernel methods (importance 3): Neural networks learn task-relevant features during training; kernel methods use fixed features. Enables transfer learning and sample efficiency.. Source: (from training memory of book).
architectural inductive biases (importance 3): Architecture choices encode assumptions about problem structure. Convolutions → translation invariance. Attention → permutation invariance + content-based access.. Source: (from training memory of book).
self-supervised learning (importance 3): Learn from unlabeled data by predicting parts from other parts. Contrastive learning, masked prediction, denoising — labels come from data structure.. Source: (from training memory of book).
distribution shift and robustness (importance 3): Models fail when test distribution differs from training. Covariate shift, label shift, concept drift. Fundamental challenge to deployment reliability.. Source: (from training memory of book).
intelligence without understanding (importance 3): Models exhibit sophisticated behavior without human-like understanding. Stochastic parrots vs genuine comprehension — philosophical question with practical implications.. Source: (from training memory of book).
mutual information (importance 2): How much knowing X reduces uncertainty about Y: I(X;Y) = H(Y) - H(Y|X). Quantifies dependence between variables.. Source: (from training memory of book).
Neural Tangent Kernel (NTK) (importance 2): In infinite-width limit, neural networks behave like kernel methods with fixed kernel. Training is equivalent to linear regression in feature space.. Source: (from training memory of book).
double descent phenomenon (importance 2): Test error first decreases, then increases (classical overfitting), then decreases again with more parameters. Challenges bias-variance tradeoff intuition.. Source: (from training memory of book).
grokking (delayed generalization) (importance 2): Models suddenly generalize long after achieving perfect training accuracy. Understanding generalizes later than memorization.. Source: (from training memory of book).
Occam's razor (inductive bias) (importance 2): Prefer simpler explanations consistent with data. Learning algorithms embed inductive biases toward particular solution classes.. Source: (from training memory of book).
PAC learning framework (Valiant) (importance 2): Probably Approximately Correct: formalize learnability via sample complexity bounds. How many examples needed to learn with high confidence?. Source: (from training memory of book).
computational complexity of learning (importance 2): Many learning problems are NP-hard in worst case. Practical success comes from structure in real data and approximate solutions.. Source: (from training memory of book).
SGD noise as feature (importance 2): Mini-batch noise isn't just inefficiency — it helps escape sharp minima and find flat regions with better generalization.. Source: (from training memory of book).
meta-learning (learning to learn) (importance 2): Train models to quickly adapt to new tasks with few examples. Learn initialization, optimizer, or architecture that generalizes across task distribution.. Source: (from training memory of book).
few-shot learning (importance 2): Generalize from very few examples per class. Requires transfer from prior tasks or architectural inductive biases like attention over support set.. Source: (from training memory of book).
continual learning and catastrophic forgetting (importance 2): Learning new tasks causes forgetting of old tasks unless explicitly prevented. Elastic weight consolidation, memory replay, progressive networks.. Source: (from training memory of book).
multi-task learning (importance 2): Train single model on multiple related tasks. Shared representations enable positive transfer; task-specific heads specialize.. Source: (from training memory of book).
fairness and bias in ML (importance 2): Models inherit biases from training data. Disparate impact across demographics; feedback loops amplify inequality. Technical and ethical challenge.. Source: (from training memory of book).
causality versus correlation (importance 2): ML finds correlations; causal reasoning requires interventions and counterfactuals. Spurious correlations fail under distribution shift.. Source: (from training memory of book).
uncertainty quantification (importance 2): Models should know what they don't know. Aleatoric (data noise) vs epistemic (model uncertainty). Calibration, conformal prediction, Bayesian methods.. Source: (from training memory of book).
symbolic-neural hybrid approaches (importance 2): Combine neural networks' pattern recognition with symbolic reasoning's compositionality and interpretability. Neuro-symbolic AI, differentiable programs.. Source: (from training memory of book).
energy-based models (importance 2): Assign energy to configurations; probability ∝ exp(-energy). Unifies many model classes; training via contrastive divergence or score matching.. Source: (from training memory of book).
world models and model-based RL (importance 2): Learn predictive model of environment dynamics, use for planning or data augmentation. Sample efficiency gains; harder to train than model-free.. Source: (from training memory of book).
neural collapse phenomenon (importance 1): Late in training, within-class features converge toward class means; class means form simplex equiangular tight frame. Last-layer geometry simplifies.. Source: (from training memory of book).

Claims

Minsky-Papert XOR impossibility (importance 4): Demonstrated that single-layer perceptrons cannot learn XOR function — a linearly non-separable problem. Led to the first AI winter.. Source: (from training memory of book).
universal approximation theorem (importance 4): A single hidden layer with enough neurons can approximate any continuous function. Theoretical justification for neural network expressiveness.. Source: (from training memory of book).
depth over width efficiency (importance 4): Deep networks can represent certain functions exponentially more efficiently than shallow wide networks. Depth creates hierarchical feature abstraction.. Source: (from training memory of book).
manifold hypothesis (importance 4): Real-world high-dimensional data (images, text) lies on or near low-dimensional manifolds embedded in high-dimensional space. Justifies dimensionality reduction.. Source: (from training memory of book).
neural scaling laws (importance 4): Model performance follows predictable power laws in compute, data, and parameters. Loss decreases smoothly with scale; emergent abilities appear at thresholds.. Source: (from training memory of book).
lottery ticket hypothesis (importance 2): Dense networks contain sparse subnetworks that can train to comparable accuracy in isolation. Weight initialization contains lucky subnetworks.. Source: (from training memory of book).
No Free Lunch theorem (importance 2): Averaged over all possible problems, all optimization algorithms perform equally. Success requires matching algorithm to problem structure.. Source: (from training memory of book).
loss landscape flatness correlates with generalization (importance 2): Solutions in flat basins (low curvature) generalize better than sharp minima. Robust to perturbations in weights.. Source: (from training memory of book).
mode connectivity (importance 1): Different local minima can be connected by paths of low loss. Loss landscapes have more structure than randomly initialized networks suggest.. Source: (from training memory of book).

Empirical results

AlexNet ImageNet breakthrough (2012) (importance 4): Deep convolutional network won ImageNet competition by massive margin, reigniting interest in neural networks after decades of dormancy.. Source: (from training memory of book).

Methods

Rosenblatt's Perceptron (1958) (importance 5): First algorithm that learned from examples by adjusting weights. Could learn linear separators in feature space through iterative weight updates.. Source: (from training memory of book).
gradient descent (Cauchy 1847) (importance 5): Move in the direction of steepest descent of the loss function. The optimization workhorse behind modern deep learning.. Source: (from training memory of book).
backpropagation algorithm (importance 5): Efficient computation of gradients in multi-layer networks using the chain rule. Popularized by Rumelhart, Hinton, Williams (1986).. Source: (from training memory of book).
attention mechanism (importance 5): Weighted focus on different parts of input based on learned relevance. Key innovation enabling Transformers to process sequences without recurrence.. Source: (from training memory of book).
Transformer architecture (2017) (importance 5): Purely attention-based sequence model. Self-attention layers compute representations by weighted combinations of all positions in parallel.. Source: (from training memory of book).
Generative Adversarial Network (GAN) (importance 5): Two networks in competition: generator creates fake samples, discriminator distinguishes real from fake. Generator improves to fool discriminator.. Source: (from training memory of book).
Legendre's method of least squares (1805) (importance 4): The foundational curve-fitting technique: minimize the sum of squared vertical distances from points to a line. First systematic approach to learning from data.. Source: (from training memory of book).
kernel trick (implicit high-dimensional mapping) (importance 4): Compute inner products in high-dimensional feature space without explicitly transforming the data. Makes SVMs computationally tractable.. Source: (from training memory of book).
Vapnik's Support Vector Machine (importance 4): Find the maximum-margin hyperplane that separates classes. Only the support vectors (boundary points) matter for defining the decision boundary.. Source: (from training memory of book).
convolutional layers (LeCun) (importance 4): Sliding filters that detect local patterns, sharing weights across spatial positions. Exploit translation invariance in images.. Source: (from training memory of book).
residual connections (ResNet) (importance 4): Skip connections that add layer input to output: y = F(x) + x. Allow training networks hundreds of layers deep by providing gradient shortcuts.. Source: (from training memory of book).
self-attention (query-key-value) (importance 4): Each position queries all others, producing attention weights via softmax of key-query dot products. Output is weighted sum of values.. Source: (from training memory of book).
Variational Autoencoder (VAE) (importance 4): Encode inputs as probability distributions in latent space, not fixed points. Train via variational inference to maximize lower bound on data likelihood.. Source: (from training memory of book).
diffusion probabilistic models (importance 4): Learn to reverse a gradual noising process. Train network to denoise at each step; at inference, start from noise and iteratively denoise to generate samples.. Source: (from training memory of book).
Deep Q-Network (DQN) (importance 4): Use deep neural network to approximate Q-function. Experience replay and target network stabilize training. Mastered Atari games from pixels.. Source: (from training memory of book).
AlphaGo's Monte Carlo Tree Search (importance 4): Combine deep neural networks with tree search. Networks guide exploration; search provides training signal. Defeated world champion Lee Sedol.. Source: (from training memory of book).
BERT masked language modeling (importance 4): Pre-training by predicting masked words from context. Bidirectional Transformer learns rich contextual representations of language.. Source: (from training memory of book).
GPT autoregressive generation (importance 4): Transformer trained to predict next token given previous tokens. Unidirectional attention; scales to massive models generating coherent long-form text.. Source: (from training memory of book).
RLHF (Reinforcement Learning from Human Feedback) (importance 4): Train reward model on human preferences, then optimize language model policy against it via RL. Aligns model outputs with human values.. Source: (from training memory of book).
ReLU activation function (importance 3): max(0, x) — simple nonlinearity that avoids vanishing gradients and speeds training. Replaced sigmoid/tanh in most modern architectures.. Source: (from training memory of book).
positional encoding (importance 3): Add sinusoidal position information to input embeddings. Necessary because attention is permutation-invariant — needs explicit position signal.. Source: (from training memory of book).
Word2Vec (Mikolov 2013) (importance 3): Learn word vectors by predicting context words (skip-gram) or target from context (CBOW). Shallow networks producing rich semantic spaces.. Source: (from training memory of book).
principal component analysis (Pearson 1901) (importance 3): Find orthogonal directions of maximum variance. Project data onto top k principal components for dimensionality reduction.. Source: (from training memory of book).
autoencoder architecture (importance 3): Neural network trained to reconstruct its input through a bottleneck layer. Bottleneck learns compressed representation of data.. Source: (from training memory of book).
reparameterization trick (importance 3): Make stochastic sampling differentiable by expressing random variable as deterministic function of input and external noise. Enables backprop through VAE sampling.. Source: (from training memory of book).
score matching (Hyvärinen) (importance 3): Learn the gradient of log probability (the score function) instead of the probability itself. Avoids intractable normalization constants.. Source: (from training memory of book).
Q-learning (Watkins 1989) (importance 3): Learn action-value function Q(s,a) through iterative updates based on observed rewards and max Q of next state. Model-free temporal-difference learning.. Source: (from training memory of book).
policy gradient methods (importance 3): Directly optimize policy by gradient ascent on expected reward. Use likelihood ratio trick to compute gradient through stochastic actions.. Source: (from training memory of book).
actor-critic architecture (importance 3): Actor network outputs actions (policy), critic evaluates them (value function). Combine policy gradient and value estimation for variance reduction.. Source: (from training memory of book).
cross-entropy loss function (importance 3): Measure divergence between predicted and true probability distributions. Standard loss for classification: -Σ y log(ŷ).. Source: (from training memory of book).
dropout regularization (importance 3): Randomly zero activations during training with probability p. Prevents co-adaptation; approximate Bayesian inference over ensemble of subnetworks.. Source: (from training memory of book).
batch normalization (importance 3): Normalize layer inputs to zero mean, unit variance per mini-batch. Reduces internal covariate shift; enables higher learning rates.. Source: (from training memory of book).
Adam optimizer (importance 3): Adaptive learning rates per parameter using first and second moment estimates. Combines momentum and RMSProp; de facto standard optimizer.. Source: (from training memory of book).
contrastive learning (SimCLR, MoCo) (importance 3): Learn representations by pulling together augmented views of same image, pushing apart different images. Matches or exceeds supervised pre-training.. Source: (from training memory of book).
Langevin dynamics sampling (importance 2): Iteratively move in direction of score (gradient of log-prob) plus noise. Provably converges to target distribution given perfect score function.. Source: (from training memory of book).
Bayesian neural networks (importance 2): Place probability distributions over network weights instead of point estimates. Intractable exact inference; approximations via variational methods or MCMC.. Source: (from training memory of book).
learning rate scheduling (importance 2): Gradually decrease learning rate during training (warmup, cosine decay, step decay). Helps convergence and fine-grained optimization late in training.. Source: (from training memory of book).
prompt engineering (importance 2): Carefully design input text to elicit desired model behavior. Examples, instructions, role-play, chain-of-thought — interface with frozen models.. Source: (from training memory of book).
data augmentation (importance 2): Artificially expand training set via invariance-preserving transformations. Rotations, crops, color jitter for images; back-translation for text.. Source: (from training memory of book).
Neural Architecture Search (NAS) (importance 2): Automate architecture design via reinforcement learning or evolutionary methods. Search over graph of operations; expensive but finds novel designs.. Source: (from training memory of book).
quantization and low-precision training (importance 2): Reduce numerical precision (FP32 → INT8) for faster inference and smaller models. Surprisingly robust; 8-bit often sufficient.. Source: (from training memory of book).
knowledge distillation (importance 2): Train small student model to match large teacher's outputs. Soft targets contain more information than hard labels; compress without accuracy loss.. Source: (from training memory of book).
mixup and cutmix regularization (importance 1): Train on interpolated examples (mixup) or pasted image regions (cutmix). Smooths decision boundaries; improves calibration and robustness.. Source: (from training memory of book).
iterative magnitude pruning (importance 1): Remove smallest-weight connections, retrain, repeat. Discovers sparse subnetworks; used to find lottery tickets.. Source: (from training memory of book).

Relations

Ananthaswamy's curve-fitting paradigm exemplifies Legendre's method of least squares (1805)
Legendre's method of least squares (1805) motivates Gaussian justification for least squares
Ananthaswamy's curve-fitting paradigm generalizes Ananthaswamy's feature space transformation
Rosenblatt's Perceptron (1958) exemplifies Ananthaswamy's curve-fitting paradigm
Minsky-Papert XOR impossibility refutes Rosenblatt's Perceptron (1958)
Ananthaswamy's feature space transformation supports Minsky-Papert XOR impossibility
kernel trick (implicit high-dimensional mapping) enables Ananthaswamy's feature space transformation
Vapnik's Support Vector Machine requires kernel trick (implicit high-dimensional mapping)
Vapnik-Chervonenkis dimension supports Vapnik's Support Vector Machine
bias-variance decomposition motivates Vapnik-Chervonenkis dimension
gradient descent (Cauchy 1847) enables Ananthaswamy's curve-fitting paradigm
backpropagation algorithm builds-on gradient descent (Cauchy 1847)
chain rule as gradient highway enables backpropagation algorithm
loss landscape geometry evidences gradient descent (Cauchy 1847)
universal approximation theorem generalizes Rosenblatt's Perceptron (1958)
depth over width efficiency supports universal approximation theorem
Ananthaswamy's representation learning evidences depth over width efficiency
convolutional layers (LeCun) exemplifies Ananthaswamy's representation learning
AlexNet ImageNet breakthrough (2012) evidences convolutional layers (LeCun)
ReLU activation function supports vanishing gradient problem
vanishing gradient problem contradicts backpropagation algorithm
residual connections (ResNet) supports vanishing gradient problem
attention mechanism exemplifies Ananthaswamy's representation learning
Transformer architecture (2017) requires attention mechanism
self-attention (query-key-value) exemplifies attention mechanism
positional encoding requires Transformer architecture (2017)
word embeddings as geometry exemplifies Ananthaswamy's feature space transformation
Word2Vec (Mikolov 2013) enables word embeddings as geometry
distributional hypothesis (Harris) motivates Word2Vec (Mikolov 2013)
high-dimensional space intuition supports word embeddings as geometry
curse of dimensionality evidences high-dimensional space intuition
manifold hypothesis supports curse of dimensionality
principal component analysis (Pearson 1901) exemplifies manifold hypothesis
autoencoder architecture exemplifies manifold hypothesis
generative modeling paradigm generalizes Ananthaswamy's curve-fitting paradigm
Variational Autoencoder (VAE) builds-on autoencoder architecture
reparameterization trick enables Variational Autoencoder (VAE)
Generative Adversarial Network (GAN) exemplifies generative modeling paradigm
GAN minimax game supports Generative Adversarial Network (GAN)
diffusion probabilistic models exemplifies generative modeling paradigm
score matching (Hyvärinen) enables diffusion probabilistic models
Langevin dynamics sampling requires score matching (Hyvärinen)
reinforcement learning framework generalizes Ananthaswamy's curve-fitting paradigm
Markov Decision Process (MDP) supports reinforcement learning framework
Bellman equation supports Markov Decision Process (MDP)
Q-learning (Watkins 1989) builds-on Bellman equation
Deep Q-Network (DQN) builds-on Q-learning (Watkins 1989)
policy gradient methods exemplifies reinforcement learning framework
actor-critic architecture builds-on policy gradient methods
AlphaGo's Monte Carlo Tree Search builds-on actor-critic architecture
exploration-exploitation tradeoff supports reinforcement learning framework
Shannon entropy supports Ananthaswamy's curve-fitting paradigm
cross-entropy loss function builds-on Shannon entropy
Kullback-Leibler divergence builds-on Shannon entropy
mutual information builds-on Kullback-Leibler divergence
information bottleneck principle builds-on mutual information
Bayesian inference framework generalizes Ananthaswamy's curve-fitting paradigm
Bayesian neural networks exemplifies Bayesian inference framework
dropout regularization supports Bayesian neural networks
batch normalization supports vanishing gradient problem
Adam optimizer builds-on gradient descent (Cauchy 1847)
learning rate scheduling supports Adam optimizer
transfer learning paradigm builds-on Ananthaswamy's representation learning
BERT masked language modeling builds-on Transformer architecture (2017)
GPT autoregressive generation builds-on Transformer architecture (2017)
neural scaling laws evidences GPT autoregressive generation
emergent abilities in large models evidences neural scaling laws
in-context learning exemplifies emergent abilities in large models
prompt engineering enables in-context learning
RLHF (Reinforcement Learning from Human Feedback) builds-on reinforcement learning framework
AI alignment problem motivates RLHF (Reinforcement Learning from Human Feedback)
neural network interpretability supports Ananthaswamy's representation learning
adversarial examples evidences loss landscape geometry
Neural Tangent Kernel (NTK) supports universal approximation theorem
lottery ticket hypothesis contradicts universal approximation theorem
double descent phenomenon contradicts bias-variance decomposition
grokking (delayed generalization) supports double descent phenomenon
implicit regularization of SGD evidences gradient descent (Cauchy 1847)
No Free Lunch theorem supports Ananthaswamy's curve-fitting paradigm
Occam's razor (inductive bias) motivates bias-variance decomposition
PAC learning framework (Valiant) generalizes Vapnik-Chervonenkis dimension
computational complexity of learning supports PAC learning framework (Valiant)
SGD noise as feature evidences gradient descent (Cauchy 1847)
loss landscape flatness correlates with generalization supports SGD noise as feature
mode connectivity evidences loss landscape geometry
neural collapse phenomenon evidences loss landscape geometry
feature learning vs kernel methods evidences Ananthaswamy's representation learning
architectural inductive biases supports feature learning vs kernel methods
data augmentation supports bias-variance decomposition
mixup and cutmix regularization builds-on data augmentation
self-supervised learning enables Ananthaswamy's representation learning
contrastive learning (SimCLR, MoCo) exemplifies self-supervised learning
meta-learning (learning to learn) generalizes transfer learning paradigm
few-shot learning exemplifies meta-learning (learning to learn)
Neural Architecture Search (NAS) enables architectural inductive biases
iterative magnitude pruning enables lottery ticket hypothesis
quantization and low-precision training supports Neural Architecture Search (NAS)
knowledge distillation builds-on transfer learning paradigm
continual learning and catastrophic forgetting generalizes transfer learning paradigm
multi-task learning exemplifies transfer learning paradigm
fairness and bias in ML evidences AI alignment problem
causality versus correlation contradicts Ananthaswamy's curve-fitting paradigm
distribution shift and robustness evidences causality versus correlation
uncertainty quantification builds-on Bayesian inference framework
symbolic-neural hybrid approaches generalizes Ananthaswamy's representation learning
energy-based models generalizes generative modeling paradigm
world models and model-based RL builds-on reinforcement learning framework
intelligence without understanding contradicts Ananthaswamy's representation learning
Ananthaswamy's curve-fitting paradigm evidences loss landscape geometry
Ananthaswamy's representation learning evidences loss landscape geometry
high-dimensional space intuition supports loss landscape geometry
manifold hypothesis supports Ananthaswamy's feature space transformation
Variational Autoencoder (VAE) exemplifies manifold hypothesis
Generative Adversarial Network (GAN) exemplifies manifold hypothesis
information bottleneck principle supports Ananthaswamy's representation learning
BERT masked language modeling exemplifies transfer learning paradigm
GPT autoregressive generation exemplifies transfer learning paradigm
RLHF (Reinforcement Learning from Human Feedback) builds-on GPT autoregressive generation
neural scaling laws evidences depth over width efficiency
in-context learning exemplifies meta-learning (learning to learn)
prompt engineering enables GPT autoregressive generation
adversarial examples evidences distribution shift and robustness
neural network interpretability supports AI alignment problem
loss landscape flatness correlates with generalization evidences loss landscape geometry
implicit regularization of SGD enables loss landscape flatness correlates with generalization
dropout regularization supports bias-variance decomposition
batch normalization evidences implicit regularization of SGD
residual connections (ResNet) enables depth over width efficiency
attention mechanism exemplifies architectural inductive biases
self-attention (query-key-value) requires Transformer architecture (2017)
contrastive learning (SimCLR, MoCo) enables Ananthaswamy's representation learning
diffusion probabilistic models requires score matching (Hyvärinen)
Deep Q-Network (DQN) builds-on Ananthaswamy's representation learning
AlphaGo's Monte Carlo Tree Search builds-on Deep Q-Network (DQN)
fairness and bias in ML evidences distribution shift and robustness
uncertainty quantification supports AI alignment problem
world models and model-based RL builds-on Ananthaswamy's representation learning
intelligence without understanding contradicts emergent abilities in large models
Ananthaswamy's feature space transformation enables kernel trick (implicit high-dimensional mapping)
word embeddings as geometry evidences high-dimensional space intuition
autoencoder architecture exemplifies Ananthaswamy's representation learning
Neural Tangent Kernel (NTK) generalizes kernel trick (implicit high-dimensional mapping)
lottery ticket hypothesis evidences implicit regularization of SGD
double descent phenomenon contradicts Vapnik-Chervonenkis dimension
grokking (delayed generalization) evidences implicit regularization of SGD
mode connectivity supports loss landscape flatness correlates with generalization
neural collapse phenomenon evidences Ananthaswamy's representation learning
data augmentation exemplifies architectural inductive biases
self-supervised learning enables transfer learning paradigm
few-shot learning supports in-context learning
Neural Architecture Search (NAS) enables depth over width efficiency
quantization and low-precision training supports iterative magnitude pruning
knowledge distillation builds-on Ananthaswamy's representation learning
continual learning and catastrophic forgetting generalizes multi-task learning
symbolic-neural hybrid approaches supports intelligence without understanding
energy-based models generalizes diffusion probabilistic models
cross-entropy loss function enables gradient descent (Cauchy 1847)
Kullback-Leibler divergence enables Variational Autoencoder (VAE)
information bottleneck principle supports Variational Autoencoder (VAE)
Bayesian inference framework enables uncertainty quantification

Why Machines Learn: The Elegant Math Behind Modern AI

fast mental map

share a specific view

not a citable source