All filters off — toggle a chip or lower the importance slider to see nodes.
Top hubs · by degree
Legend
concept
claim
result
method
entity
MAP
Interactive version —
how to use this graph
✓
fast mental map
Click ▶ Guided tour for a 60-second walk through the editor's pick. Or hover any node to focus; click for source; ★ nodes you want to come back to; ⌘+click two nodes to compare.
✓
share a specific view
Select any node, copy URL — the link encodes selection, zoom, and filters. Save it as a named view (⌘ views). Annotations save locally per paper. </> embed generates an iframe.
✗
not a citable source
Do not quote the graph as an authority. Edge labels and importance scores are interpretive judgments by the generating agent. Any claim worth citing must be traced back to the original paper.
reliability noteHeadline structure and importance-5 nodes are stable across runs. Mid-tier nodes (importance 2–3) and edge type distinctions are interpretive and may differ between runs. Click any node to see its source citation — nodes marked "training memory" or "inferred" were not directly verified against the source document.
LOOMUS™ and the Knowledge-Loom methodology are proprietary. Visual system is original to LOOMUS.
Knowledge Graph: Why Machines Learn: The Elegant Math Behind Modern AI (Anil Ananthaswamy, 2024)
Editorial spotlight: ↑ the geometry of learning — what gradient descent sees in high-dimensional space
Concepts
Ananthaswamy's curve-fitting paradigm (importance 5): The central metaphor: all machine learning is finding the right curve through data points in high-dimensional space. From Legendre's method of least squares to deep neural networks.. Source: (from training memory of book).
Ananthaswamy's feature space transformation (importance 5): The key insight bridging linear and nonlinear learning: transform the input into a higher-dimensional space where the problem becomes linearly separable.. Source: (from training memory of book).
loss landscape geometry (importance 5): The high-dimensional surface of the loss function. Neural networks navigate vast spaces with countless local minima, saddle points, and plateaus.. Source: (from training memory of book).
Ananthaswamy's representation learning (importance 5): Neural networks don't just fit curves — they learn useful internal representations. Each layer transforms inputs into progressively more abstract features.. Source: (from training memory of book).
bias-variance decomposition (importance 4): Total error splits into bias (underfitting), variance (overfitting), and irreducible noise. The fundamental tradeoff in statistical learning.. Source: (from training memory of book).
word embeddings as geometry (importance 4): Words mapped to vectors where semantic similarity corresponds to geometric proximity. Linear relationships capture analogies (king - man + woman ≈ queen).. Source: (from training memory of book).
high-dimensional space intuition (importance 4): In high dimensions, most of the volume is near the surface of a sphere. Random points are nearly orthogonal. Counter-intuitive geometry dominates ML.. Source: (from training memory of book).
generative modeling paradigm (importance 4): Learn the probability distribution of data, then sample from it to generate new examples. Contrasts with discriminative models that learn decision boundaries.. Source: (from training memory of book).
reinforcement learning framework (importance 4): Agent learns by trial and error, receiving rewards from environment. Goal is to maximize cumulative reward over time through learned policy.. Source: (from training memory of book).
transfer learning paradigm (importance 4): Pre-train on large dataset, fine-tune on specific task. Lower layers learn general features reusable across tasks; upper layers specialize.. Source: (from training memory of book).
Gaussian justification for least squares (importance 3): Gauss showed that if measurement errors follow a normal distribution, least squares gives the maximum likelihood estimate. Connected curve-fitting to probability theory.. Source: (from training memory of book).
Vapnik-Chervonenkis dimension (importance 3): Measure of a model's capacity: the largest number of points it can shatter (correctly classify all possible labelings). Controls generalization bounds.. Source: (from training memory of book).
chain rule as gradient highway (importance 3): Derivatives propagate backward through composed functions. Each layer's gradient is the product of local derivatives along the path.. Source: (from training memory of book).
vanishing gradient problem (importance 3): In deep networks with sigmoid/tanh, gradients shrink exponentially through layers during backprop. Early layers learn extremely slowly.. Source: (from training memory of book).
distributional hypothesis (Harris) (importance 3): Words that occur in similar contexts have similar meanings. Linguistic foundation for learned word representations.. Source: (from training memory of book).
curse of dimensionality (importance 3): As dimensions increase, data becomes exponentially sparse. Distances lose meaning. Need exponentially more data to maintain density.. Source: (from training memory of book).
GAN minimax game (importance 3): Generator minimizes log-probability discriminator is correct; discriminator maximizes it. Nash equilibrium when generator matches data distribution.. Source: (from training memory of book).
Markov Decision Process (MDP) (importance 3): Formal framework: states, actions, transition probabilities, rewards. Markov property: future depends only on current state, not history.. Source: (from training memory of book).
Bellman equation (importance 3): Value of state equals immediate reward plus discounted value of next state. Recursive relationship enabling dynamic programming solutions.. Source: (from training memory of book).
exploration-exploitation tradeoff (importance 3): Balance trying new actions (exploration) versus choosing known good actions (exploitation). Fundamental dilemma in sequential decision making.. Source: (from training memory of book).
Shannon entropy (importance 3): Measure of uncertainty/information content: H(X) = -Σ p(x) log p(x). Foundation for information-theoretic view of learning.. Source: (from training memory of book).
Kullback-Leibler divergence (importance 3): Measure how one distribution differs from another: KL(P||Q) = Σ p log(p/q). Asymmetric; central to VAE and information bottleneck.. Source: (from training memory of book).
information bottleneck principle (importance 3): Optimal representation compresses input X while preserving information about output Y. Trade off compression I(X;Z) against prediction I(Z;Y).. Source: (from training memory of book).
Bayesian inference framework (importance 3): Update beliefs through Bayes' rule: P(θ|data) ∝ P(data|θ)P(θ). Posterior combines likelihood and prior; principled uncertainty quantification.. Source: (from training memory of book).
emergent abilities in large models (importance 3): Capabilities not present in smaller models suddenly appear at scale (few-shot learning, chain-of-thought reasoning). Qualitative phase transitions.. Source: (from training memory of book).
in-context learning (importance 3): Large language models adapt to new tasks from examples in the prompt without weight updates. Meta-learning emerges from scale.. Source: (from training memory of book).
AI alignment problem (importance 3): Ensuring AI systems do what humans want them to do. Specification gaming, reward hacking, distributional shift — optimizers find unexpected solutions.. Source: (from training memory of book).
neural network interpretability (importance 3): Understanding what models have learned and how they make decisions. Activation visualization, attention maps, probing classifiers — reverse-engineering black boxes.. Source: (from training memory of book).
adversarial examples (importance 3): Imperceptible perturbations that cause misclassification. Reveal brittleness of learned representations; challenge to robustness and security.. Source: (from training memory of book).
implicit regularization of SGD (importance 3): Stochastic gradient descent biases toward simpler solutions even without explicit regularization. Optimization algorithm itself has inductive bias.. Source: (from training memory of book).
feature learning vs kernel methods (importance 3): Neural networks learn task-relevant features during training; kernel methods use fixed features. Enables transfer learning and sample efficiency.. Source: (from training memory of book).
architectural inductive biases (importance 3): Architecture choices encode assumptions about problem structure. Convolutions → translation invariance. Attention → permutation invariance + content-based access.. Source: (from training memory of book).
self-supervised learning (importance 3): Learn from unlabeled data by predicting parts from other parts. Contrastive learning, masked prediction, denoising — labels come from data structure.. Source: (from training memory of book).
distribution shift and robustness (importance 3): Models fail when test distribution differs from training. Covariate shift, label shift, concept drift. Fundamental challenge to deployment reliability.. Source: (from training memory of book).
intelligence without understanding (importance 3): Models exhibit sophisticated behavior without human-like understanding. Stochastic parrots vs genuine comprehension — philosophical question with practical implications.. Source: (from training memory of book).
mutual information (importance 2): How much knowing X reduces uncertainty about Y: I(X;Y) = H(Y) - H(Y|X). Quantifies dependence between variables.. Source: (from training memory of book).
Neural Tangent Kernel (NTK) (importance 2): In infinite-width limit, neural networks behave like kernel methods with fixed kernel. Training is equivalent to linear regression in feature space.. Source: (from training memory of book).
double descent phenomenon (importance 2): Test error first decreases, then increases (classical overfitting), then decreases again with more parameters. Challenges bias-variance tradeoff intuition.. Source: (from training memory of book).
grokking (delayed generalization) (importance 2): Models suddenly generalize long after achieving perfect training accuracy. Understanding generalizes later than memorization.. Source: (from training memory of book).
Occam's razor (inductive bias) (importance 2): Prefer simpler explanations consistent with data. Learning algorithms embed inductive biases toward particular solution classes.. Source: (from training memory of book).
PAC learning framework (Valiant) (importance 2): Probably Approximately Correct: formalize learnability via sample complexity bounds. How many examples needed to learn with high confidence?. Source: (from training memory of book).
computational complexity of learning (importance 2): Many learning problems are NP-hard in worst case. Practical success comes from structure in real data and approximate solutions.. Source: (from training memory of book).
SGD noise as feature (importance 2): Mini-batch noise isn't just inefficiency — it helps escape sharp minima and find flat regions with better generalization.. Source: (from training memory of book).
meta-learning (learning to learn) (importance 2): Train models to quickly adapt to new tasks with few examples. Learn initialization, optimizer, or architecture that generalizes across task distribution.. Source: (from training memory of book).
few-shot learning (importance 2): Generalize from very few examples per class. Requires transfer from prior tasks or architectural inductive biases like attention over support set.. Source: (from training memory of book).
continual learning and catastrophic forgetting (importance 2): Learning new tasks causes forgetting of old tasks unless explicitly prevented. Elastic weight consolidation, memory replay, progressive networks.. Source: (from training memory of book).
multi-task learning (importance 2): Train single model on multiple related tasks. Shared representations enable positive transfer; task-specific heads specialize.. Source: (from training memory of book).
fairness and bias in ML (importance 2): Models inherit biases from training data. Disparate impact across demographics; feedback loops amplify inequality. Technical and ethical challenge.. Source: (from training memory of book).
causality versus correlation (importance 2): ML finds correlations; causal reasoning requires interventions and counterfactuals. Spurious correlations fail under distribution shift.. Source: (from training memory of book).
uncertainty quantification (importance 2): Models should know what they don't know. Aleatoric (data noise) vs epistemic (model uncertainty). Calibration, conformal prediction, Bayesian methods.. Source: (from training memory of book).
symbolic-neural hybrid approaches (importance 2): Combine neural networks' pattern recognition with symbolic reasoning's compositionality and interpretability. Neuro-symbolic AI, differentiable programs.. Source: (from training memory of book).
energy-based models (importance 2): Assign energy to configurations; probability ∝ exp(-energy). Unifies many model classes; training via contrastive divergence or score matching.. Source: (from training memory of book).
world models and model-based RL (importance 2): Learn predictive model of environment dynamics, use for planning or data augmentation. Sample efficiency gains; harder to train than model-free.. Source: (from training memory of book).
neural collapse phenomenon (importance 1): Late in training, within-class features converge toward class means; class means form simplex equiangular tight frame. Last-layer geometry simplifies.. Source: (from training memory of book).
Claims
Minsky-Papert XOR impossibility (importance 4): Demonstrated that single-layer perceptrons cannot learn XOR function — a linearly non-separable problem. Led to the first AI winter.. Source: (from training memory of book).
universal approximation theorem (importance 4): A single hidden layer with enough neurons can approximate any continuous function. Theoretical justification for neural network expressiveness.. Source: (from training memory of book).
depth over width efficiency (importance 4): Deep networks can represent certain functions exponentially more efficiently than shallow wide networks. Depth creates hierarchical feature abstraction.. Source: (from training memory of book).
manifold hypothesis (importance 4): Real-world high-dimensional data (images, text) lies on or near low-dimensional manifolds embedded in high-dimensional space. Justifies dimensionality reduction.. Source: (from training memory of book).
neural scaling laws (importance 4): Model performance follows predictable power laws in compute, data, and parameters. Loss decreases smoothly with scale; emergent abilities appear at thresholds.. Source: (from training memory of book).
lottery ticket hypothesis (importance 2): Dense networks contain sparse subnetworks that can train to comparable accuracy in isolation. Weight initialization contains lucky subnetworks.. Source: (from training memory of book).
No Free Lunch theorem (importance 2): Averaged over all possible problems, all optimization algorithms perform equally. Success requires matching algorithm to problem structure.. Source: (from training memory of book).
loss landscape flatness correlates with generalization (importance 2): Solutions in flat basins (low curvature) generalize better than sharp minima. Robust to perturbations in weights.. Source: (from training memory of book).
mode connectivity (importance 1): Different local minima can be connected by paths of low loss. Loss landscapes have more structure than randomly initialized networks suggest.. Source: (from training memory of book).
Empirical results
AlexNet ImageNet breakthrough (2012) (importance 4): Deep convolutional network won ImageNet competition by massive margin, reigniting interest in neural networks after decades of dormancy.. Source: (from training memory of book).
Methods
Rosenblatt's Perceptron (1958) (importance 5): First algorithm that learned from examples by adjusting weights. Could learn linear separators in feature space through iterative weight updates.. Source: (from training memory of book).
gradient descent (Cauchy 1847) (importance 5): Move in the direction of steepest descent of the loss function. The optimization workhorse behind modern deep learning.. Source: (from training memory of book).
backpropagation algorithm (importance 5): Efficient computation of gradients in multi-layer networks using the chain rule. Popularized by Rumelhart, Hinton, Williams (1986).. Source: (from training memory of book).
attention mechanism (importance 5): Weighted focus on different parts of input based on learned relevance. Key innovation enabling Transformers to process sequences without recurrence.. Source: (from training memory of book).
Transformer architecture (2017) (importance 5): Purely attention-based sequence model. Self-attention layers compute representations by weighted combinations of all positions in parallel.. Source: (from training memory of book).
Generative Adversarial Network (GAN) (importance 5): Two networks in competition: generator creates fake samples, discriminator distinguishes real from fake. Generator improves to fool discriminator.. Source: (from training memory of book).
Legendre's method of least squares (1805) (importance 4): The foundational curve-fitting technique: minimize the sum of squared vertical distances from points to a line. First systematic approach to learning from data.. Source: (from training memory of book).
kernel trick (implicit high-dimensional mapping) (importance 4): Compute inner products in high-dimensional feature space without explicitly transforming the data. Makes SVMs computationally tractable.. Source: (from training memory of book).
Vapnik's Support Vector Machine (importance 4): Find the maximum-margin hyperplane that separates classes. Only the support vectors (boundary points) matter for defining the decision boundary.. Source: (from training memory of book).
convolutional layers (LeCun) (importance 4): Sliding filters that detect local patterns, sharing weights across spatial positions. Exploit translation invariance in images.. Source: (from training memory of book).
residual connections (ResNet) (importance 4): Skip connections that add layer input to output: y = F(x) + x. Allow training networks hundreds of layers deep by providing gradient shortcuts.. Source: (from training memory of book).
self-attention (query-key-value) (importance 4): Each position queries all others, producing attention weights via softmax of key-query dot products. Output is weighted sum of values.. Source: (from training memory of book).
Variational Autoencoder (VAE) (importance 4): Encode inputs as probability distributions in latent space, not fixed points. Train via variational inference to maximize lower bound on data likelihood.. Source: (from training memory of book).
diffusion probabilistic models (importance 4): Learn to reverse a gradual noising process. Train network to denoise at each step; at inference, start from noise and iteratively denoise to generate samples.. Source: (from training memory of book).
Deep Q-Network (DQN) (importance 4): Use deep neural network to approximate Q-function. Experience replay and target network stabilize training. Mastered Atari games from pixels.. Source: (from training memory of book).
AlphaGo's Monte Carlo Tree Search (importance 4): Combine deep neural networks with tree search. Networks guide exploration; search provides training signal. Defeated world champion Lee Sedol.. Source: (from training memory of book).
BERT masked language modeling (importance 4): Pre-training by predicting masked words from context. Bidirectional Transformer learns rich contextual representations of language.. Source: (from training memory of book).
GPT autoregressive generation (importance 4): Transformer trained to predict next token given previous tokens. Unidirectional attention; scales to massive models generating coherent long-form text.. Source: (from training memory of book).
RLHF (Reinforcement Learning from Human Feedback) (importance 4): Train reward model on human preferences, then optimize language model policy against it via RL. Aligns model outputs with human values.. Source: (from training memory of book).
ReLU activation function (importance 3): max(0, x) — simple nonlinearity that avoids vanishing gradients and speeds training. Replaced sigmoid/tanh in most modern architectures.. Source: (from training memory of book).
positional encoding (importance 3): Add sinusoidal position information to input embeddings. Necessary because attention is permutation-invariant — needs explicit position signal.. Source: (from training memory of book).
Word2Vec (Mikolov 2013) (importance 3): Learn word vectors by predicting context words (skip-gram) or target from context (CBOW). Shallow networks producing rich semantic spaces.. Source: (from training memory of book).
principal component analysis (Pearson 1901) (importance 3): Find orthogonal directions of maximum variance. Project data onto top k principal components for dimensionality reduction.. Source: (from training memory of book).
autoencoder architecture (importance 3): Neural network trained to reconstruct its input through a bottleneck layer. Bottleneck learns compressed representation of data.. Source: (from training memory of book).
reparameterization trick (importance 3): Make stochastic sampling differentiable by expressing random variable as deterministic function of input and external noise. Enables backprop through VAE sampling.. Source: (from training memory of book).
score matching (Hyvärinen) (importance 3): Learn the gradient of log probability (the score function) instead of the probability itself. Avoids intractable normalization constants.. Source: (from training memory of book).
Q-learning (Watkins 1989) (importance 3): Learn action-value function Q(s,a) through iterative updates based on observed rewards and max Q of next state. Model-free temporal-difference learning.. Source: (from training memory of book).
policy gradient methods (importance 3): Directly optimize policy by gradient ascent on expected reward. Use likelihood ratio trick to compute gradient through stochastic actions.. Source: (from training memory of book).
actor-critic architecture (importance 3): Actor network outputs actions (policy), critic evaluates them (value function). Combine policy gradient and value estimation for variance reduction.. Source: (from training memory of book).
cross-entropy loss function (importance 3): Measure divergence between predicted and true probability distributions. Standard loss for classification: -Σ y log(ŷ).. Source: (from training memory of book).
dropout regularization (importance 3): Randomly zero activations during training with probability p. Prevents co-adaptation; approximate Bayesian inference over ensemble of subnetworks.. Source: (from training memory of book).
batch normalization (importance 3): Normalize layer inputs to zero mean, unit variance per mini-batch. Reduces internal covariate shift; enables higher learning rates.. Source: (from training memory of book).
Adam optimizer (importance 3): Adaptive learning rates per parameter using first and second moment estimates. Combines momentum and RMSProp; de facto standard optimizer.. Source: (from training memory of book).
contrastive learning (SimCLR, MoCo) (importance 3): Learn representations by pulling together augmented views of same image, pushing apart different images. Matches or exceeds supervised pre-training.. Source: (from training memory of book).
Langevin dynamics sampling (importance 2): Iteratively move in direction of score (gradient of log-prob) plus noise. Provably converges to target distribution given perfect score function.. Source: (from training memory of book).
Bayesian neural networks (importance 2): Place probability distributions over network weights instead of point estimates. Intractable exact inference; approximations via variational methods or MCMC.. Source: (from training memory of book).
learning rate scheduling (importance 2): Gradually decrease learning rate during training (warmup, cosine decay, step decay). Helps convergence and fine-grained optimization late in training.. Source: (from training memory of book).
prompt engineering (importance 2): Carefully design input text to elicit desired model behavior. Examples, instructions, role-play, chain-of-thought — interface with frozen models.. Source: (from training memory of book).
data augmentation (importance 2): Artificially expand training set via invariance-preserving transformations. Rotations, crops, color jitter for images; back-translation for text.. Source: (from training memory of book).
Neural Architecture Search (NAS) (importance 2): Automate architecture design via reinforcement learning or evolutionary methods. Search over graph of operations; expensive but finds novel designs.. Source: (from training memory of book).
quantization and low-precision training (importance 2): Reduce numerical precision (FP32 → INT8) for faster inference and smaller models. Surprisingly robust; 8-bit often sufficient.. Source: (from training memory of book).
knowledge distillation (importance 2): Train small student model to match large teacher's outputs. Soft targets contain more information than hard labels; compress without accuracy loss.. Source: (from training memory of book).
mixup and cutmix regularization (importance 1): Train on interpolated examples (mixup) or pasted image regions (cutmix). Smooths decision boundaries; improves calibration and robustness.. Source: (from training memory of book).
iterative magnitude pruning (importance 1): Remove smallest-weight connections, retrain, repeat. Discovers sparse subnetworks; used to find lottery tickets.. Source: (from training memory of book).
Relations
Ananthaswamy's curve-fitting paradigm exemplifies Legendre's method of least squares (1805)
Legendre's method of least squares (1805) motivates Gaussian justification for least squares
Ananthaswamy's curve-fitting paradigm generalizes Ananthaswamy's feature space transformation