GradientDissenter comments on GradientDissenter’s Shortform

GradientDissenter 18 Nov 2025 7:59 UTC
57 points
−1
When I was first trying to learn ML for AI safety research, people told me to learn linear algebra. And today lots of people I talk to who are trying to learn ML^[1] seem under the impression they need to master linear algebra before they start fiddling with transformers. I find in practice I almost never use 90% of the linear algebra I’ve learned. I use other kinds of math much more, and overall being good at empiricism and implementation seems more valuable than knowing most math beyond the level of AP calculus.
The one part of linear algebra you do absolutely need is a really, really good intuition for what a dot product is, the fact that you can do them in batches, and the fact that matrix multiplication is associative. Someone smart who can’t so much as multiply matrices can learn the basics in an hour or two with a good tutor (I’ve taken people through it in that amount of time). The introductory linear algebra courses I’ve seen^[2] wouldn’t drill this intuition nearly as well as the tutor even if you took them.
In my experience it’s not that useful to have good intuitions for things like eigenvectors/eigenvalues or determinants (unless you’re doing something like SLT). Understanding bases and change-of-basis is somewhat useful for improving your intuitions, and especially useful for some kinds of interp, I guess? Matrix decompositions are useful if you want to improve cuBLAS. Sparsity sometimes comes up, especially in interp (it’s also a very very simple concept).
The same goes for much of vector calculus. (You need to know you can take your derivatives in batches and that this means you write your d/dx as ∂/∂x or an upside-down triangle. You don’t need curl or divergence.)
I find it’s pretty easy to pick things like this up on the fly if you ever happen to need them.
Inasmuch as I do use math, I find I most often use basic statistics (so I can understand my empirical results!), basic probability theory (variance, expectations, estimators), having good intuitions for high-dimensional probability (which is the only part of math that seems underrated for ML), basic calculus (the chain rule), basic information theory (“what is KL-divergence?”), arithmetic, a bunch of random tidbits like “the log derivative trick”, and the ability to look at equations with lots of symbols and digest them.
In general most work and innovation^[3] in machine learning these days (and in many domains of AI safety^[4]) is not based in formal mathematical theory, it’s based on empiricism, fussing with lots of GPUs, and stacking small optimizations. As such, being good at math doesn’t seem that useful for doing most ML research. There are notable exceptions: some people do theory-based research. But outside these niches, being good at implementation and empiricism seems much more important; inasmuch as math gives you better intuitions in ML, I think reading more empirical papers or running more experiments or just talking to different models will give you far better intuitions per hour.
1. ^
  By “ML” I mean things involving modern foundation models, especially transformer-based LLMs.
2. ^
  It’s pretty plausible to me that I’ve only been exposed to particularly mediocre math courses. My sample-size is small, and it seems like course quality and content varies a lot.
3. ^
  Please don’t do capabilities mindlessly.
4. ^
  The standard counterargument here is these parts of AI safety are ignoring what’s actually hard about ML and that empiricism won’t work. For example we need to develop techniques that work on the first model we build that can self-improve. I don’t want to get into that debate.
- Buck 18 Nov 2025 9:44 UTC
  30 points
  0
  Parent
  This is a great list!
  Here’s some stuff that isn’t in your list that I think comes up often enough that aspiring ML researchers should eventually know it (and most of this is indeed universally known). Everything in this comment is something that I’ve used multiple times in the last month.
  - Linear algebra tidbits
    Vector-matrix-vector products
    Probably einsums more generally
    And the derivative of an einsum wrt any input
    Matrix multiplication of matrices of shape [A,B] and [B,C] takes 2ABC flops.
    This stuff comes up when doing basic math about the FLOPs of a neural net architecture.
  - Stuff that I use as concrete simple examples when thinking about ML
    A deep understanding of linear regression, covariance, correlation. (This is useful because it is a simple analogy for fitting a probabilistic model, and it lets you remember a bunch of important facts.)
    Basic facts about (multivariate) Gaussians; Bayesian updates on Gaussians
  - Variance reduction, importance sampling. Lots of ML algorithms, e.g. value baselining, are basically just variance reduction tricks. Maybe consider the difference between paired and unpaired t-tests as a simple example.
    This is relevant for understanding ML algorithms, for doing basic statistics to understand empirical results, and for designing sample-efficient experiments and algorithms.
  - Errors go as 1/sqrt(n) so sample sizes need to grow 4x if you want your error bars to shrink 2x
  - AUROC is the probability that a sample from distribution A will be greater than a sample from distribution B, this is the obvious natural way of comparing distributions over a totally ordered set
  - Maximum likelihood estimation, MAP estimation, full Bayes
  - The Boltzmann distribution (aka softmax)
  And some stuff I’m personally very glad to know:
  - The Price equation/the breeder’s equation—we’re constantly thinking about how neural net properties change as you train them, it is IMO helpful to have the quantitative form of natural selection in your head as an example
  - SGD is not parameterization invariant; natural gradients
  - Bayes nets
  - Your half-power-of-ten times tables
  - (barely counts) Conversions between different units of time (e.g. “there are 30M seconds in a year, there are 3k seconds in an hour, there are 1e5 seconds in a day”)
  - Cipolla 18 Nov 2025 20:02 UTC
    1 point
    0
    Parent
    I think you wanted to say standard error of the mean.
- Garrett Baker 18 Nov 2025 17:00 UTC
  9 points
  5
  Parent
  
  In general most work and innovation[3] in machine learning these days (and in many domains of AI safety[4]) is not based in formal mathematical theory, it’s based on empiricism, fussing with lots of GPUs, and stacking small optimizations. As such, being good at math doesn’t seem that useful for doing most ML research.
  
  I think I somewhat disagree here, I think that often even good empirics-focused researchers have background informal and not-so-respectable models informed by mathematical intuition. Source is probably some Dwarkesh Patel interview, but I’m not sure which.
- leogao 18 Nov 2025 19:39 UTC
  4 points
  2
  Parent
  this feels intuitively true to me, but I’m also very biased—I’ve basically shovelled all of my skill points into engineering and research intuition, and have only a passable understanding of math, and this generally has not been a huge bottleneck for me. but maybe if I knew more math i’d know what I’m missing out on
- Hastings 19 Nov 2025 0:26 UTC
  3 points
  0
  Parent
  I think this is largely right point by point, except that I’d flag that if you are rarely using eigendecomposition (mostly at the whiteboard, less so in code), you are possibly bottlenecked by a poor grasp of eigenvectors and eigenvalues.
  Also, a fancy linear algebra education will tell you exactly how matrix log and matrix exponent work, but all you need is that 99% of the time any number manipulation you can do with regular logs and exponents will work completely unmodified with square matrices and matrix logs and exponents, but if you don’t know about matrix logs at all this will be a glaring hole: I use these constantly in actual code. ( Actually 99% is definitely sampling bias- for example, given matrices A and B, log(AB) only equals log(A) + log(B) if A and B share eigenvalues, and them being numerically equal may require being tricky about which branch of the log to pick, and my pleading may fall on deaf ears that well of course, but you’d only think to try it if they share eigenvalues and you’re doing an operation later that kills branch differences so in practice when you try it it works)