Buck comments on GradientDissenter’s Shortform

Buck 18 Nov 2025 9:44 UTC
30 points
0
This is a great list!
Here’s some stuff that isn’t in your list that I think comes up often enough that aspiring ML researchers should eventually know it (and most of this is indeed universally known). Everything in this comment is something that I’ve used multiple times in the last month.
- Linear algebra tidbits
  - Vector-matrix-vector products
    Probably einsums more generally
    And the derivative of an einsum wrt any input
  - Matrix multiplication of matrices of shape [A,B] and [B,C] takes 2ABC flops.
  - This stuff comes up when doing basic math about the FLOPs of a neural net architecture.
- Stuff that I use as concrete simple examples when thinking about ML
  - A deep understanding of linear regression, covariance, correlation. (This is useful because it is a simple analogy for fitting a probabilistic model, and it lets you remember a bunch of important facts.)
  - Basic facts about (multivariate) Gaussians; Bayesian updates on Gaussians
- Variance reduction, importance sampling. Lots of ML algorithms, e.g. value baselining, are basically just variance reduction tricks. Maybe consider the difference between paired and unpaired t-tests as a simple example.
  - This is relevant for understanding ML algorithms, for doing basic statistics to understand empirical results, and for designing sample-efficient experiments and algorithms.
- Errors go as 1/sqrt(n) so sample sizes need to grow 4x if you want your error bars to shrink 2x
- AUROC is the probability that a sample from distribution A will be greater than a sample from distribution B, this is the obvious natural way of comparing distributions over a totally ordered set
- Maximum likelihood estimation, MAP estimation, full Bayes
- The Boltzmann distribution (aka softmax)
And some stuff I’m personally very glad to know:
- The Price equation/the breeder’s equation—we’re constantly thinking about how neural net properties change as you train them, it is IMO helpful to have the quantitative form of natural selection in your head as an example
- SGD is not parameterization invariant; natural gradients
- Bayes nets
- Your half-power-of-ten times tables
- (barely counts) Conversions between different units of time (e.g. “there are 30M seconds in a year, there are 3k seconds in an hour, there are 1e5 seconds in a day”)
- Cipolla 18 Nov 2025 20:02 UTC
  1 point
  0
  Parent
  I think you wanted to say standard error of the mean.