Here’s some stuff that isn’t in your list that I think comes up often enough that aspiring ML researchers should eventually know it (and most of this is indeed universally known). Everything in this comment is something that I’ve used multiple times in the last month.
Linear algebra tidbits
Vector-matrix-vector products
Probably einsums more generally
And the derivative of an einsum wrt any input
Matrix multiplication of matrices of shape [A,B] and [B,C] takes 2ABC flops.
This stuff comes up when doing basic math about the FLOPs of a neural net architecture.
Stuff that I use as concrete simple examples when thinking about ML
A deep understanding of linear regression, covariance, correlation. (This is useful because it is a simple analogy for fitting a probabilistic model, and it lets you remember a bunch of important facts.)
Basic facts about (multivariate) Gaussians; Bayesian updates on Gaussians
Variance reduction, importance sampling. Lots of ML algorithms, e.g. value baselining, are basically just variance reduction tricks. Maybe consider the difference between paired and unpaired t-tests as a simple example.
This is relevant for understanding ML algorithms, for doing basic statistics to understand empirical results, and for designing sample-efficient experiments and algorithms.
Errors go as 1/sqrt(n) so sample sizes need to grow 4x if you want your error bars to shrink 2x
AUROC is the probability that a sample from distribution A will be greater than a sample from distribution B, this is the obvious natural way of comparing distributions over a totally ordered set
Maximum likelihood estimation, MAP estimation, full Bayes
The Boltzmann distribution (aka softmax)
And some stuff I’m personally very glad to know:
The Price equation/the breeder’s equation—we’re constantly thinking about how neural net properties change as you train them, it is IMO helpful to have the quantitative form of natural selection in your head as an example
SGD is not parameterization invariant; natural gradients
(barely counts) Conversions between different units of time (e.g. “there are 30M seconds in a year, there are 3k seconds in an hour, there are 1e5 seconds in a day”)
This is a great list!
Here’s some stuff that isn’t in your list that I think comes up often enough that aspiring ML researchers should eventually know it (and most of this is indeed universally known). Everything in this comment is something that I’ve used multiple times in the last month.
Linear algebra tidbits
Vector-matrix-vector products
Probably einsums more generally
And the derivative of an einsum wrt any input
Matrix multiplication of matrices of shape [A,B] and [B,C] takes 2ABC flops.
This stuff comes up when doing basic math about the FLOPs of a neural net architecture.
Stuff that I use as concrete simple examples when thinking about ML
A deep understanding of linear regression, covariance, correlation. (This is useful because it is a simple analogy for fitting a probabilistic model, and it lets you remember a bunch of important facts.)
Basic facts about (multivariate) Gaussians; Bayesian updates on Gaussians
Variance reduction, importance sampling. Lots of ML algorithms, e.g. value baselining, are basically just variance reduction tricks. Maybe consider the difference between paired and unpaired t-tests as a simple example.
This is relevant for understanding ML algorithms, for doing basic statistics to understand empirical results, and for designing sample-efficient experiments and algorithms.
Errors go as 1/sqrt(n) so sample sizes need to grow 4x if you want your error bars to shrink 2x
AUROC is the probability that a sample from distribution A will be greater than a sample from distribution B, this is the obvious natural way of comparing distributions over a totally ordered set
Maximum likelihood estimation, MAP estimation, full Bayes
The Boltzmann distribution (aka softmax)
And some stuff I’m personally very glad to know:
The Price equation/the breeder’s equation—we’re constantly thinking about how neural net properties change as you train them, it is IMO helpful to have the quantitative form of natural selection in your head as an example
SGD is not parameterization invariant; natural gradients
Bayes nets
Your half-power-of-ten times tables
(barely counts) Conversions between different units of time (e.g. “there are 30M seconds in a year, there are 3k seconds in an hour, there are 1e5 seconds in a day”)
I think you wanted to say standard error of the mean.