It won’t explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning
CGP Grey’s video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don’t know exist.
Training consists of a huge amount of trial and error where you take datapoints, predict something about the result, then search for nearby modifications that do better, then repeat until performance plateaus.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate
There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.
Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it)
Source?
CGP Grey’s video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don’t know exist.
Training consists of a huge amount of trial and error where you take datapoints, predict something about the result, then search for nearby modifications that do better, then repeat until performance plateaus.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate
There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.
Oh, I actually did that a year or so ago