Sounds like we agree :)
On 2): Being overparameterized doesn’t mean you fit all your training data. It just means that you could fit it with enough optimization. Perhaps the existence of some Savant people shows that the brain could memorize way more than it does.
On 3): The number of our synaptic weights is stupendous too—about 30000 for every second in our life.
On 4): You can underfit at the evolution level and still overparameterize at the individual level.
Overall you convinced me that underparameterization is less likely though. Especially on your definition of overparameterization, which is relevant for double descent.
Why do you think that humans are, and powerful AI systems will be, severely underparameterized?
Potential paper from DM/Stanford for a future newsletter: https://arxiv.org/pdf/1911.00459.pdf
It addresses the problem that an RL agent will delude itself by finding loopholes in a learned reward function.
Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.
One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?
If X is “number of paperclips” and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise “1-Y” would go in the opposite direction which can’t be true by symmetry. But if Y is something like “number of happy people”, Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.
I should’ve specified that the strong version is “Y decreases relative to a world where neither of X nor Y are being optimized”. Am I right that this version is not true?
Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart’s law that says “if X is a proxy for Y and you optimize X, the correlation breaks” but we really mean a stronger version: “if you optimize X, Y will actively decrease”. Your paper clarifies that what we actually mean is an intermediate version: “if you optimize X, it becomes a harder to optimize Y”. My conclusion would be that the intermediate version is true but the strong version false then. Would you say that’s an accurate summary?
Posted a little reaction to this paper here.
My tentative view on MuZero:
Cool for board games and related tasks.
The Atari demo seems sketchy.
Not a big step towards making model-based RL work—instead, a step making it more like model-free RL.
A textbook benefit for model-based RL is that world models (i.e. prediction of observations) generalize to new reward functions and environments. They’ve removed this benefit.
The other textbook benefit of model-based RL is data efficiency. But on Atari, MuZero is just as inefficient as model-free RL. In fact, MuZero moves a lot closer to model-free methods by removing the need to predict observations. And it’s roughly equally inefficient. Plus it trains with 40 TPUs per game where other algorithms use a single GPU and similar training time. What if they spent that extra compute to get more data?
In the low-data setting they outperform model-free methods. But they suspiciously didn’t compare to any model-based method. They’d probably lose there because they’d need a world model for data efficiency.
MuZero only plans for K=5 steps ahead—far less than AlphaZero. Two takeaways: 1) This again looks more similar to model-free RL which has essentially K=1. 2) This makes me more optimistic that model-free RL can learn Go with just a moderate efficiency (and stability?) loss (Paul has speculated this. Also, the trained AlphaZero policy net is apparently still better than Lee Sedol without MCTS).
the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion)
Curious where this estimate comes from?
Why just a 10x speedup over model free RL? I would’ve expected much more.
Should I share the Alignment Research Overview in its current Google Doc form or is it about to be published somewhere more official?
Yes. To the extent that the system in question is an agent, I’d roughly think of many copies of it as a single distributed agent.
Hmmm my worry isn’t so nuch that we have an unusual definition of inner alignment. It’s more the opposite: that outsiders associate this line of research with quackery (which only gets worse if our definition is close to the standard one).
Re whether ML is easy to deploy: most compute these days goes into deployment. And there are a lot of other deployment challenges that you don’t have during training where you train a single model under lab conditions.
Fair—I’d probably count “making lots of copies of a trained system” as a single system here.
Yes—the part that I was doubting is that it provides evidence for relatively quick takeoff.
For the record, two people who I consider authorities on this told me some version of “model sizes matter a lot”.
“Continuous” vs “gradual”:
I’ve also seen people internally use the word gradual and I prefer it to continuous because 1) in maths, a discontinuity can be an insignificant jump and 2) a fast takeoff is about fast changes in the growth rate, whereas continuity is about jumps in the function value (you can have either without the other). I don’t see a natural way to say non-gradual or a non-graduality though, which is why I do often say discontinuous instead.