The fundamental architecture, training methods and requirements for progress for modern AI systems are all completely different from the technology Yudkowsky imagined in 2008, yet nothing about the core MIRI story has changed.
We could say — and certainly Yudkowsky and Soares would say — that this isn’t important, because the essential dynamics of superintelligence don’t depend on any particular architecture. But that just raises a different question: why does the rest of the book talk about particular architectures so much? Chapter two, for example, is all about contingent properties of present day AI systems. It focuses on the fact that AIs are grown, not crafted — that is, they emerge through opaque machine learning processes instead of being designed like traditional computer programs. This is used as evidence that we should expect AIs to have strange alien values that we can’t control or predict, since the humans who “grow” AIs can’t exactly input ethics or morals by hand. This might seem broadly reasonable — except that this was also Yudkowsky’s conclusion in 2006, when he assumed that AIs would be crafted. Back then, his argument was that during takeoff, when an AI rapidly self-improves into superintelligence, it would undergo a sudden and extreme value shift. Yudkowsky and Soares still believe this argument, or at least Soares did as of 2022. But if this is true, then the techniques used to build older, dumber systems are irrelevant — the risk comes from the fundamental nature of superintelligence, not any specific architecture.
I think Clara misunderstands the arguments and how they’ve changed. There’s two layers to the problem: In the first layer, which was the one relevant in the old days, the difficulties are mainly reflective stability and systematic-bias-introduced-through-imperfect-hypothesis-search.[1] Sufficient understanding and careful design would be enough to solve these, and this is what agent foundations was aiming at. How difficult these problems are does depend on the architecture, as with other engineering safety problems, and MIRI was for a while working on architectures that they hoped would solve the problems (under the heading HRAD, highly reliable agent design).
The second layer of argument became relevant with deep learning: that, if grown rather than engineered, the inner alignment issue becomes almost unavoidable,[2] and the reflective stability issue doesn’t go away. The reflective stability issue doesn’t depend on FOOM, as Clara claims, it just depends on the kind of learning-how-to-think-better or reasoning-about-how-to-think-better that humans routinely do. The book focuses on the second layer of argument, but does mention the first (albeit mostly explained via analogy):
It’s also discussed more in the supplemental material (note non-dependence on FOOM). And it emphasises at the end that the reflective stability issue (layer 1) is disjunctive with the inner alignment issue (layer 2).
Importantly, the second layer of argument did change the conclusion. It caused Yudkowsky to update negatively on our chances at succeeding at alignment,[3] because it adds that second layer of indirection between our engineering efforts and the ultimate goals of a super-intelligence.
I find it unpleasant how aggressive Clara is in this article, especially given the shallow understanding of the argument structure and how it has changed.
I’m sure there was a tweet or something where he said something like “the success of deep learning was a small positive update (because CoT) and a large negative update (because inscrutable)”. Can’t find it.
I think Clara misunderstands the arguments and how they’ve changed. There’s two layers to the problem: In the first layer, which was the one relevant in the old days, the difficulties are mainly reflective stability and systematic-bias-introduced-through-imperfect-hypothesis-search.[1] Sufficient understanding and careful design would be enough to solve these, and this is what agent foundations was aiming at. How difficult these problems are does depend on the architecture, as with other engineering safety problems, and MIRI was for a while working on architectures that they hoped would solve the problems (under the heading HRAD, highly reliable agent design).
The second layer of argument became relevant with deep learning: that, if grown rather than engineered, the inner alignment issue becomes almost unavoidable,[2] and the reflective stability issue doesn’t go away. The reflective stability issue doesn’t depend on FOOM, as Clara claims, it just depends on the kind of learning-how-to-think-better or reasoning-about-how-to-think-better that humans routinely do.
The book focuses on the second layer of argument, but does mention the first (albeit mostly explained via analogy):
It’s also discussed more in the supplemental material (note non-dependence on FOOM). And it emphasises at the end that the reflective stability issue (layer 1) is disjunctive with the inner alignment issue (layer 2).
Importantly, the second layer of argument did change the conclusion. It caused Yudkowsky to update negatively on our chances at succeeding at alignment,[3] because it adds that second layer of indirection between our engineering efforts and the ultimate goals of a super-intelligence.
I find it unpleasant how aggressive Clara is in this article, especially given the shallow understanding of the argument structure and how it has changed.
Where this is a difficult problem to explain, but extreme examples of it are optimization daemons and the malign prior.
Because you’ve deliberately created a mesa-optimizer.
I’m sure there was a tweet or something where he said something like “the success of deep learning was a small positive update (because CoT) and a large negative update (because inscrutable)”. Can’t find it.