Rob Bensinger comments on Evaluating the historical value misspecification argument

Rob Bensinger 8 Oct 2023 6:19 UTC
5 points
2
Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn’t (and isn’t) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn’t otherwise in the business of challenging people to race to AGI faster.
It wouldn’t have occurred to me that someone might think ‘can a deep net fill a bucket of water, in real life, without being dangerously capable’ is a crucial question in this context; I’m not sure we ever even had the thought occur in our heads ‘when might such-and-such DL technique successfully fill a bucket?’. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.
(And while I think I understand why others see ChatGPT as a large positive update about alignment’s difficulty, I hope it’s also obvious why others, MIRI included, would not see it that way.)
Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches—the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn’t look to me like it actually lets you build a corrigible AI that can help with a pivotal act.
If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it’s a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.
(Though now that I’ve seen the confusion the example causes, I’m more inclined to think that the strawberry problem is a better frame than the Fantasia example.)