I’m really glad you wrote this!! I already knew you were way more optimistic than me about AGI accident risk being low, and have been eager to hear where you’re coming from.
Here are some points of disagreement…
If we define AGI as “world optimizer” then yes, definitely. But I can imagine a couple different kinds of superintelligences that aren’t world optimizers (along with a few that naturally trend toward world optimizing). If you built a superintelligent machine that isn’t a world optimizer then it need not necessarily end the world.
For example, MuZero separates value from policy from reward. If you built just the value network and cranked it up to superintelligence then you would have a superintelligence that is not a world optimizer.
See my discussion of so-called “RL-on-thoughts” here. Basically, I argue that if we want the AGI to be able to find / invent really new useful ideas that solve particular problems, it needs to explore the space-of-all-possible-ideas with purpose, because the space-of-all-possible-ideas is just way too big to explore it in any other way. To explore the space-of-all-possible-ideas with purpose, you need a closed-loop policy+value consequentialist thing, and a closed-loop policy+value consequentialist thing is a world optimizer by default, absent a solution to the alignment problem.
I don’t know if Eliezer or Nate would endorse my “RL-on-thoughts” discussion, but my hunch is that they would, or at least something in that general vicinity, and that this underlies some of the things they said recently, including the belief that MuZero is on a path to AGI in a way that GPT-3 isn’t.
Forcing the AI to use simple models provides a powerful safety mechanism against misalignment.
I think that’s an overstatement. Let’s say we have a dial / hyperparameter for “how simple the model must be” (or what’s the numerical exchange rate between simplicity vs reward / loss / whatever). There are some possible dial settings where the model is simple enough for us to understand, simple enough to not come up with deceptive strategies, etc. There are also some possible dial settings where the model is powerful enough to “be a real-deal AGI” that can build new knowledge, advance AI alignment research, invent weird nanotechnology, etc.
The question is, do those ranges of dial settings overlap? If yes, it’s a “powerful safety mechanism against misalignment”. If no, it’s maybe slightly helpful on the margin, or it’s mostly akin to saying “not building AGI at all is a powerful safety mechanism against misaligned AGIs”. :-P
So what’s the answer? Do the ranges overlap or not? I think it’s hard to say for sure. My strong hunch is “no they don’t overlap”.
You can get corrigibility by providing a switch the computer can activate for maximum reward by escaping its sandbox and providing an escape hatch you think is just beyond the AI’s abilities and then turning up the allowed complexity. I understand this approach has theoretical problems. I can’t prove it will work, but I predict it’ll be a practical solution to real-world situations.
I think this presupposes that the AI is “trying” to maximize future reward, i.e. it presupposes a solution to inner alignment. Just as humans are not all hedonists, likewise AGIs are not all explicitly trying to maximize future rewards. I wrote about that (poorly) here; a pedagogically-improved version is forthcoming.
I am far more concerned about outer alignment. I’m not worried that an AI will take over the world by accident. I’m worried that an AI will take over the world because someone deliberately told it to.
This is a bit of a nitpick, but I think standard terminology would be to call this “bad actor risks”. (Or perhaps “coordination problems”, depending on the underlying story.) I’ve only heard “outer alignment” used to mean “the AI is not doing what its programmer wants it to do, because of poor choice of objective function” (or similar)—i.e., outer alignment issues are a strict subset of accident risk. This diagram is my take (from a forthcoming post):
I’m really glad you wrote this!! I already knew you were way more optimistic than me about AGI accident risk being low, and have been eager to hear where you’re coming from.
Here are some points of disagreement…
See my discussion of so-called “RL-on-thoughts” here. Basically, I argue that if we want the AGI to be able to find / invent really new useful ideas that solve particular problems, it needs to explore the space-of-all-possible-ideas with purpose, because the space-of-all-possible-ideas is just way too big to explore it in any other way. To explore the space-of-all-possible-ideas with purpose, you need a closed-loop policy+value consequentialist thing, and a closed-loop policy+value consequentialist thing is a world optimizer by default, absent a solution to the alignment problem.
I don’t know if Eliezer or Nate would endorse my “RL-on-thoughts” discussion, but my hunch is that they would, or at least something in that general vicinity, and that this underlies some of the things they said recently, including the belief that MuZero is on a path to AGI in a way that GPT-3 isn’t.
I think that’s an overstatement. Let’s say we have a dial / hyperparameter for “how simple the model must be” (or what’s the numerical exchange rate between simplicity vs reward / loss / whatever). There are some possible dial settings where the model is simple enough for us to understand, simple enough to not come up with deceptive strategies, etc. There are also some possible dial settings where the model is powerful enough to “be a real-deal AGI” that can build new knowledge, advance AI alignment research, invent weird nanotechnology, etc.
The question is, do those ranges of dial settings overlap? If yes, it’s a “powerful safety mechanism against misalignment”. If no, it’s maybe slightly helpful on the margin, or it’s mostly akin to saying “not building AGI at all is a powerful safety mechanism against misaligned AGIs”. :-P
So what’s the answer? Do the ranges overlap or not? I think it’s hard to say for sure. My strong hunch is “no they don’t overlap”.
I think this presupposes that the AI is “trying” to maximize future reward, i.e. it presupposes a solution to inner alignment. Just as humans are not all hedonists, likewise AGIs are not all explicitly trying to maximize future rewards. I wrote about that (poorly) here; a pedagogically-improved version is forthcoming.
This is a bit of a nitpick, but I think standard terminology would be to call this “bad actor risks”. (Or perhaps “coordination problems”, depending on the underlying story.) I’ve only heard “outer alignment” used to mean “the AI is not doing what its programmer wants it to do, because of poor choice of objective function” (or similar)—i.e., outer alignment issues are a strict subset of accident risk. This diagram is my take (from a forthcoming post):
Thank you for the quality feedback. As you know, I have a high opinion of your work.
I have replaced “outer alignment” with “bad actor risk”. Thank you for the correction.