A follow-on inference from the above point is: when the AI leaves training, and it’s tasked with solving bigger and harder long-horizon problems in cases where it has to grow smarter than ever before and develop new tools to solve new problems, and you realize finally that it’s pursuing neither the targets you trained it to pursue nor the targets you asked it to pursue—well, by that point, you’ve built a generalized obstacle-surmounting engine. You’ve built a thing that excels at noticing when a wrench has been thrown in its plans, and at understanding the wrench, and at removing the wrench or finding some other way to proceed with its plans.
And when you protest and try to shut it down—well, that’s just another obstacle, and you’re just another wrench.
This seems right. “you’ve built a generalized obstacle-surmounting engine.” is maybe the the single best distillation of what’s hard about AI risk of that length that I’ve ever read.
But also, I’m having difficulty connecting my experience with LLMs to the abstract claim that corrigibility is anti-natural.
...Actually, never mind, I just thought about it a bit more. You might train your LLM-ish agent to ask for your permission when taking sufficiently big actions. And it will probably continue to do that. But if you also train in some behaviorist goals, it will steer the world in the direction of the satisfaction of the those goals, regardless of the fact that it is “asking your permission”, by priming you to answer in particular ways, or choosing strategies that don’t require asking for permission, or whatever. “Asking for permission” doesn’t really build in corrigibility, that’s just a superficial constraint on the planning process.
This seems right. “you’ve built a generalized obstacle-surmounting engine.” is maybe the the single best distillation of what’s hard about AI risk of that length that I’ve ever read.
But also, I’m having difficulty connecting my experience with LLMs to the abstract claim that corrigibility is anti-natural.
...Actually, never mind, I just thought about it a bit more. You might train your LLM-ish agent to ask for your permission when taking sufficiently big actions. And it will probably continue to do that. But if you also train in some behaviorist goals, it will steer the world in the direction of the satisfaction of the those goals, regardless of the fact that it is “asking your permission”, by priming you to answer in particular ways, or choosing strategies that don’t require asking for permission, or whatever. “Asking for permission” doesn’t really build in corrigibility, that’s just a superficial constraint on the planning process.