With controlling a theoretical rl agent, what’s the problem with asking the ai to be 99% sure that it mopped 99% of the floor and stop?
I remember that if you just ask for 99% floormop then agent will spend forever getting 99.99999% sure that at least 99% is mopped, but I can’t remember the problem with this little patch.
This can kind of work if you assume you have a friendly prior over actions to draw from, and no inner misalignment issues. Suppose an AI gets a score of 1 for mopping and 0 otherwise. If you draw from the set of all action sequences (according to some prior) which get expected reward >0.99, then you’re probably fine as long as the prior isn’t too malign. For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. This is a kind of satisficer, which is an old MIRI idea.
The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”. In that case, there’s no way to really construct a friendly distribution to satisfice over.
There’s also the problem of accurately specifying what “mopped” means in the first place, but thankfully GPT-4 already knows what mopping is. Having a friendly prior does an enormous amount of work here.
And there’s the whole inner misalignment failure thingy as well.
For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. … The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”.
An LLM is still going to have essentially everything in its base distribution, the trajectories that solve very difficult problems aren’t going to be absurdly improbable, they just won’t ever be encountered by chance, without actually doing the RL. If the finger is put on the base model distribution in a sufficiently non-damaging way, it doesn’t seem impossible that a lot of concepts and attitudes from the base distribution survive even if solutions to the very difficult problems move much closer to the surface. Alien mesa-optimizers might take over, but also they might not, and the base distribution is still there, even if in a somewhat distorted form.
What if it’s only 99.99% sure that it’s 99% sure? Also, in some sense levels of credence are ill-defined, and worse any abstractions of ontology in the real world will be leaky, even computation. It’s not even possible to define what “stop” means without assuming sufficient intent alignment, it’s not fundamentally more difficult to take over the reachable universe than to shut down without leaving the factory. And it also may well turn out to be possible to take over the reachable universe while also in some borderline inadmissible sense technically shutting down without leaving the factory.
Stop can be done with thermodynamics and boundaries, I think? You need to be able to address all the locations the AI is implemented and require that their energy release goes to background. Still some hairy ingredients for asymptotic alignment, but not as bad as “fetch a coffee as fast as possible without that being bad”.
With controlling a theoretical rl agent, what’s the problem with asking the ai to be 99% sure that it mopped 99% of the floor and stop?
I remember that if you just ask for 99% floormop then agent will spend forever getting 99.99999% sure that at least 99% is mopped, but I can’t remember the problem with this little patch.
This can kind of work if you assume you have a friendly prior over actions to draw from, and no inner misalignment issues. Suppose an AI gets a score of 1 for mopping and 0 otherwise. If you draw from the set of all action sequences (according to some prior) which get expected reward >0.99, then you’re probably fine as long as the prior isn’t too malign. For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. This is a kind of satisficer, which is an old MIRI idea.
The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”. In that case, there’s no way to really construct a friendly distribution to satisfice over.
There’s also the problem of accurately specifying what “mopped” means in the first place, but thankfully GPT-4 already knows what mopping is. Having a friendly prior does an enormous amount of work here.
And there’s the whole inner misalignment failure thingy as well.
An LLM is still going to have essentially everything in its base distribution, the trajectories that solve very difficult problems aren’t going to be absurdly improbable, they just won’t ever be encountered by chance, without actually doing the RL. If the finger is put on the base model distribution in a sufficiently non-damaging way, it doesn’t seem impossible that a lot of concepts and attitudes from the base distribution survive even if solutions to the very difficult problems move much closer to the surface. Alien mesa-optimizers might take over, but also they might not, and the base distribution is still there, even if in a somewhat distorted form.
What if it’s only 99.99% sure that it’s 99% sure? Also, in some sense levels of credence are ill-defined, and worse any abstractions of ontology in the real world will be leaky, even computation. It’s not even possible to define what “stop” means without assuming sufficient intent alignment, it’s not fundamentally more difficult to take over the reachable universe than to shut down without leaving the factory. And it also may well turn out to be possible to take over the reachable universe while also in some borderline inadmissible sense technically shutting down without leaving the factory.
Stop can be done with thermodynamics and boundaries, I think? You need to be able to address all the locations the AI is implemented and require that their energy release goes to background. Still some hairy ingredients for asymptotic alignment, but not as bad as “fetch a coffee as fast as possible without that being bad”.