This can kind of work if you assume you have a friendly prior over actions to draw from, and no inner misalignment issues. Suppose an AI gets a score of 1 for mopping and 0 otherwise. If you draw from the set of all action sequences (according to some prior) which get expected reward >0.99, then you’re probably fine as long as the prior isn’t too malign. For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. This is a kind of satisficer, which is an old MIRI idea.
The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”. In that case, there’s no way to really construct a friendly distribution to satisfice over.
There’s also the problem of accurately specifying what “mopped” means in the first place, but thankfully GPT-4 already knows what mopping is. Having a friendly prior does an enormous amount of work here.
And there’s the whole inner misalignment failure thingy as well.
For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. … The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”.
An LLM is still going to have essentially everything in its base distribution, the trajectories that solve very difficult problems aren’t going to be absurdly improbable, they just won’t ever be encountered by chance, without actually doing the RL. If the finger is put on the base model distribution in a sufficiently non-damaging way, it doesn’t seem impossible that a lot of concepts and attitudes from the base distribution survive even if solutions to the very difficult problems move much closer to the surface. Alien mesa-optimizers might take over, but also they might not, and the base distribution is still there, even if in a somewhat distorted form.
This can kind of work if you assume you have a friendly prior over actions to draw from, and no inner misalignment issues. Suppose an AI gets a score of 1 for mopping and 0 otherwise. If you draw from the set of all action sequences (according to some prior) which get expected reward >0.99, then you’re probably fine as long as the prior isn’t too malign. For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. This is a kind of satisficer, which is an old MIRI idea.
The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”. In that case, there’s no way to really construct a friendly distribution to satisfice over.
There’s also the problem of accurately specifying what “mopped” means in the first place, but thankfully GPT-4 already knows what mopping is. Having a friendly prior does an enormous amount of work here.
And there’s the whole inner misalignment failure thingy as well.
An LLM is still going to have essentially everything in its base distribution, the trajectories that solve very difficult problems aren’t going to be absurdly improbable, they just won’t ever be encountered by chance, without actually doing the RL. If the finger is put on the base model distribution in a sufficiently non-damaging way, it doesn’t seem impossible that a lot of concepts and attitudes from the base distribution survive even if solutions to the very difficult problems move much closer to the surface. Alien mesa-optimizers might take over, but also they might not, and the base distribution is still there, even if in a somewhat distorted form.