For example drawing from the distribution of GPT-4-base trajectories probably doesn’t kill you. … The real issues occur when you need the AI to do something difficult, like “Save the world from OpenBrain’s upcoming Agent-5 release”.
An LLM is still going to have essentially everything in its base distribution, the trajectories that solve very difficult problems aren’t going to be absurdly improbable, they just won’t ever be encountered by chance, without actually doing the RL. If the finger is put on the base model distribution in a sufficiently non-damaging way, it doesn’t seem impossible that a lot of concepts and attitudes from the base distribution survive even if solutions to the very difficult problems move much closer to the surface. Alien mesa-optimizers might take over, but also they might not, and the base distribution is still there, even if in a somewhat distorted form.
An LLM is still going to have essentially everything in its base distribution, the trajectories that solve very difficult problems aren’t going to be absurdly improbable, they just won’t ever be encountered by chance, without actually doing the RL. If the finger is put on the base model distribution in a sufficiently non-damaging way, it doesn’t seem impossible that a lot of concepts and attitudes from the base distribution survive even if solutions to the very difficult problems move much closer to the surface. Alien mesa-optimizers might take over, but also they might not, and the base distribution is still there, even if in a somewhat distorted form.