I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.
I don’t see why we care about mesa-optimization in particular. The argument for risk just factors through the fact that capabilities generalize, but the objective doesn’t. Why does it matter whether the agent is internally performing some kind of search?
By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy.
I agree. That’s why I think we should compare it to a hard-coded agent that pursues the optimal policy for collecting keys, and an agent that pursues the optimal policy for opening chests. If the trained agent is similar to the first hard-coded agent rather than the second, this would be striking evidence of misalignment.