Matthew Barnett comments on A simple environment for showing mesa misalignment

Matthew Barnett 26 Sep 2019 19:49 UTC
LW: 1 AF: 1
0
AF
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy.
I agree. That’s why I think we should compare it to a hard-coded agent that pursues the optimal policy for collecting keys, and an agent that pursues the optimal policy for opening chests. If the trained agent is similar to the first hard-coded agent rather than the second, this would be striking evidence of misalignment.