The training procedure is only judging based on actions during training. This makes it incapable of distinguishing between an agent that behaves in the box, and runs wild the moment it gets out the box, from an agent that behaves all the time.
The training process produces no incentive that controls the behaviour of the agent after training. (Assuming the training and runtime environment differ in some way.)
As such, the runtime behaviour depends on the priors. The decisions implicit in the structure of the agent and training process, not just the objective. What kinds of agents are easiest for the training process to find. A sufficiently smart agent that understands its place in the world seems simple. A random smart agent will probably not have the utility function we want. (There are lots of possible utility functions.) But almost any agent with real world goals that understands the situation its in will play nice on the training, and then turn on us in deployment.
There are various discussions about what sort of training processes have this problem, and it isn’t really settled.