The key question is your second paragraph, which I still don’t really buy. Taking an action like “attempt to break out of the box” is penalized if it is done during training (and doesn’t work), so the very optimization process will be to find systems that do not do that. It might know that it could, but why would it? Doing so in no way helps it, in the same way that outputting anything unrelated to the prompt would be selected against.
The idea is that a model of the world that helps you succeed inside the box might naturally generalize to making consequentialist plans that depend on whether you’re in the box. This is actually closely analogous to human intellect—we evolved our reasoning capabilities because they helped use reproduce in hunter-gatherer groups, but since then we’ve used our brains for all sorts of new things that evolution totally didn’t predict. And when we are placed in the modern environment rather than hunter-gatherer groups, we actually use our brains to invent condoms and otherwise deviate from what evolution originally thought brains were good for.