The second part of the report examines the prerequisites for scheming. In particular, I focus on:
Situational awareness: the model understands that it’s a model in a training process, what the training process will reward, and the basic nature of the objective world in general.
We can prevent situational awareness by training in simulation sandboxes—carefully censored training environments. The cost is that you give up modern world knowledge and thus most of the immediate economic value of the training run, but it then allows safe exploration of alignment of powerful architectures without creating powerful agents. If every new potentially-dangerous compute-record-breaking foundation agent was trained first in a simbox that would roughly only double the training cost.
We can prevent situational awareness by training in simulation sandboxes—carefully censored training environments. The cost is that you give up modern world knowledge and thus most of the immediate economic value of the training run, but it then allows safe exploration of alignment of powerful architectures without creating powerful agents. If every new potentially-dangerous compute-record-breaking foundation agent was trained first in a simbox that would roughly only double the training cost.