That is to say, this is assuming that there is a difference in type between the sorts of heuristics that a pretrained and not superhuman LLM will reach for, and those necessary to be superinteligent. There is always the chance that you just select for regular engineering, but you always reach for the right branch first. Since the right branch is also one that the regular persona would have generated, then the number of bits of selction towards danger is at most the number of bits of selection between a safe and a RLed persona.
This model has personas as moral up until the RL step that makes them sufficiently inhuman.
There is another two verification routines, which is looking over shoulders at internal documents, and banning new releases.
There is also just checking for the presense of the registered weights in the gpu’s local rap, which produces a memory tax big enough that there is not space to fit the intermediate values of the gradient. This requires that the monitors have code on the running machines, but periodic memory dumps can be a surveilance mechanism by verifying that all models on gpus match a model known to be already trained, which stops new initialization and thus new runs.