Unsure if this is implicit, but I’m excited about an additional affordance for diagnosis—having access to training checkpoints.
This would be especially useful for models which may acquire some alien drives during post-training, and by the end of it can competently hide it. At the earlier stages, the drive would not be strong enough for the model to develop the instrumental goal of hiding alien drives.
Unsure if this is implicit, but I’m excited about an additional affordance for diagnosis—having access to training checkpoints.
This would be especially useful for models which may acquire some alien drives during post-training, and by the end of it can competently hide it. At the earlier stages, the drive would not be strong enough for the model to develop the instrumental goal of hiding alien drives.