Reiterating a point above: observe how this whole scheme has basically assumed that capabilities won’t start to generalize relevantly out of distribution. My model says that they eventually will, and that this is precisely when things start to get scary, and that one of the big hard bits of alignment is that once that starts happening, the capabilities generalize further than the alignment. A problem that has been simply assumed away in this agenda, as far as I can tell, before we even dive into the details of this framework.
My reply last time is still relevant: link.