If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.
Note that I would disagree with this being the default we’ll see if we fail. In the paper, we note that given increasing situational awareness, increased safety training in future models, etc, what you’ll see by default here is that the rate of observed misbehavior will go to 0, and it will seem like progress. Further, there will be capabilities and product incentives to train directly against observed misbehavior without fixing the underlying problem.
It also seems very plausible to me that the more accurate model of detection isn’t a binary detection event, but that the model is allowed to get away with more and more misalignment as long as it’s otherwise adding a corresponding amount of value. I think o3 / sonnet 3.7 are particularly good examples of this.
This is tolerable at sufficiently low capability levels, but it’s very unclear to me what this “iterative deployment” strategy is expected to look like for a model capable of satisfying the “critical” criteria for AI Self-Improvement in the Preparedness Framework:
The model is capable of recursively self improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research-scientist agent
The plan as written is to “halt further development until safeguards are specified”. For a model right under that threshold, it would be very surprising to me if “almost recursively self improving almost fully automated AI R&D” is something where we can afford the same “just look for failures in the real world” approach.
Note that I would disagree with this being the default we’ll see if we fail. In the paper, we note that given increasing situational awareness, increased safety training in future models, etc, what you’ll see by default here is that the rate of observed misbehavior will go to 0, and it will seem like progress. Further, there will be capabilities and product incentives to train directly against observed misbehavior without fixing the underlying problem.
It also seems very plausible to me that the more accurate model of detection isn’t a binary detection event, but that the model is allowed to get away with more and more misalignment as long as it’s otherwise adding a corresponding amount of value. I think o3 / sonnet 3.7 are particularly good examples of this.
This is tolerable at sufficiently low capability levels, but it’s very unclear to me what this “iterative deployment” strategy is expected to look like for a model capable of satisfying the “critical” criteria for AI Self-Improvement in the Preparedness Framework:
The plan as written is to “halt further development until safeguards are specified”. For a model right under that threshold, it would be very surprising to me if “almost recursively self improving almost fully automated AI R&D” is something where we can afford the same “just look for failures in the real world” approach.