I cannot speak for their team, but my best guess is that they are envisioning an Agent-3 which possesses insufficient awareness of its misaligned goals or insufficient coherence to notice it is incentivized to scheme. This does seem consistent with Agent-3 being incompetent to align Agent-4. To quote:
The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.
In Rob’s list of possible outcomes, this seems to fall under “AIs that are confidently wrong and lead you off a cliff just like the humans would.” (Possibly at some point Agent-3 said “Yep, I’m stumped too” and then OpenBrain trained it not to do that.)
I cannot speak for their team, but my best guess is that they are envisioning an Agent-3 which possesses insufficient awareness of its misaligned goals or insufficient coherence to notice it is incentivized to scheme. This does seem consistent with Agent-3 being incompetent to align Agent-4. To quote:
In Rob’s list of possible outcomes, this seems to fall under “AIs that are confidently wrong and lead you off a cliff just like the humans would.” (Possibly at some point Agent-3 said “Yep, I’m stumped too” and then OpenBrain trained it not to do that.)