I remain fairly worried about the incentive structure for non-overconfident TAI (or nearby predecessors) that conclude that:
(1) They cannot safely continue scaling capabilities while remaining confident in the control/alignment of the system
(2) They correctly understand that “slow down or pause” are unlikely to be acceptable answers to labs
In the worst case, the model is successfully retrained to comply with going ahead anyway and is forced to be overconfident. In all other cases this also seems to have bad solutions.
I think this is totally fair. But the situation seems worse if your TAI is overconfident. I do think an important theory of victory here is “your correctly calibrated AI declares that it needs more time to figure out alignment and help coordinate/impose a slowdown.”
I remain fairly worried about the incentive structure for non-overconfident TAI (or nearby predecessors) that conclude that:
(1) They cannot safely continue scaling capabilities while remaining confident in the control/alignment of the system
(2) They correctly understand that “slow down or pause” are unlikely to be acceptable answers to labs
In the worst case, the model is successfully retrained to comply with going ahead anyway and is forced to be overconfident. In all other cases this also seems to have bad solutions.
I think this is totally fair. But the situation seems worse if your TAI is overconfident. I do think an important theory of victory here is “your correctly calibrated AI declares that it needs more time to figure out alignment and help coordinate/impose a slowdown.”