As someone who has advocated for my own simple enough scheme for alignment (though it would be complicated to actually do it in practice, but I absolutely think it could be done.), I absolutely agree with this, and it does IMO look a lot better of an option than most schemes for safety.
I also agree re tractability claims, and I do think there’s a reasonably high chance that the first AIs that automate all AI research like scaling and robotics will have quite weak forward passes and quite strong COTs, more like in the 50-75% IMO, and this is actually quite a high value activity to do.
I’ve updated away from the CoT hopes, due to the recurrent architecture paper, I’d now say it’s probably going to be 45-65% at best, and I expect it to keep dropping predictably (though the reason I’m not yet updating all the way is that compute constraints combined with people maybe not choosing to use these architectures means I can’t skip to the end and update all the way):
As someone who has advocated for my own simple enough scheme for alignment (though it would be complicated to actually do it in practice, but I absolutely think it could be done.), I absolutely agree with this, and it does IMO look a lot better of an option than most schemes for safety.
I also agree re tractability claims, and I do think there’s a reasonably high chance that the first AIs that automate all AI research like scaling and robotics will have quite weak forward passes and quite strong COTs, more like in the 50-75% IMO, and this is actually quite a high value activity to do.
Link below:
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
I’ve updated away from the CoT hopes, due to the recurrent architecture paper, I’d now say it’s probably going to be 45-65% at best, and I expect it to keep dropping predictably (though the reason I’m not yet updating all the way is that compute constraints combined with people maybe not choosing to use these architectures means I can’t skip to the end and update all the way):
https://arxiv.org/abs/2502.05171